Add the initial version of the Polish stemmer #159

tomek-ai · 2021-12-08T15:51:34Z

No description provided.

ojwb · 2021-12-17T04:04:40Z

libstemmer/modules.txt

@@ -28,6 +28,7 @@ italian         UTF_8,ISO_8859_1        italian,it,ita
 lithuanian      UTF_8                   lithuanian,lt,lit
 nepali          UTF_8                   nepali,ne,nep
 norwegian       UTF_8,ISO_8859_1        norwegian,no,nor
+polish          UTF_8                   polish,pl,pol


I think this could be UTF_8,ISO_8859_2 (if we're going to have ISO-8859-2 support in libstemmer, we ought to have it available for languages for which it covers the alphabet).

ojwb · 2021-12-17T04:51:24Z

The tests need to pass for all programming languages, but currently this fails the tests for C (try make check), Ada and Rust and passes for C#, Java, Javascript, Python and Ruby.

(The Pascal backend currently only supports iso-8859-1 and I wasn't able to test Go as there's something up with my local Go setup.)

The pattern here is that it's failing for languages that use UTF-8 and working for those that use wide characters. I'll comment on a line of code where I think the problem is.

The CI should have shown this, but it hasn't run for this PR. I'm not sure why not as it ran for a push I just made to master, and "Build pushed pull requests" is on in the travis-ci settings. I'll try to get that fixed, but meanwhile please try to run at least the C tests locally (they shouldn't need anything beyond what you must have installed to have built the snowball compiler).

ojwb · 2021-12-17T05:06:24Z

algorithms/polish.sbl

+
+define remove_nouns as (
+			($(len > 7)
+			test ($pos = (len - 5)


The 5 here assumes that each character counts as 1 which isn't true when we're working in UTF-8 - it's assumptions such as this which are causing the tests for fail when we're working in UTF-8.

I think this is going to need to significant restructuring to fix satisfactorily.

ojwb · 2021-12-17T05:08:58Z

algorithms/polish.sbl

+		($pos = (len -2)
+		hop pos
+		([tolimit] delete))
+		)


This would benefit from using snowball's among feature. This can be used to remove suffixes from the string without having to do all this tedious calculating of offsets, and that would fix a lot of the current problems when working in UTF-8.

Add the initial version of the Polish stemmer

d55d5db

ojwb reviewed Dec 17, 2021

View reviewed changes

asdfMaciej mentioned this pull request Sep 3, 2024

Add support for Polish language stemming quickwit-oss/tantivy#2484

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the initial version of the Polish stemmer #159

Add the initial version of the Polish stemmer #159

tomek-ai commented Dec 8, 2021

ojwb Dec 17, 2021

ojwb commented Dec 17, 2021

ojwb Dec 17, 2021

ojwb Dec 17, 2021

Add the initial version of the Polish stemmer #159

Are you sure you want to change the base?

Add the initial version of the Polish stemmer #159

Conversation

tomek-ai commented Dec 8, 2021

ojwb Dec 17, 2021

Choose a reason for hiding this comment

ojwb commented Dec 17, 2021

ojwb Dec 17, 2021

Choose a reason for hiding this comment

ojwb Dec 17, 2021

Choose a reason for hiding this comment