Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the initial version of the Polish stemmer #159

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tomek-ai
Copy link

@tomek-ai tomek-ai commented Dec 8, 2021

No description provided.

@@ -28,6 +28,7 @@ italian UTF_8,ISO_8859_1 italian,it,ita
lithuanian UTF_8 lithuanian,lt,lit
nepali UTF_8 nepali,ne,nep
norwegian UTF_8,ISO_8859_1 norwegian,no,nor
polish UTF_8 polish,pl,pol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be UTF_8,ISO_8859_2 (if we're going to have ISO-8859-2 support in libstemmer, we ought to have it available for languages for which it covers the alphabet).

@ojwb
Copy link
Member

ojwb commented Dec 17, 2021

The tests need to pass for all programming languages, but currently this fails the tests for C (try make check), Ada and Rust and passes for C#, Java, Javascript, Python and Ruby.

(The Pascal backend currently only supports iso-8859-1 and I wasn't able to test Go as there's something up with my local Go setup.)

The pattern here is that it's failing for languages that use UTF-8 and working for those that use wide characters. I'll comment on a line of code where I think the problem is.

The CI should have shown this, but it hasn't run for this PR. I'm not sure why not as it ran for a push I just made to master, and "Build pushed pull requests" is on in the travis-ci settings. I'll try to get that fixed, but meanwhile please try to run at least the C tests locally (they shouldn't need anything beyond what you must have installed to have built the snowball compiler).


define remove_nouns as (
($(len > 7)
test ($pos = (len - 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5 here assumes that each character counts as 1 which isn't true when we're working in UTF-8 - it's assumptions such as this which are causing the tests for fail when we're working in UTF-8.

I think this is going to need to significant restructuring to fix satisfactorily.

($pos = (len -2)
hop pos
([tolimit] delete))
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would benefit from using snowball's among feature. This can be used to remove suffixes from the string without having to do all this tedious calculating of offsets, and that would fix a lot of the current problems when working in UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants