Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turkish Stemmer has problems #176

Open
ekinimo opened this issue Mar 1, 2023 · 2 comments
Open

Turkish Stemmer has problems #176

ekinimo opened this issue Mar 1, 2023 · 2 comments

Comments

@ekinimo
Copy link

ekinimo commented Mar 1, 2023

odun --> odu (meaningless)
oda ---> o (oda means room or you too, stemmer chooses you)
adam ---> ada (adam means man or my island, stemmer chooses my island)
adamlar ---> adam
odam -----> oda

One should perhaps somehow distinguish them

@ojwb
Copy link
Member

ojwb commented May 12, 2023

Note that while the stem form is often a word itself, this is not always the case as this is not a requirement for text search systems, which are the intended field of use of Snowball.

So "odu" being meaningless is not a problem in itself. If other forms of the word "odun" don't stem to "odu" as well, that's a problem. If unrelated words also stem to "odu" that's a (probably worse) problem.

@ojwb
Copy link
Member

ojwb commented Aug 8, 2023

If other forms of the word "odun" don't stem to "odu" as well, that's a problem.

I looked into the odun case some more, and its various forms stem to either odu or odun. Testing some other words this "two stems" issue is more widespread. It's not terrible as at least the many forms are conflated down to just two, but conflating them to one would clearly be better.

If unrelated words also stem to "odu" that's a (probably worse) problem.

I didn't see any for this case, but the stemmer currently produces some very short stems (a single character in some cases) which results in conflating unrelated words - this is effectively a form of overstemming and is a worse problem as it leads to incorrectly matching irrelevant documents rather than possibly missing some relevant documents.

I've written both these issues up in more detail on the mailing list in the hopes someone with more knowledge of Turkish than me is up to the job of helping sort it out (many more people read the list than are likely to see a discussion here):

https://lists.tartarus.org/pipermail/snowball-discuss/2023-August/001755.html

#171 reported aile stemming to ai, which isn't the linguistic stem but is arguably another case of an overly short stem which e.g. could cause conflation with the initialism AI (Artificial Intelligence or Artificial Insemination).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants