Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it normal that comparatives and superlatives are not stemmed? #172

Open
raffaem opened this issue Jun 20, 2022 · 3 comments
Open

Is it normal that comparatives and superlatives are not stemmed? #172

raffaem opened this issue Jun 20, 2022 · 3 comments

Comments

@raffaem
Copy link

raffaem commented Jun 20, 2022

>>> import Stemmer
>>> stemmer = Stemmer.Stemmer('english')
>>> print(stemmer.stemWord('poorer'))
poorer
>>> print(stemmer.stemWord('cleaner'))
cleaner
>>> print(stemmer.stemWord('cleanest'))
cleanest
@ojwb ojwb transferred this issue from snowballstem/pystemmer Nov 15, 2022
@ojwb
Copy link
Member

ojwb commented Nov 16, 2022

I've moved this ticket because the question here is really about the code in the snowball repo (pystemmer is just a thin wrapper layer on top of this).

This point doesn't seem to be explicitly covered in the algorithm documentation on the website, but I think these aren't done because the obvious rules for them would also trigger in cases where they'd be harmful. In the intended domain of use (generating index terms for information retrieval) overstemming (at least when it causes collisions between unrelated words) is much more problematic than understemming, so we tend to err on the side of understemming in such cases.

For example, tempest and temper would both be reduced to temp (and would also collide with temp meaning a temporary employee), wither would collide with with, etc.

There is actually a rule to remove an -er suffix, but only in R2 (https://snowballstem.org/texts/r1r2.html). This means some longer superlatives are actually handled (e.g. yellower) as well as conflating observer with observe, observed, observes, observing, etc.

If we were to add est where er is handled, there is the odd problematic case - e.g. interest -> inter (colliding with inter meaning to bury) and similarly disinterest -> disinter, but it mostly seems helpful so maybe that's worth considering.

@ojwb
Copy link
Member

ojwb commented Nov 7, 2023

I had a look for past discussion of this and found Martin posted about -est in https://lists.tartarus.org/pipermail/snowball-discuss/2003-December/000548.html (just under 20 years ago!):

Yes -est is not removed, there being too many words from which its removal
would be incorrect - behest, attest, request and so on. Removing -est from
longer words only is not really satisfactory, since the English comparitive
and superlative endings are only added to short adjectives anyway: "curioser
and curioser" is not, of course, correct English.

Similar comments also from Martin slightly more recently in https://lists.tartarus.org/pipermail/snowball-discuss/2009-November/001137.html :

Comparatives and superlatives in English have too many exceptions for them
to be usefully put into a general rule for suffix removal. Think of,

winter center after aether elder ...

divest detest digest attest ...

The usual way to to 'soften' a rule like this in the Porter stemmer is to
make it applicable to longer words only -- typically, those that have at
least a two syllable stem. But the problem there is that in English the
comparative and superlative endings are only added to short adjectives
anyway. So we have bigger, larger, fatter, but not giganticer, immenser,
enormouser.

The problem can only be solved by building up special word lists of
adjectives that can take these endings.

It's true that these comparative and superlative endings are only used on shorter adjectives for which it seems impossible to implement a rule which isn't just a huge list of such adjectives, but there are a number of cases where this "short" overlaps with Snowball's R2 so doing that still seems worth considering even though it only addresses a minority of cases.

Here's an analysis of the changes for the sample vocabulary for removing -est in R2 (adding 'est' where 'er' is handled):

  • A total of 118 words changed stem
  • 21 words changed stem but aren't interesting
  • 87 merges of groups of stems:
    { ador adorable adoration adorations adore adored adorer adores adoring } + { adorest }
    { answer answerable answered answering answers } + { answerest }
    { behold beholder beholders beholding beholds } + { beholdest }
    { believ believe believed believer believers believes believing } + { believest }
    { bitter bitterer bitterly bitterness } + { bitterest }
    { clever cleverer cleverly cleverness } + { cleverest }
    { comfortabler } + { comfortablest }
    { common commoner commoners commonly commons } + { commonest }
    { complain complained complainer complaining complainings complains } + { complainest }
    { complete completed completely completeness completes completing completion } + { completest }
    { consort consorted consorting } + { consortest }
    { convert converted convertible converting convertion converts } + { convertest }
    { deceiv deceivable deceive deceived deceiver deceivers deceives deceiving } + { deceivest }
    { depart departed departing department departs } + { departest }
    { deserv deserve deserved deservedly deserver deservers deserves deserving deservings } + { deservest }
    { desir desirable desire desired desirers desires desiring desirous } + { desirest }
    { diffus diffused diffusing diffusion } + { diffusest }
    { discreet discreetly } + { discreetest }
    { dismal dismally } + { dismallest }
    { divin divination divine divined divinely divineness diviner divines divining divinities divinity } + { divinest }
    { easily } + { easiliest }
    { enforc enforce enforced enforcedly enforcement enforces enforcing } + { enforcest }
    { engross engrossed engrosser engrossing engrossments } + { engrossest }
    { exact exacted exacting exaction exactions exactly exactness exacts } + { exactest }
    { extreme extremely extremes extremities extremity } + { extremest }
    { flatter flattered flatterer flatterers flattering flatters } + { flatterest }
    { follow followed follower followers following follows } + { followest }
    { forlorn forlornly } + { forlornest }
    { generous generously } + { generousest }
    { genteel genteeler } + { genteelest }
    { handsome handsomely handsomeness handsomer } + { handsomest }
    { heartedness } + { heartedest }
    { honest honester honestly } + { honestest }
    { honour honourable honourables honourably honoured honourible honouring honours } + { honourest }
    { impress impressed impresses impressible impressing impression impressions impressive impressively } + { impressest }
    { intense intensely intensity } + { intensest }
    { inter interred } + { interest interested interesting interests }
    { interrupt interrupted interrupter interrupting interruption interruptions interrupts } + { interruptest }
    { junior juniors } + { juniorest }
    { knowingness } + { knowingest }
    { likelier } + { likeliest }
    { livelier liveliness } + { liveliest }
    { loathsome loathsomeness } + { loathsomest }
    { lovelier loveliness } + { loveliest }
    { master mastered masterful mastering masterly masters } + { masterest }
    { minute minutely minuteness minutes } + { minutest }
    { narrow narrowed narrower narrowing narrowness narrows } + { narrowest }
    { often oftener } + { oftenest }
    { outer } + { outerest }
    { perfect perfected perfecter perfection perfections perfectly perfectness } + { perfectest }
    { perish perishable perished perishing } + { perishest }
    { pleasant pleasanter pleasantly pleasantness } + { pleasantest }
    { polit polite politely politeness politic political politically politicly politics } + { politest }
    { precious preciously } + { preciousest }
    { profound profoundly } + { profoundest }
    { rather } + { ratherest }
    { receiv receive received receiver receives receiving } + { receivest }
    { refus refusal refusant refuse refused refuses refusing } + { refusest }
    { remote remotely remoteness remotion } + { remotest }
    { renew renewable renewal renewals renewed renewing renews } + { renewest }
    { report reported reporter reporters reporting reportingly reports } + { reportest }
    { return returned returning returns } + { returnest }
    { review reviewal reviewed reviewing reviews } + { reviewest }
    { sever several severally severals severance severe severed severely severer severing severities severity severs } + { severest }
    { shallow shallows } + { shallowest }
    { sincere sincerely sincerity } + { sincerest }
    { solemn solemnities solemnity solemnize solemnized solemnly } + { solemnest }
    { sorrow sorrowed sorrowful sorrowfully sorrowing sorrows } + { sorrowest }
    { sovereign sovereignly sovereigns } + { sovereignest }
    { stubborn stubbornly stubbornness } + { stubbornest }
    { stupid stupider stupidity stupidly stupids } + { stupidest }
    { supposal suppose supposed supposes supposing } + { supposest }
    { supreme supremely } + { supremest }
    { survey surveyed surveying surveys } + { surveyest }
    { tender tendered tenderer tendering tenderly tenderness tenders } + { tenderest }
    { tortur torture tortured torturer torturers tortures torturing } + { torturest }
    { travel traveler travelers traveling travell travelled traveller travellers travelling travels } + { travellest }
    { unkind unkindly unkindness } + { unkindest }
    { unworthier unworthiness unworthy } + { unworthiest }
    { vanish vanished vanishes vanishing } + { vanishest }
    { vanquish vanquished vanquisher } + { vanquishest }
    { vulgar vulgarities vulgarity vulgarly vulgars } + { vulgarest }
    { welcom welcome welcomed welcomer welcomes welcoming } + { welcomest }
    { wickeder wickedness } + { wickedest }
    { woeful woefull } + { woefullest }
    { worshipp worshipper worshippers } + { worshippest }
    { wretchedness } + { wretchedest }
  • 1 splits of groups of stems:
    { manifest manifested manifesting manifestly manifests | manifestation manifestations }
  • 3 words moving between stem groups:

    compare

    Another thing I notice from this is that we probably don't want to remove -est if we already removed an ending (consider interested, interesting, interests, manifested, manifesting, manifestly, manifests, undigested) so handling it in the same place as -er is probably wrong. This difference is because -er can also occur as a suffix in other situations (e.g. observe/observer/observers).

@ojwb
Copy link
Member

ojwb commented Jul 29, 2024

Another thing I notice from this is that we probably don't want to remove -est if we already removed an ending

Looking at this again, I'm not so convinced that's the right conclusion - we really don't want to remove est from interest (because then it collides with inter) and it's liguistically wrong though not problematic to remove it from manifest - if we get that part right the rest works fine. Also true for undigest but that's a very rare word so its handling matters rather less.

I looked at the slightly more restricted change of removing est in step 4 unless preceded by er, f, or g (which empirically seems to exclude the problematic cases):

  • A total of 90 words changed stem
  • 15 words changed stem but aren't interesting
  • 75 merges of groups of stems:
    { ador adorable adoration adorations adore adored adorer adores adoring } + { adorest }
    { behold beholder beholders beholding beholds } + { beholdest }
    { believ believe believed believer believers believes believing } + { believest }
    { comfortabler } + { comfortablest }
    { common commoner commoners commonly commons } + { commonest }
    { complain complained complainer complaining complainings complains } + { complainest }
    { complete completed completely completeness completes completing completion } + { completest }
    { consort consorted consorting } + { consortest }
    { convert converted convertible converting convertion converts } + { convertest }
    { deceiv deceivable deceive deceived deceiver deceivers deceives deceiving } + { deceivest }
    { depart departed departing department departs } + { departest }
    { deserv deserve deserved deservedly deserver deservers deserves deserving deservings } + { deservest }
    { desir desirable desire desired desirers desires desiring desirous } + { desirest }
    { diffus diffused diffusing diffusion } + { diffusest }
    { discreet discreetly } + { discreetest }
    { dismal dismally } + { dismallest }
    { divin divination divine divined divinely divineness diviner divines divining divinities divinity } + { divinest }
    { easily } + { easiliest }
    { enforc enforce enforced enforcedly enforcement enforces enforcing } + { enforcest }
    { engross engrossed engrosser engrossing engrossments } + { engrossest }
    { exact exacted exacting exaction exactions exactly exactness exacts } + { exactest }
    { extreme extremely extremes extremities extremity } + { extremest }
    { follow followed follower followers following follows } + { followest }
    { forlorn forlornly } + { forlornest }
    { generous generously } + { generousest }
    { genteel genteeler } + { genteelest }
    { handsome handsomely handsomeness handsomer } + { handsomest }
    { heartedness } + { heartedest }
    { honest honester honestly } + { honestest }
    { honour honourable honourables honourably honoured honourible honouring honours } + { honourest }
    { impress impressed impresses impressible impressing impression impressions impressive impressively } + { impressest }
    { intense intensely intensity } + { intensest }
    { interrupt interrupted interrupter interrupting interruption interruptions interrupts } + { interruptest }
    { junior juniors } + { juniorest }
    { likelier } + { likeliest }
    { livelier liveliness } + { liveliest }
    { loathsome loathsomeness } + { loathsomest }
    { lovelier loveliness } + { loveliest }
    { minute minutely minuteness minutes } + { minutest }
    { narrow narrowed narrower narrowing narrowness narrows } + { narrowest }
    { often oftener } + { oftenest }
    { perfect perfected perfecter perfection perfections perfectly perfectness } + { perfectest }
    { perish perishable perished perishing } + { perishest }
    { pleasant pleasanter pleasantly pleasantness } + { pleasantest }
    { polit polite politely politeness politic political politically politicly politics } + { politest }
    { precious preciously } + { preciousest }
    { profound profoundly } + { profoundest }
    { receiv receive received receiver receives receiving } + { receivest }
    { refus refusal refusant refuse refused refuses refusing } + { refusest }
    { remote remotely remoteness remotion } + { remotest }
    { renew renewable renewal renewals renewed renewing renews } + { renewest }
    { report reported reporter reporters reporting reportingly reports } + { reportest }
    { return returned returning returns } + { returnest }
    { review reviewal reviewed reviewing reviews } + { reviewest }
    { shallow shallows } + { shallowest }
    { solemn solemnities solemnity solemnize solemnized solemnly } + { solemnest }
    { sorrow sorrowed sorrowful sorrowfully sorrowing sorrows } + { sorrowest }
    { sovereign sovereignly sovereigns } + { sovereignest }
    { stubborn stubbornly stubbornness } + { stubbornest }
    { stupid stupider stupidity stupidly stupids } + { stupidest }
    { supposal suppose supposed supposes supposing } + { supposest }
    { supreme supremely } + { supremest }
    { survey surveyed surveying surveys } + { surveyest }
    { tortur torture tortured torturer torturers tortures torturing } + { torturest }
    { travel traveler travelers traveling travell travelled traveller travellers travelling travels } + { travellest }
    { unkind unkindly unkindness } + { unkindest }
    { unworthier unworthiness unworthy } + { unworthiest }
    { vanish vanished vanishes vanishing } + { vanishest }
    { vanquish vanquished vanquisher } + { vanquishest }
    { vulgar vulgarities vulgarity vulgarly vulgars } + { vulgarest }
    { welcom welcome welcomed welcomer welcomes welcoming } + { welcomest }
    { wickeder wickedness } + { wickedest }
    { woeful woefull } + { woefullest }
    { worshipp worshipper worshippers } + { worshippest }
    { wretchedness } + { wretchedest }

    That's adding 'est' ('er' or 'f' or 'g' or delete) as a new among case.

    Many of the affected words from the sample vocab with est suffixes this affects are archaic though - from a quick scan I see maybe 35 non-archaic cases above, and many of those seem likely to be rare. This seems to be in the area where the extra complexity (and so time to stem a word) is harder to justify.

    Also a superlative does affect the meaning more than most suffixes.

    So I'm thinking to address this by improving the algorithm documentation to include an explicit note about comparatives and superlatives covering the various points noted above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants