Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implosion/to_s problem with Enclitics #68

Open
n8 opened this issue Jan 13, 2014 · 2 comments
Open

Implosion/to_s problem with Enclitics #68

n8 opened this issue Jan 13, 2014 · 2 comments

Comments

@n8
Copy link

n8 commented Jan 13, 2014

    text = "It's about time."
    text = sentence(text).apply(:tokenize, :parse)
    puts text.to_s

Results in:

It 's about time.

Should that to_s without the extra space between It and `s?

@chrisanderton
Copy link

it's is a contraction - for tokenisation contractions are often considered two words (because they are really) - this is the case in Stanford Core - http://stackoverflow.com/questions/14058399/stanford-corenlp-split-words-ignoring-apostrophe

One option, as suggested in the above link, would be to handle imploding enclitics in the implode method - in treat this would be in module Treat::Entities::Entity::Stringable

@chrisanderton
Copy link

so - looks like the issue is with the current implode method on string able - although it attempts to handle enclitics then from what i can see in the current implementation then 'value' would already be blank, so calling strip! would make no difference - when the imploded parts are merged the space is still there (as it is outside the scope of the strip!)

here's a fixed version - modified the recursive call to pass the value string and operations are all performed on the string instead of multiple copies - but a disclaimer is that i only started looking at treat about 3 hours ago!

chrisanderton@d9b912f

for the same code, this now gives:

It's about time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants