Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'\u2028' not recognized in SpacesAfter #103

Open
maxtrem opened this issue Jun 12, 2019 · 3 comments
Open

'\u2028' not recognized in SpacesAfter #103

maxtrem opened this issue Jun 12, 2019 · 3 comments

Comments

@maxtrem
Copy link

maxtrem commented Jun 12, 2019

We used the tagger and tokenizer of UDpipe. In some of our files we had this newline character '\u2028' which wasn't recognized as one. This led to further errors in other programs in our pipeline, but also to tokenization problems in UDpipe itself:
For example:

17 out.
What out.
what PRON WP PronType=Int _ _ _ _

Where '\u2028' is just placed after the end of the sentence.

So it would be really cool if you could add this character to the list of newline characters.

@foxik
Copy link
Member

foxik commented Jun 12, 2019

Good catch, the tokenizer does not consider '\u2028' to be a newline character. Furthermore, we do not recognize '\u2029' as well -- we should fix both.

We might even consider adding a new escaping characters to SpacesAfter, even if ConLL-U documentation states that only LF is used as line separator, some tools might split on \u202[89]. regardless. But maybe not... I will think about it.

@maxtrem
Copy link
Author

maxtrem commented Jun 13, 2019

Thank you for your reply!
Yes, we actually used the UUParser and it does split on '\u2028' and crashes. So escaping would definitely help in that regard.

@foxik
Copy link
Member

foxik commented Jun 13, 2019

Thanks for the feedback, escaping it is then :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants