Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpacesAfter= for unbreakable spaces etc. #917

Closed
jheinecke opened this issue Jan 19, 2023 · 5 comments
Closed

SpacesAfter= for unbreakable spaces etc. #917

jheinecke opened this issue Jan 19, 2023 · 5 comments

Comments

@jheinecke
Copy link
Contributor

UD-based parsers may encounter unbreakable spaces (U+00A0) in texts. While they can tokenize this character correctly, what is, in your opinion the proper information to include in the SpacesAfter= tag in the MISC column? Currently, the options are \s, \n, \r, and \t for standard spaces, line feed, carriage return, and tabulator. Some parsers use SpacesAfter=X (with X being the unbreakable space unicode point). I am wondering whether a different coding should be used for spaces other than the standard space (U+0020) to be able to accurately reproduce the original text from CoNLL-U data.

@foxik
Copy link
Member

foxik commented Jan 19, 2023

Hi! First, the SpacesAfter is not really a UD thing, but UDPipe thing (i.e., an additional field in MISC capable of storing the non-token characters); it is described at https://ufal.mff.cuni.cz/udpipe/1/users-manual#run_udpipe_tokenizer_spaces

UDPipe currently includes it in verbatim in SpacesAfter -- I think it does not violate CoNLL-U rules, which do not disallow U+00A0 in fields (compared to spaces, tabs, and newlines). However, some programs might consider it to be a space, so I understand we could have a special escape character for it.

A similar question was raised about U+2028 (a Unicode line break) in ufal/udpipe#103 -- some programs might consider a raw U+2028 a line break, causing problems during load. A possible approach is to escape all Zl https://www.fileformat.info/info/unicode/category/Zl/list.htm, Zp https://www.fileformat.info/info/unicode/category/Zp/list.htm and Zs https://www.fileformat.info/info/unicode/category/Zs/list.htm characters in SpacesAfter; and we could probably include also all control characters (ASCII < 32). That is the approach I plan to take in the next major version of UDPipe (but I have not yet decided the exact encoding format; maybe a combination of \xXX and \uXXXX).

@jheinecke
Copy link
Contributor Author

I thought that is less an annotation problem than a "noisy input text" problem. But since SpacesAfter is part of the UD encoding scheme, it would be nice to have a standard.
If you plan \xXX or \uXXXX for UDPipe, I'd vote for \uXXXX

@foxik
Copy link
Member

foxik commented Jan 20, 2023

I think SpacesAfter and SpacesBefore are not (yet) an official part of UD encoding, only SpaceAfter=No is -- see https://universaldependencies.org/format.html which contains only the latter, not the former.

I found https://universaldependencies.org/v2/conll-u.html mentioning that SpacesBefore and SpacesAfter will likely be standardized, but as far as I know, it had not yet happened -- please correct me if I am wrong.

But if we are standardizing it, we definitely need to properly escape the required characters...

@dan-zeman
Copy link
Member

standardized, but as far as I know, it had not yet happened

It is not part of the UD standard. However, there is a page that tries to document MISC attributes that have been used in one or more corpora. It is recommended that if people want to annotate the same thing in a new corpus, they use the same encoding.

@foxik
Copy link
Member

foxik commented Jan 20, 2023

Thanks @dan-zeman When I update the way how UDPipe does this, I will also create a pull request to update the mentioned page.

Regarding the original question, after thinking about it, I believe the U00A0 can easily be represented as the original character (I do not see any harm in doing it); the only possible harm I see is in the Unicode newline and Unicode paragraph symbols, which I plan to escape. Using \u2028 and \u2029 seems like the most sensible approach, which means that general \uXXXX should be supported for decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants