Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to document script used for the data in treebank? #1032

Open
Abhishek-P opened this issue May 21, 2024 · 7 comments
Open

How to document script used for the data in treebank? #1032

Abhishek-P opened this issue May 21, 2024 · 7 comments

Comments

@Abhishek-P
Copy link
Contributor

Abhishek-P commented May 21, 2024

This is a case I came across when using UD Sanskrit (https://universaldependencies.org/sa/index.html) treebank(s).

The two treebanks use two different scripts, UFAL uses devanagari while Vedic uses latin.

I suspect this maybe true for some other languages (yet to do an audit) (I am also assuming other such cases, script mixing is not a case we have to worry about with this)

Currently, the tree bank page does not provide any explicit information about the script - although this can be inferred from the examples in the morphology overview section.

I think it would be nice to have that information surfaced more clearly in the treebank page since this can be an important tree bank characteristic to keep in mind for certain uses.

My preliminary proposal for it would be to add it as part of the description similar to Genre with a hyperlink to a scholarly source on scripts (scriptsource?)

License: CC BY-SA 4.0

Genre: fiction
Script: devanagari

@Abhishek-P Abhishek-P changed the title How to document script for the data in treebank How to document script used for the data in treebank? May 21, 2024
@dan-zeman dan-zeman added this to the v2.15 milestone May 21, 2024
@dan-zeman
Copy link
Member

dan-zeman commented May 21, 2024

ISO 15924 provides codes suitable for such a metadata item. There are probably finer distinctions that could be made about the spelling rules in the treebank, but those would be difficult to capture systematically, and ISO codes of scripts would be an improvement over no info (current status).

@Abhishek-P
Copy link
Contributor Author

I can make a change in the sa treebanks pages using the ISO 15924 for a start.
Let me know if and where this needs to be discussed and documented (templatized) for future work.

@Abhishek-P
Copy link
Contributor Author

Abhishek-P commented Jun 1, 2024

I did a quick check of the treebank comparison pages (those linked in home page) sa is the only case I caught of treebanks having different scripts.

@dan-zeman
Copy link
Member

I did a quick check of the treebank comparison pages (those linked in home page) sa is the only case I caught of treebanks having different scripts.

This is true at the moment as far as I know, but there are other languages that could use multiple writing systems, so it is definitely a property of the treebank rather than the language.

@dan-zeman
Copy link
Member

I can make a change in the sa treebanks pages using the ISO 15924 for a start. Let me know if and where this needs to be discussed and documented (templatized) for future work.

I will raise this at a future meeting of the core group. Assuming there won't be objections, these are the next steps:

  • Document it next to other metadata in the Release checklist.
  • Add it to validation infrastructure (the line will be required and must contain valid values, e.g. Script: Latn for Latin-based alphabets).
  • Announce it to the mailing list and make sure that all treebanks have the new line in their READMEs. Also add it to the template used when creating new repositories.
  • Make sure that the script that generates treebank hub pages (at release time) copies this information to the pages.

@dan-zeman dan-zeman self-assigned this Jun 1, 2024
@amir-zeldes
Copy link
Contributor

This is true at the moment as far as I know, but there are other languages that could use multiple writing systems, so it is definitely a property of the treebank rather than the language.

Another candidate is UD_Egyptian, which uses Schenkel transcription rather than hieroglyphs or Gardiner codes, either of which would be conceivable for Egyptian.

@robvanderg
Copy link
Contributor

If it should be automated, it can be tricky to find code for this (as script is ambiguous when searching for code), here are some existing solutions (last one by me, optimized for speed not RAM):

https://github.com/cisnlp/GlotScript
https://robvanderg.github.io/scripts/scripts/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants