How to document script used for the data in treebank? #1032

Abhishek-P · 2024-05-21T18:25:59Z

This is a case I came across when using UD Sanskrit (https://universaldependencies.org/sa/index.html) treebank(s).

The two treebanks use two different scripts, UFAL uses devanagari while Vedic uses latin.

I suspect this maybe true for some other languages (yet to do an audit) (I am also assuming other such cases, script mixing is not a case we have to worry about with this)

Currently, the tree bank page does not provide any explicit information about the script - although this can be inferred from the examples in the morphology overview section.

I think it would be nice to have that information surfaced more clearly in the treebank page since this can be an important tree bank characteristic to keep in mind for certain uses.

My preliminary proposal for it would be to add it as part of the description similar to Genre with a hyperlink to a scholarly source on scripts (scriptsource?)

License: CC BY-SA 4.0

Genre: fiction
Script: devanagari

dan-zeman · 2024-05-21T22:25:15Z

ISO 15924 provides codes suitable for such a metadata item. There are probably finer distinctions that could be made about the spelling rules in the treebank, but those would be difficult to capture systematically, and ISO codes of scripts would be an improvement over no info (current status).

Abhishek-P · 2024-05-27T00:49:00Z

I can make a change in the sa treebanks pages using the ISO 15924 for a start.
Let me know if and where this needs to be discussed and documented (templatized) for future work.

Abhishek-P · 2024-06-01T05:14:58Z

I did a quick check of the treebank comparison pages (those linked in home page) sa is the only case I caught of treebanks having different scripts.

dan-zeman · 2024-06-01T21:51:55Z

I did a quick check of the treebank comparison pages (those linked in home page) sa is the only case I caught of treebanks having different scripts.

This is true at the moment as far as I know, but there are other languages that could use multiple writing systems, so it is definitely a property of the treebank rather than the language.

dan-zeman · 2024-06-01T22:19:36Z

I can make a change in the sa treebanks pages using the ISO 15924 for a start. Let me know if and where this needs to be discussed and documented (templatized) for future work.

I will raise this at a future meeting of the core group. Assuming there won't be objections, these are the next steps:

Document it next to other metadata in the Release checklist.
Add it to validation infrastructure (the line will be required and must contain valid values, e.g. Script: Latn for Latin-based alphabets).
Announce it to the mailing list and make sure that all treebanks have the new line in their READMEs. Also add it to the template used when creating new repositories.
Make sure that the script that generates treebank hub pages (at release time) copies this information to the pages.

amir-zeldes · 2024-06-03T13:58:45Z

This is true at the moment as far as I know, but there are other languages that could use multiple writing systems, so it is definitely a property of the treebank rather than the language.

Another candidate is UD_Egyptian, which uses Schenkel transcription rather than hieroglyphs or Gardiner codes, either of which would be conceivable for Egyptian.

robvanderg · 2024-06-05T19:00:58Z

If it should be automated, it can be tricky to find code for this (as script is ambiguous when searching for code), here are some existing solutions (last one by me, optimized for speed not RAM):

https://github.com/cisnlp/GlotScript
https://robvanderg.github.io/scripts/scripts/

Abhishek-P changed the title ~~How to document script for the data in treebank~~ How to document script used for the data in treebank? May 21, 2024

dan-zeman added enhancement universal website labels May 21, 2024

dan-zeman added this to the v2.15 milestone May 21, 2024

dan-zeman self-assigned this Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to document script used for the data in treebank? #1032

How to document script used for the data in treebank? #1032

Abhishek-P commented May 21, 2024 •

edited

Loading

dan-zeman commented May 21, 2024 •

edited

Loading

Abhishek-P commented May 27, 2024

Abhishek-P commented Jun 1, 2024 •

edited

Loading

dan-zeman commented Jun 1, 2024

dan-zeman commented Jun 1, 2024

amir-zeldes commented Jun 3, 2024

robvanderg commented Jun 5, 2024

How to document script used for the data in treebank? #1032

How to document script used for the data in treebank? #1032

Comments

Abhishek-P commented May 21, 2024 • edited Loading

dan-zeman commented May 21, 2024 • edited Loading

Abhishek-P commented May 27, 2024

Abhishek-P commented Jun 1, 2024 • edited Loading

dan-zeman commented Jun 1, 2024

dan-zeman commented Jun 1, 2024

amir-zeldes commented Jun 3, 2024

robvanderg commented Jun 5, 2024

Abhishek-P commented May 21, 2024 •

edited

Loading

dan-zeman commented May 21, 2024 •

edited

Loading

Abhishek-P commented Jun 1, 2024 •

edited

Loading