Langid: Train for remaining languages that weren't in opus-100 #213

unhammer · 2023-02-22T09:53:42Z

We have trained model files lid.beta.ftz and lid.release.ftz in the repo for languages that were in the opus-100 corpus. We should get corpora for the languages that weren't there and retrain (preferably in a fairly reproducible way, see scripts in ./ft-train).

Corpus suggestions: #207 (comment)

Missing in release:

Got only 35791 lines for oci oc
Got only 35907 lines for sme se
Got only 67312 lines for bel be
Got only 6961 lines for arg an
Got only 79927 lines for kaz kk
No corpus found for crh
No corpus found for frp
No corpus found for szl
No corpus found for zlm

Full missing-list for beta and relase: https://github.com/apertium/apertium-apy/blob/master/ft-train/download-extract-corpus#L56

The text was updated successfully, but these errors were encountered:

jonorthwash · 2023-11-01T23:45:03Z

How many lines of text do we need per language?

unhammer · 2023-11-02T08:47:57Z

The current script only uses the first 100.000 lines of text for each corpus. This was based on experiments with Scandi languages which can have very similar spelling (and then increased a bit) – if you have a language that is quite different from the rest of the set then I think you can get away with quite a bit less. As the above comment shows we "just" have 35k lines for sme (whereas e.g. deu has 100k), but https://beta.apertium.org/apy/identifyLang?q=ja+leat still gets it right

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Langid: Train for remaining languages that weren't in opus-100 #213

Langid: Train for remaining languages that weren't in opus-100 #213

unhammer commented Feb 22, 2023 •

edited

Loading

jonorthwash commented Nov 1, 2023

unhammer commented Nov 2, 2023 •

edited

Loading

Langid: Train for remaining languages that weren't in opus-100 #213

Langid: Train for remaining languages that weren't in opus-100 #213

Comments

unhammer commented Feb 22, 2023 • edited Loading

jonorthwash commented Nov 1, 2023

unhammer commented Nov 2, 2023 • edited Loading

unhammer commented Feb 22, 2023 •

edited

Loading

unhammer commented Nov 2, 2023 •

edited

Loading