Skip to content

hipster-philology/greek-lemmatization-data

Repository files navigation

Ancient Greek Lemmatization and Morpho-Syntactic Data

Referentials

Lemma are from the Henry George Liddell, Robert Scott, A Greek-English Lexicon

Scores

Models are available in releases.

Summary

task Accuracy Accuracy Ambiguous
case 0.9612 0.8854
degree 0.9926 0.9596
gender 0.9436 0.8296
lemma 0.954 0.9097
mood 0.9913 0.957
num 0.9841 0.9589
pers 0.9864 0.9219
pos 0.9287 0.8805
tense 0.9917 0.9588
voice 0.9915 0.9606

Script

  1. Run build.py to get the "simple" training data
    • Warning: default output is NFKD
  2. Run build-normalized.py to get nfd and nfc data

Sources

Sources to check

Those are sources I do not know the status of (Gold ? Silver ? Bronze ? Wood ?)

Licence

Lemmatization data

Licence are the one from the original repositories. Converted data inherits the

Script

Mozilla Public Licence

Statistics

  • 1,068,131 tokens,
    • including 115,412 punctuation signs
  • 56,133 different sentences

91 chars found

Char Count
7743
" 4219
% 4
' 6745
( 704
) 702
, 142218
- 7085
. 66860
0 1
1 5727
2 3197
3 1616
4 2
: 7638
; 7268
< 72
> 74
? 137
[ 577
] 571
j 3
{ 1
~ 38
· 31204
ʽ 17
̀ 230277
́ 1123673
̄ 25
̆ 8
̈ 3682
̓ 584276
̔ 287290
͂ 249187
ͅ 38177
Α 24953
Β 1412
Γ 1957
Δ 4253
Ε 7741
Ζ 2358
Η 2125
Θ 2724
Ι 4642
Κ 9669
Λ 5939
Μ 6123
Ν 1777
Ξ 728
Ο 3754
Π 9063
Ρ 2739
Σ 6155
Τ 5237
Υ 586
Φ 3391
Χ 903
Ψ 34
Ω 346
α 957329
β 53775
γ 152992
δ 248067
ε 880724
ζ 23108
η 294280
θ 112297
ι 845411
κ 294851
λ 281371
μ 315232
ν 617318
ξ 30632
ο 968199
π 330404
ρ 379429
ς 479697
σ 271423
τ 541687
υ 398026
φ 81370
χ 95052
ψ 8992
ω 340318
ϝ 13
388
2
5404
4
74
4