Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frak models in ocrd resmgr #404

Open
jbarth-ubhd opened this issue Dec 21, 2023 · 27 comments
Open

frak models in ocrd resmgr #404

jbarth-ubhd opened this issue Dec 21, 2023 · 27 comments
Labels
question Further information is requested

Comments

@jbarth-ubhd
Copy link

I've compared these frak models:

ocrd: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata from ocrd resmgr

ubma: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069.traineddata from https://ocr-bw.bib.uni-mannheim.de/faq/

size & md5sum:

-rw-rw-r-- 1 jb jb 3421140 Mär 27  2021 ocrd--frak2021-0.905.traineddata
234e8bb819042f615576bd01aa2419fd  ocrd--frak2021-0.905.traineddata
-rw-rw-r-- 1 jb jb 5060763 Dez  9  2021 ubma--frak2021_1.069.traineddata
9405b1603db21cb066e4e7614a405dd4  ubma--frak2021_1.069.traineddata

content after combine_tessdata -u x.traineddata aa :

jb@nuc:~/models$ LC_ALL=C ls -lh ocrd ubma
ocrd:
total 3.3M
-rw-rw-r-- 1 jb jb 3.3M Dec 21 12:18 aa.lstm
-rw-rw-r-- 1 jb jb 2.8K Dec 21 12:18 aa.lstm-recoder
-rw-rw-r-- 1 jb jb  22K Dec 21 12:18 aa.lstm-unicharset
-rw-rw-r-- 1 jb jb   30 Dec 21 12:18 aa.version
-rw-rw-r-- 1 jb jb  345 Dec 21 12:18 extr.log

ubma:
total 4.9M
-rw-rw-r-- 1 jb jb 432K Dec 21 12:18 aa.lstm
-rw-rw-r-- 1 jb jb 6.3K Dec 21 12:18 aa.lstm-number-dawg
-rw-rw-r-- 1 jb jb 4.5K Dec 21 12:18 aa.lstm-punc-dawg
-rw-rw-r-- 1 jb jb 2.8K Dec 21 12:18 aa.lstm-recoder
-rw-rw-r-- 1 jb jb  22K Dec 21 12:18 aa.lstm-unicharset
-rw-rw-r-- 1 jb jb 4.4M Dec 21 12:18 aa.lstm-word-dawg
-rw-rw-r-- 1 jb jb   30 Dec 21 12:18 aa.version
-rw-rw-r-- 1 jb jb  553 Dec 21 12:18 extr.log

ubma is with .lstm-word-dawg, ocrd is without.

ocrd is 3.3M lstm size, ubma is 432k lstm size.

shouldn't ocrd use the ubma file for fraktur/gothic?

@stweil
Copy link
Collaborator

stweil commented Jan 12, 2024

Which one of the two models is "better", and how did you compare them?

@jbarth-ubhd
Copy link
Author

Comparison in sense of "check if the model files have the same content".

@bertsky
Copy link
Collaborator

bertsky commented Jan 18, 2024

That's strange indeed. It's not to be expected from the vanilla tesstrain rules (even the fast variant just does ConvertToInt). And the concrete wordlist looks very awkward (contains 400k fullforms, nearly half of which are made of strange punctuation characters indicative of absent tokenisation, and the actual tokens are clearly scraped off the web, not historic at all). I would understand if the wordlist from deu or frk is used in frak2021, but that's not the case at all.

@stweil can you explain?

@stweil
Copy link
Collaborator

stweil commented Jan 18, 2024

frak2021_1.069.traineddata was made from the original training result, but with additional components like wordlist, number und punctuation hints (frak2021_1.069.lstm-word-dawg, frak2021_1.069.lstm-number-dawg, frak2021_1.069.lstm-punc-dawg). Those additional components are based on the components from a Tesseract standard model (as far as I remember on Fraktur.traineddata, but I'd have to check). Sort the word list before comparing it with other word lists.

Because of the additional components the file frak2021_1.069.traineddata is larger.

Typically models with a (ideally domain specific) wordlist can achieve slightly higher recognition rates, but sometimes it can also lead to OCR results which differ from the printed text.

And yes, this word list contains a lot of entries which should be removed. That's inherited from all standard Tesseract word lists.

@bertsky
Copy link
Collaborator

bertsky commented Jan 18, 2024

Those additional components are based on the components from a Tesseract standard model (as far as I remember on Fraktur.traineddata, but I'd have to check)

No, the latter word list is about twice the size, also with texts from the web, but contains none of these strange words with punctuation (non-tokenised), and does contain ſ, which yours does not.

Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must be accompanied by a thorough evaluation.)

Regardless, the word list in that model files looks exceptionally bad (much worse than the Tesseract word lists) and should be improved.

@bertsky
Copy link
Collaborator

bertsky commented Jan 18, 2024

Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must be accompanied by a thorough evaluation.)

I have now distilled a list of full forms, capping at different frequencies, respectively:

  • >10: 314248 words
  • >50: 100516 words
  • >100: 60403 words

I filtered by part-of-speech, removing punctuation, numbers and non-words (XY):

select trim(u,'"') from csv where f > 100 and p != "$(" and p != "$," and p != "$." and p != "FM.xy" and p != "CARD" and p != "XY";

Furthermore, I removed those entries which have not been properly tokenised (indicated by leading punctuation) or are merely numbers (but still do not get p=CARD):

grep -v -e '^[[:punct:]]' -e '^[[:digit:][:punct:]]*$'

The quality is very good!

Maybe I'll also recompose the number and punc DAWGs for the additional historic patterns (e.g. instead of hyphen, solidus instead of comma) and remove the contemporary ones ( sign etc).

I will try to use this with frak2021, but also GT4HistOCR and others.

I guess I'll do some recognition experiments and evaluation before publishing the modified models.

@stweil
Copy link
Collaborator

stweil commented Jan 29, 2024

In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts.
It would be more interesting to use it with german_print.

@bertsky
Copy link
Collaborator

bertsky commented Jan 29, 2024

In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts.
It would be more interesting to use it with german_print.

sure, that's why it's among the models I build the dict into – see full list of assets

Some evaluation (which material which model whether dict or not and which cap freq preferable) will follow.

@jbarth-ubhd
Copy link
Author

Here my small tool for checking the wordlists of .traineddata files:

https://gist.github.com/jbarth-ubhd/8d5ceb4035bf2d89700117a311209f20

@jbarth-ubhd
Copy link
Author

@bertsky : but frak2021_dta10+100 do not contain »ſ«:

AMBIGIOUS (EXCERPT): 1sten A/ AP. As. AZ. Basalt- Bauers- Besitz- Bietsch- c. cas. Centralbl. Chrysost. cl. Corn. dial. Diener. Ding- Dinge. Ebd. Eigen- eigentl. Eisen. euch. Eurip. fgm. FML. fundam. g1 Gebiets- Geitz- Generals- G.n GOtts Griseb. Haubt- haus- HErre hsg. inst. Jahrbb. Jungfrau- k. Kg. Kiefer- Lactant. lap. legit. Loose. Magdalenen- Mai- Mehl= Namen. nat. neu- NJmb Normal- O1 Pall. pan. Pfand- Pfl. proc. Reb- redet. Rev. Rhodigin. Rich. Roman- Sc. Schulen. Schweine- Sed. SEin SJndt Spargel- Spitz- Strom. Syllog. Trauben- Trav. Trias- Trift- VIEUSSENS. VVilliam Wach- W.-B. wohl- Wolf. XCVII. y2 Ztg. zwei-

264677 lines
0.00 % lines with »ſ«
0.64 % lines all-UPPERCASE
3.51 % lines ambigious

@bertsky
Copy link
Collaborator

bertsky commented Jan 30, 2024

Indeed – something went wrong. Thanks @jbarth-ubhd, I'll investigate!

@bertsky
Copy link
Collaborator

bertsky commented Jan 30, 2024

Ok, I found the problem. See new release.

346632 lines
16.37 % lines with »ſ«
0.19 % lines all-UPPERCASE
132.80 % lines ambigious

What's with the > 100% BTW?

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Jan 30, 2024

>100% is because I've inspected only every 1/0.003th word, to keep output compact and multiply the count - I'll have a look at this.

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Jan 30, 2024

Just inspected frak2021_dta50.traineddata.

Ambigious:

welchẽ   (not NFC) Welchs     weñ      (not NFC) Weñ      (not NFC) wenigen

a lot of spaces after words(?). And not NFC (double counting, my bug.)

Spaces were not in frak2021_dta10/100 I've downloaded till Jan 30 11:55.

@jbarth-ubhd
Copy link
Author

now with much nicer output:

welchẽ␣␣␣(not NFC) Welchs␣␣␣␣ weñ␣␣␣␣␣␣(not NFC) Weñ␣␣␣␣␣␣(not NFC) wenigen␣␣␣

@bertsky
Copy link
Collaborator

bertsky commented Jan 30, 2024

a lot of spaces after words(?).

wow, I should have checked. Thanks again for being thorough @jbarth-ubhd – much appreciated!

see new release

And not NFC (double counting, my bug.)

Do we really want that? (Even if DTA decided not to do it?)

@jbarth-ubhd
Copy link
Author

If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check.

@bertsky
Copy link
Collaborator

bertsky commented Jan 30, 2024

If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check.

I just checked: tesstrain does NFC on the input GT (via unicodedata.normalize in generate_line_box.py). And calamari-train does by default. Kraken's ketos train offers it, but it does not seem to be default.

It is also used in most CER measurement tools.

I feel obliged to comply with this obvious convention in the OCR space.

@bertsky
Copy link
Collaborator

bertsky commented Jan 30, 2024

There we go

@stweil
Copy link
Collaborator

stweil commented Jan 30, 2024

Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters with diacritica.

And I already have a Tesseract branch which no longer requires box and lstmf files for the training.

@bertsky
Copy link
Collaborator

bertsky commented Jan 30, 2024

Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters with diacritica.

Right, but you can choose, --norm_mode (or NORM_MODE in tesstrain) Normalization mode:

  1. Combine graphemes,
  2. Split graphemes
  3. Pure unicode

And it's configured differently for various mother tongues.

So my fixed NFC in the DTA LM was premature is what you are saying @stweil?

@stweil
Copy link
Collaborator

stweil commented Jan 30, 2024

No, my comment was just meant as an information for you.

@jbarth-ubhd
Copy link
Author

Comparison frak2021 … _dta50:

4160da1e088452fcec11df5a411d9a91 /usr/local/ocrd-models/ocrd-tesserocr-recognize/frak2021_dta50.traineddata

234e8bb819042f615576bd01aa2419fd /usr/local/ocrd-models/ocrd-tesserocr-recognize/frak2021.traineddata

image

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Feb 9, 2024

with ..._dta50 there is missing some punctation, but there is almost no word diff ... I've expected the dictionary to have greater impact.

@stweil
Copy link
Collaborator

stweil commented Mar 1, 2024

So it looks like using a dictionary makes recognition of punctuation worse (unless the dictionary also contains the words with the punctuation)? That's not the kind of impact which is desired.

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

with ..._dta50 there is missing some punctation, but there is almost no word diff ... I've expected the dictionary to have greater impact.

me too. But the averages do go down overall (if just a little) in my experiments.

I did not fiddle with WORD_DAWG_FACTOR yet.

So it looks like using a dictionary makes recognition of punctuation worse (unless the dictionary also contains the words with the punctuation)?

It would appear so. But there may be a general problem with re-integrating the punctuation DAWG. I am also still trying to modify it in a way to cover extra punctuation characters like and and . The problem is that Tesseract does not have code to de/serialise it from/to anything other than binary form. (I would have expected at least one of the old automaton text formats like AT&T. Unclear how these FSTs came to be in the first place. Manually?)

@stweil stweil added the question Further information is requested label May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants