Problem with table recognition #134

Shanksum · 2020-07-30T11:14:42Z

With tables where there are no horizontal lines, the workflow results in a wrong reading order by only recognizing the columns and no rows.
See the following image as an example:

The result is as follows:
OCR-D-TXT_catalog46muse_0564.txt

This is the used workfow:

ocrd-olena-binarize -I OCR-D-OPT -O OCR-D-BIN -p '{"impl": "sauvola-ms-split"}'
ocrd-cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-DENOISE -p '{"level-of-operation":"page"}'
ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p '{"level-of-operation":"page"}'
ocrd-tesserocr-segment-region -I OCR-D-DESKEW-PAGE -O OCR-D-SEG-REG
ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{"plausibilize":true}'
ocrd-cis-ocropy-binarize -I OCR-D-SEG-REPAIR -O OCR-D-BIN2 -p '{"level-of-operation":"region"}'
ocrd-tesserocr-deskew -I OCR-D-BIN2 -O OCR-D-DESKEW-TEXT
ocrd-tesserocr-segment-line -I OCR-D-DESKEW-TEXT -O OCR-D-SEG-LINE
ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG
ocrd-cis-ocropy-dewarp -I OCR-D-RESEG -O OCR-D-DEWARP-LINE
ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"model": "deu"}'

The text was updated successfully, but these errors were encountered:

bertsky · 2020-08-24T18:24:06Z

That's because there are no good table processors in OCR-D yet. But you'd also have to include the existing ones in your workflow in the first place!

Here's my take on this example:

Binarization is hard. The above page features heavy show-through, stains/specks, and handwriting. And since you uploaded a JPEG, I also get heavy compression artifacts around the glyphs. I have not been able to put to much use ocrd-skimage-normalize or ocrd-skimage-denoise-raw here, and my best shot for binarization is ocrd-olena-binarize with sauvola-ms-split and k set to 0.2 (guessing a dpi of 200):
Table detection is currently only available with ocrd-tesserocr-segment-region (with its default find_tables: true). But its underlying segmentation is fragile and does not cope well at all with binarized input. Tesseract (i.e. its usage of Leptonica) wants to see the raw image and binarize with its (bad, internal) global Otsu implementation. So running binarization after segmentation is currently the only way to get a table region for that page. But often the workflow needs binarization prior to page segmentation (table detection). Our OCR-D wrapper could of course extract the raw image, regardless of the workflow. But that might degrade quality in other cases (exactly because the internal binarization is so bad). Therefore I started Segmentation on raw images #144 to experiment with this behaviour. Note: I also found a bug in Tesseract's separator detection. There's very likely more of those lurking.
After table detection you need a processor for table recognition. Although ocrd-cis-ocropy-segment has a level-of-operation=table, I would currently not recommend it. You can use ocrd-tesserocr-segment-table for a slightly better approximation, but don't expect too much! This currently just uses Tesseract's SPARSE_TEXT mode (or SPARSE_TEXT_OSD in Segmentation on raw images #144). Here's what this looks like:

So: there's a text region for the handwritten "check" on the right, then the table region commences. The cells of that table are not ideal and there is no recursive or consistent structure. Also, many separators go undetected. Again, note that 2 and 3 had to be done on the raw image.
After segmentation, you might want to do dewarping and recognition. This will use the binarization from step 1 again.

jbarth-ubhd · 2021-02-01T09:36:16Z

This scan has different skew angles (at top & bottom); perhaps a 3d deskew could help.

jbarth-ubhd · 2021-02-01T09:39:34Z

text lines aligned (but not vertically aligned):

bertsky · 2021-02-01T12:48:11Z

This scan has different skew angles (at top & bottom); perhaps a 3d deskew could help.

@jbarth-ubhd, by 3d deskew you mean dewarping? How did you get this result?

Back to the issue: the core problem is still making Tesseract (currently the only table detector in OCR-D) actually detect a table region for that page. As explained above, this only works if input is not binarized (normalized or not).

Now, with your dewarped JPEG, I cannot get a table at all anymore. Probably because of the corners clipped to white. But if apply ocrd-sbb-binarize to the dewarped image, the I get at least a partial table:

In summary, we have to

make Tesseract cope with binarized input (at least as good as raw)
wrap a better (more robust, ideally neural) table detection than Tesseract
wrap a better (more adequate w.r.t. cells and order) table recognition than the "tables as pages" paradigm (Tesseract sparse mode or Ocropy recursive XY-cut)

jbarth-ubhd · 2021-02-01T14:10:16Z

@jbarth-ubhd, by 3d deskew you mean dewarping? How did you get this result?

No, I mean correcting a photo not taken orthogonally to the plane (paper) (=perspective distortion). The vertical column separators are not parallel in the scan. Since we had scans with "perspective distortion" I wrote a tool to correct it - without correction of verticals (didn't know how to correct those reliable)

bertsky · 2021-02-01T15:08:51Z

No, I mean correcting a photo not taken orthogonally to the plane (paper) (=perspective distortion). The vertical column separators are not parallel in the scan. Since we had scans with "perspective distortion" I wrote a tool to correct it - without correction of verticals (didn't know how to correct those reliable)

That sounds interesting. I had that use-case, too. See my report on probing various unperspective and dewarp tools for suitability in OCR-D. Back then you said you were using mzucker's tool. Is that still the case, or did you write your own?

jbarth-ubhd · 2021-02-01T15:49:16Z

this one: https://github.com/jbarth-ubhd/blitzDrt

stweil · 2021-02-01T16:06:14Z

Jochen, great that you published that oldy now on GitHub. Do you want to add a license file, too?

jbarth-ubhd · 2021-02-03T12:45:51Z

Done: MIT. Am 01.02.21 um 17:06 schrieb Stefan Weil:

…

Jochen, great that you published that oldy now on GitHub. Do you want to add a license file, too? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#134 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHJ32U4MFDMPDQC5PS44G5DS43GQRANCNFSM4PNPZIAQ>.

bertsky mentioned this issue Sep 11, 2020

Line segmentation in tables OCR-D/ocrd_all#190

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with table recognition #134

Problem with table recognition #134

Shanksum commented Jul 30, 2020

bertsky commented Aug 24, 2020

jbarth-ubhd commented Feb 1, 2021

jbarth-ubhd commented Feb 1, 2021

bertsky commented Feb 1, 2021

jbarth-ubhd commented Feb 1, 2021 •

edited

Loading

bertsky commented Feb 1, 2021

jbarth-ubhd commented Feb 1, 2021

stweil commented Feb 1, 2021

jbarth-ubhd commented Feb 3, 2021 via email

Problem with table recognition #134

Problem with table recognition #134

Comments

Shanksum commented Jul 30, 2020

bertsky commented Aug 24, 2020

jbarth-ubhd commented Feb 1, 2021

jbarth-ubhd commented Feb 1, 2021

bertsky commented Feb 1, 2021

jbarth-ubhd commented Feb 1, 2021 • edited Loading

bertsky commented Feb 1, 2021

jbarth-ubhd commented Feb 1, 2021

stweil commented Feb 1, 2021

jbarth-ubhd commented Feb 3, 2021 via email

jbarth-ubhd commented Feb 1, 2021 •

edited

Loading