Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with table recognition #134

Open
Shanksum opened this issue Jul 30, 2020 · 9 comments
Open

Problem with table recognition #134

Shanksum opened this issue Jul 30, 2020 · 9 comments

Comments

@Shanksum
Copy link

With tables where there are no horizontal lines, the workflow results in a wrong reading order by only recognizing the columns and no rows.
See the following image as an example:
catalog46muse_0564

The result is as follows:
OCR-D-TXT_catalog46muse_0564.txt

This is the used workfow:

ocrd-olena-binarize -I OCR-D-OPT -O OCR-D-BIN -p '{"impl": "sauvola-ms-split"}'
ocrd-cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-DENOISE -p '{"level-of-operation":"page"}'
ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p '{"level-of-operation":"page"}'
ocrd-tesserocr-segment-region -I OCR-D-DESKEW-PAGE -O OCR-D-SEG-REG
ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{"plausibilize":true}'
ocrd-cis-ocropy-binarize -I OCR-D-SEG-REPAIR -O OCR-D-BIN2 -p '{"level-of-operation":"region"}'
ocrd-tesserocr-deskew -I OCR-D-BIN2 -O OCR-D-DESKEW-TEXT
ocrd-tesserocr-segment-line -I OCR-D-DESKEW-TEXT -O OCR-D-SEG-LINE
ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG
ocrd-cis-ocropy-dewarp -I OCR-D-RESEG -O OCR-D-DEWARP-LINE
ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"model": "deu"}'
@bertsky
Copy link
Collaborator

bertsky commented Aug 24, 2020

That's because there are no good table processors in OCR-D yet. But you'd also have to include the existing ones in your workflow in the first place!

Here's my take on this example:

  1. Binarization is hard. The above page features heavy show-through, stains/specks, and handwriting. And since you uploaded a JPEG, I also get heavy compression artifacts around the glyphs. I have not been able to put to much use ocrd-skimage-normalize or ocrd-skimage-denoise-raw here, and my best shot for binarization is ocrd-olena-binarize with sauvola-ms-split and k set to 0.2 (guessing a dpi of 200):
    OCR-D-IMG-DEN_catalog46muse_0564-BIN_sauvola-ms-split
  2. Table detection is currently only available with ocrd-tesserocr-segment-region (with its default find_tables: true). But its underlying segmentation is fragile and does not cope well at all with binarized input. Tesseract (i.e. its usage of Leptonica) wants to see the raw image and binarize with its (bad, internal) global Otsu implementation. So running binarization after segmentation is currently the only way to get a table region for that page. But often the workflow needs binarization prior to page segmentation (table detection). Our OCR-D wrapper could of course extract the raw image, regardless of the workflow. But that might degrade quality in other cases (exactly because the internal binarization is so bad). Therefore I started Segmentation on raw images #144 to experiment with this behaviour. Note: I also found a bug in Tesseract's separator detection. There's very likely more of those lurking.
  3. After table detection you need a processor for table recognition. Although ocrd-cis-ocropy-segment has a level-of-operation=table, I would currently not recommend it. You can use ocrd-tesserocr-segment-table for a slightly better approximation, but don't expect too much! This currently just uses Tesseract's SPARSE_TEXT mode (or SPARSE_TEXT_OSD in Segmentation on raw images #144). Here's what this looks like:
    OCR-D-SEG-TAB_catalog46muse_0564_pageviewer
    So: there's a text region for the handwritten "check" on the right, then the table region commences. The cells of that table are not ideal and there is no recursive or consistent structure. Also, many separators go undetected. Again, note that 2 and 3 had to be done on the raw image.
  4. After segmentation, you might want to do dewarping and recognition. This will use the binarization from step 1 again.

@jbarth-ubhd
Copy link

This scan has different skew angles (at top & bottom); perhaps a 3d deskew could help.

@jbarth-ubhd
Copy link

text lines aligned (but not vertically aligned):
0001

@bertsky
Copy link
Collaborator

bertsky commented Feb 1, 2021

This scan has different skew angles (at top & bottom); perhaps a 3d deskew could help.

@jbarth-ubhd, by 3d deskew you mean dewarping? How did you get this result?

Back to the issue: the core problem is still making Tesseract (currently the only table detector in OCR-D) actually detect a table region for that page. As explained above, this only works if input is not binarized (normalized or not).

Now, with your dewarped JPEG, I cannot get a table at all anymore. Probably because of the corners clipped to white. But if apply ocrd-sbb-binarize to the dewarped image, the I get at least a partial table:
OCR-D-BIN-SBB-DESKEW-SEGREG_catalog46muse_0564_dew_pageviewer

In summary, we have to

  • make Tesseract cope with binarized input (at least as good as raw)
  • wrap a better (more robust, ideally neural) table detection than Tesseract
  • wrap a better (more adequate w.r.t. cells and order) table recognition than the "tables as pages" paradigm (Tesseract sparse mode or Ocropy recursive XY-cut)

@jbarth-ubhd
Copy link

jbarth-ubhd commented Feb 1, 2021

@jbarth-ubhd, by 3d deskew you mean dewarping? How did you get this result?

No, I mean correcting a photo not taken orthogonally to the plane (paper) (=perspective distortion). The vertical column separators are not parallel in the scan. Since we had scans with "perspective distortion" I wrote a tool to correct it - without correction of verticals (didn't know how to correct those reliable)

@bertsky
Copy link
Collaborator

bertsky commented Feb 1, 2021

No, I mean correcting a photo not taken orthogonally to the plane (paper) (=perspective distortion). The vertical column separators are not parallel in the scan. Since we had scans with "perspective distortion" I wrote a tool to correct it - without correction of verticals (didn't know how to correct those reliable)

That sounds interesting. I had that use-case, too. See my report on probing various unperspective and dewarp tools for suitability in OCR-D. Back then you said you were using mzucker's tool. Is that still the case, or did you write your own?

@jbarth-ubhd
Copy link

this one: https://github.com/jbarth-ubhd/blitzDrt

@stweil
Copy link
Contributor

stweil commented Feb 1, 2021

Jochen, great that you published that oldy now on GitHub. Do you want to add a license file, too?

@jbarth-ubhd
Copy link

jbarth-ubhd commented Feb 3, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants