Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading order: WORD-based vs. top-level based #24

Open
bertsky opened this issue Aug 7, 2024 · 5 comments · May be fixed by #23
Open

reading order: WORD-based vs. top-level based #24

bertsky opened this issue Aug 7, 2024 · 5 comments · May be fixed by #23

Comments

@bertsky
Copy link
Member

bertsky commented Aug 7, 2024

The current implementation extracts the ReadingOrder from the top-level parents of all WORD blocks (in the order of these word blocks). This seems to be necessary for cases with TABLE results.

However, for LAYOUT_* blocks, the results look much better if the top-level blocks are directly taken as the order – as implemented in #23.

For example, here is how both implementations compare:

current (WORD-based) #23 (top-level based)
sn1991-02-09_pr_0002 ro_word small sn1991-02-09_pr_0002 ro_top-level small
Ansiedlung_Korotschin_UZS_Sign_22a_0018 ro_word Ansiedlung_Korotschin_UZS_Sign_22a_0018 ro_top-level
nd1969-01-21_3 ro_word small nd1969-01-21_3 ro_top-level small

The first page is a typical newspaper page (added to the tests in #23) and shows how #23 is better.

The 2nd and 3rd example are taken from the test suite. The 2nd shows that the current implementation is better, because #23 places the table after all the other regions.

In the 3rd example (nd1969 test case) has AWS results which obviously look bad either way, with 2 false tables and many highly overlapping column regions.

@bertsky
Copy link
Member Author

bertsky commented Aug 7, 2024

Also, from the 1st example it is obvious it would be better if the non-textual top-level blocks (image regions etc) would also be part of the extracted reading order (so for example one can post-process image captions differently).

@bertsky
Copy link
Member Author

bertsky commented Aug 9, 2024

So regarding the open question how to deal with LAYOUT_TABLE – currently

if layout.textract_layout_type == "LAYOUT_TABLE":
# we cover tables separatly
continue
– I found these interesting snippets:

https://github.com/aws-samples/amazon-textract-textractor/blob/82ceab8ca8460dcf6efebaf73e27b762dabb93b1/prettyprinter/textractprettyprinter/t_pretty_print_layout.py#L91-L99

https://github.com/aws-samples/amazon-textract-textractor/blob/82ceab8ca8460dcf6efebaf73e27b762dabb93b1/textractor/parsers/response_parser.py#L1090-L1111

Thus, apparently, for every LAYOUT_TABLE block we should expect to find a TABLE block of the same position and size, which in turn will contain the actual table hierarchy, but is itself not "in order". So we do need both. For the case where we do not find any LAYOUT_TABLE object for each TABLE (or no LAYOUT_* results at all) we should keep our current word-order based approach.

@bertsky bertsky linked a pull request Aug 10, 2024 that will close this issue
@joewiz
Copy link

joewiz commented Aug 27, 2024

This is really interesting! Thank you for the great illustrations! (Were they produced by hand, or using a utility? If the latter, is it available?)

@bertsky
Copy link
Member Author

bertsky commented Aug 27, 2024

Were they produced by hand, or using a utility? If the latter, is it available?

The example images are produced via builtin screenshot facility of the native PAGE-XML view of OCR-D Browser, a Gtk (i.e. Linux) GUI. Of which currently probably the best version is hnesk/browse-ocrd#64. There is also a Docker version, which runs the Gtk app in the browser (also best built locally via make docker) – see instructions in readme.

For the sake of completeness, here are the results as of the current state of #23 (much better than shown above):

sn1991-02-09_pr_0002 pageview web

Ansiedlung_Korotschin_UZS_Sign_22a_0018 pageview web

nd1969-01-21_3 pageview web

Plus as a new example we stumbled upon cases of LINE within FIGURE (which must become TextRegions in ImageRegions) and PARAGRAPH within LIST (which must become TextRegion in TextRegion):

sn1991-01-03_0001 pageview web

@bertsky
Copy link
Member Author

bertsky commented Aug 27, 2024

Note: OCR-D Browser expects a METS-XML representation in OCR-D conventions. If you just need a direct viewer, consider using PRImA PageViewer, which is written in Java and thus platform-independent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants