reading order: WORD-based vs. top-level based #24

bertsky · 2024-08-07T12:46:39Z

The current implementation extracts the ReadingOrder from the top-level parents of all WORD blocks (in the order of these word blocks). This seems to be necessary for cases with TABLE results.

However, for LAYOUT_* blocks, the results look much better if the top-level blocks are directly taken as the order – as implemented in #23.

For example, here is how both implementations compare:

current (WORD-based)	#23 (top-level based)

The first page is a typical newspaper page (added to the tests in #23) and shows how #23 is better.

The 2nd and 3rd example are taken from the test suite. The 2nd shows that the current implementation is better, because #23 places the table after all the other regions.

In the 3rd example (nd1969 test case) has AWS results which obviously look bad either way, with 2 false tables and many highly overlapping column regions.

The text was updated successfully, but these errors were encountered:

bertsky · 2024-08-07T15:21:37Z

Also, from the 1st example it is obvious it would be better if the non-textual top-level blocks (image regions etc) would also be part of the extracted reading order (so for example one can post-process image captions differently).

bertsky · 2024-08-09T15:07:06Z

So regarding the open question how to deal with LAYOUT_TABLE – currently

textract2page/textract2page/convert_aws.py

Lines 925 to 927 in 55fe416

    
           if layout.textract_layout_type == "LAYOUT_TABLE": 
        
               # we cover tables separatly 
        
               continue

– I found these interesting snippets:

https://github.com/aws-samples/amazon-textract-textractor/blob/82ceab8ca8460dcf6efebaf73e27b762dabb93b1/prettyprinter/textractprettyprinter/t_pretty_print_layout.py#L91-L99

https://github.com/aws-samples/amazon-textract-textractor/blob/82ceab8ca8460dcf6efebaf73e27b762dabb93b1/textractor/parsers/response_parser.py#L1090-L1111

Thus, apparently, for every LAYOUT_TABLE block we should expect to find a TABLE block of the same position and size, which in turn will contain the actual table hierarchy, but is itself not "in order". So we do need both. For the case where we do not find any LAYOUT_TABLE object for each TABLE (or no LAYOUT_* results at all) we should keep our current word-order based approach.

joewiz · 2024-08-27T12:34:11Z

This is really interesting! Thank you for the great illustrations! (Were they produced by hand, or using a utility? If the latter, is it available?)

bertsky · 2024-08-27T13:08:59Z

Were they produced by hand, or using a utility? If the latter, is it available?

The example images are produced via builtin screenshot facility of the native PAGE-XML view of OCR-D Browser, a Gtk (i.e. Linux) GUI. Of which currently probably the best version is hnesk/browse-ocrd#64. There is also a Docker version, which runs the Gtk app in the browser (also best built locally via make docker) – see instructions in readme.

For the sake of completeness, here are the results as of the current state of #23 (much better than shown above):

Plus as a new example we stumbled upon cases of LINE within FIGURE (which must become TextRegions in ImageRegions) and PARAGRAPH within LIST (which must become TextRegion in TextRegion):

bertsky · 2024-08-27T13:11:26Z

Note: OCR-D Browser expects a METS-XML representation in OCR-D conventions. If you just need a direct viewer, consider using PRImA PageViewer, which is written in Java and thus platform-independent.

bertsky linked a pull request Aug 10, 2024 that will close this issue

Top-level reading order #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading order: WORD-based vs. top-level based #24

reading order: WORD-based vs. top-level based #24

bertsky commented Aug 7, 2024

bertsky commented Aug 7, 2024

bertsky commented Aug 9, 2024

joewiz commented Aug 27, 2024

bertsky commented Aug 27, 2024

bertsky commented Aug 27, 2024

reading order: WORD-based vs. top-level based #24

reading order: WORD-based vs. top-level based #24

Comments

bertsky commented Aug 7, 2024

bertsky commented Aug 7, 2024

bertsky commented Aug 9, 2024

joewiz commented Aug 27, 2024

bertsky commented Aug 27, 2024

bertsky commented Aug 27, 2024