Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relation of METS and PAGE ReadingOrder #40

Open
kba opened this issue May 7, 2018 · 11 comments
Open

Relation of METS and PAGE ReadingOrder #40

kba opened this issue May 7, 2018 · 11 comments
Assignees
Milestone

Comments

@kba
Copy link
Member

kba commented May 7, 2018

We need to specify how these constructs are related, which one to use, how to handle contradictions.

@kba kba modified the milestones: v2.0.0, v2.3.0 Jun 13, 2018
@kba
Copy link
Member Author

kba commented Jun 18, 2018

c.f. #55

@wrznr
Copy link
Contributor

wrznr commented Jun 19, 2018

After discussing this issue with @tboenig: Reading order is not represented within METS since it is a page-level datum.

@wrznr
Copy link
Contributor

wrznr commented Jun 19, 2018

However, we find examples of reading orders represented in METS, e.g., within the DDR-Presseportal:

<mets:div TYPE="article-part" ORDER="1" ID="article6-1">
                    <mets:div TYPE="article-zone" LABEL="title" ID="article6-zone1">
                        <mets:fptr>
                            <mets:area COORDS="194,886,658,170" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block18" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                    <mets:div TYPE="article-zone" LABEL="body" ID="article6-zone2">
                        <mets:fptr>
                            <mets:area COORDS="183,1082,670,203" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block19" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                    <mets:div TYPE="article-zone" LABEL="body" ID="article6-zone3">
                        <mets:fptr>
                            <mets:area COORDS="186,1290,673,559" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block20" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                    <mets:div TYPE="article-zone" LABEL="body" ID="article6-zone4">
                        <mets:fptr>
                            <mets:area COORDS="189,1864,658,145" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block21" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                </mets:div>

@kba
Copy link
Member Author

kba commented Jun 19, 2018

How can you represent document structure? <mets:file mimetype="application/tei+xml">...</mets:file>?

@kba kba changed the title Relation of METS structMap and PAGE ReadingOrder Relation of METS and PAGE ReadingOrder Jun 19, 2018
@wrznr
Copy link
Contributor

wrznr commented Sep 17, 2018

@kba Proposal for OCR-D purposes:
<mets:structMap TYPE="LOGICAL" /> is the place to represent document structure (i.e. all structural phenomena which may cross page boundaries).
<pc:ReadingOrder /> is the place to store page-internal reading order.

@wrznr
Copy link
Contributor

wrznr commented Oct 4, 2018

@tboenig We should update the guidelines asap.

@wrznr
Copy link
Contributor

wrznr commented Nov 6, 2018

@tboenig Push.

@cneud
Copy link
Member

cneud commented May 21, 2019

This is only awaiting the updated guidelines, right?

#80 is closed and I agree fully with #40 (comment).

For the main purposes of OCR-D we should avoid (modifying) the depths of METS/MODS library style structural tagging whenever we can also rely on PAGE ReadingOrder.

A solution for METS/MODS structural enrichment via external information available through our standard fileGrp mechanism is therefore imho the best solution for now.

@kba
Copy link
Member Author

kba commented Jun 16, 2020

Possibly fixed by #154

@bertsky
Copy link
Collaborator

bertsky commented Sep 1, 2022

Possibly fixed by #154

superseded by #207, but unrelated AFAICS

For the main purposes of OCR-D we should avoid (modifying) the depths of METS/MODS library style structural tagging whenever we can also rely on PAGE ReadingOrder.

A solution for METS/MODS structural enrichment via external information available through our standard fileGrp mechanism is therefore imho the best solution for now.

Page-local reading order and structure is important both on its own, and as a contributor to document structure.

The latter (i.e. structure across pages like section boundaries and cross-refs/indexes) cannot be adequately represented in fileGrps, though. The only place for that is still the logical structMap IMHO. So far, we have two conventions for its representation:

  • the DFG profile for METS, i.e. mets:div with Strukturdatenset structural types, which are linked to the physical file structure via mets:structLink (i.e. only page-level granularity)
  • the ENMAP profile for METS, i.e. mets:area as exemplified above, allowing for direct references into page segments (either in the form of @COORDS or via idref-typed @BEGIN pointers into ALTO or PAGE segments)

The second convention is of course more powerful and general, but not as widely used.

In fact, is has been somewhat forgotten even in the context of newspaper digitization, as even DDB Zeitungsportal shied away from adopting it so far – despite listing the recording of article structure as task in its grant proposal (AP 6 p.10) and in its master planning (Tiefenerschließung Artikelebene, p. 20). The latter document references ENMAP specifically, giving it a certain spin:

ENMAP ist ein METS/ALTO-Profil für Zeitungen das vom Europeana-Newspapers-Projekt entwickelt wurde und das insbesondere nützliche Hinweise für eine Feinstrukturierung der formalen und inhaltlichen Zeitungsbestandsteile enthält. Bitte beachten Sie jedoch, dass aufwendige Feinstrukturierungen möglicherweise ausschließlich in lokalen Umgebungen Mehrwerte erbringen und in überregionalen Nachweisinstrumenten (z.B. DDB, Europeana) nicht nachgenutzt werden können.

So we can see there is a hen-vs-egg problem here: automatic structural tagging is still hard (although tools for visualizing and detecting article structure are getting better), hence enriched datasets are rare, therefore training is difficult. Not having everyone commit to the existing, agreed upon unified representation makes this even more difficult.

But it's not just a matter of simply adopting the ENMAP spec: IMO it is not trivially compatible with the DFG profile.

However this will be resolved, I do think it is worth pursuing some form of documentation and specification already – as enabler for tool developers and data providers.

(For example, we could simply write some OCR-D processor extracting OLR results with headings and reading order into "coarse" document structure in either DFG-profile / mets:structLink or ENMAP / mets:area form already.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants