PDF issues: PDF Linearization data has bad errors #204

ronaldtse · 2023-05-29T13:59:08Z

PDF Linearization data has bad errors - this is optional data so you could always not generate until the bugs are fixed.

From #201

Intelligent2013 · 2023-06-01T17:15:37Z

To solve this issue, I'm in trying to understand how to catch errors in PDF Linearization data...
I didn't find these issues in TestGrammar log. May be I have to use the another tool...

Intelligent2013 · 2023-06-01T18:32:28Z

Linearization data checking by qpdf tool:

"C:\Program Files\qpdf 11.4.0\bin\qpdf.exe" --check-linearization rice-en.final.presentation.pdf
WARNING: rice-en.final.presentation.pdf: error encountered while checking linearization data: overflow reading bit stream: wanted = 32; available = 24
qpdf: operation succeeded with warnings

Intelligent2013 · 2023-06-08T17:16:30Z

For the PDF rice-en.final.presentation.pdf that contains only two pages (cover and inner page):

WARNING: rice-en.final.presentation.pdf: end of first page section (/E) mismatch: /E = 103552; computed = 132030..132031
WARNING: rice-en.final.presentation.pdf: first page object offset mismatch
WARNING: rice-en.final.presentation.pdf: object count mismatch for page 0: hint table = 47; computed = 49
WARNING: rice-en.final.presentation.pdf: page 1: shared object 9: in computed list but not hint table
WARNING: rice-en.final.presentation.pdf: page 1: shared object 10: in computed list but not hint table
WARNING: rice-en.final.presentation.pdf: page 1: shared object 104: in computed list but not hint table
WARNING: rice-en.final.presentation.pdf: page 1: shared object 105: in computed list but not hint table
...
WARNING: rice-en.final.presentation.pdf: page 1: shared object 146: in computed list but not hint table
WARNING: rice-en.final.presentation.pdf: page 1: shared object 147: in computed list but not hint table
qpdf: operation succeeded with warnings

Intelligent2013 · 2023-06-08T17:38:25Z

I've made the simple experiment to check the PDF generation by Adobe Acrobat vs. qpdf checking feature:

create simple document in Word (word_simple.docx)
in Word: File -> Save as -> Save as type: PDF with options

(word_simple_pdf.pdf)
open generated PDF in the Adobe Acrobat:

No linearization (Fast Web View: no)
qpdf also returns: word_simple_pdf.pdf is not linearized

in the Adobe Acrobat: File -> Save as Other -> Optimized PDF ... , then on the tab Clean Up set:

word_simple_pdf_linearized.pdf

qpdf returns ("C:\Program Files\qpdf 11.4.0\bin\qpdf.exe" --check-linearization word_simple_pdf_linearized.pdf`):

WARNING: word_simple_pdf_linearized.pdf: linearized file contains an uncompressed object after a compressed one in a cross-reference stream
WARNING: word_simple_pdf_linearized.pdf: object count mismatch for page 0: hint table = 5; computed = 4
WARNING: word_simple_pdf_linearized.pdf: page 0 has shared identifier entries
WARNING: word_simple_pdf_linearized.pdf: page 0: shared object 9: in hint table but not computed list
qpdf: operation succeeded with warnings

I.e. qpdf found the warning in the PDF generated by Adobe Acrobat

There are two cases:

Adobe Acrobat generates the wrong PDF
qpdf works wrongly

Intelligent2013 · 2023-06-08T17:41:00Z

PDF Linearization data has bad errors - this is optional data so you could always not generate until the bugs are fixed

@petervwyatt to be on the same track, which tool did you use for PDF Linearization checking?

petervwyatt · 2023-06-09T01:19:57Z

The easiest is probably QPDF (https://github.com/qpdf/qpdf/releases) using qpdf --check <file.pdf>.

Given that standards are usually official "documents of record" and that all versions of PDF/A explicitly prohibit Linearized PDF for valid technical reasons, I would strongly recommend not bothering to output it at all. A lot of implementations just ignore it anyway because (a) it is often wrong or out-of-date; (b) what was previously documented vs implemented by major vendors was different anyway (only corrected in the PDF 2.0 spec); (c) there is no requirement for PDF processors to implement it (i.e. it is entirely optional); and (d) it is a known source of "parser differentials" vulns. With today's super-fast internet speeds (vs 25 years ago when it was invented!) and a modern efficient PDF (ie. compressed cross-reference streams and compressed object streams), 95% of PDFs won't benefit.

Intelligent2013 · 2023-06-09T07:42:38Z

@petervwyatt thank you!

Intelligent2013 · 2023-06-12T20:36:10Z

Fixed in https://github.com/metanorma/mn2pdf/tree/v1.73

ronaldtse added the bug Something isn't working label May 29, 2023

ronaldtse assigned Intelligent2013 May 29, 2023

ronaldtse mentioned this issue May 29, 2023

PDF tagging issues discovered by the PDF Association experts #201

Open

8 tasks

Intelligent2013 added a commit that referenced this issue Jun 9, 2023

Apache FOP config updated, linearization turned off, #204

6ae29df

Intelligent2013 closed this as completed Jun 12, 2023

Intelligent2013 mentioned this issue Aug 6, 2023

PDF sizes very large compared to original ISO PDFs metanorma/metanorma-iso#952

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF issues: PDF Linearization data has bad errors #204

PDF issues: PDF Linearization data has bad errors #204

ronaldtse commented May 29, 2023

Intelligent2013 commented Jun 1, 2023

Intelligent2013 commented Jun 1, 2023

Intelligent2013 commented Jun 8, 2023

Intelligent2013 commented Jun 8, 2023

Intelligent2013 commented Jun 8, 2023

petervwyatt commented Jun 9, 2023

Intelligent2013 commented Jun 9, 2023

Intelligent2013 commented Jun 12, 2023

PDF issues: PDF Linearization data has bad errors #204

PDF issues: PDF Linearization data has bad errors #204

Comments

ronaldtse commented May 29, 2023

Intelligent2013 commented Jun 1, 2023

Intelligent2013 commented Jun 1, 2023

Intelligent2013 commented Jun 8, 2023

Intelligent2013 commented Jun 8, 2023

Intelligent2013 commented Jun 8, 2023

petervwyatt commented Jun 9, 2023

Intelligent2013 commented Jun 9, 2023

Intelligent2013 commented Jun 12, 2023