Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF issues: PDF Linearization data has bad errors #204

Closed
ronaldtse opened this issue May 29, 2023 · 8 comments
Closed

PDF issues: PDF Linearization data has bad errors #204

ronaldtse opened this issue May 29, 2023 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@ronaldtse
Copy link
Contributor

PDF Linearization data has bad errors - this is optional data so you could always not generate until the bugs are fixed.

From #201

@Intelligent2013
Copy link
Contributor

To solve this issue, I'm in trying to understand how to catch errors in PDF Linearization data...
I didn't find these issues in TestGrammar log. May be I have to use the another tool...

@Intelligent2013
Copy link
Contributor

Linearization data checking by qpdf tool:

"C:\Program Files\qpdf 11.4.0\bin\qpdf.exe" --check-linearization rice-en.final.presentation.pdf
WARNING: rice-en.final.presentation.pdf: error encountered while checking linearization data: overflow reading bit stream: wanted = 32; available = 24
qpdf: operation succeeded with warnings                                                                                                                 

@Intelligent2013
Copy link
Contributor

For the PDF rice-en.final.presentation.pdf that contains only two pages (cover and inner page):

WARNING: rice-en.final.presentation.pdf: end of first page section (/E) mismatch: /E = 103552; computed = 132030..132031
WARNING: rice-en.final.presentation.pdf: first page object offset mismatch
WARNING: rice-en.final.presentation.pdf: object count mismatch for page 0: hint table = 47; computed = 49
WARNING: rice-en.final.presentation.pdf: page 1: shared object 9: in computed list but not hint table
WARNING: rice-en.final.presentation.pdf: page 1: shared object 10: in computed list but not hint table
WARNING: rice-en.final.presentation.pdf: page 1: shared object 104: in computed list but not hint table
WARNING: rice-en.final.presentation.pdf: page 1: shared object 105: in computed list but not hint table
...
WARNING: rice-en.final.presentation.pdf: page 1: shared object 146: in computed list but not hint table
WARNING: rice-en.final.presentation.pdf: page 1: shared object 147: in computed list but not hint table
qpdf: operation succeeded with warnings

@Intelligent2013
Copy link
Contributor

I've made the simple experiment to check the PDF generation by Adobe Acrobat vs. qpdf checking feature:

No linearization (Fast Web View: no)
qpdf also returns: word_simple_pdf.pdf is not linearized

  • in the Adobe Acrobat: File -> Save as Other -> Optimized PDF ... , then on the tab Clean Up set:
    image

word_simple_pdf_linearized.pdf

  • qpdf returns ("C:\Program Files\qpdf 11.4.0\bin\qpdf.exe" --check-linearization word_simple_pdf_linearized.pdf`):
WARNING: word_simple_pdf_linearized.pdf: linearized file contains an uncompressed object after a compressed one in a cross-reference stream
WARNING: word_simple_pdf_linearized.pdf: object count mismatch for page 0: hint table = 5; computed = 4
WARNING: word_simple_pdf_linearized.pdf: page 0 has shared identifier entries
WARNING: word_simple_pdf_linearized.pdf: page 0: shared object 9: in hint table but not computed list
qpdf: operation succeeded with warnings

I.e. qpdf found the warning in the PDF generated by Adobe Acrobat

There are two cases:

  • Adobe Acrobat generates the wrong PDF
  • qpdf works wrongly

@Intelligent2013
Copy link
Contributor

PDF Linearization data has bad errors - this is optional data so you could always not generate until the bugs are fixed

@petervwyatt to be on the same track, which tool did you use for PDF Linearization checking?

@petervwyatt
Copy link

The easiest is probably QPDF (https://github.com/qpdf/qpdf/releases) using qpdf --check <file.pdf>.

Given that standards are usually official "documents of record" and that all versions of PDF/A explicitly prohibit Linearized PDF for valid technical reasons, I would strongly recommend not bothering to output it at all. A lot of implementations just ignore it anyway because (a) it is often wrong or out-of-date; (b) what was previously documented vs implemented by major vendors was different anyway (only corrected in the PDF 2.0 spec); (c) there is no requirement for PDF processors to implement it (i.e. it is entirely optional); and (d) it is a known source of "parser differentials" vulns. With today's super-fast internet speeds (vs 25 years ago when it was invented!) and a modern efficient PDF (ie. compressed cross-reference streams and compressed object streams), 95% of PDFs won't benefit.

@Intelligent2013
Copy link
Contributor

@petervwyatt thank you!

@Intelligent2013
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants