Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update version of pdfminer-six to 20240706 #1166

Open
ValentinaGalataAA opened this issue Jul 8, 2024 · 10 comments
Open

Update version of pdfminer-six to 20240706 #1166

ValentinaGalataAA opened this issue Jul 8, 2024 · 10 comments
Assignees
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@ValentinaGalataAA
Copy link

Please update the version of pdfminer-six to 20240706.

@ValentinaGalataAA ValentinaGalataAA added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Jul 8, 2024
@jsvine
Copy link
Owner

jsvine commented Jul 14, 2024

There seems to be a bug in the latest release — pdfminer/pdfminer.six#1004 — which also happens to be throwing errors in pdfplumber's test suite. I'll keep an eye out for pdfminer.six's next release, which hopefully fixes the bug.

@dhdaines
Copy link
Contributor

There seems to be a bug in the latest release — pdfminer/pdfminer.six#1004 — which also happens to be throwing errors in pdfplumber's test suite. I'll keep an eye out for pdfminer.six's next release, which hopefully fixes the bug.

I fixed the bug :) pdfminer/pdfminer.six#1027 hopefully it gets released soon!

@jsvine
Copy link
Owner

jsvine commented Jul 31, 2024

@dhdaines Wonderful, thanks!

@chenxi-briink
Copy link

chenxi-briink commented Aug 13, 2024

@jsvine would you consider upgrade this dependency before the next release of pdfminer.six ?

  • pdfminer has a release cycle of about 5-6 months, so it can means another 5 months until next release, which is a bit too long imo
  • the current version throw similar errors too, which is what I encountered (please see below)

The project I'm working on uses pdfplumber in production, and when parsing the following PDF
https://www.ge.com/sites/default/files/ge2021_sustainability_report.pdf, it raises TypeError: 'PDFObjRef' object is not iterable

I tested locally that pdfminer.six 20240706 could solve the issue. (I forced pdfplumber 0.10.2 and pdfminer.six 20240706 to coexist in order to verify it. However I couldn't do that in the project code because poetry is used there)

@jsvine
Copy link
Owner

jsvine commented Aug 18, 2024

Hi @chenxi-briink, can you try upgrading pdfplumber to the latest version, 0.11.3? Using that version, I'm able to parse the PDF you've cited with no problems/errors.

@chenxi-briink
Copy link

chenxi-briink commented Aug 19, 2024

Hi @jsvine,

Sorry that I mis-typed the version number in my previous message

I forced pdfplumber 0.10.2 and pdfminer.six 20240706

should be: I forced pdfplumber 0.11.3 and pdfminer.six 20240706 to coexist.

yes that combination works for me.

however, the issue is, the requirements.txt of pdfplumber depends on pdfminer.six 20231228, it is the latter throws this exception.

File ~/foo/bar/.venv/lib/python3.11/site-packages/pdfminer/pdftypes.py:373, in PDFStream.decode(self)
    371     raise PDFNotImplementedError("Unsupported filter: %r" % f)
    372 # apply predictors
--> 373 if params and "Predictor" in params:
    374     pred = int_value(params["Predictor"])
    375     if pred == 1:
    376         # no predictor

TypeError: argument of type 'PDFObjRef' is not iterable

For in my production environment, in which poetry is used, I couldn't override the stated pdfminer.six version 20231228.

@jsvine
Copy link
Owner

jsvine commented Aug 19, 2024

Hi @chenxi-briink and thanks for the clarification. That's strange; I'm running the exact same combination and seeing no error. First, I set up this fresh environment:

python -m venv venv
source venv/bin/activate
pip install pdfplumber==0.11.3
pip freeze | grep pdf

... which outputs:

pdfminer.six==20231228
pdfplumber==0.11.3
pypdfium2==4.30.0

Then I ran this:

import pdfplumber

pdf = pdfplumber.open("./ge2021_sustainability_report.pdf")

for page in pdf.pages:
    assert len(pdf.objects)

... which completed without error.

@chenxi-briink
Copy link

Hi @jsvine,

Gee, by trying to replicate what you posted, I realised that the file I got turned out to be a modified version of the public available one I shared with you. For this modified file, the exception will occur when doing the same as you shared.
(Sorry that I didn't double check cause I didn't expect there would be a modified version)

I uploaded this file to a public accessible GDrive folder , basically it's a shortened version of the original GE 2021 Sustainability Report. A PDF viewer could render it w/o problem.

@jsvine
Copy link
Owner

jsvine commented Aug 19, 2024

Thanks for providing the updated PDF, @chenxi-briink. Using that one, I can indeed replicate the error.

In this case, however, I don't plan on upgrading the dependency until at least the next pdfminer.six release — although doing so might fix your situation, it will likely break others (as confirmed pdfplumber's test suite). @dhdaines's fix in pdfminer/pdfminer.six#1027 handles your PDF well; perhaps you can use his fork in the meantime?

As context: pdfminer.six is a pinned dependency in pdfplumber because changes to that library can have breaking changes for this one. I realize it can cause issues when someone wants to use a different specific version of pdfminer.six, but that tradeoff is preferable to all new installations of pdfplumber breaking.

@chenxi-briink
Copy link

Hi @jsvine , I totally understand the rational for not upgrading. Thanks for explaining and pointing me to @dhdaines 's fork, I might find sometime to give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

4 participants