Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces in between words #33

Open
MaxDaten opened this issue May 12, 2017 · 4 comments
Open

Spaces in between words #33

MaxDaten opened this issue May 12, 2017 · 4 comments

Comments

@MaxDaten
Copy link

Hi,

can I send you a pdf file via email with the problem above? I have a problem when extracting text from a pdf file. On some pages words are broken by spaces. For example:

"Der Bri ga di er be ob ach te te das Spek ta kel grim mig wie ein"

instead of:

"Der Brigadier beobachtete das Spektakel grimmig wie ein"

No problems with pdftotext version 3.03.

I'm not allowed to upload the pdf publicly.

Greetings

Jan

@Yuras
Copy link
Owner

Yuras commented May 12, 2017

Sure, see email in the profile. Are you using the stable version from hackage?
Note that we are using heuristics to add spaces, they are not very reliable. There is a magic constant 1.8, which probably should be a default glyph width from the font. If the text in question is big, then you can try to increase the constant. If it will fix the issue for you, then we have to extract the default glyph width from font and use it instead of the constant.

@MaxDaten
Copy link
Author

Thx, I just sent you an email.

Are you using the stable version from hackage?

Yes I'm using pdf-toolbox-document 0.0.7.1

@Yuras
Copy link
Owner

Yuras commented May 14, 2017

Indeed the magic constant has nothing to do here. Actually the file contains this spaces, but the next glyph starts at the beginning of the space, overriding it. I send details by email.

Probably pdftotext has some heuristics to eliminate such fake spaces. But for example Evince (PDF viewer for GNOME Desktop) extracts them when I select and copy text from the file.

I don't know whether it makes sense to add such heuristics to pdf-toolbox. I'll accept PR is it will not be too complicated.

@MaxDaten
Copy link
Author

Thank you for investigation and plausible to be skeptical about a heuristic for such cases. I worked with around 100-200 documents, and this specific document is the first with this non-ideal glyph pattern. Currently, my time budget is far overstretched to commit a PR, but we are currently allocating new resources. Maybe in the near future. for this document I will wrap pdftotext, but mainly I will stick with this library. Thanks for developing such!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants