Spaces in between words #33

MaxDaten · 2017-05-12T14:22:39Z

Hi,

can I send you a pdf file via email with the problem above? I have a problem when extracting text from a pdf file. On some pages words are broken by spaces. For example:

"Der Bri ga di er be ob ach te te das Spek ta kel grim mig wie ein"

instead of:

"Der Brigadier beobachtete das Spektakel grimmig wie ein"

No problems with pdftotext version 3.03.

I'm not allowed to upload the pdf publicly.

Greetings

Jan

The text was updated successfully, but these errors were encountered:

Yuras · 2017-05-12T15:17:58Z

Sure, see email in the profile. Are you using the stable version from hackage?
Note that we are using heuristics to add spaces, they are not very reliable. There is a magic constant 1.8, which probably should be a default glyph width from the font. If the text in question is big, then you can try to increase the constant. If it will fix the issue for you, then we have to extract the default glyph width from font and use it instead of the constant.

MaxDaten · 2017-05-13T00:27:56Z

Thx, I just sent you an email.

Are you using the stable version from hackage?

Yes I'm using pdf-toolbox-document 0.0.7.1

Yuras · 2017-05-14T11:07:48Z

Indeed the magic constant has nothing to do here. Actually the file contains this spaces, but the next glyph starts at the beginning of the space, overriding it. I send details by email.

Probably pdftotext has some heuristics to eliminate such fake spaces. But for example Evince (PDF viewer for GNOME Desktop) extracts them when I select and copy text from the file.

I don't know whether it makes sense to add such heuristics to pdf-toolbox. I'll accept PR is it will not be too complicated.

MaxDaten · 2017-05-15T11:31:08Z

Thank you for investigation and plausible to be skeptical about a heuristic for such cases. I worked with around 100-200 documents, and this specific document is the first with this non-ideal glyph pattern. Currently, my time budget is far overstretched to commit a PR, but we are currently allocating new resources. Maybe in the near future. for this document I will wrap pdftotext, but mainly I will stick with this library. Thanks for developing such!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spaces in between words #33

Spaces in between words #33

MaxDaten commented May 12, 2017

Yuras commented May 12, 2017

MaxDaten commented May 13, 2017

Yuras commented May 14, 2017

MaxDaten commented May 15, 2017

Spaces in between words #33

Spaces in between words #33

Comments

MaxDaten commented May 12, 2017

Yuras commented May 12, 2017

MaxDaten commented May 13, 2017

Yuras commented May 14, 2017

MaxDaten commented May 15, 2017