Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip formatting #1

Open
bamnet opened this issue Jul 26, 2011 · 1 comment
Open

Strip formatting #1

bamnet opened this issue Jul 26, 2011 · 1 comment

Comments

@bamnet
Copy link
Contributor

bamnet commented Jul 26, 2011

When I run PDF tests I get output that looks like this

Textractor returns the contents of pdf documents
Failure/Error: Textractor.text_from_path(fixture_path("document.pdf")).should == 'text'
expected: "text",
got: "text\t\r \302\240 \t\r \302\240" (using ==)

My pdftotext version must handle formatting characters differently from yours. Do you think this is something textractor should handle?

In my use case I never care about the document formatting, I only want strings separated by spaces, with a limited subset of punctuation (aka periods and commas) for use indexing documents. I don't mind handling this functionality in each application, but I'd be glad to write it into texttactor if you think there's value in that.

@mguterl
Copy link
Owner

mguterl commented Jul 29, 2011

I'm going to have to think about this some more, it looks like all of the extractors are calling String#strip to remove trailing whitespace. It looks like those characters just represent whitespace of some type and if that is the case I'm fine with coming up with a replacement for String#strip that grabs these characters too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants