Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode ecodeError while parsing the PDF files. #17

Open
adityardesai opened this issue Apr 23, 2016 · 4 comments
Open

Unicode ecodeError while parsing the PDF files. #17

adityardesai opened this issue Apr 23, 2016 · 4 comments

Comments

@adityardesai
Copy link

adityardesai commented Apr 23, 2016

Hi

I am using NLTKRest server to parse few of the PDF files from Polar Trec Data and get the required NER quantities. But for most of the PDF files I am seeing the following error from the REST server.

"UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128) // Werkzeug Debugger "

Command used is
curl -X POST -d "PDF TEXT in STRING" http://localhost:8888/nltk.

Error file is attached as well.
nltkrest.txt

@manalishah
Copy link
Collaborator

manalishah commented Apr 23, 2016

yes, thats true @adityardesai
you might want to use this patch until its merged chrismattmann#7
or you could simply build this branch 'encoding-issue' from source

@adityardesai
Copy link
Author

Thanks for letting us know @manalishah . But I tried the patch given and again same error I am seeing. Am I missing any steps, apart from adding
tokenized = nltk.word_tokenize(content.decode("utf-8")) to the server.py. Any specific build commands to run?

@manalishah
Copy link
Collaborator

can you upload any one such pdf file that gives you this error? I can replicate the issue and try to resolve it. @adityardesai

@adityardesai
Copy link
Author

adityardesai commented Apr 24, 2016

Sure @manalishah . Attached is the sample file. I just added tokenized = nltk.word_tokenize(content.decode("utf-8")) to the server.py and re-run the REST server and again same error.
Sample.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants