PDFtoTXT

Write all text info from a PDF, even if you can't copy-paste it manually or is from an image and translate it on-the-fly.

Tested on python 3.10

Requirements

From a package manager (pacman, apt...)

Tesseract
plopper

From pip

pdf2image
natsort
deep_translator
Inquirer
progressbar

Instalation

Clone this repo or download the latest PDFtoTXT.py file from releases

$ git clone https://github.com/sbritorodr/pdf_to_txt.git

Install tesseract using any package manager. (pacman, apt, brew...)

$ sudo pacman -S tesseract plopper

Don't forget to add trained data to tesseract. Download tessdata files: https://tesseract-ocr.github.io/tessdoc/Data-Files.html and place it in the folder said in the tesseract documentation.

download the language traineddata files required by you and place them in this tessdata directory (/usr/local/share/tessdata).

Install all pip requirements. Just copypaste this onto your terminal. Use pip3 instead if it doesn't work:

$ pip install -r requirements.txt

Usage

Place pdftotxt.py where your pdf's are (Or move your pdf into the folder pdf2txt if you cloned the repo)
Execute the script under python3:

$ python3 pdftotxt.py

Follow up the instructions. By default, the program picks any pdf from the folder, disables translation and merges all into ./output_ocr_file.txt
You cannot translate your document if there's +5,000 characters on each page
If your desired language destination is not avaliable, you can add it by editing the script (lines 70 to 75). Check if it works and create a PR if you want to add this option to the main project:

70    questions = [
71        inq.List('lang',
72                message="Select which language you want to use",
73                choices=['spanish', 'english', 'french','italian', 'portuguese', 'german'] # add here your language/s
74            ),
75    ]

Uninstall

Delete pdftotxt.py
Delete all installed dependencies of pip, tesseract and plopper

$ pip uninstall -r requirements.txt

$ sudo pacman -Rs tesseract plopper

Remove all your tessdata files inside /usr/local/share/tessdata if the uninstall has not already deleted it.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.gitignore		.gitignore
LICENSE		LICENSE
PDFtoTXT.py		PDFtoTXT.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFtoTXT

Requirements

From a package manager (pacman, apt...)

From pip

Instalation

Usage

Uninstall

About

Releases 5

Packages

Languages

License

sbritorodr/PDFtoTXT

Folders and files

Latest commit

History

Repository files navigation

PDFtoTXT

Requirements

From a package manager (pacman, apt...)

From pip

Instalation

Usage

Uninstall

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages