Skip to content

Converts any PDF file from one language into your language

License

Notifications You must be signed in to change notification settings

sbritorodr/PDFtoTXT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDFtoTXT

Write all text info from a PDF, even if you can't copy-paste it manually or is from an image and translate it on-the-fly.

Tested on python 3.10

Requirements

From a package manager (pacman, apt...)

  • Tesseract
  • plopper

From pip

  • pdf2image
  • natsort
  • deep_translator
  • Inquirer
  • progressbar

Instalation

  1. Clone this repo or download the latest PDFtoTXT.py file from releases
$ git clone https://github.com/sbritorodr/pdf_to_txt.git
  1. Install tesseract using any package manager. (pacman, apt, brew...)
$ sudo pacman -S tesseract plopper
  1. Don't forget to add trained data to tesseract. Download tessdata files: https://tesseract-ocr.github.io/tessdoc/Data-Files.html and place it in the folder said in the tesseract documentation.

download the language traineddata files required by you and place them in this tessdata directory (/usr/local/share/tessdata).

  1. Install all pip requirements. Just copypaste this onto your terminal. Use pip3 instead if it doesn't work:
$ pip install -r requirements.txt

Usage

  1. Place pdftotxt.py where your pdf's are (Or move your pdf into the folder pdf2txt if you cloned the repo)
  2. Execute the script under python3:
$ python3 pdftotxt.py
  1. Follow up the instructions. By default, the program picks any pdf from the folder, disables translation and merges all into ./output_ocr_file.txt
  2. You cannot translate your document if there's +5,000 characters on each page
  3. If your desired language destination is not avaliable, you can add it by editing the script (lines 70 to 75). Check if it works and create a PR if you want to add this option to the main project:
70    questions = [
71        inq.List('lang',
72                message="Select which language you want to use",
73                choices=['spanish', 'english', 'french','italian', 'portuguese', 'german'] # add here your language/s
74            ),
75    ]

Uninstall

  1. Delete pdftotxt.py
  2. Delete all installed dependencies of pip, tesseract and plopper
$ pip uninstall -r requirements.txt
$ sudo pacman -Rs tesseract plopper
  1. Remove all your tessdata files inside /usr/local/share/tessdata if the uninstall has not already deleted it.