Python PDF Processing

-- Work in progress --

This repo contains 4 files:

pdfToImages.py: converts PDF files to images
imageOCR.py: performs OCR on a set of images with the Azure Computer Vision API, then stores the results in json files
OCRinterpretation.py: interprets the results from the json files and generates new json files with the pdf's structure
pdfProcessing.py: takes care of the whole pipeline described above

Installs

Install GhostScript on Windows
Install ImageMagick on Windows
Install Wand: pip install wand
Install OpenCV: pip install opencv-python
Install Matplotlib: pip install matplotlib

How to use

Clone or download this repo, then use either the sample files or your own. If you chose to use your own, change the path names accordingly. /!\ If you use your own files, try first with a small number of pages. Converting pdf to images takes a while.

If you wish to use the OCR capabilities, set up a Computer Vision API service on Azure and fill in your subscription key and region in imagesOCR.py and/or pdfProcessing.py.

Once everything is ready, execute the files and watch the magic happen: python .\filename.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python PDF Processing

Installs

How to use

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Python PDF Processing

Installs

How to use