Skip to content

Latest commit

 

History

History
28 lines (19 loc) · 1.46 KB

File metadata and controls

28 lines (19 loc) · 1.46 KB

Python PDF Processing

-- Work in progress --

This repo contains 4 files:

  1. pdfToImages.py: converts PDF files to images
  2. imageOCR.py: performs OCR on a set of images with the Azure Computer Vision API, then stores the results in json files
  3. OCRinterpretation.py: interprets the results from the json files and generates new json files with the pdf's structure
  4. pdfProcessing.py: takes care of the whole pipeline described above

Installs

  • Install GhostScript on Windows
  • Install ImageMagick on Windows
  • Install Wand: pip install wand
  • Install OpenCV: pip install opencv-python
  • Install Matplotlib: pip install matplotlib

How to use

Clone or download this repo, then use either the sample files or your own. If you chose to use your own, change the path names accordingly. /!\ If you use your own files, try first with a small number of pages. Converting pdf to images takes a while.

If you wish to use the OCR capabilities, set up a Computer Vision API service on Azure and fill in your subscription key and region in imagesOCR.py and/or pdfProcessing.py.

Once everything is ready, execute the files and watch the magic happen: python .\filename.py