This project uses pymupdf4llm to process PDF documents and convert them to Markdown format while preserving table structures and formatting. The output is organized into batches to ensure each folder stays under a specified file size limit (default: 5MB).
- Processes multiple PDF files in a batch
- Preserves table structures
- Maintains text formatting
- Organizes output into size-limited folders
- Converts complex PDF layouts into LLM-friendly markdown
- Install the required dependencies:
pip install -r requirements.txt
- Place your PDF files in the
data/inputfolder - Run the parser:
python pdf_parser.py - Find the processed markdown files in the
data/outputfolder
The script supports several command line arguments:
python pdf_parser.py [--input INPUT_DIR] [--output OUTPUT_DIR] [--max-size MAX_SIZE_MB] [--process-all]
--input: Input folder containing PDF files (default: "data/input")--output: Output folder for markdown files (default: "data/output")--max-size: Maximum size in MB for each output folder (default: 5)--process-all: Process all PDFs, even if they have been processed before
By default, the script will skip PDF files that have already been processed (files with the same name exist in the output directory). This allows you to add new PDFs to the input folder and only process the new ones.
To process all PDFs regardless of whether they've been processed before:
python pdf_parser.py --process-all
You can modify the following parameters in the pdf_parser.py file:
input_folder: Location of input PDF filesoutput_base_folder: Location for output markdown filesmax_size_mb: Maximum size per output folder (default: 5MB)
The output will be organized in multiple folders (batch_1, batch_2, etc.), each containing markdown files converted from PDFs with preserved tables and formatting. Each folder will be under the specified size limit.
After placing your PDFs in the input folder, run the script:
python pdf_parser.pyThis will process all PDFs and generate markdown files, organizing them into batches to keep each folder under 5MB.