Skip to content

sajor2000/pdfmarkdownllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF to Markdown Parser for LLM Ingestion

This project uses pymupdf4llm to process PDF documents and convert them to Markdown format while preserving table structures and formatting. The output is organized into batches to ensure each folder stays under a specified file size limit (default: 5MB).

Features

  • Processes multiple PDF files in a batch
  • Preserves table structures
  • Maintains text formatting
  • Organizes output into size-limited folders
  • Converts complex PDF layouts into LLM-friendly markdown

Installation

  1. Install the required dependencies:
    pip install -r requirements.txt
    

Usage

  1. Place your PDF files in the data/input folder
  2. Run the parser:
    python pdf_parser.py
    
  3. Find the processed markdown files in the data/output folder

Command Line Arguments

The script supports several command line arguments:

python pdf_parser.py [--input INPUT_DIR] [--output OUTPUT_DIR] [--max-size MAX_SIZE_MB] [--process-all]
  • --input: Input folder containing PDF files (default: "data/input")
  • --output: Output folder for markdown files (default: "data/output")
  • --max-size: Maximum size in MB for each output folder (default: 5)
  • --process-all: Process all PDFs, even if they have been processed before

Incremental Processing

By default, the script will skip PDF files that have already been processed (files with the same name exist in the output directory). This allows you to add new PDFs to the input folder and only process the new ones.

To process all PDFs regardless of whether they've been processed before:

python pdf_parser.py --process-all

Configuration

You can modify the following parameters in the pdf_parser.py file:

  • input_folder: Location of input PDF files
  • output_base_folder: Location for output markdown files
  • max_size_mb: Maximum size per output folder (default: 5MB)

Output Structure

The output will be organized in multiple folders (batch_1, batch_2, etc.), each containing markdown files converted from PDFs with preserved tables and formatting. Each folder will be under the specified size limit.

Example

After placing your PDFs in the input folder, run the script:

python pdf_parser.py

This will process all PDFs and generate markdown files, organizing them into batches to keep each folder under 5MB.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors