PDF to Markdown Parser for LLM Ingestion

This project uses pymupdf4llm to process PDF documents and convert them to Markdown format while preserving table structures and formatting. The output is organized into batches to ensure each folder stays under a specified file size limit (default: 5MB).

Features

Processes multiple PDF files in a batch
Preserves table structures
Maintains text formatting
Organizes output into size-limited folders
Converts complex PDF layouts into LLM-friendly markdown

Installation

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Place your PDF files in the data/input folder
Run the parser:
```
python pdf_parser.py
```
Find the processed markdown files in the data/output folder

Command Line Arguments

The script supports several command line arguments:

python pdf_parser.py [--input INPUT_DIR] [--output OUTPUT_DIR] [--max-size MAX_SIZE_MB] [--process-all]

--input: Input folder containing PDF files (default: "data/input")
--output: Output folder for markdown files (default: "data/output")
--max-size: Maximum size in MB for each output folder (default: 5)
--process-all: Process all PDFs, even if they have been processed before

Incremental Processing

By default, the script will skip PDF files that have already been processed (files with the same name exist in the output directory). This allows you to add new PDFs to the input folder and only process the new ones.

To process all PDFs regardless of whether they've been processed before:

python pdf_parser.py --process-all

Configuration

You can modify the following parameters in the pdf_parser.py file:

input_folder: Location of input PDF files
output_base_folder: Location for output markdown files
max_size_mb: Maximum size per output folder (default: 5MB)

Output Structure

The output will be organized in multiple folders (batch_1, batch_2, etc.), each containing markdown files converted from PDFs with preserved tables and formatting. Each folder will be under the specified size limit.

Example

After placing your PDFs in the input folder, run the script:

python pdf_parser.py

This will process all PDFs and generate markdown files, organizing them into batches to keep each folder under 5MB.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
explore_functions.py		explore_functions.py
pdf_parser.py		pdf_parser.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to Markdown Parser for LLM Ingestion

Features

Installation

Usage

Command Line Arguments

Incremental Processing

Configuration

Output Structure

Example

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF to Markdown Parser for LLM Ingestion

Features

Installation

Usage

Command Line Arguments

Incremental Processing

Configuration

Output Structure

Example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages