Skip to content

ByteVeda/paperjam

Repository files navigation

paperjam logo

PyPI License: MIT Python 3.12+

Fast document processing powered by Rust. One API. Every document format.

Supported Formats

Format Read Write Extract Text Extract Tables Convert
PDF Yes Yes Yes Yes Yes
DOCX Yes Yes Yes Yes Yes
XLSX Yes Yes Yes Yes Yes
PPTX Yes Yes Yes Yes Yes
HTML Yes Yes Yes Yes Yes
EPUB Yes Yes Yes - Yes

Installation

pip install paperjam

CLI tool (Rust):

cargo install paperjam-cli

Quick Start

Open any format

import paperjam

doc = paperjam.open("report.pdf")
docx = paperjam.open("document.docx")
xlsx = paperjam.open("data.xlsx")
pptx = paperjam.open("slides.pptx")

Extract text and tables

doc = paperjam.open("report.pdf")

text = doc.pages[0].extract_text()
tables = doc.pages[0].extract_tables()
md = doc.to_markdown(layout_aware=True)

Convert between formats

paperjam.convert("report.pdf", "report.docx")
paperjam.convert("data.xlsx", "data.pdf")
paperjam.convert("page.html", "page.epub")

Run a pipeline

# pipeline.yaml
steps:
  - open: "reports/*.pdf"
  - extract_tables:
      strategy: auto
      output: tables.csv
  - convert:
      format: docx
      output: "converted/"
paperjam pipeline run pipeline.yaml

CLI usage

paperjam extract text report.pdf
paperjam extract tables data.pdf --format csv
paperjam convert report.pdf report.docx
paperjam info document.pdf

MCP server

pip install paperjam-mcp

Add to your MCP client configuration (Claude Code, Claude Desktop, Cursor):

{
  "mcpServers": {
    "paperjam": {
      "command": "uvx",
      "args": ["paperjam-mcp", "--working-dir", "."]
    }
  }
}

Features

  • Multi-format support -- PDF, DOCX, XLSX, PPTX, HTML, EPUB through one unified API
  • Text extraction -- plain text, positioned lines, spans with font info
  • Table extraction -- lattice and stream strategies with CSV/DataFrame export
  • Format conversion -- convert between any supported formats
  • Pipeline engine -- define multi-step document workflows in YAML
  • MCP server -- expose document operations as tools for AI agents
  • PDF manipulation -- split, merge, reorder, rotate, delete, insert blank pages
  • Metadata & bookmarks -- read and edit document properties and outline
  • Annotations & watermarks -- add, read, remove annotations; text watermarks
  • Forms -- inspect, fill, create, and modify form fields
  • Security -- encryption (AES-128/256, RC4), sanitization, true content-stream redaction
  • Digital signatures -- sign, verify, and inspect with LTV timestamp support
  • PDF/A & PDF/UA -- validation and conversion, accessibility checks
  • Native async -- powered by Rust and tokio, no Python thread pools
  • CLI tool -- full-featured command-line interface for scripting and automation
  • WASM playground -- try it in the browser at docs.byteveda.org/paperjam

Documentation

Full docs, API reference, and interactive playground at docs.byteveda.org/paperjam.

License

MIT

About

Fast document processing powered by Rust

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors