Docling, but actually fast enough for real apps.
5–6× faster on GPU • runs locally on your laptop in seconds
Zero AWS required to start. Deploy to AWS only when you need scale.
Stock Docling is great at quality. Terrible at speed. A 93-page 10-K takes 71 seconds on an A10G.
Turbodocling is the same high-quality pipeline, completely re-architected. Same 10-K now finishes in 13 seconds on AWS. On a MacBook Pro M1 Max it processes a 48-page 10-Q in ~24 seconds via MPS. Still optimizing — target is sub-20s.
That's fast enough to drop a PDF into your chat UI and get structured Markdown + elements back inline. No background jobs. No "we'll email you later."
| Document | Pages | Turbodocling | Stock Docling | Speedup |
|---|---|---|---|---|
| Docling paper | 8 | 3.0 s | 8.1 s | 2.7× |
| NVIDIA 10-Q | 48 | 6.5 s | 40.7 s | 6.3× |
| NVIDIA 10-K | 93 | 12.8 s | 71.0 s | 5.5× |
20 concurrent 93-page 10-Ks (1 860 pages): 16.7 pages/sec vs ~1.3 pages/sec stock.
- 48-page 10-Q → ~24 s on M1 Max (MPS)
- Works out of the box on CPU / CUDA / MPS
- No Docker, no cloud, no credentials
git clone https://github.com/lucasastorian/turbodocling.git
cd turbodocling
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Build the optimized C++ parser
python shared/docling_parse/build.py
pip install -e shared/docling_parsepython -m turbodocling my_document.pdf -o output/One command. Outputs:
output/output.md— clean Markdownoutput/elements.json— structured elements with bboxes + image crops
Options:
--device auto|cuda|mps|cpu(default: auto-detect)--workers N(default: CPU core count)
Python API:
from turbodocling.local_runner import run_local
result = run_local("my_document.pdf", output_dir="output/", device="auto")
print(result.wall_time_s, result.md_path, result.elements_path)cdk deployThen call the Step Function:
import boto3, json, uuid
import pypdfium2 as pdfium
STEP_FUNCTION_ARN = "arn:aws:states:us-east-1:123456789012:stateMachine:..."
BUCKET = "turbodocling-...-documentsbucket-..."
s3 = boto3.client("s3")
sfn = boto3.client("stepfunctions")
job_id = str(uuid.uuid4())
s3.upload_file("my_document.pdf", BUCKET, f"uploads/user/{job_id}/source.pdf")
total_pages = len(pdfium.PdfDocument("my_document.pdf"))
sfn.start_execution(
stateMachineArn=STEP_FUNCTION_ARN,
name=f"job-{job_id[:8]}",
input=json.dumps({
"job_id": job_id,
"user_id": "user",
"total_pages": total_pages,
})
)Results land in S3:
processed/user/{job_id}/output.md→ clean markdownprocessed/user/{job_id}/elements.json→ the good stuff (see below)
Turbodocling cost per 10 000 pages
| GPU utilization | Cost | vs Textract ($150) | vs Mistral ($30) |
|---|---|---|---|
| 100% (batch/backfill) | $0.28 | 536× cheaper | 107× cheaper |
| 30% (real-time) | $0.74 | 203× cheaper | 41× cheaper |
Spot instances drop the GPU portion another ~70%. One always-on A10G is ~$870/mo on-demand — still orders of magnitude cheaper than API services.
Local → single-process, multi-threaded + MPS/CUDA/CPU AWS → Step Function → 40 parallel Lambdas (parse + render) → A10G worker (inference only)
Everything CPU-heavy is off the critical path. TableFormer is fully batched + KV-cached. The rest is hundreds of vectorized + zero-copy optimizations.
-
Split CPU and GPU work Stock Docling runs everything on the same machine. You pay A10G prices for parsing and image rendering. We moved that to cheap parallel Lambdas.
-
Rewrote TableFormer from scratch No more one-table-at-a-time. No more recomputing attention every token. Batched + KV-cached + multi-threaded. Huge win.
-
Everything else optimized to death Vectorized numpy everywhere, spatial indexes instead of O(N²) loops, C++ parser patches, zero-copy data paths, multi-threaded pipeline with backpressure. The boring stuff that adds up.
{
"pages": [
{
"page_no": 1,
"width": 612,
"height": 792,
"elements": [
{
"type": "section_header",
"content": "1. Introduction",
"bbox": { "l": 72, "t": 100, "r": 400, "b": 120 },
"level": 1
},
{
"type": "table",
"content": "| Col1 | Col2 |\n|------|------|",
"bbox": { "l": 72, "t": 200, "r": 540, "b": 400 }
},
{
"type": "picture",
"image_base64": "iVBORw0KGgoAAAANSUhEUg...",
"bbox": { "l": 72, "t": 450, "r": 400, "b": 650 }
}
]
}
]
}Every element includes bounding boxes and (for pictures) ready-to-use base64 crops. Perfect for RAG, multimodal prompts, provenance, or custom filtering.
- No OCR yet (native text layer only)
- Equations come out as image crops only
- TableFormer occasionally struggles with very weird layouts (fallback reconciliation prevents data loss, but test your docs)
- MPS (Apple Silicon) support is experimental. Layout and table inference both run on MPS and produce correct output on our test corpus, but performance varies by table complexity. Long or structurally dense tables may be slower because the autoregressive decode path is inherently sequential. Set
TURBODOCLING_TABLE_MPS=0to fall back to CPU for table inference if needed.
Apache 2.0. Built on the excellent Docling work from IBM Research (MIT licensed).
Local is the default. Try it in 30 seconds. Deploy to AWS only when you need massive throughput. We're still pushing local performance harder — next target: sub-20s on M1/M2/M3 for a 48-page 10-Q.
