feat: add checkpoint/resume for long document processing by ag9920 · Pull Request #227 · VectifyAI/PageIndex

ag9920 · 2026-04-11T05:52:08Z

Summary

Add two-phase checkpoint support to the document processing pipeline,
enabling users to resume from the last completed stage instead of
restarting from scratch when processing is interrupted.

Closes #170

Problem

Processing large documents (100+ pages) through PageIndex is expensive
and time-consuming — the LLM calls in tree_parser and
generate_summaries can take 10-30 minutes and cost significant tokens.
If the process crashes at any point (API rate limit, network timeout,
context overflow), all progress is lost and users must start over.

Issue #170 raised this exact pain point: users need a way to recover
from failures without re-running the entire pipeline.

Solution

Introduce a two-phase checkpoint mechanism that saves intermediate
results after each expensive LLM stage:

Checkpoint File	Saved After	What It Contains
`{doc}_tree.json`	`tree_parser` completes	Raw tree structure from TOC parsing
`{doc}_summary.json`	`generate_summaries` completes	Tree structure with summaries attached

On resume, the pipeline automatically picks the latest available
checkpoint (summary > tree), skipping all completed LLM calls.

For Markdown documents, a single checkpoint ({doc}_md_summary.json)
is saved after summary generation, since tree construction is local
and doesn't require LLM calls.

Usage

CLI:

# First run: parse + save checkpoints
python run_pageindex.py --pdf_path doc.pdf --checkpoint-dir ./checkpoints

# If interrupted, resume from the latest checkpoint
python run_pageindex.py --pdf_path doc.pdf --checkpoint-dir ./checkpoints --resume

# Markdown works the same way
python run_pageindex.py --md_path doc.md --checkpoint-dir ./checkpoints --resume

Python API:

from pageindex import PageIndexClient

client = PageIndexClient(workspace="./workspace")

# Enable checkpointing
doc_id = client.index("doc.pdf", checkpoint_dir="./checkpoints")

# Resume after interruption
doc_id = client.index("doc.pdf", checkpoint_dir="./checkpoints", resume=True)

Human-in-the-loop correction:

Users can manually edit checkpoint JSON files (e.g., fix an incorrect
page number identified by the LLM) before resuming, enabling a
human-in-the-loop workflow not previously possible.

Implementation Details

Atomic writes: Checkpoints are written to a .tmp file first,
then os.replace() atomically swaps it into place. This prevents
corruption if the process crashes mid-write.
Backward compatible: When checkpoint_dir is not set (default
None), behavior is identical to before — zero overhead.
Config integration: New checkpoint_dir (default: null) and
resume (default: "no") fields added to config.yaml, validated
by the existing ConfigLoader.
Error handling: --resume without --checkpoint-dir raises a
clear error. Resume with no checkpoint files found raises
FileNotFoundError with the expected file path.

Files Changed

File	Change
`pageindex/page_index.py`	Two-phase checkpoint in `page_index_builder()` + `_save_checkpoint()` helper
`pageindex/page_index_md.py`	Checkpoint after summary generation in `md_to_tree()` + `_save_checkpoint_md()` helper
`pageindex/client.py`	Pass `checkpoint_dir`/`resume` through `PageIndexClient.index()` for both PDF and MD
`pageindex/config.yaml`	Add `checkpoint_dir: null` and `resume: "no"`
`run_pageindex.py`	Add `--checkpoint-dir` and `--resume` CLI args, global validation, pass to both PDF/MD branches

Testing

Verified with a 21-page PDF using Kimi K2.5: both _tree.json and
_summary.json checkpoints are saved correctly
--resume successfully skips LLM calls and loads from checkpoint
--resume without --checkpoint-dir raises clear error
Default behavior (no checkpoint_dir) unchanged
AST syntax check passes for all modified files

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

saccharin98 · 2026-04-20T06:40:11Z

Thanks for the contribution. I agree checkpoint/resume is a valuable direction for PageIndex, but I do not think this PR is the right design to merge.

PageIndex has a complex multi-stage pipeline with many intermediate states, recursive processing, concurrent LLM calls, retries, validation, correction, and
summary generation. Reliable resume support needs to checkpoint inside those expensive loops and track which sub-tasks have completed. Saving only after
tree_parser completes and after all summaries complete is closer to coarse artifact caching than true checkpoint/resume.

A robust design would also need checkpoint metadata and validation: source document hash, model, config/options, pipeline version, completed stage/task IDs,
and compatibility checks before loading. Without that, stale or incompatible checkpoints can be loaded silently and produce incorrect results.

There is also a correctness issue in the current PR: loading a PDF _summary.json checkpoint still falls through to generate_summaries_for_structure(...),
so it does not actually skip the expensive summary calls as described.

I am going to close this PR for now. I would be open to revisiting checkpoint/resume with a more deliberate design proposal, likely starting with a narrow,
well-tested checkpoint mechanism for one expensive loop/stage rather than broad CLI/API flags over the whole pipeline.

feat: add checkpoint/resume for long document processing

93b6ac6

claude bot reviewed Apr 11, 2026

View reviewed changes

saccharin98 closed this Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add checkpoint/resume for long document processing#227

feat: add checkpoint/resume for long document processing#227
ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
ag9920:feat_checkpoint_document

ag9920 commented Apr 11, 2026

Uh oh!

claude bot left a comment

Uh oh!

saccharin98 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ag9920 commented Apr 11, 2026

Summary

Problem

Solution

Usage

Implementation Details

Files Changed

Testing

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

saccharin98 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants