Skip to content

feat: add checkpoint/resume for long document processing#227

Closed
ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
ag9920:feat_checkpoint_document
Closed

feat: add checkpoint/resume for long document processing#227
ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
ag9920:feat_checkpoint_document

Conversation

@ag9920
Copy link
Copy Markdown

@ag9920 ag9920 commented Apr 11, 2026

Summary

Add two-phase checkpoint support to the document processing pipeline,
enabling users to resume from the last completed stage instead of
restarting from scratch when processing is interrupted.

Closes #170

Problem

Processing large documents (100+ pages) through PageIndex is expensive
and time-consuming — the LLM calls in tree_parser and
generate_summaries can take 10-30 minutes and cost significant tokens.
If the process crashes at any point (API rate limit, network timeout,
context overflow), all progress is lost and users must start over.

Issue #170 raised this exact pain point: users need a way to recover
from failures without re-running the entire pipeline.

Solution

Introduce a two-phase checkpoint mechanism that saves intermediate
results after each expensive LLM stage:

Checkpoint File Saved After What It Contains
{doc}_tree.json tree_parser completes Raw tree structure from TOC parsing
{doc}_summary.json generate_summaries completes Tree structure with summaries attached

On resume, the pipeline automatically picks the latest available
checkpoint (summary > tree), skipping all completed LLM calls.

For Markdown documents, a single checkpoint ({doc}_md_summary.json)
is saved after summary generation, since tree construction is local
and doesn't require LLM calls.

Usage

CLI:

# First run: parse + save checkpoints
python run_pageindex.py --pdf_path doc.pdf --checkpoint-dir ./checkpoints

# If interrupted, resume from the latest checkpoint
python run_pageindex.py --pdf_path doc.pdf --checkpoint-dir ./checkpoints --resume

# Markdown works the same way
python run_pageindex.py --md_path doc.md --checkpoint-dir ./checkpoints --resume

Python API:

from pageindex import PageIndexClient

client = PageIndexClient(workspace="./workspace")

# Enable checkpointing
doc_id = client.index("doc.pdf", checkpoint_dir="./checkpoints")

# Resume after interruption
doc_id = client.index("doc.pdf", checkpoint_dir="./checkpoints", resume=True)

Human-in-the-loop correction:

Users can manually edit checkpoint JSON files (e.g., fix an incorrect
page number identified by the LLM) before resuming, enabling a
human-in-the-loop workflow not previously possible.

Implementation Details

  • Atomic writes: Checkpoints are written to a .tmp file first,
    then os.replace() atomically swaps it into place. This prevents
    corruption if the process crashes mid-write.

  • Backward compatible: When checkpoint_dir is not set (default
    None), behavior is identical to before — zero overhead.

  • Config integration: New checkpoint_dir (default: null) and
    resume (default: "no") fields added to config.yaml, validated
    by the existing ConfigLoader.

  • Error handling: --resume without --checkpoint-dir raises a
    clear error. Resume with no checkpoint files found raises
    FileNotFoundError with the expected file path.

Files Changed

File Change
pageindex/page_index.py Two-phase checkpoint in page_index_builder() + _save_checkpoint() helper
pageindex/page_index_md.py Checkpoint after summary generation in md_to_tree() + _save_checkpoint_md() helper
pageindex/client.py Pass checkpoint_dir/resume through PageIndexClient.index() for both PDF and MD
pageindex/config.yaml Add checkpoint_dir: null and resume: "no"
run_pageindex.py Add --checkpoint-dir and --resume CLI args, global validation, pass to both PDF/MD branches

Testing

  • Verified with a 21-page PDF using Kimi K2.5: both _tree.json and
    _summary.json checkpoints are saved correctly
  • --resume successfully skips LLM calls and loads from checkpoint
  • --resume without --checkpoint-dir raises clear error
  • Default behavior (no checkpoint_dir) unchanged
  • AST syntax check passes for all modified files

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@saccharin98
Copy link
Copy Markdown
Collaborator

Thanks for the contribution. I agree checkpoint/resume is a valuable direction for PageIndex, but I do not think this PR is the right design to merge.

PageIndex has a complex multi-stage pipeline with many intermediate states, recursive processing, concurrent LLM calls, retries, validation, correction, and
summary generation. Reliable resume support needs to checkpoint inside those expensive loops and track which sub-tasks have completed. Saving only after
tree_parser completes and after all summaries complete is closer to coarse artifact caching than true checkpoint/resume.

A robust design would also need checkpoint metadata and validation: source document hash, model, config/options, pipeline version, completed stage/task IDs,
and compatibility checks before loading. Without that, stale or incompatible checkpoints can be loaded silently and produce incorrect results.

There is also a correctness issue in the current PR: loading a PDF _summary.json checkpoint still falls through to generate_summaries_for_structure(...),
so it does not actually skip the expensive summary calls as described.

I am going to close this PR for now. I would be open to revisiting checkpoint/resume with a more deliberate design proposal, likely starting with a narrow,
well-tested checkpoint mechanism for one expensive loop/stage rather than broad CLI/API flags over the whole pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Breakpoint error debugging and correction

2 participants