feat: add checkpoint/resume for long document processing#227
feat: add checkpoint/resume for long document processing#227ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
Conversation
|
Thanks for the contribution. I agree checkpoint/resume is a valuable direction for PageIndex, but I do not think this PR is the right design to merge. PageIndex has a complex multi-stage pipeline with many intermediate states, recursive processing, concurrent LLM calls, retries, validation, correction, and A robust design would also need checkpoint metadata and validation: source document hash, model, config/options, pipeline version, completed stage/task IDs, There is also a correctness issue in the current PR: loading a PDF I am going to close this PR for now. I would be open to revisiting checkpoint/resume with a more deliberate design proposal, likely starting with a narrow, |
Summary
Add two-phase checkpoint support to the document processing pipeline,
enabling users to resume from the last completed stage instead of
restarting from scratch when processing is interrupted.
Closes #170
Problem
Processing large documents (100+ pages) through PageIndex is expensive
and time-consuming — the LLM calls in
tree_parserandgenerate_summariescan take 10-30 minutes and cost significant tokens.If the process crashes at any point (API rate limit, network timeout,
context overflow), all progress is lost and users must start over.
Issue #170 raised this exact pain point: users need a way to recover
from failures without re-running the entire pipeline.
Solution
Introduce a two-phase checkpoint mechanism that saves intermediate
results after each expensive LLM stage:
{doc}_tree.jsontree_parsercompletes{doc}_summary.jsongenerate_summariescompletesOn resume, the pipeline automatically picks the latest available
checkpoint (summary > tree), skipping all completed LLM calls.
For Markdown documents, a single checkpoint (
{doc}_md_summary.json)is saved after summary generation, since tree construction is local
and doesn't require LLM calls.
Usage
CLI:
Python API:
Human-in-the-loop correction:
Users can manually edit checkpoint JSON files (e.g., fix an incorrect
page number identified by the LLM) before resuming, enabling a
human-in-the-loop workflow not previously possible.
Implementation Details
Atomic writes: Checkpoints are written to a
.tmpfile first,then
os.replace()atomically swaps it into place. This preventscorruption if the process crashes mid-write.
Backward compatible: When
checkpoint_diris not set (defaultNone), behavior is identical to before — zero overhead.Config integration: New
checkpoint_dir(default:null) andresume(default:"no") fields added toconfig.yaml, validatedby the existing
ConfigLoader.Error handling:
--resumewithout--checkpoint-dirraises aclear error. Resume with no checkpoint files found raises
FileNotFoundErrorwith the expected file path.Files Changed
pageindex/page_index.pypage_index_builder()+_save_checkpoint()helperpageindex/page_index_md.pymd_to_tree()+_save_checkpoint_md()helperpageindex/client.pycheckpoint_dir/resumethroughPageIndexClient.index()for both PDF and MDpageindex/config.yamlcheckpoint_dir: nullandresume: "no"run_pageindex.py--checkpoint-dirand--resumeCLI args, global validation, pass to both PDF/MD branchesTesting
_tree.jsonand_summary.jsoncheckpoints are saved correctly--resumesuccessfully skips LLM calls and loads from checkpoint--resumewithout--checkpoint-dirraises clear error