Skip to content

[Auto-Recovery] Add checkpoint save/load/resume#1947

Open
yf225 wants to merge 1 commit intoyf225/stack/95from
yf225/stack/96
Open

[Auto-Recovery] Add checkpoint save/load/resume#1947
yf225 wants to merge 1 commit intoyf225/stack/95from
yf225/stack/96

Conversation

@yf225
Copy link
Copy Markdown
Contributor

@yf225 yf225 commented Apr 4, 2026

Stacked PRs:


[Auto-Recovery] Add checkpoint save/load/resume

Add opt-in checkpoint support gated behind HELION_AUTOTUNE_CHECKPOINT_DIR.
When set, the autotuner saves in-progress state each generation and can
resume from a checkpoint on subsequent runs. The checkpoint file is
deleted on successful completion.

Includes pickle serialization support for BaseSearch and PopulationMember,
stable-hash-based checkpoint file naming, atomic writes, and kernel
recompilation on checkpoint load.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 4, 2026
yf225 added a commit that referenced this pull request Apr 4, 2026
Add opt-in checkpoint support gated behind HELION_AUTOTUNE_CHECKPOINT_DIR.
When set, the autotuner saves in-progress state each generation and can
resume from a checkpoint on subsequent runs. The checkpoint file is
deleted on successful completion.

Includes pickle serialization support for BaseSearch and PopulationMember,
stable-hash-based checkpoint file naming, atomic writes, and kernel
recompilation on checkpoint load.

stack-info: PR: #1947, branch: yf225/stack/96
yf225 added a commit that referenced this pull request Apr 4, 2026
Add opt-in checkpoint support gated behind HELION_AUTOTUNE_CHECKPOINT_DIR.
When set, the autotuner saves in-progress state each generation and can
resume from a checkpoint on subsequent runs. The checkpoint file is
deleted on successful completion.

Includes pickle serialization support for BaseSearch and PopulationMember,
stable-hash-based checkpoint file naming, atomic writes, and kernel
recompilation on checkpoint load.

stack-info: PR: #1947, branch: yf225/stack/96
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 21:20
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 21:20
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 21:58
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 21:58
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 21:58
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 21:58
Add opt-in checkpoint support gated behind HELION_AUTOTUNE_CHECKPOINT_DIR.
When set, the autotuner saves in-progress state each generation and can
resume from a checkpoint on subsequent runs. The checkpoint file is
deleted on successful completion.

Includes pickle serialization support for BaseSearch and PopulationMember,
stable-hash-based checkpoint file naming, atomic writes, and kernel
recompilation on checkpoint load.

stack-info: PR: #1947, branch: yf225/stack/96
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 22:06
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 22:06
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 23:06
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 23:06
yf225 added a commit that referenced this pull request Apr 7, 2026
Add opt-in checkpoint support gated behind HELION_AUTOTUNE_CHECKPOINT_DIR.
When set, the autotuner saves in-progress state each generation and can
resume from a checkpoint on subsequent runs. The checkpoint file is
deleted on successful completion.

Includes pickle serialization support for BaseSearch and PopulationMember,
stable-hash-based checkpoint file naming, atomic writes, and kernel
recompilation on checkpoint load.

stack-info: PR: #1947, branch: yf225/stack/96
Comment on lines +198 to +207
# Enable checkpointing to a directory:
HELION_AUTOTUNE_CHECKPOINT_DIR=/tmp/helion_checkpoints python run_kernel.py

# If interrupted, just re-run with the same directory to resume:
HELION_AUTOTUNE_CHECKPOINT_DIR=/tmp/helion_checkpoints python run_kernel.py
```

Without `HELION_AUTOTUNE_CHECKPOINT_DIR`, no checkpoints are saved (opt-in).
Multiple kernels can safely use the same directory — each kernel writes to a
file named by its unique stable hash.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resume behavior doesn't seem super clear to me. IMO there should a different API for resume versus start again.

For the crash recovery use case maybe we should also do something more automatic -- like have a parent process that auto-resumes the child.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants