feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate} by ivanbasov · Pull Request #51 · NVIDIA/Ising-Decoding

ivanbasov · 2026-04-07T17:52:30Z

Summary

Rename models/PreDecoderModelMemory_r9_v1.0.77.pt → models/Ising-Decoder-SurfaceCode-1-Fast.pt
Rename models/PreDecoderModelMemory_r13_v1.0.86.pt → models/Ising-Decoder-SurfaceCode-1-Accurate.pt
Both files remain Git LFS-tracked via the existing models/*.pt glob — no storage approach change
Add model_checkpoint_file direct-path config option to _load_model in run.py (new named files don't embed epoch numbers, so the old epoch-scanning logic doesn't apply)
Update test_inference_public_model.py to use the new filenames and the direct-path loader
Update README.md and checkpoint_to_safetensors.py docstring examples

Test plan

git lfs pull after checkout — both new .pt files should resolve correctly
Run pytest code/tests/test_inference_public_model.py — model loading should work with new filenames via model_checkpoint_file
Verify find_best_model error message reads "No valid model checkpoint files found" (not old prefix-specific message)

🤖 Generated with Claude Code

…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…vent segfault" This reverts commit 7f0f6c8.

…ccurate} - Rename PreDecoderModelMemory_r9_v1.0.77.pt → Ising-Decoder-SurfaceCode-1-Fast.pt - Rename PreDecoderModelMemory_r13_v1.0.86.pt → Ising-Decoder-SurfaceCode-1-Accurate.pt - Models remain Git LFS-tracked via models/*.pt (no storage change) - Add model_checkpoint_file direct-path option to _load_model so named pretrained files (without epoch numbers) can be loaded without directory scanning - Update test_inference_public_model.py, README, and checkpoint_to_safetensors.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bmhowe23 · 2026-04-08T01:31:43Z

I am concerned that these renames may have broken some of the workflows in our README.md. For example, when I run the following, it now fails where it used to work before the rename. @ivanbasov - do you have any suggestions for changes to my workflow?

(.venv) root@9ad5eb711cc6:~/Ising-Decoding# cp models/Ising-Decoder-SurfaceCode-1-Fast.pt outputs/predecoder_model_1/models/
(.venv) root@9ad5eb711cc6:~/Ising-Decoding# CUDA_VISIBLE_DEVICES=3 ONNX_WORKFLOW=2 DISTANCE=13 N_ROUNDS=104 PREDECODER_INFERENCE_NUM_SAMPLES=2048 WORKFLOW=inference EXPERIMENT_NAME=predecoder_model_1 bash
 code/scripts/local_run.sh
==========================================
Local run
==========================================
workflow.task: inference
config: config_public
GPUS: 1 (CUDA_VISIBLE_DEVICES=3)
output: /home/Ising-Decoding/outputs/predecoder_model_1
logs: /home/Ising-Decoding/logs/predecoder_model_1_20260408_012832
overrides: hydra.run.dir=/home/Ising-Decoding/outputs/predecoder_model_1 distance=13 n_rounds=104
==========================================
Loading model for task: inference
Loading best model from: /home/Ising-Decoding/outputs/predecoder_model_1/models/best_model
best_model/ not found; falling back to: /home/Ising-Decoding/outputs/predecoder_model_1/models
Searching for best model in: /home/Ising-Decoding/outputs/predecoder_model_1/models
Found 0 model files:
Error executing job with overrides: ['workflow.task=inference', '+exp_tag=predecoder_model_1', '++load_checkpoint=True', 'distance=13', 'n_rounds=104']
Traceback (most recent call last):
  File "/home/Ising-Decoding/code/workflows/run.py", line 66, in run
    run_surface(cfg)
  File "/home/Ising-Decoding/code/workflows/run.py", line 81, in run_surface
    model = _load_model(cfg, dist)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Ising-Decoding/code/workflows/run.py", line 262, in _load_model
    model_path = find_best_model(model_dir, rank=dist.rank)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Ising-Decoding/code/workflows/run.py", line 132, in find_best_model
    raise FileNotFoundError(f"No valid model checkpoint files found in {path}")
FileNotFoundError: No valid model checkpoint files found in /home/Ising-Decoding/outputs/predecoder_model_1/models

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

ivanbasov · 2026-04-08T17:19:52Z

Thanks for the report @bmhowe23! This is a regression from the rename — find_best_model had a hard-coded guard requiring filenames to start with PreDecoderModelMemory_, so the new names were silently skipped and the directory appeared empty.

Fix is in #55: when no epoch-numbered PreDecoderModelMemory_* checkpoints are found, find_best_model now falls back to any .pt file in the directory. Your existing workflow (copy the file → run inference) will work without any changes on your end.

* fix: find_best_model now accepts named .pt files without epoch numbers The old code required filenames to start with PreDecoderModelMemory_ and encode an epoch number. After the model rename to Ising-Decoder-SurfaceCode-1- {Fast,Accurate}.pt, copying one of these files into the models dir and running inference via local_run.sh would fail with "Found 0 model files". Fall back to any .pt file (sorted, last wins) when no epoch-numbered PreDecoderModelMemory_ checkpoints are found in the directory. Fixes regression reported in #51 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix yapf formatting in find_best_model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ccurate} (#51) * fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate} - Rename PreDecoderModelMemory_r9_v1.0.77.pt → Ising-Decoder-SurfaceCode-1-Fast.pt - Rename PreDecoderModelMemory_r13_v1.0.86.pt → Ising-Decoder-SurfaceCode-1-Accurate.pt - Models remain Git LFS-tracked via models/*.pt (no storage change) - Add model_checkpoint_file direct-path option to _load_model so named pretrained files (without epoch numbers) can be loaded without directory scanning - Update test_inference_public_model.py, README, and checkpoint_to_safetensors.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: find_best_model now accepts named .pt files without epoch numbers The old code required filenames to start with PreDecoderModelMemory_ and encode an epoch number. After the model rename to Ising-Decoder-SurfaceCode-1- {Fast,Accurate}.pt, copying one of these files into the models dir and running inference via local_run.sh would fail with "Found 0 model files". Fall back to any .pt file (sorted, last wins) when no epoch-numbered PreDecoderModelMemory_ checkpoints are found in the directory. Fixes regression reported in #51 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix yapf formatting in find_best_model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov and others added 4 commits March 30, 2026 11:54

Revert "fix(ci): disable torch.compile in orientation training to pre…

9d3fa08

…vent segfault" This reverts commit 7f0f6c8.

Merge remote-tracking branch 'upstream/main'

838d14f

This was referenced Apr 7, 2026

Rename PreDecoderModelMemory_r9_v1.0.77.pt to Ising-Decoder-SurfaceCo… #48

Closed

Rename PreDecoderModelMemory_r13_v1.0.86.pt to Ising-Decoder-SurfaceC… #47

Closed

ivanbasov requested review from bmhowe23 and tlubowe April 7, 2026 17:53

bmhowe23 approved these changes Apr 7, 2026

View reviewed changes

ivanbasov merged commit 812b935 into NVIDIA:main Apr 7, 2026
26 checks passed

ivanbasov deleted the worktree-model-renames branch April 7, 2026 21:28

ivanbasov mentioned this pull request Apr 8, 2026

fix: find_best_model accepts named .pt files without epoch numbers #55

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate}#51

feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate}#51
ivanbasov merged 4 commits into
NVIDIA:mainfrom
ivanbasov:worktree-model-renames

ivanbasov commented Apr 7, 2026

Uh oh!

Uh oh!

bmhowe23 commented Apr 8, 2026

Uh oh!

ivanbasov commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivanbasov commented Apr 7, 2026

Summary

Test plan

Uh oh!

Uh oh!

bmhowe23 commented Apr 8, 2026

Uh oh!

ivanbasov commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants