feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate}#51
Merged
Merged
Conversation
…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vent segfault" This reverts commit 7f0f6c8.
…ccurate} - Rename PreDecoderModelMemory_r9_v1.0.77.pt → Ising-Decoder-SurfaceCode-1-Fast.pt - Rename PreDecoderModelMemory_r13_v1.0.86.pt → Ising-Decoder-SurfaceCode-1-Accurate.pt - Models remain Git LFS-tracked via models/*.pt (no storage change) - Add model_checkpoint_file direct-path option to _load_model so named pretrained files (without epoch numbers) can be loaded without directory scanning - Update test_inference_public_model.py, README, and checkpoint_to_safetensors.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced Apr 7, 2026
bmhowe23
approved these changes
Apr 7, 2026
Collaborator
|
I am concerned that these renames may have broken some of the workflows in our README.md. For example, when I run the following, it now fails where it used to work before the rename. @ivanbasov - do you have any suggestions for changes to my workflow? (.venv) root@9ad5eb711cc6:~/Ising-Decoding# cp models/Ising-Decoder-SurfaceCode-1-Fast.pt outputs/predecoder_model_1/models/
(.venv) root@9ad5eb711cc6:~/Ising-Decoding# CUDA_VISIBLE_DEVICES=3 ONNX_WORKFLOW=2 DISTANCE=13 N_ROUNDS=104 PREDECODER_INFERENCE_NUM_SAMPLES=2048 WORKFLOW=inference EXPERIMENT_NAME=predecoder_model_1 bash
code/scripts/local_run.sh
==========================================
Local run
==========================================
workflow.task: inference
config: config_public
GPUS: 1 (CUDA_VISIBLE_DEVICES=3)
output: /home/Ising-Decoding/outputs/predecoder_model_1
logs: /home/Ising-Decoding/logs/predecoder_model_1_20260408_012832
overrides: hydra.run.dir=/home/Ising-Decoding/outputs/predecoder_model_1 distance=13 n_rounds=104
==========================================
Loading model for task: inference
Loading best model from: /home/Ising-Decoding/outputs/predecoder_model_1/models/best_model
best_model/ not found; falling back to: /home/Ising-Decoding/outputs/predecoder_model_1/models
Searching for best model in: /home/Ising-Decoding/outputs/predecoder_model_1/models
Found 0 model files:
Error executing job with overrides: ['workflow.task=inference', '+exp_tag=predecoder_model_1', '++load_checkpoint=True', 'distance=13', 'n_rounds=104']
Traceback (most recent call last):
File "/home/Ising-Decoding/code/workflows/run.py", line 66, in run
run_surface(cfg)
File "/home/Ising-Decoding/code/workflows/run.py", line 81, in run_surface
model = _load_model(cfg, dist)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/Ising-Decoding/code/workflows/run.py", line 262, in _load_model
model_path = find_best_model(model_dir, rank=dist.rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/Ising-Decoding/code/workflows/run.py", line 132, in find_best_model
raise FileNotFoundError(f"No valid model checkpoint files found in {path}")
FileNotFoundError: No valid model checkpoint files found in /home/Ising-Decoding/outputs/predecoder_model_1/models
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. |
2 tasks
Member
Author
|
Thanks for the report @bmhowe23! This is a regression from the rename — Fix is in #55: when no epoch-numbered |
ivanbasov
added a commit
that referenced
this pull request
Apr 8, 2026
* fix: find_best_model now accepts named .pt files without epoch numbers
The old code required filenames to start with PreDecoderModelMemory_ and
encode an epoch number. After the model rename to Ising-Decoder-SurfaceCode-1-
{Fast,Accurate}.pt, copying one of these files into the models dir and running
inference via local_run.sh would fail with "Found 0 model files".
Fall back to any .pt file (sorted, last wins) when no epoch-numbered
PreDecoderModelMemory_ checkpoints are found in the directory.
Fixes regression reported in #51
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* style: fix yapf formatting in find_best_model
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov
added a commit
that referenced
this pull request
Apr 10, 2026
…ccurate} (#51) * fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate} - Rename PreDecoderModelMemory_r9_v1.0.77.pt → Ising-Decoder-SurfaceCode-1-Fast.pt - Rename PreDecoderModelMemory_r13_v1.0.86.pt → Ising-Decoder-SurfaceCode-1-Accurate.pt - Models remain Git LFS-tracked via models/*.pt (no storage change) - Add model_checkpoint_file direct-path option to _load_model so named pretrained files (without epoch numbers) can be loaded without directory scanning - Update test_inference_public_model.py, README, and checkpoint_to_safetensors.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov
added a commit
that referenced
this pull request
Apr 10, 2026
…ccurate} (#51) * fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate} - Rename PreDecoderModelMemory_r9_v1.0.77.pt → Ising-Decoder-SurfaceCode-1-Fast.pt - Rename PreDecoderModelMemory_r13_v1.0.86.pt → Ising-Decoder-SurfaceCode-1-Accurate.pt - Models remain Git LFS-tracked via models/*.pt (no storage change) - Add model_checkpoint_file direct-path option to _load_model so named pretrained files (without epoch numbers) can be loaded without directory scanning - Update test_inference_public_model.py, README, and checkpoint_to_safetensors.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov
added a commit
that referenced
this pull request
Apr 10, 2026
* fix: find_best_model now accepts named .pt files without epoch numbers
The old code required filenames to start with PreDecoderModelMemory_ and
encode an epoch number. After the model rename to Ising-Decoder-SurfaceCode-1-
{Fast,Accurate}.pt, copying one of these files into the models dir and running
inference via local_run.sh would fail with "Found 0 model files".
Fall back to any .pt file (sorted, last wins) when no epoch-numbered
PreDecoderModelMemory_ checkpoints are found in the directory.
Fixes regression reported in #51
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* style: fix yapf formatting in find_best_model
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
models/PreDecoderModelMemory_r9_v1.0.77.pt→models/Ising-Decoder-SurfaceCode-1-Fast.ptmodels/PreDecoderModelMemory_r13_v1.0.86.pt→models/Ising-Decoder-SurfaceCode-1-Accurate.ptmodels/*.ptglob — no storage approach changemodel_checkpoint_filedirect-path config option to_load_modelinrun.py(new named files don't embed epoch numbers, so the old epoch-scanning logic doesn't apply)test_inference_public_model.pyto use the new filenames and the direct-path loaderREADME.mdandcheckpoint_to_safetensors.pydocstring examplesTest plan
git lfs pullafter checkout — both new.ptfiles should resolve correctlypytest code/tests/test_inference_public_model.py— model loading should work with new filenames viamodel_checkpoint_filefind_best_modelerror message reads "No valid model checkpoint files found" (not old prefix-specific message)🤖 Generated with Claude Code