Skip to content

feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate}#51

Merged
ivanbasov merged 4 commits into
NVIDIA:mainfrom
ivanbasov:worktree-model-renames
Apr 7, 2026
Merged

feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate}#51
ivanbasov merged 4 commits into
NVIDIA:mainfrom
ivanbasov:worktree-model-renames

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

Summary

  • Rename models/PreDecoderModelMemory_r9_v1.0.77.ptmodels/Ising-Decoder-SurfaceCode-1-Fast.pt
  • Rename models/PreDecoderModelMemory_r13_v1.0.86.ptmodels/Ising-Decoder-SurfaceCode-1-Accurate.pt
  • Both files remain Git LFS-tracked via the existing models/*.pt glob — no storage approach change
  • Add model_checkpoint_file direct-path config option to _load_model in run.py (new named files don't embed epoch numbers, so the old epoch-scanning logic doesn't apply)
  • Update test_inference_public_model.py to use the new filenames and the direct-path loader
  • Update README.md and checkpoint_to_safetensors.py docstring examples

Test plan

  • git lfs pull after checkout — both new .pt files should resolve correctly
  • Run pytest code/tests/test_inference_public_model.py — model loading should work with new filenames via model_checkpoint_file
  • Verify find_best_model error message reads "No valid model checkpoint files found" (not old prefix-specific message)

🤖 Generated with Claude Code

ivanbasov and others added 4 commits March 30, 2026 11:54
…fault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ccurate}

- Rename PreDecoderModelMemory_r9_v1.0.77.pt  → Ising-Decoder-SurfaceCode-1-Fast.pt
- Rename PreDecoderModelMemory_r13_v1.0.86.pt → Ising-Decoder-SurfaceCode-1-Accurate.pt
- Models remain Git LFS-tracked via models/*.pt (no storage change)
- Add model_checkpoint_file direct-path option to _load_model so named
  pretrained files (without epoch numbers) can be loaded without directory scanning
- Update test_inference_public_model.py, README, and checkpoint_to_safetensors.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov merged commit 812b935 into NVIDIA:main Apr 7, 2026
26 checks passed
@ivanbasov ivanbasov deleted the worktree-model-renames branch April 7, 2026 21:28
@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 8, 2026

I am concerned that these renames may have broken some of the workflows in our README.md. For example, when I run the following, it now fails where it used to work before the rename. @ivanbasov - do you have any suggestions for changes to my workflow?

(.venv) root@9ad5eb711cc6:~/Ising-Decoding# cp models/Ising-Decoder-SurfaceCode-1-Fast.pt outputs/predecoder_model_1/models/
(.venv) root@9ad5eb711cc6:~/Ising-Decoding# CUDA_VISIBLE_DEVICES=3 ONNX_WORKFLOW=2 DISTANCE=13 N_ROUNDS=104 PREDECODER_INFERENCE_NUM_SAMPLES=2048 WORKFLOW=inference EXPERIMENT_NAME=predecoder_model_1 bash
 code/scripts/local_run.sh
==========================================
Local run
==========================================
workflow.task: inference
config: config_public
GPUS: 1 (CUDA_VISIBLE_DEVICES=3)
output: /home/Ising-Decoding/outputs/predecoder_model_1
logs: /home/Ising-Decoding/logs/predecoder_model_1_20260408_012832
overrides: hydra.run.dir=/home/Ising-Decoding/outputs/predecoder_model_1 distance=13 n_rounds=104
==========================================
Loading model for task: inference
Loading best model from: /home/Ising-Decoding/outputs/predecoder_model_1/models/best_model
best_model/ not found; falling back to: /home/Ising-Decoding/outputs/predecoder_model_1/models
Searching for best model in: /home/Ising-Decoding/outputs/predecoder_model_1/models
Found 0 model files:
Error executing job with overrides: ['workflow.task=inference', '+exp_tag=predecoder_model_1', '++load_checkpoint=True', 'distance=13', 'n_rounds=104']
Traceback (most recent call last):
  File "/home/Ising-Decoding/code/workflows/run.py", line 66, in run
    run_surface(cfg)
  File "/home/Ising-Decoding/code/workflows/run.py", line 81, in run_surface
    model = _load_model(cfg, dist)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Ising-Decoding/code/workflows/run.py", line 262, in _load_model
    model_path = find_best_model(model_dir, rank=dist.rank)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Ising-Decoding/code/workflows/run.py", line 132, in find_best_model
    raise FileNotFoundError(f"No valid model checkpoint files found in {path}")
FileNotFoundError: No valid model checkpoint files found in /home/Ising-Decoding/outputs/predecoder_model_1/models

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@ivanbasov
Copy link
Copy Markdown
Member Author

Thanks for the report @bmhowe23! This is a regression from the rename — find_best_model had a hard-coded guard requiring filenames to start with PreDecoderModelMemory_, so the new names were silently skipped and the directory appeared empty.

Fix is in #55: when no epoch-numbered PreDecoderModelMemory_* checkpoints are found, find_best_model now falls back to any .pt file in the directory. Your existing workflow (copy the file → run inference) will work without any changes on your end.

ivanbasov added a commit that referenced this pull request Apr 8, 2026
* fix: find_best_model now accepts named .pt files without epoch numbers

The old code required filenames to start with PreDecoderModelMemory_ and
encode an epoch number. After the model rename to Ising-Decoder-SurfaceCode-1-
{Fast,Accurate}.pt, copying one of these files into the models dir and running
inference via local_run.sh would fail with "Found 0 model files".

Fall back to any .pt file (sorted, last wins) when no epoch-numbered
PreDecoderModelMemory_ checkpoints are found in the directory.

Fixes regression reported in #51

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: fix yapf formatting in find_best_model

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…ccurate} (#51)

* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate}

- Rename PreDecoderModelMemory_r9_v1.0.77.pt  → Ising-Decoder-SurfaceCode-1-Fast.pt
- Rename PreDecoderModelMemory_r13_v1.0.86.pt → Ising-Decoder-SurfaceCode-1-Accurate.pt
- Models remain Git LFS-tracked via models/*.pt (no storage change)
- Add model_checkpoint_file direct-path option to _load_model so named
  pretrained files (without epoch numbers) can be loaded without directory scanning
- Update test_inference_public_model.py, README, and checkpoint_to_safetensors.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…ccurate} (#51)

* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* feat: rename pretrained models to Ising-Decoder-SurfaceCode-1-{Fast,Accurate}

- Rename PreDecoderModelMemory_r9_v1.0.77.pt  → Ising-Decoder-SurfaceCode-1-Fast.pt
- Rename PreDecoderModelMemory_r13_v1.0.86.pt → Ising-Decoder-SurfaceCode-1-Accurate.pt
- Models remain Git LFS-tracked via models/*.pt (no storage change)
- Add model_checkpoint_file direct-path option to _load_model so named
  pretrained files (without epoch numbers) can be loaded without directory scanning
- Update test_inference_public_model.py, README, and checkpoint_to_safetensors.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
* fix: find_best_model now accepts named .pt files without epoch numbers

The old code required filenames to start with PreDecoderModelMemory_ and
encode an epoch number. After the model rename to Ising-Decoder-SurfaceCode-1-
{Fast,Accurate}.pt, copying one of these files into the models dir and running
inference via local_run.sh would fail with "Found 0 model files".

Fall back to any .pt file (sorted, last wins) when no epoch-numbered
PreDecoderModelMemory_ checkpoints are found in the directory.

Fixes regression reported in #51

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: fix yapf formatting in find_best_model

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants