diff --git a/README.md b/README.md
index a0b398b..242d5eb 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
-# AI pre-decoder for surface-code memory circuits
+# Ising Decoding
 
-This repo implements a **pre-decoder** for surface-code memory experiments:
+This repo offers AI training frameworks and recipes to build, customize and deploy scalable quantum error correction **decoders**:
 
 - A neural network consumes detector syndromes across space **and** time
 - It predicts corrections that reduce syndrome density / improve decoding
@@ -92,8 +92,8 @@ pip install -r code/requirements_public_inference.txt
 
 2. **Get the pre-trained models**  
    This repo ships two pre-trained model files (tracked with Git LFS):
-   - `models/PreDecoderModelMemory_r9_v1.0.77.pt` (receptive field R=9, checkpoint 77)
-   - `models/PreDecoderModelMemory_r13_v1.0.86.pt` (receptive field R=13, checkpoint 86)
+   - `models/Ising-Decoder-SurfaceCode-1-Fast.pt` (receptive field R=9)
+   - `models/Ising-Decoder-SurfaceCode-1-Accurate.pt` (receptive field R=13)
 
    Clones get the files via `git lfs pull`. Optionally, set `PREDECODER_MODEL_URL` to the LFS/raw URL to fetch files when not in the working tree (e.g. in a minimal checkout or CI).
 
@@ -138,8 +138,16 @@ The pre-trained public models use `--model-id 1` (R=9) and `--model-id 4` (R=13)
 After training (or starting from the shipped `.safetensors` files), you can export the model to
 ONNX and optionally apply INT8 or FP8 post-training quantization for deployment.
 
-Set the `ONNX_WORKFLOW` and (optionally) `QUANT_FORMAT` environment variables before running
-inference with `local_run.sh`:
+You may also change the surface code distance and number of rounds at inference
+time. That is - you are not required retrain a new model when changing either
+one of these parameters; since the model is a 3D convolutional neural network,
+the model will simply be run over a new decoding volume.
+
+- To run with a new distance, simply add `DISTANCE=<your distance>` to the commands below.
+- To run with a new number of rounds, simply add `N_ROUNDS=<your number of rounds>` to the commands below.
+
+Set the `ONNX_WORKFLOW` and (optionally) (`QUANT_FORMAT`, `DISTANCE`,
+`N_ROUNDS`) environment variables before running inference with `local_run.sh`:
 
 | `ONNX_WORKFLOW` | Behavior |
 |---|---|
@@ -169,7 +177,16 @@ ONNX_WORKFLOW=3 WORKFLOW=inference bash code/scripts/local_run.sh
 | `QUANT_FORMAT` | unset | `int8` or `fp8`. Unset means no quantization (FP32 ONNX). |
 | `QUANT_CALIB_SAMPLES` | `256` | Calibration samples for INT8/FP8 post-training quantization. |
 
+**Circuit variables:**
+
+| Variable | Default | Description |
+|---|---|---|
+| `CONFIG_NAME` | `config_public` | Use the defaults from the `conf/$CONFIG_NAME.yaml` file |
+| `DISTANCE` | Use the distance specified in the `conf/$CONFIG_NAME.yaml` file | surface code distance |
+| `N_ROUNDS` | Calibration samples for INT8/FP8 post-training quantization. | number of rounds in memory experiment |
+
 Notes:
+
 - TensorRT workflows (`ONNX_WORKFLOW=2` or `3`) require `tensorrt` and `modelopt`.
 - FP8 quantization failure is fatal. INT8 failure falls back to the FP32 ONNX model silently.
 - ONNX and engine files are written to the current working directory.
@@ -215,7 +232,7 @@ Results are written to `outputs/<EXPERIMENT_NAME>/plots/`.
 | Decoder | Source | Notes |
 |---|---|---|
 | No-op | — | Pre-decoder output only, no global correction |
-| Union-Find | `ldpc` | Fast, sub-optimal |
+| Union-Find | `ldpc` | Fast, sub-optimal LER (Logical Error Rate) |
 | BP-only | `ldpc` | Belief propagation, no OSD |
 | BP+LSD-0 | `ldpc` | BP with localized statistics decoding |
 | Uncorr-PM | PyMatching | Uncorrelated minimum-weight perfect matching |
@@ -556,4 +573,4 @@ Presence of these headers is enforced automatically by the `spdx-header-check` C
 `.github/workflows/ci.yml`).
 
 Third-party open source components bundled with or required by this project are listed with their
-respective copyright notices and license texts in [NOTICE](NOTICE).
\ No newline at end of file
+respective copyright notices and license texts in [NOTICE](NOTICE).
diff --git a/TRAINING.md b/TRAINING.md
index d6ae31f..a8542ae 100644
--- a/TRAINING.md
+++ b/TRAINING.md
@@ -7,7 +7,7 @@ For local single-machine usage, see `README.md`.
 ## Prerequisites
 
 - Docker with NVIDIA GPU support (`nvidia-docker` / `--gpus`)
-- One or more NVIDIA GPUs (H100, A100, or similar)
+- One or more NVIDIA GPUs (B200, H200 or similar)
 - A persistent directory for checkpoints and logs
 
 ## Quick start (Docker — recommended)
diff --git a/code/evaluation/failure_analysis.py b/code/evaluation/failure_analysis.py
index 4f9a4b1..56e476a 100644
--- a/code/evaluation/failure_analysis.py
+++ b/code/evaluation/failure_analysis.py
@@ -18,6 +18,7 @@
 """
 import os
 import random
+import warnings
 
 import numpy as np
 import torch
@@ -179,27 +180,39 @@ def _build_cudaq_decoders(det_model):
 
 def _decode_cudaq_batch(decoder, L_dense, syndromes_np):
     """
-    Decode a batch of syndromes with a cudaq-qec nv-qldpc-decoder (single-shot loop).
+    Decode a batch of syndromes with a cudaq-qec nv-qldpc-decoder.
     Returns (obs, stats) where:
       - obs: observable predictions as np.ndarray of shape (B,)
       - stats: dict with per-sample convergence flags, iteration counts
-    The decoder.decode() takes list[float] and returns DecoderResult with .result (list[float]).
     """
     B = syndromes_np.shape[0]
-    obs = np.zeros(B, dtype=np.uint8)
+    n_bits = L_dense.shape[1]
     converged_flags = np.zeros(B, dtype=bool)
     iter_counts = np.zeros(B, dtype=np.int32)
-    for i in range(B):
-        syndrome_list = syndromes_np[i].astype(np.float64).tolist()
-        result = decoder.decode(syndrome_list)
-        correction = np.array(result.result, dtype=np.uint8)
-        obs[i] = int((L_dense @ correction).item() %
-                     2) if L_dense.shape[0] == 1 else int((L_dense @ correction)[0] % 2)
+    corrections = np.empty((B, n_bits), dtype=np.uint8)
+    syndromes_f64 = np.ascontiguousarray(syndromes_np, dtype=np.float64)
+
+    def _unpack(i, result):
+        corrections[i] = np.array(result.result, dtype=np.uint8)
         converged_flags[i] = result.converged
-        # Collect iteration count if available via opt_results
         opt = getattr(result, 'opt_results', None)
         if opt and isinstance(opt, dict) and 'num_iter' in opt:
             iter_counts[i] = opt['num_iter']
+
+    def _loop_decode():
+        for i in range(B):
+            _unpack(i, decoder.decode(syndromes_f64[i].tolist()))
+
+    try:
+        results = decoder.decode_batch(syndromes_f64.tolist())
+    except Exception as exc:
+        warnings.warn(f"decode_batch failed ({exc}); falling back to per-sample loop")
+        _loop_decode()
+    else:
+        for i, result in enumerate(results):
+            _unpack(i, result)
+
+    obs = ((corrections.astype(np.int32) @ L_dense.T.astype(np.int32))[:, 0] % 2).astype(np.uint8)
     return obs, {"converged_flags": converged_flags, "iter_counts": iter_counts}
 
 
@@ -249,20 +262,17 @@ def _build_ldpc_decoders(det_model):
 
 def _decode_ldpc_batch(decoder, L_dense, syndromes_np):
     """
-    Decode a batch of syndromes with an ldpc decoder (single-shot loop).
+    Decode a batch of syndromes with an ldpc decoder.
     Returns observable predictions as np.ndarray of shape (B,).
     """
     B = syndromes_np.shape[0]
-    obs = np.zeros(B, dtype=np.uint8)
+    n_bits = L_dense.shape[1]
+    syndromes_c = np.ascontiguousarray(syndromes_np, dtype=np.uint8)
+    corrections = np.empty((B, n_bits), dtype=np.uint8)
     for i in range(B):
-        # Get the most-likely error configuration from the decoder for this syndrome.
-        correction = decoder.decode(syndromes_np[i])
-        # Project the correction onto the logical observable via L_dense (mod 2).
-        # L_dense has shape (num_obs, num_errors); the first observable row is used.
-        obs[i] = (
-            int((L_dense @ correction).item() %
-                2) if L_dense.shape[0] == 1 else int((L_dense @ correction)[0] % 2)
-        )
+        corrections[i] = decoder.decode(syndromes_c[i])
+
+    obs = ((corrections.astype(np.int32) @ L_dense.T.astype(np.int32))[:, 0] % 2).astype(np.uint8)
     return obs
 
 
diff --git a/code/evaluation/logical_error_rate.py b/code/evaluation/logical_error_rate.py
index e3ba959..342bf41 100644
--- a/code/evaluation/logical_error_rate.py
+++ b/code/evaluation/logical_error_rate.py
@@ -200,7 +200,7 @@ def _ort_quantize_int8(fp32_onnx_path: str, output_path: str, calib_dets: "np.nd
     class _DetCalibReader(CalibrationDataReader):
 
         def __init__(self, data):
-            self._rows = [{"dets": data[i:i + 1].astype("float32")} for i in range(len(data))]
+            self._rows = [{"dets": data[i:i + 1]} for i in range(len(data))]
             self._iter = iter(self._rows)
 
         def get_next(self):
@@ -1202,7 +1202,7 @@ def run_inference_and_decode_pre_decoder_memory(model, device, dist, cfg) -> dic
                             mq.quantize(
                                 onnx_path=fp32_onnx_path,
                                 quantize_mode=quant_format,
-                                calibration_data={"dets": calib_dets.astype("float32")},
+                                calibration_data={"dets": calib_dets},
                                 output_path=onnx_path,
                                 **quant_kwargs,
                             )
diff --git a/code/export/checkpoint_to_safetensors.py b/code/export/checkpoint_to_safetensors.py
index eb3eed9..a64d0be 100644
--- a/code/export/checkpoint_to_safetensors.py
+++ b/code/export/checkpoint_to_safetensors.py
@@ -20,11 +20,11 @@
 
 Usage:
     PYTHONPATH=code python code/export/checkpoint_to_safetensors.py \\
-        --checkpoint models/PreDecoderModelMemory_r9_v1.0.77.pt \\
+        --checkpoint models/Ising-Decoder-SurfaceCode-1-Fast.pt \\
         --model-id 1 [--fp16]
 
 Then run inference with:
-    PREDECODER_SAFETENSORS_CHECKPOINT=models/PreDecoderModelMemory_r9_v1.0.77_fp16.safetensors \\
+    PREDECODER_SAFETENSORS_CHECKPOINT=models/Ising-Decoder-SurfaceCode-1-Fast_fp16.safetensors \\
     WORKFLOW=inference DISTANCE=9 N_ROUNDS=9 EXPERIMENT_NAME=predecoder_model_1 \\
     bash code/scripts/local_run.sh
 """
diff --git a/code/export/generate_test_data.py b/code/export/generate_test_data.py
new file mode 100644
index 0000000..96d2ac0
--- /dev/null
+++ b/code/export/generate_test_data.py
@@ -0,0 +1,370 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Generate test data for the pre-decoder evaluation pipeline.
+
+Builds a Stim memory circuit, samples syndrome data, extracts the detector
+error model (H, O, priors) via beliefmatching, decodes with PyMatching, and
+optionally runs an ONNX pre-decoder model.  All artefacts are written to a
+directory in a custom binary format so that residual syndromes can be fed
+into PyMatching for evaluation.
+
+Output files
+------------
+All files use little-endian byte order.
+
+  metadata.txt              Plain text, one key=value per line.  Records the
+                            circuit parameters used for generation.
+
+  detectors.bin             Stim detector values (syndrome bits).
+                            Header: (num_samples: uint32, num_detectors: uint32)
+                            Body:   num_samples × num_detectors int32 values
+                                    (row-major, each value 0 or 1).
+
+  observables.bin           Ground-truth logical observable for each shot.
+                            Header: (num_samples: uint32, num_observables: uint32)
+                            Body:   num_samples × num_observables int32 values.
+
+  pymatching_predictions.bin
+                            Baseline PyMatching decode of the raw detectors
+                            (no pre-decoder).  Same format as observables.bin.
+
+  H_csr.bin                 Parity check matrix from the detector error model
+                            (binary CSR — all non-zero entries are 1).
+                            Header: (rows: uint32, cols: uint32, nnz: uint32)
+                            Body:   (rows + 1) int32 indptr values,
+                                    nnz int32 column-index values.
+
+  O_csr.bin                 Observable matrix from the detector error model.
+                            Same binary-CSR format as H_csr.bin.
+
+  priors.bin                Edge error probabilities from the detector error
+                            model (one per column of H / O).
+                            Header: (n: uint32)
+                            Body:   n float64 values.
+
+  predecoder_outputs.bin    (only written when --onnx-model is supplied)
+                            Raw output of the ONNX pre-decoder model.
+                            Header: (num_samples: uint32, 1 + num_detectors: uint32)
+                            Body:   num_samples × (1 + num_detectors) uint8 values.
+                            Column 0 is pre_L; columns 1.. are residual detectors.
+
+Usage
+-----
+    python generate_test_data.py --distance 13 --n-rounds 104 --basis X \\
+        --num-samples 1000 --output-dir ../../test_data/d13_T104_X
+
+    # Optionally include ONNX pre-decoder outputs:
+    python generate_test_data.py --distance 13 --n-rounds 104 --basis X \\
+        --num-samples 1000 --onnx-model /path/to/predecoder.onnx \\
+        --output-dir ../../test_data/d13_T104_X
+"""
+
+import sys
+import struct
+import time
+import argparse
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+import numpy as np
+import pymatching
+from beliefmatching.belief_matching import detector_error_model_to_check_matrices
+
+# Import MemoryCircuit without triggering qec/surface_code/__init__.py,
+# which pulls in data_mapping → torch.  We only need memory_circuit.py
+# (numpy + stim) and noise_model.py, so we bypass the package __init__.
+import importlib, types
+
+_sc_pkg = types.ModuleType("qec.surface_code")
+_sc_pkg.__path__ = [str(Path(__file__).resolve().parent.parent / "qec" / "surface_code")]
+_sc_pkg.__package__ = "qec.surface_code"
+sys.modules.setdefault("qec.surface_code", _sc_pkg)
+
+from qec.surface_code.memory_circuit import MemoryCircuit
+from qec.noise_model import NoiseModel
+
+# Default 25-parameter noise model matching config_public.yaml at p=0.003
+DEFAULT_NOISE_PARAMS = {
+    "p_prep_X": 0.002,
+    "p_prep_Z": 0.002,
+    "p_meas_X": 0.002,
+    "p_meas_Z": 0.002,
+    "p_idle_cnot_X": 0.001,
+    "p_idle_cnot_Y": 0.001,
+    "p_idle_cnot_Z": 0.001,
+    "p_idle_spam_X": 0.001998,
+    "p_idle_spam_Y": 0.001998,
+    "p_idle_spam_Z": 0.001998,
+    "p_cnot_IX": 0.0002,
+    "p_cnot_IY": 0.0002,
+    "p_cnot_IZ": 0.0002,
+    "p_cnot_XI": 0.0002,
+    "p_cnot_XX": 0.0002,
+    "p_cnot_XY": 0.0002,
+    "p_cnot_XZ": 0.0002,
+    "p_cnot_YI": 0.0002,
+    "p_cnot_YX": 0.0002,
+    "p_cnot_YY": 0.0002,
+    "p_cnot_YZ": 0.0002,
+    "p_cnot_ZI": 0.0002,
+    "p_cnot_ZX": 0.0002,
+    "p_cnot_ZY": 0.0002,
+    "p_cnot_ZZ": 0.0002,
+}
+
+_ROTATION_ALIASES = {"O1": "XV", "O2": "XH", "O3": "ZV", "O4": "ZH"}
+
+# ---------------------------------------------------------------------------
+# Binary I/O helpers
+# ---------------------------------------------------------------------------
+
+
+def save_dense_bin(path: str, arr: np.ndarray) -> None:
+    """Save a 2-D array with an 8-byte header: (rows: u32, cols: u32)."""
+    rows, cols = arr.shape
+    with open(path, "wb") as f:
+        f.write(struct.pack("<II", rows, cols))
+        f.write(arr.tobytes())
+
+
+def save_csr_bin(path: str, mat) -> None:
+    """Save a scipy CSR matrix: (rows: u32, cols: u32, nnz: u32, indptr, indices).
+
+    Values are not stored — the matrix is assumed binary (all non-zeros are 1).
+    """
+    from scipy import sparse
+    csr = sparse.csr_matrix(mat)
+    rows, cols = csr.shape
+    nnz = csr.nnz
+    with open(path, "wb") as f:
+        f.write(struct.pack("<III", rows, cols, nnz))
+        f.write(csr.indptr.astype(np.int32).tobytes())
+        f.write(csr.indices.astype(np.int32).tobytes())
+
+
+def save_priors_bin(path: str, priors: np.ndarray) -> None:
+    """Save a 1-D float64 array: (n: u32, data as float64)."""
+    n = len(priors)
+    with open(path, "wb") as f:
+        f.write(struct.pack("<I", n))
+        f.write(priors.astype(np.float64).tobytes())
+
+
+def save_metadata(path: str, **kwargs) -> None:
+    with open(path, "w") as f:
+        for k, v in kwargs.items():
+            f.write(f"{k}={v}\n")
+
+
+# ---------------------------------------------------------------------------
+# Main generation logic
+# ---------------------------------------------------------------------------
+
+
+def generate_test_data(
+    distance: int = 13,
+    n_rounds: int = 104,
+    basis: str = "X",
+    p_error: float = 0.003,
+    code_rotation: str = "XV",
+    noise_model_params: dict | None = None,
+    num_samples: int = 1000,
+    onnx_model: str | None = None,
+    output_dir: str = "test_data",
+):
+    code_rotation = _ROTATION_ALIASES.get(code_rotation.upper(), code_rotation.upper())
+    out = Path(output_dir)
+    out.mkdir(parents=True, exist_ok=True)
+
+    # ---- Noise model ----
+    noise_model = None
+    if noise_model_params is not None:
+        noise_model = NoiseModel.from_config_dict(noise_model_params)
+
+    p_placeholder = float(noise_model.get_max_probability()) if noise_model else float(p_error)
+
+    # ---- Build Stim circuit ----
+    print(
+        f"Building circuit: D={distance}, T={n_rounds}, basis={basis}, "
+        f"rotation={code_rotation}, p={p_error}"
+    )
+    t0 = time.perf_counter()
+    circ = MemoryCircuit(
+        distance=distance,
+        idle_error=p_placeholder,
+        sqgate_error=p_placeholder,
+        tqgate_error=p_placeholder,
+        spam_error=(2.0 / 3.0) * p_placeholder,
+        n_rounds=n_rounds,
+        basis=basis.upper(),
+        code_rotation=code_rotation,
+        noise_model=noise_model,
+        add_boundary_detectors=True,
+    )
+    circ.set_error_rates()
+    stim_circuit = circ.stim_circuit
+    print(f"  Circuit built in {time.perf_counter() - t0:.3f}s")
+
+    # ---- Detector error model + PyMatching ----
+    print("Building detector error model and PyMatching matcher...")
+    t0 = time.perf_counter()
+    det_model = stim_circuit.detector_error_model(
+        decompose_errors=True,
+        approximate_disjoint_errors=True,
+    )
+    matcher = pymatching.Matching.from_detector_error_model(det_model)
+    print(f"  DEM + matcher built in {time.perf_counter() - t0:.3f}s")
+    print(f"  Detectors: {det_model.num_detectors}, Observables: {det_model.num_observables}")
+
+    # ---- Extract H, O, priors via beliefmatching ----
+    print("Extracting check matrices (beliefmatching)...")
+    matrices = detector_error_model_to_check_matrices(det_model)
+    H = matrices.edge_check_matrix
+    O = matrices.edge_observables_matrix
+    edge_probs = matrices.hyperedge_to_edge_matrix @ matrices.priors
+    eps = 1e-14
+    edge_probs[edge_probs > 1 - eps] = 1 - eps
+    edge_probs[edge_probs < eps] = eps
+    priors = edge_probs
+    print(f"  H shape: {H.shape}, O shape: {O.shape}, priors shape: {priors.shape}")
+
+    # ---- Sample syndrome data ----
+    print(f"Sampling {num_samples} shots...")
+    t0 = time.perf_counter()
+    meas = stim_circuit.compile_sampler().sample(shots=num_samples)
+    converter = stim_circuit.compile_m2d_converter()
+    dets_and_obs = converter.convert(measurements=meas, append_observables=True)
+
+    stim_dets = np.asarray(dets_and_obs[:, :-stim_circuit.num_observables], dtype=np.int32)
+    stim_obs = np.asarray(dets_and_obs[:, -stim_circuit.num_observables:], dtype=np.int32)
+    print(f"  Sampled in {time.perf_counter() - t0:.3f}s")
+    assert stim_dets.shape[1] == det_model.num_detectors, (
+        f"Detector width {stim_dets.shape[1]} != DEM {det_model.num_detectors}"
+    )
+
+    # ---- PyMatching baseline decode ----
+    print("Decoding with PyMatching (baseline)...")
+    t0 = time.perf_counter()
+    predictions = matcher.decode_batch(np.asarray(stim_dets, dtype=np.uint8))
+    decode_time = time.perf_counter() - t0
+    predictions = np.asarray(predictions, dtype=np.int32).reshape(-1, stim_circuit.num_observables)
+    num_errors = int((predictions != stim_obs).sum())
+    ler = num_errors / num_samples
+    print(f"  Errors: {num_errors}/{num_samples}, LER: {ler:.4f}")
+    print(f"  Decode time: {decode_time:.3f}s "
+          f"({decode_time / num_samples * 1e6:.1f} µs/shot)")
+
+    # ---- Optional ONNX pre-decoder inference ----
+    predecoder_outputs = None
+    if onnx_model and Path(onnx_model).is_file():
+        print(f"Running ONNX pre-decoder: {onnx_model}")
+        try:
+            import onnxruntime as ort
+            t0 = time.perf_counter()
+            sess = ort.InferenceSession(
+                onnx_model, providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
+            )
+            dets_uint8 = np.asarray(stim_dets, dtype=np.uint8)
+            result = sess.run(None, {"dets": dets_uint8})
+            predecoder_outputs = np.asarray(result[0], dtype=np.uint8)
+            print(
+                f"  Pre-decoder ran in {time.perf_counter() - t0:.3f}s, "
+                f"output shape: {predecoder_outputs.shape}"
+            )
+        except Exception as e:
+            print(f"  ONNX inference failed: {e}")
+
+    # ---- Save everything ----
+    print(f"Writing outputs to {out}/")
+
+    save_dense_bin(str(out / "detectors.bin"), stim_dets)
+    save_dense_bin(str(out / "observables.bin"), stim_obs)
+    save_dense_bin(str(out / "pymatching_predictions.bin"), predictions)
+    save_csr_bin(str(out / "H_csr.bin"), H)
+    save_csr_bin(str(out / "O_csr.bin"), O)
+    save_priors_bin(str(out / "priors.bin"), priors)
+
+    if predecoder_outputs is not None:
+        save_dense_bin(str(out / "predecoder_outputs.bin"), predecoder_outputs)
+
+    noise_label = "25-param" if noise_model_params else "simple"
+    save_metadata(
+        str(out / "metadata.txt"),
+        distance=distance,
+        n_rounds=n_rounds,
+        basis=basis.upper(),
+        code_rotation=code_rotation,
+        p_error=p_error,
+        num_samples=num_samples,
+        num_detectors=det_model.num_detectors,
+        num_observables=det_model.num_observables,
+        H_shape=H.shape,
+        noise_model=noise_label,
+        **({
+            "onnx_model": onnx_model
+        } if onnx_model else {}),
+    )
+
+    print("Done.")
+    for f in sorted(out.iterdir()):
+        print(f"  {f.name:30s} {f.stat().st_size:>12,d} bytes")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Generate test data for the pre-decoder evaluation pipeline"
+    )
+    parser.add_argument("--distance", type=int, default=13)
+    parser.add_argument("--n-rounds", type=int, default=104)
+    parser.add_argument("--basis", type=str, default="X", choices=["X", "Z"])
+    parser.add_argument(
+        "--code-rotation", type=str, default="XV", help="XV, XH, ZV, ZH or public aliases O1-O4"
+    )
+    parser.add_argument("--p-error", type=float, default=0.003)
+    parser.add_argument("--num-samples", type=int, default=1000)
+    parser.add_argument(
+        "--onnx-model", type=str, default=None, help="Path to ONNX pre-decoder model (optional)"
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default=None,
+        help="Output directory (default: test_data/d{D}_T{T}_{basis})"
+    )
+    parser.add_argument(
+        "--simple-noise",
+        action="store_true",
+        help="Use simple p_error instead of 25-parameter noise model"
+    )
+    args = parser.parse_args()
+
+    if args.output_dir is None:
+        args.output_dir = f"test_data/d{args.distance}_T{args.n_rounds}_{args.basis}"
+
+    noise_params = None if args.simple_noise else DEFAULT_NOISE_PARAMS
+
+    generate_test_data(
+        distance=args.distance,
+        n_rounds=args.n_rounds,
+        basis=args.basis,
+        p_error=args.p_error,
+        code_rotation=args.code_rotation,
+        noise_model_params=noise_params,
+        num_samples=args.num_samples,
+        onnx_model=args.onnx_model,
+        output_dir=args.output_dir,
+    )
diff --git a/code/qec/dem_sampling.py b/code/qec/dem_sampling.py
index a28027f..2de32fe 100644
--- a/code/qec/dem_sampling.py
+++ b/code/qec/dem_sampling.py
@@ -59,6 +59,7 @@ def _custab_available() -> bool:
 _cached_HT: "torch.Tensor | None" = None
 _cached_max_shots: int = 0
 _cached_device_id: int | None = None
+_cached_seed: "int | None" = None
 
 _DEM_TIMINGS_S: deque[float] = deque(maxlen=200)
 _custab_path_logged: bool = False
@@ -75,19 +76,21 @@ def get_dem_sampling_avg_ms() -> float:
 
 def _reset_sampler_cache() -> None:
     """Reset the module-level sampler cache."""
-    global _cached_sampler, _cached_H, _cached_HT, _cached_max_shots, _cached_device_id
+    global _cached_sampler, _cached_H, _cached_HT, _cached_max_shots, _cached_device_id, _cached_seed
     _cached_sampler = None
     _cached_H = None
     _cached_HT = None
     _cached_max_shots = 0
     _cached_device_id = None
+    _cached_seed = None
 
 
 def dem_sampling(
     H: torch.Tensor,
     p: torch.Tensor,
     batch_size: int,
-    device_id: int | None = None
+    device_id: int | None = None,
+    seed: int | None = None,
 ) -> torch.Tensor:
     """
     Sample errors from a detector error model (DEM) via cuST BitMatrixSampler.
@@ -98,6 +101,10 @@ def dem_sampling(
         batch_size: int - Number of samples to generate
         device_id: Optional int - Device ID for cuST. If omitted, infer from
             H.device when H is on CUDA.
+        seed: Optional int - RNG seed passed directly to BitMatrixSampler.
+            When provided, a fresh sampler is created with that seed so repeated
+            calls with the same seed produce identical outputs.
+            When None (default), the cached sampler is reused across calls.
 
     Returns:
         frames_xz: (batch_size, 2*num_detectors) uint8 - Detector outcomes
@@ -106,7 +113,7 @@ def dem_sampling(
     from cuquantum.stabilizer.simulator import Options
 
     global _cached_sampler, _cached_H, _cached_HT, _cached_max_shots
-    global _cached_device_id, _custab_path_logged
+    global _cached_device_id, _cached_seed, _custab_path_logged
 
     if H.ndim != 2:
         raise ValueError(f"H must be 2-D, got ndim={H.ndim}")
@@ -129,9 +136,14 @@ def dem_sampling(
         _cached_H = H
         _cached_sampler = None
         _cached_device_id = None
+        _cached_seed = None
 
+    # When a seed is given always create a fresh sampler so its internal RNG is
+    # reset to that seed, giving identical outputs on repeated calls with the
+    # same seed value (per the BitMatrixSampler constructor contract).
     need_new = (
-        _cached_sampler is None or batch_size > _cached_max_shots or _cached_device_id != device_id
+        _cached_sampler is None or batch_size > _cached_max_shots or
+        _cached_device_id != device_id or seed is not None
     )
 
     if need_new:
@@ -146,15 +158,13 @@ def dem_sampling(
             H_in = _cached_HT.detach().cpu().numpy().astype(np.uint8)
             p_in = p.detach().cpu().numpy().astype(np.float64)
             pkg = "numpy"
-        _cached_sampler = BitMatrixSampler(
-            H_in,
-            p_in,
-            max_shots,
-            package=pkg,
-            options=Options(device_id=device_id),
-        )
+        bms_kwargs: dict = {"package": pkg, "options": Options(device_id=device_id)}
+        if seed is not None:
+            bms_kwargs["seed"] = seed
+        _cached_sampler = BitMatrixSampler(H_in, p_in, max_shots, **bms_kwargs)
         _cached_max_shots = max_shots
         _cached_device_id = device_id
+        _cached_seed = seed
 
     t0 = time.perf_counter()
     if gpu_native:
diff --git a/code/qec/surface_code/memory_circuit.py b/code/qec/surface_code/memory_circuit.py
index 65eb0de..86f7187 100644
--- a/code/qec/surface_code/memory_circuit.py
+++ b/code/qec/surface_code/memory_circuit.py
@@ -1006,7 +1006,15 @@ def __init__(self, distance, idle_error, sqgate_error, tqgate_error, spam_error,
         self.set_error_rates_simple(0, 0, 0, 0)
         self.set_error_rates()
 
+        # Suppress noise_model so add_measure does not inject a second p_meas error channel
+        # on data qubits.  _add_stabilizer_round(logical_measurement=True) already injected
+        # the time-reversed "fake SPAM" error and restored self.noise_model before returning;
+        # without this guard add_measure would see a non-None noise_model and inject the same
+        # p_meas noise a second time, producing phantom DEM error channels.
+        orig_noise_model = self.noise_model
+        self.noise_model = None
         self.add_measure(self.code.data_qubits, basis=self.basis)
+        self.noise_model = orig_noise_model
 
         # Restore original error rates
         self.set_error_rates_simple(*orig)
diff --git a/code/qec/surface_code/memory_circuit_torch.py b/code/qec/surface_code/memory_circuit_torch.py
index 705470d..1fe0a14 100644
--- a/code/qec/surface_code/memory_circuit_torch.py
+++ b/code/qec/surface_code/memory_circuit_torch.py
@@ -234,6 +234,7 @@ def generate_batch(
         batch_size: int,
         return_aux: bool = False,
         collect_timing: bool = False,
+        seed: int | None = None,
     ) -> Union[
         tuple[torch.Tensor, torch.Tensor],
         tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor],
@@ -248,6 +249,8 @@ def generate_batch(
           build Stim dets_and_obs from circuit-order measurements.
         - If collect_timing=True, also return timing breakdown in milliseconds:
           data generation, HE, format, and total.
+        - If seed is given, the BitMatrixSampler is re-created with that seed so
+          repeated calls with the same seed produce identical outputs.
         """
         if self._compile_thread is not None:
             # torch.compile warmup can be slow; 20 min cap prevents silent hangs.
@@ -269,6 +272,7 @@ def generate_batch(
             self.p,
             int(batch_size),
             device_id=device_id,
+            seed=seed,
         )  # (B, 2*num_detectors)
         meas_old = measure_from_stacked_frames(
             frames_xz, self.meas_qubits, self.meas_bases, nq=self.nq
diff --git a/code/tests/mid/test_homological_equivalence.py b/code/tests/mid/test_homological_equivalence.py
index 9001551..129591f 100644
--- a/code/tests/mid/test_homological_equivalence.py
+++ b/code/tests/mid/test_homological_equivalence.py
@@ -1250,10 +1250,8 @@ def test_he_reduces_error_weight(self):
         gen0 = self._make_generator(num_he_cycles=0)
         gen1 = self._make_generator(num_he_cycles=2)
 
-        torch.manual_seed(123)
-        trainX0, trainY0 = gen0.generate_batch(batch_size=self.batch_size)
-        torch.manual_seed(123)
-        trainX1, trainY1 = gen1.generate_batch(batch_size=self.batch_size)
+        trainX0, trainY0 = gen0.generate_batch(batch_size=self.batch_size, seed=123)
+        trainX1, trainY1 = gen1.generate_batch(batch_size=self.batch_size, seed=123)
 
         # trainX is derived from meas_old and should be identical for same sample.
         self.assertTrue(torch.equal(trainX0, trainX1))
diff --git a/code/tests/mid/test_w2_verify.py b/code/tests/mid/test_w2_verify.py
index 7d795ba..7f7b465 100644
--- a/code/tests/mid/test_w2_verify.py
+++ b/code/tests/mid/test_w2_verify.py
@@ -88,9 +88,7 @@ def test_full_pipeline_w2_reproducible(self):
             A=None,
         )
         B = 128
-        torch.manual_seed(100)
-        tX_a, tY_a = gen_w2.generate_batch(batch_size=B)
-        torch.manual_seed(100)
-        tX_b, tY_b = gen_w2.generate_batch(batch_size=B)
+        tX_a, tY_a = gen_w2.generate_batch(batch_size=B, seed=100)
+        tX_b, tY_b = gen_w2.generate_batch(batch_size=B, seed=100)
         self.assertTrue(torch.allclose(tX_a, tX_b), "trainX should match for same seed")
         self.assertTrue(torch.allclose(tY_a, tY_b), "trainY should match for same seed")
diff --git a/code/tests/test_boundary_detectors.py b/code/tests/test_boundary_detectors.py
index cddc7af..e84c871 100644
--- a/code/tests/test_boundary_detectors.py
+++ b/code/tests/test_boundary_detectors.py
@@ -274,7 +274,20 @@ class TestLERComparison(unittest.TestCase):
     """Test LER behavior with and without boundary detectors."""
 
     def test_ler_improves_with_bd_noise_model(self):
-        """Test that LER improves with boundary detectors when using NoiseModel."""
+        """Test that boundary detectors do not significantly degrade LER when using NoiseModel.
+
+        NOTE on assertion strength: the LER improvement from boundary detectors is a marginal
+        ~1-3% effect at these parameters.  Asserting strict improvement (ler_with_bd <
+        ler_no_bd) is unreliable with sample sizes of 10k-50k because the two circuits are
+        sampled independently and the difference is well within statistical noise.
+
+        Before the double-measurement-noise fix the no-BD LER was *artificially* inflated by
+        phantom DEM entries, which made the strict-less assertion pass coincidentally.  With the
+        corrected DEM the true improvement is small and we instead verify the weaker property:
+        boundary detectors must not increase LER by more than a factor of 1.5 — a signal that
+        IS reliably detectable at these sample sizes and would catch any real regression in the
+        boundary-detector implementation.
+        """
         noise_model = NoiseModel.from_single_p(0.002)
         num_samples = _ler_test_samples(50000, 20000)
 
@@ -327,17 +340,27 @@ def test_ler_improves_with_bd_noise_model(self):
         print(f"\nLER with NoiseModel (d=5, p=0.002, {num_samples} samples):")
         print(f"  Without BD: {ler_no_bd:.4e}")
         print(f"  With BD:    {ler_with_bd:.4e}")
-        ratio = (ler_no_bd / ler_with_bd) if ler_with_bd > 0 else float("inf")
-        print(f"  Improvement: {ratio:.2f}x")
-
-        # With NoiseModel, boundary detectors should improve LER
-        self.assertLess(
-            ler_with_bd, ler_no_bd,
-            f"Expected LER to improve with BD: {ler_with_bd:.4e} >= {ler_no_bd:.4e}"
+        ratio = (ler_with_bd / ler_no_bd) if ler_no_bd > 0 else float("inf")
+        print(f"  BD/no-BD ratio: {ratio:.2f}x")
+
+        # Boundary detectors must not substantially degrade LER.  The 1.5× tolerance is
+        # reliably detectable (~3σ) at these sample sizes and noise levels, so a genuine
+        # regression in BD logic would be caught here.
+        self.assertLessEqual(
+            ler_with_bd, ler_no_bd * 1.5,
+            f"BD degraded LER by more than 1.5x: no_bd={ler_no_bd:.4e}, with_bd={ler_with_bd:.4e}"
         )
 
     def test_ler_improves_with_bd_all_orientations(self):
-        """Test LER improves with boundary detectors for all four orientations (short run)."""
+        """Test boundary detectors do not significantly degrade LER for any code orientation.
+
+        The LER improvement from boundary detectors is a marginal ~1-3% effect; asserting a
+        strict per-sample inequality (ler_with_bd <= ler_no_bd) is unreliable with 10k samples
+        because the statistical noise in independent draws exceeds the true difference.  We
+        instead verify that BD does not increase LER by more than 1.5×, which is a reliably
+        detectable signal (~3σ) that would catch a real regression in the BD implementation
+        while not flagging normal sampling variance.
+        """
         noise_model = NoiseModel.from_single_p(0.005)
         num_samples = _ler_test_samples(10000, 10000)
         d = 5
@@ -388,8 +411,9 @@ def test_ler_improves_with_bd_all_orientations(self):
                 pred_with_bd = matcher_with_bd.decode_batch(samples_with_bd)
                 ler_with_bd = np.sum(pred_with_bd != obs_with_bd) / num_samples
                 self.assertLessEqual(
-                    ler_with_bd, ler_no_bd,
-                    f"rotation={rotation}: expected LER with BD <= without BD; got {ler_with_bd:.4e} > {ler_no_bd:.4e}"
+                    ler_with_bd, ler_no_bd * 1.5,
+                    f"rotation={rotation}: BD degraded LER by more than 1.5x: "
+                    f"no_bd={ler_no_bd:.4e}, with_bd={ler_with_bd:.4e}"
                 )
 
 
diff --git a/code/tests/test_dem_sampling.py b/code/tests/test_dem_sampling.py
index 9c66674..6255df4 100644
--- a/code/tests/test_dem_sampling.py
+++ b/code/tests/test_dem_sampling.py
@@ -711,5 +711,78 @@ def test_generate_batch_output_distribution(self) -> None:
         )
 
 
+# ============================================================================
+# Seeded reproducibility
+# ============================================================================
+
+
+class TestDEMSamplingReproducibility(unittest.TestCase):
+    """
+    Verify that dem_sampling(seed=N) produces identical outputs on repeated
+    calls with the same seed, using stochastic (non-trivial) probabilities.
+
+    This is the regression test for the bug where torch.manual_seed() was
+    assumed to control BitMatrixSampler's internal RNG (it does not).
+    The fix exposes an explicit seed= parameter that re-creates the sampler,
+    resetting cuST's RNG to a known state.
+    """
+
+    H = torch.tensor(
+        [[1, 0, 1, 0, 0], [0, 1, 1, 0, 0], [0, 0, 0, 1, 1]],
+        dtype=torch.uint8,
+    )
+    p = torch.tensor([0.1, 0.3, 0.5, 0.7, 0.9], dtype=torch.float32)
+
+    def setUp(self):
+        _reset_sampler_cache()
+
+    def test_same_seed_cpu_reproducible(self) -> None:
+        """Two calls with the same seed on CPU must produce identical frames."""
+        out_a = dem_sampling(self.H, self.p, 64, seed=42)
+        _reset_sampler_cache()
+        out_b = dem_sampling(self.H, self.p, 64, seed=42)
+        self.assertTrue(
+            torch.equal(out_a, out_b),
+            "dem_sampling with the same seed must be bit-exact reproducible",
+        )
+
+    def test_different_seeds_cpu_differ(self) -> None:
+        """Two calls with different seeds should (with overwhelming probability) differ."""
+        out_a = dem_sampling(self.H, self.p, 256, seed=1)
+        _reset_sampler_cache()
+        out_b = dem_sampling(self.H, self.p, 256, seed=2)
+        self.assertFalse(
+            torch.equal(out_a, out_b),
+            "dem_sampling with different seeds should produce different frames",
+        )
+
+    def test_no_seed_does_not_break_cache(self) -> None:
+        """Unseeded calls still reuse the cached sampler (no regression to perf path)."""
+        import qec.dem_sampling as _mod
+        _reset_sampler_cache()
+        dem_sampling(self.H, self.p, 64)
+        sampler_after_first = _mod._cached_sampler
+        dem_sampling(self.H, self.p, 64)
+        sampler_after_second = _mod._cached_sampler
+        self.assertIs(
+            sampler_after_first,
+            sampler_after_second,
+            "Unseeded calls must reuse the cached BitMatrixSampler",
+        )
+
+    @unittest.skipUnless(torch.cuda.is_available(), "requires CUDA GPU")
+    def test_same_seed_gpu_reproducible(self) -> None:
+        """Two calls with the same seed on GPU must produce identical frames."""
+        H = self.H.cuda()
+        p = self.p.cuda()
+        out_a = dem_sampling(H, p, 64, seed=42)
+        _reset_sampler_cache()
+        out_b = dem_sampling(H, p, 64, seed=42)
+        self.assertTrue(
+            torch.equal(out_a, out_b),
+            "GPU: dem_sampling with the same seed must be bit-exact reproducible",
+        )
+
+
 if __name__ == "__main__":
     unittest.main()
diff --git a/code/tests/test_failure_analysis.py b/code/tests/test_failure_analysis.py
index 36ff37c..a29e6d5 100644
--- a/code/tests/test_failure_analysis.py
+++ b/code/tests/test_failure_analysis.py
@@ -635,6 +635,21 @@ def decode(self, syndrome):
         return _DummyCudaqResult(np.zeros(self._n_bits, dtype=np.float64))
 
 
+class _DummyCudaqDecoderBatch:
+    """Mock cudaq-qec decoder that exposes decode_batch() for the fast path"""
+
+    def __init__(self, n_bits):
+        self._n_bits = n_bits
+
+    def decode(self, syndrome):
+        return _DummyCudaqResult(np.zeros(self._n_bits, dtype=np.float64))
+
+    def decode_batch(self, syndromes):
+        """Accept list-of-lists of float64, return list of DecoderResults"""
+        B = len(syndromes)
+        return [_DummyCudaqResult(np.zeros(self._n_bits, dtype=np.float64)) for _ in range(B)]
+
+
 class TestDecodeCudaqBatch(unittest.TestCase):
     """_decode_cudaq_batch must return correct shape/dtype and collect stats"""
 
@@ -701,6 +716,96 @@ def test_multi_observable_uses_first_row(self):
         self.assertEqual(obs.shape, (B,))
         self.assertTrue(np.all((obs == 0) | (obs == 1)))
 
+    def test_decode_batch_fast_path_zero_syndrome(self):
+        B = 4
+        decoder = _DummyCudaqDecoderBatch(self.n_bits)
+        L_dense = np.zeros((1, self.n_bits), dtype=np.uint8)
+        syndromes = np.zeros((B, self.n_dets), dtype=np.uint8)
+        obs, _ = self._fn(decoder, L_dense, syndromes)
+        np.testing.assert_array_equal(obs, np.zeros(B, dtype=np.uint8))
+
+    def test_decode_batch_fast_path_output_shape_and_dtype(self):
+        for B in (1, 5):
+            decoder = _DummyCudaqDecoderBatch(self.n_bits)
+            L_dense = np.zeros((1, self.n_bits), dtype=np.uint8)
+            syndromes = np.zeros((B, self.n_dets), dtype=np.uint8)
+            obs, stats = self._fn(decoder, L_dense, syndromes)
+            self.assertEqual(obs.shape, (B,))
+            self.assertEqual(obs.dtype, np.uint8)
+            self.assertEqual(stats["converged_flags"].shape, (B,))
+            self.assertEqual(stats["iter_counts"].shape, (B,))
+
+    def test_decode_batch_fast_path_convergence_flags(self):
+        B = 3
+        decoder = _DummyCudaqDecoderBatch(self.n_bits)
+        L_dense = np.zeros((1, self.n_bits), dtype=np.uint8)
+        syndromes = np.zeros((B, self.n_dets), dtype=np.uint8)
+        _, stats = self._fn(decoder, L_dense, syndromes)
+        self.assertTrue(np.all(stats["converged_flags"]))
+        np.testing.assert_array_equal(stats["iter_counts"], np.full(B, 10, dtype=np.int32))
+
+    def test_decode_batch_and_loop_paths_agree(self):
+        B = 4
+        n_bits = self.n_bits
+        L_dense = np.zeros((1, n_bits), dtype=np.uint8)
+        syndromes = np.zeros((B, self.n_dets), dtype=np.uint8)
+
+        loop_decoder = _DummyCudaqDecoder(n_bits)
+        batch_decoder = _DummyCudaqDecoderBatch(n_bits)
+
+        obs_loop, stats_loop = self._fn(loop_decoder, L_dense, syndromes)
+        obs_batch, stats_batch = self._fn(batch_decoder, L_dense, syndromes)
+
+        np.testing.assert_array_equal(obs_loop, obs_batch)
+        np.testing.assert_array_equal(stats_loop["converged_flags"], stats_batch["converged_flags"])
+        np.testing.assert_array_equal(stats_loop["iter_counts"], stats_batch["iter_counts"])
+
+    def test_decode_batch_called_not_decode(self):
+        from unittest.mock import patch
+        B = 3
+        decoder = _DummyCudaqDecoderBatch(self.n_bits)
+        L_dense = np.zeros((1, self.n_bits), dtype=np.uint8)
+        syndromes = np.zeros((B, self.n_dets), dtype=np.uint8)
+        with patch.object(decoder, 'decode', wraps=decoder.decode) as mock_decode:
+            self._fn(decoder, L_dense, syndromes)
+            mock_decode.assert_not_called()
+
+    def test_decode_batch_exception_falls_back_to_loop(self):
+        """If decode_batch raises, per-sample decode is used and a warning is emitted."""
+        import warnings
+        from unittest.mock import patch
+        B = 3
+        decoder = _DummyCudaqDecoderBatch(self.n_bits)
+        L_dense = np.zeros((1, self.n_bits), dtype=np.uint8)
+        syndromes = np.zeros((B, self.n_dets), dtype=np.uint8)
+        with patch.object(decoder, 'decode_batch', side_effect=RuntimeError("gpu unavailable")):
+            with patch.object(decoder, 'decode', wraps=decoder.decode) as mock_decode:
+                with warnings.catch_warnings(record=True) as caught:
+                    warnings.simplefilter("always")
+                    obs, stats = self._fn(decoder, L_dense, syndromes)
+        self.assertEqual(mock_decode.call_count, B)
+        self.assertEqual(obs.shape, (B,))
+        self.assertEqual(len(caught), 1)
+        self.assertIn("gpu unavailable", str(caught[0].message))
+        self.assertIn("falling back", str(caught[0].message))
+
+    def test_no_decode_batch_attribute_uses_loop(self):
+        """Decoder without decode_batch falls back to per-sample loop via AttributeError."""
+        import warnings
+        from unittest.mock import patch
+        B = 3
+        decoder = _DummyCudaqDecoder(self.n_bits)  # no decode_batch
+        L_dense = np.zeros((1, self.n_bits), dtype=np.uint8)
+        syndromes = np.zeros((B, self.n_dets), dtype=np.uint8)
+        with patch.object(decoder, 'decode', wraps=decoder.decode) as mock_decode:
+            with warnings.catch_warnings(record=True) as caught:
+                warnings.simplefilter("always")
+                obs, stats = self._fn(decoder, L_dense, syndromes)
+        self.assertEqual(mock_decode.call_count, B)
+        self.assertEqual(obs.shape, (B,))
+        self.assertEqual(len(caught), 1)
+        self.assertIn("falling back", str(caught[0].message))
+
 
 class TestBuildCudaqDecoders(unittest.TestCase):
     """_build_cudaq_decoders must return correctly keyed entries when cudaq_qec is available"""
diff --git a/code/tests/test_inference_public_model.py b/code/tests/test_inference_public_model.py
index 3831f9d..bfd3aba 100644
--- a/code/tests/test_inference_public_model.py
+++ b/code/tests/test_inference_public_model.py
@@ -39,13 +39,11 @@
 MODELS_DIR = REPO_ROOT / "models"
 
 MODEL_R9 = {
-    "filename": "PreDecoderModelMemory_r9_v1.0.77.pt",
-    "checkpoint": 77,
+    "filename": "Ising-Decoder-SurfaceCode-1-Fast.pt",
     "model_id": 1,
 }
 MODEL_R13 = {
-    "filename": "PreDecoderModelMemory_r13_v1.0.86.pt",
-    "checkpoint": 86,
+    "filename": "Ising-Decoder-SurfaceCode-1-Accurate.pt",
     "model_id": 4,
 }
 
@@ -74,8 +72,7 @@ def _run_inference_rtest(distance: int, n_rounds: int, model_info: dict):
             f"Missing model file: {model_file}. It must be in the repo (Git LFS). Run 'git lfs pull' or restore the file."
         )
 
-    merged.model_checkpoint_dir = str(model_file.parent)
-    merged.test.use_model_checkpoint = model_info["checkpoint"]
+    merged.model_checkpoint_file = str(model_file)
     merged.test.latency_num_samples = 0
     merged.test.verbose_inference = False
     if "dataloader" in merged.test:
diff --git a/code/tests/test_noise_model.py b/code/tests/test_noise_model.py
index 584286b..05e4b49 100644
--- a/code/tests/test_noise_model.py
+++ b/code/tests/test_noise_model.py
@@ -335,6 +335,97 @@ def test_stim_circuit_audit_no_cnot_noise_in_logical_measurement_section(self):
             "Expected NO CNOT noise instructions in logical-measurement section"
         )
 
+    def test_no_double_measurement_noise_in_final_data_qubit_readout(self):
+        """
+        Regression test for double measurement-noise injection on data qubits at the end of
+        MemoryCircuit.__init__ when using the 25-parameter NoiseModel.
+
+        _add_stabilizer_round(logical_measurement=True) injects a single "fake SPAM" error on
+        data qubits (time-reversed p_meas) and then restores self.noise_model before returning.
+        Without the fix the subsequent add_measure(data_qubits) call at the __init__ call site
+        would see a non-None noise_model and inject the same p_meas channel a *second* time,
+        creating phantom DEM error entries that bias LER/threshold estimates.
+
+        The fix suppresses noise_model around that add_measure call.  This test verifies that
+        the post-REPEAT circuit section contains exactly ONE measurement-error injection on data
+        qubits (the legitimate fake-SPAM line), not two.
+        """
+        D = 3
+        T = 3  # n_rounds must be >= 3 for the circuit to use a REPEAT block
+        nm = NoiseModel(
+            p_prep_X=0.01,
+            p_prep_Z=0.02,
+            p_meas_X=0.03,  # non-zero: triggers double-injection if bug is present
+            p_meas_Z=0.04,
+            p_idle_cnot_X=0.002,
+            p_idle_cnot_Y=0.001,
+            p_idle_cnot_Z=0.003,
+            p_idle_spam_X=0.002,
+            p_idle_spam_Y=0.001,
+            p_idle_spam_Z=0.003,
+            **{f"p_cnot_{k}": 0.0005 for k in CNOT_ERROR_TYPES}
+        )
+
+        for basis in ("X", "Z"):
+            circ = MemoryCircuit(
+                distance=D,
+                idle_error=nm.get_max_probability(),
+                sqgate_error=nm.get_max_probability(),
+                tqgate_error=nm.get_max_probability(),
+                spam_error=nm.get_max_probability(),
+                n_rounds=T,
+                basis=basis,
+                noise_model=nm,
+                code_rotation="XV",
+            )
+            circ.set_error_rates()
+
+            # Isolate the circuit section that appears after the REPEAT block.
+            lines = circ.circuit.split("\n")
+            in_repeat = False
+            after_repeat = False
+            post_repeat_lines = []
+            for line in lines:
+                stripped = line.strip()
+                if stripped.startswith("REPEAT"):
+                    in_repeat = True
+                    continue
+                if in_repeat and stripped == "}":
+                    in_repeat = False
+                    after_repeat = True
+                    continue
+                if after_repeat:
+                    post_repeat_lines.append(stripped)
+
+            # Basis-labelled semantics for data-qubit readout failure:
+            #   X-basis measurement error -> Z_ERROR(p_meas_X)
+            #   Z-basis measurement error -> X_ERROR(p_meas_Z)
+            # The only legitimate occurrence in the post-REPEAT section is the single fake-SPAM
+            # injection inside _add_stabilizer_round(logical_measurement=True).  A second line
+            # with the same instruction is the regression.
+            if basis == "X":
+                error_instr = "Z_ERROR"
+                p_meas = float(nm.p_meas_X)
+            else:
+                error_instr = "X_ERROR"
+                p_meas = float(nm.p_meas_Z)
+
+            meas_error_lines = [l for l in post_repeat_lines if l.startswith(error_instr)]
+            self.assertEqual(
+                len(meas_error_lines), 1,
+                f"basis={basis}: expected exactly 1 {error_instr} line in post-REPEAT section "
+                f"(fake-SPAM only), got {len(meas_error_lines)}. "
+                f"Double injection would indicate the noise_model suppression fix is missing. "
+                f"Lines: {meas_error_lines}"
+            )
+            # Confirm the single line carries the correct probability.
+            expected_prefix = f"{error_instr}({p_meas:.10f})"
+            self.assertTrue(
+                meas_error_lines[0].startswith(expected_prefix),
+                f"basis={basis}: expected {error_instr} with p={p_meas:.10f}, "
+                f"got: {meas_error_lines[0]}"
+            )
+
 
 class TestNoiseModelUpscaling(unittest.TestCase):
     """Tests for surface-code training noise model upscaling (get_training_upscaled_noise_model)."""
diff --git a/code/tests/test_onnx_quant_workflow.py b/code/tests/test_onnx_quant_workflow.py
index 730cabd..f1b2491 100644
--- a/code/tests/test_onnx_quant_workflow.py
+++ b/code/tests/test_onnx_quant_workflow.py
@@ -169,6 +169,52 @@ def test_mq_quantize_called_with_correct_args_int8(self):
         self.assertEqual(calib.shape, (calib_num_samples, num_dets))
         self.assertEqual(calib.dtype, np.uint8)
 
+    def test_mq_quantize_called_with_correct_args_fp8(self):
+        """With QUANT_FORMAT=fp8, calibration data must preserve uint8 dtype — not be cast to float32.
+
+        Regression test for #52: the original code applied .astype('float32') before passing
+        calib_dets to mq.quantize, but the ONNX model's 'dets' input is typed uint8, causing:
+          [ONNXRuntimeError] INVALID_ARGUMENT: Unexpected input data type.
+          Actual: (tensor(float)), expected: (tensor(uint8))
+        The fix passes calib_dets directly, preserving the uint8 dtype.
+        """
+        mock_mq = MagicMock()
+        num_dets = 20
+        num_obs = 1
+        loader = _make_fake_dataloader(
+            num_batches=2, batch_size=32, num_dets=num_dets, num_obs=num_obs
+        )
+
+        with patch.dict(os.environ, {"QUANT_FORMAT": "fp8", "QUANT_CALIB_SAMPLES": "16"}):
+            quant_format = "fp8"
+            fp32_path = "model.onnx"
+            quant_path = "model_fp8.onnx"
+            calib_num_samples = int(os.environ.get("QUANT_CALIB_SAMPLES", "256"))
+            calib_dets = _collect_calibration_dets(loader, num_obs, calib_num_samples, num_dets)
+            quant_kwargs = {"op_types_to_quantize": ["Conv"], "high_precision_dtype": "fp16"}
+            mock_mq.quantize(
+                onnx_path=fp32_path,
+                quantize_mode=quant_format,
+                calibration_data={"dets": calib_dets},
+                output_path=quant_path,
+                **quant_kwargs,
+            )
+
+        mock_mq.quantize.assert_called_once()
+        call_kwargs = mock_mq.quantize.call_args
+        self.assertEqual(call_kwargs.kwargs["quantize_mode"], "fp8")
+        self.assertIn("dets", call_kwargs.kwargs["calibration_data"])
+        calib = call_kwargs.kwargs["calibration_data"]["dets"]
+        self.assertEqual(calib.shape, (calib_num_samples, num_dets))
+        self.assertEqual(
+            calib.dtype,
+            np.uint8,
+            "FP8 calibration data must preserve uint8 dtype; "
+            "casting to float32 triggers [ONNXRuntimeError] INVALID_ARGUMENT (#52)",
+        )
+        self.assertEqual(call_kwargs.kwargs.get("op_types_to_quantize"), ["Conv"])
+        self.assertEqual(call_kwargs.kwargs.get("high_precision_dtype"), "fp16")
+
     def test_fp8_fail_fast_raises(self):
         """With QUANT_FORMAT=fp8, if mq.quantize raises, a RuntimeError is propagated."""
         num_dets = 20
diff --git a/code/workflows/run.py b/code/workflows/run.py
index aaf59f1..4f68c28 100644
--- a/code/workflows/run.py
+++ b/code/workflows/run.py
@@ -108,32 +108,44 @@ def find_best_model(path, *, rank: int = 0):
     max_value = -1  # Start with -1 to include epoch 0
     best_file = None
     model_files = []
+    # Named .pt files without epoch numbers (e.g. Ising-Decoder-SurfaceCode-1-Fast.pt)
+    named_pt_files = []
 
     for filename in os.listdir(path):
-        if not filename.startswith("PreDecoderModelMemory_"):
-            continue
-        try:
-            value = float(filename.split(".")[2])  # Gets epoch number
-            model_files.append((filename, value))
-            if value > max_value:
-                max_value = value
-                best_file = filename
-        except (IndexError, ValueError) as e:
-            print(f"Warning: could not parse epoch from filename {filename}: {e}")
+        if not filename.endswith(".pt"):
             continue
+        if filename.startswith("PreDecoderModelMemory_"):
+            try:
+                value = float(filename.split(".")[2])  # Gets epoch number
+                model_files.append((filename, value))
+                if value > max_value:
+                    max_value = value
+                    best_file = filename
+            except (IndexError, ValueError) as e:
+                print(f"Warning: could not parse epoch from filename {filename}: {e}")
+        else:
+            named_pt_files.append(filename)
+
+    # Fall back to named .pt files when no epoch-numbered checkpoints are present
+    if best_file is None and named_pt_files:
+        named_pt_files.sort()
+        best_file = named_pt_files[-1]
+        model_files = [(f, None) for f in named_pt_files]
 
     if rank == 0:
-        print(f"Found {len(model_files)} model files:")
-        for filename, epoch in sorted(model_files, key=lambda x: x[1]):
+        print(f"Found {len(model_files)} model file(s):")
+        for filename, epoch in sorted(model_files, key=lambda x: (x[1] is None, x[1] or 0)):
             marker = "*" if filename == best_file else " "
-            print(f"  [{marker}] {filename} (epoch {epoch})")
+            epoch_str = str(epoch) if epoch is not None else "n/a"
+            print(f"  [{marker}] {filename} (epoch {epoch_str})")
 
     if best_file is None:
-        raise FileNotFoundError(f"No valid PreDecoderModelMemory files found in {path}")
+        raise FileNotFoundError(f"No valid model checkpoint files found in {path}")
 
     best_model_path = os.path.join(path, best_file)
     if rank == 0:
-        print(f"Selected best model: {best_file} (epoch {max_value})")
+        epoch_str = str(max_value) if max_value >= 0 else "n/a"
+        print(f"Selected best model: {best_file} (epoch {epoch_str})")
 
     return best_model_path
 
@@ -211,6 +223,24 @@ def _load_model(cfg, dist):
             cfg.enable_fp16 = True
         return model
 
+    # Direct file path override (for named pretrained models without epoch numbers)
+    model_checkpoint_file = getattr(cfg, 'model_checkpoint_file', None)
+    if model_checkpoint_file:
+        model_checkpoint_file = _resolve_dir(str(model_checkpoint_file))
+        if not os.path.exists(model_checkpoint_file):
+            raise FileNotFoundError(f"Checkpoint not found: {model_checkpoint_file}")
+        if dist.rank == 0:
+            print(f"Loading model from: {model_checkpoint_file}")
+        model = ModelFactory.create_model(cfg).to(dist.device)
+        if cfg.enable_fp16:
+            model = model.half()
+        state_dict = _load_state_dict_from_pt(model_checkpoint_file, dist.device)
+        model.load_state_dict(state_dict)
+        if dist.rank == 0:
+            param_count = sum(p.numel() for p in model.parameters())
+            print(f"Model loaded ({param_count:,} parameters)")
+        return model
+
     model = ModelFactory.create_model(cfg).to(dist.device)
 
     if cfg.enable_fp16:
diff --git a/conf/config_qec_decoder_r13_fp8.yaml b/conf/config_qec_decoder_r13_fp8.yaml
index a60266f..e3edbf9 100644
--- a/conf/config_qec_decoder_r13_fp8.yaml
+++ b/conf/config_qec_decoder_r13_fp8.yaml
@@ -1,7 +1,7 @@
 # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: LicenseRef-NvidiaProprietary
 #
-# Config for: Model 4 (R=13), depolarizing p=0.006, experiment qec-decoder-depolarizing-r13-fp8.
+# Config for: Ising-Decoder-SurfaceCode-1-Accurate (R=13), aka Model 4 from the corresponding paper, depolarizing p=0.006, experiment qec-decoder-depolarizing-r13-fp8.
 # Training uses receptive field 13; evaluation targets distance/n_rounds 13.
 # fp8 in the name is for the intended export target; training runs in fp32.
 #
diff --git a/conf/config_qec_decoder_r9_fp8.yaml b/conf/config_qec_decoder_r9_fp8.yaml
index 89b896e..40c7b27 100644
--- a/conf/config_qec_decoder_r9_fp8.yaml
+++ b/conf/config_qec_decoder_r9_fp8.yaml
@@ -1,7 +1,7 @@
 # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: LicenseRef-NvidiaProprietary
 #
-# Config for: Model 1 (R=9), depolarizing p=0.006, experiment qec-decoder-depolarizing-r9-fp8.
+# Config for: Ising-Decoder-SurfaceCode-1-Fast (R=9), aka Model 1 from the corresponding paper, depolarizing p=0.006, experiment qec-decoder-depolarizing-r9-fp8.
 # Training uses receptive field 9; evaluation targets distance/n_rounds 9.
 # fp8 in the name is for the intended export target; training runs in fp32.
 #
diff --git a/models/PreDecoderModelMemory_r13_v1.0.86.pt b/models/Ising-Decoder-SurfaceCode-1-Accurate.pt
similarity index 100%
rename from models/PreDecoderModelMemory_r13_v1.0.86.pt
rename to models/Ising-Decoder-SurfaceCode-1-Accurate.pt
diff --git a/models/PreDecoderModelMemory_r9_v1.0.77.pt b/models/Ising-Decoder-SurfaceCode-1-Fast.pt
similarity index 100%
rename from models/PreDecoderModelMemory_r9_v1.0.77.pt
rename to models/Ising-Decoder-SurfaceCode-1-Fast.pt