Skip to content

Feature: Swept reproducibility matrix + ed25519 verify-before-load#92

Open
ryoari wants to merge 4 commits into
mainfrom
expansion
Open

Feature: Swept reproducibility matrix + ed25519 verify-before-load#92
ryoari wants to merge 4 commits into
mainfrom
expansion

Conversation

@ryoari

@ryoari ryoari commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Extends OpenVerifiableLLM from a single-batch toy into a swept experiment that
produces a models × conditions results matrix of training-reproducibility
verdicts, and hardens checkpoint loading with ed25519 verify-before-deserialize.

Two verdicts are kept deliberately separate (the crux of the experiment):

  • Run-to-run reproducibility on the same hardware (reproducible, first_divergence_step)
  • Agreement with the fp32 reference (vs_fp32_bitwise, vs_fp32_losstol)

The matrix surfaces the planted result: every cell is run-to-run reproducible,
yet bf16 silently disagrees with the fp32 reference — reproducible ≠ correct.

What's included

  • src/experiment.py — shared engine: twin runs → reproducible / first_divergence_step, safetensors + Merkle, one JSON record per cell.
  • run_experiment.py / sweep.py / demo.py — single cell / matrix grid / one-command narrative arc.
  • src/signing.py — ed25519 sign_file / verify_file / verified_torch_load (verify signature before deserialization; replaces load-then-SHA256, which executed code before the integrity check).
  • src/config.py, src/model.py, src/dataset.py, src/device.py — scaled config + presets, configurable TinyGPT / MLP / LSTM / CNN, char-level corpora + CIFAR with replay-exact get_batch(), precision (fp32/tf32/bf16) + determinism toggles.
  • src/reproducibility.py — segmented-replay audit (5 scenarios) rewired to the scaled model; broken seal now rejected pre-deserialization.
  • src/plot_divergence.py (T7), src/ddp_repro.py (T9 stretch), tests/, RUNBOOK.md.

Verification (local, CPU)

  • python -m unittest tests.test_artifacts tests.test_experiment14 pass, 3 GPU tests skip.
  • python sweep.py --quick12 cells, all run-to-run PASS, 3 bf16 cells DIFF from fp32 reference.
  • reproducibility.py smoke → CLEAN AUDIT: PASS · BROKEN-SEAL REJECTED BEFORE LOAD: YES.
  • T7 divergence plot pipeline runs end-to-end.

GPU-only cells (TF32 divergence on Ampere, determinism-OFF, cross-GPU) are run on a RunPod pod per RUNBOOK.md.

Addressed Issues:

Fixes #62
Related to #37

Screenshots/Recordings:

image

Additional Notes:

  • TF32 is a no-op on CPU, so it reads SAME in the local matrix; it only diverges
    on Ampere GPUs. Determinism-OFF and cross-GPU divergence are empirical and
    must be confirmed on the pod before being relied on — see CONTEXT.md / RUNBOOK.md.
  • Private signing key (keys/ovl_ed25519.key) is git-ignored; only the public key
    is committed. results/ and artifacts/ are git-ignored.
  • Open question for reviewers (the intended debate): should the verification bar be
    bitwise identity (strong, brittle, hardware-bound) or loss-tolerance
    (portable, but admits silent precision drift)?

AI Usage Disclosure:

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: Claude code for test case expansion

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

  • New Features
    • Added an end-to-end demo runner, single experiment CLI, and parameter sweep with verdict tables, debate hooks, and optional cross-GPU comparison.
    • Expanded model/dataset coverage with real text corpora (offline fallback) and optional vision dataset support, plus a divergence plot utility and a DDP reproducibility script.
  • Security
    • Added detached ed25519 signature verification before checkpoint loading, along with Merkle-sealed artifact integrity.
  • Documentation
    • Updated README with reproducibility expectations and security upgrade guidance; added a demo-day RUNBOOK.
  • Tests
    • Added security/crypto, Merkle, determinism, CUDA, and DDP test coverage.
  • Chores
    • Updated git LFS tracking, ignore rules, requirements (version ranges + safetensors), and added an offline Shakespeare sample.

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ce46b51f-0a60-4f34-8393-a58bcbddb13a

📥 Commits

Reviewing files that changed from the base of the PR and between eb72638 and 2239c33.

📒 Files selected for processing (2)
  • src/device.py
  • src/reproducibility.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/device.py
  • src/reproducibility.py

📝 Walkthrough

Walkthrough

This PR replaces the toy TinyDataset/TinyGPT setup with a production-grade reproducibility verification system: ed25519 checkpoint signing, a multi-architecture model zoo, real corpus dataset loaders, parameterized determinism/precision control, a twin-run experiment engine, a signed segmented-replay audit, and CLI tools for sweeping and demoing reproducibility across precision/determinism conditions.

Changes

OpenVerifiableLLM Reproducibility & Security Overhaul

Layer / File(s) Summary
ed25519 signing infrastructure
src/signing.py, requirements.txt, .gitattributes, .gitignore
Introduces SignatureError, keypair management under keys/, sign_file/verify_file operating on raw bytes, verified_torch_load that verifies detached .sig before deserialization, and signed_torch_save. Adds pynacl/safetensors dependencies, Git LFS rule for *.safetensors, and ignores private keys/signatures/generated artifacts.
Configuration system and model zoo
src/config.py, src/model.py
Replaces TRAIN_CONFIG with a canonical config plus MODEL_PRESETS per architecture; adds effective_config, model_config, and _coerce for env-var overrides. Expands TinyGPT to a configurable multi-layer decoder and adds MLPLanguageModel, LSTMLanguageModel, TinyCNN, build_model factory, count_params, and is_vision_model.
Dataset loaders and device/precision control
src/dataset.py, src/device.py, src/main.py, data/shakespeare_sample.txt
Replaces TinyDataset with CharDataset (Shakespeare/enwik8/wikitext with offline fallback), CIFARDataset, and get_dataset factory. Adds parameterized configure_determinism, apply_precision, autocast_context, and precision_flags to device.py. Updates set_seed in main.py to seed all generators and call the new device APIs.
Twin-run experiment engine and Merkle fix
src/experiment.py, src/artifacts.py, src/telemetry.py
Introduces run_one as the primary API: seeds RNGs, runs two identical training passes, computes reproducible/first_divergence_step, saves safetensors + Merkle manifest, and returns a comprehensive JSON-serializable record. Fixes build_merkle_manifest to derive size_bytes/sha256 from stat/compute_sha256 rather than an in-loop accumulator.
Signed segmented-replay audit
src/reproducibility.py
Rewrites the audit to verify ed25519 signatures before any deserialization and restore full RNG state from checkpoints. Updates all four audit scenarios (bad_seed, secret_noise, sabotage, broken_seal) to use the new signed-checkpoint format. verify now returns only a telemetry-match boolean; hash mismatch is reported separately.
Updated downstream integrations
src/eval.py, src/gpu_reproducibility_test.py, src/global_manifest.py, src/ddp_repro.py
Wires eval.py to safetensors-first loading with verified_torch_load fallback; updates gpu_reproducibility_test.py and global_manifest.py to use model_config/get_dataset. Adds src/ddp_repro.py for NCCL DDP bitwise reproducibility testing under torchrun.
CLI sweep, demo, run_experiment, and divergence plot
sweep.py, run_experiment.py, demo.py, src/plot_divergence.py
Adds sweep.py for a full model×condition grid with reference annotation and debate-hook markers; run_experiment.py as a single-cell CLI wrapper; demo.py as an end-to-end demo covering verdict table, debate hook, security demo, and Merkle summary; and src/plot_divergence.py to plot cumulative divergence from JSONL.
Tests, docs, and repo metadata
tests/test_experiment.py, README.md, RUNBOOK.md, notebooks/colab_gpu_reproducibility.ipynb, experiments/verifiable_llm_experiment.ipynb
Adds five test suites (signing-before-load, Merkle non-degeneracy, determinism, CUDA failures, DDP scaffold). Adds README reproducibility matrix, security upgrade description, and quickstart. Adds full RUNBOOK.md for GPU pod demo-day procedures. Updates Colab notebook to use version ranges and restructures experiment notebook imports.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as demo.py / sweep.py
  participant run_one as experiment.run_one
  participant prepare_run
  participant _single_train
  participant signing as signing.sign_file
  participant _merkle_from_model
  participant reproducibility as reproducibility.run_training_segment

  CLI->>run_one: model, dataset, precision, deterministic
  run_one->>prepare_run: seed RNGs, apply_precision, configure_determinism
  prepare_run-->>run_one: configured
  run_one->>_single_train: run A
  _single_train-->>run_one: param_sha256_A, losses_A
  run_one->>_single_train: run B (twin)
  _single_train-->>run_one: param_sha256_B, losses_B
  run_one->>_merkle_from_model: save safetensors + build Merkle
  _merkle_from_model->>signing: sign artifact
  signing-->>_merkle_from_model: .sig written
  _merkle_from_model-->>run_one: merkle_root, chunk_count
  run_one-->>CLI: record {reproducible, first_divergence_step, ...}

  Note over reproducibility: Segmented-replay audit
  reproducibility->>signing: verify_file before torch.load
  alt signature valid
    signing-->>reproducibility: True
    reproducibility->>reproducibility: restore RNG state, replay training
  else tampered
    signing-->>reproducibility: raises SignatureError
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • AOSSIE-Org/OpenVerifiableLLM#89: Shares the same src/config.py/device.py determinism setup, src/eval.py checkpoint loading, and audit/replay logic that this PR substantially refactors.
  • AOSSIE-Org/OpenVerifiableLLM#91: This PR modifies src/artifacts.py's build_merkle_manifest output fields (size_bytes, sha256) introduced in that PR, and extends the downstream safetensors/Merkle checkpoint flows.

Suggested labels

documentation, enhancement, backend, python, configuration, pending-coderabbit-review, size/XL

Suggested reviewers

  • Archit381

Poem

🐇 Hop hop, the weights are signed today,
No pickle traps shall stand in the way!
Twin runs agree—or FAIL is shown,
The Merkle root seals what was grown.
bf16 diverges, tf32 too,
But the rabbit audits every step anew! 🔏

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the two main features: a swept reproducibility matrix for testing multiple conditions and ed25519 signature-based checkpoint verification.
Linked Issues check ✅ Passed The PR implements the core requirements from #62: validating checkpoint reproducibility across identical training runs with deterministic execution, cryptographic hashing, and comprehensive comparison frameworks.
Out of Scope Changes check ✅ Passed All code changes directly support the reproducibility validation and checkpoint verification objectives. Additions to notebooks, configuration, dataset handling, and device management are all necessary infrastructure for the main features.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch expansion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (1)
src/artifacts.py (1)

85-101: ⚡ Quick win

Compute file hash in the same pass as chunking.

Current two-pass approach can produce mismatched chunks vs top-level sha256 if the file is mutated between reads. A single streaming pass avoids TOCTOU inconsistency and extra I/O.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/artifacts.py` around lines 85 - 101, The file is being read twice - once
in the while loop to compute individual chunk hashes and offsets, and again
separately with compute_sha256(file_path=path) to compute the overall file hash.
This two-pass approach creates a TOCTOU vulnerability where the file could be
modified between reads, resulting in mismatched hashes. Instead, compute the
overall file hash during the same streaming pass as the chunking operation by
accumulating a hash object as you iterate through chunks in the while loop, then
use that final accumulated hash value instead of making the separate
compute_sha256(file_path=path) call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@demo.py`:
- Line 134: The print statement on line 134 uses an f-string prefix (f"...") but
contains no format placeholders or variable interpolations. Remove the f prefix
from the string literal in the print function call to convert it to a regular
string literal, since there are no curly braces or variables that need to be
interpolated. This will resolve the Ruff F541 lint error.

In `@run_experiment.py`:
- Around line 38-40: Add validation logic after the argument parser processes
the --model, --dataset, and --precision arguments to check for valid
model/dataset combinations before training begins. After the p.parse_args()
call, implement an early guard that verifies the selected model and dataset pair
are compatible, and exit with a clear error message if they are not (for
example, reject combinations like cnn with shakespeare). This prevents invalid
pairs from reaching deeper code paths where errors are less actionable.

In `@src/config.py`:
- Around line 81-84: The get_config_hash() function currently only includes
TRAIN_CONFIG in the hash calculation, which means changes to MODEL_PRESETS are
not reflected in the config fingerprint. To fix this, modify the function to
include MODEL_PRESETS along with TRAIN_CONFIG when creating the JSON dump before
hashing. Create a dictionary that contains both TRAIN_CONFIG and MODEL_PRESETS,
then encode and hash this combined data instead of just TRAIN_CONFIG alone to
ensure all configuration changes are captured in the hash.
- Around line 68-77: The model_config function does not normalize the model_name
parameter to lowercase before looking it up in MODEL_PRESETS, while
build_model() performs lowercasing. This inconsistency causes mixed-case inputs
to potentially miss their architecture presets. Normalize model_name to
lowercase before the MODEL_PRESETS.get(model_name, {}) lookup call to ensure
consistent behavior between model_config and build_model functions.

In `@src/dataset.py`:
- Around line 33-36: The dataset ingest pipeline needs security hardening
against tampered archives and malicious pickle files. Add pinned hash values to
the _SOURCES dictionary alongside each URL, then in the extraction and
deserialization code (around lines 165-172), implement integrity verification by
computing the hash of downloaded files and comparing against the pinned values,
validate all tar archive members to prevent path traversal attacks by checking
member paths don't escape the target directory, and fail closed by raising an
exception immediately if any verification fails before proceeding with pickle
deserialization.
- Around line 125-131: The get_batch method in CharDataset has off-by-one errors
that prevent valid edge cases. The calculation of max_start on line 126 should
be self.data.size(0) - block_size (remove the extra -1 subtraction) to allow
sequences when data length equals block_size plus one. Additionally, the
torch.randint call on line 130 needs to use max_start + 1 as the upper bound
(instead of max_start) to include the last valid starting index, since
torch.randint's upper bound is exclusive.

In `@src/device.py`:
- Around line 181-185: The autocast code in the conditional block checking for
mode "bf16" and "fp16" will fail on CPU-only systems when mode="fp16" because
PyTorch 2.10 doesn't support float16 autocast on CPU. After determining
device_type using get_device().type, add an explicit check: if the device_type
is CPU and mode is "fp16", either raise a clear ValueError explaining the
incompatibility or return nullcontext() to disable autocast gracefully. This
guard should be placed before the torch.autocast call to prevent the failure.

In `@src/experiment.py`:
- Around line 142-143: The code crashes with an IndexError when accessing
lossesA[-1] if total_steps is set to 0 through the overrides parameter. After
calling model_config in the run_one function to create the cfg variable, add
validation to check that cfg.total_steps is at least 1, and raise an appropriate
ValueError or similar exception if total_steps is less than 1 to prevent the
crash downstream.

In `@src/plot_divergence.py`:
- Around line 21-24: The divergence_signal function silently truncates
mismatched input lists to the shorter length using min(len(a), len(b)), which
can hide tail divergence and report incorrect results on malformed records.
Replace the truncation logic with an explicit check that raises an exception
when the lengths of parameters a and b differ, implementing a fail-fast approach
to catch data inconsistencies early.

In `@src/signing.py`:
- Around line 65-69: The load_verify_key() function currently falls back to
generate_keypair() when the PUBLIC_KEY_PATH file is missing, which can cause
infinite recursion if the private key exists but the public key doesn't, and
silently replaces verification trust material. Replace the fallback behavior by
raising an exception (such as FileNotFoundError) when PUBLIC_KEY_PATH does not
exist, instead of calling generate_keypair(). This ensures the function fails
closed and does not silently modify security-critical verification material.
- Around line 117-119: The `verified_torch_load()` function has a TOCTOU
vulnerability where `verify_file(path)` checks the file at a specific moment,
but then `torch.load(path, ...)` re-reads the file from disk, creating a window
where an attacker could swap the file contents. To fix this, read the file
contents into memory once, verify the bytes against the signature, and then
deserialize those same in-memory bytes using torch.load with a file-like object
or BytesIO, ensuring verification and deserialization operate on identical data
without any re-reads from disk.

In `@tests/test_experiment.py`:
- Around line 32-36: In the torch import block where HAS_TORCH is set, replace
the broad `except Exception:` clause with `except ImportError:` to catch only
the expected module import failure. This prevents masking unrelated exceptions
that occur during torch's import process, such as syntax errors or missing
dependencies, which should propagate and be caught as real failures rather than
silently setting HAS_TORCH to False and skipping tests.

---

Nitpick comments:
In `@src/artifacts.py`:
- Around line 85-101: The file is being read twice - once in the while loop to
compute individual chunk hashes and offsets, and again separately with
compute_sha256(file_path=path) to compute the overall file hash. This two-pass
approach creates a TOCTOU vulnerability where the file could be modified between
reads, resulting in mismatched hashes. Instead, compute the overall file hash
during the same streaming pass as the chunking operation by accumulating a hash
object as you iterate through chunks in the while loop, then use that final
accumulated hash value instead of making the separate
compute_sha256(file_path=path) call.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3842be42-f99a-4eea-b7a8-a51d67b7bc06

📥 Commits

Reviewing files that changed from the base of the PR and between 7377277 and 91e4453.

⛔ Files ignored due to path filters (1)
  • keys/ovl_ed25519.pub is excluded by !**/*.pub
📒 Files selected for processing (26)
  • .gitattributes
  • .gitignore
  • README.md
  • RUNBOOK.md
  • data/shakespeare_sample.txt
  • demo.py
  • notebooks/colab_gpu_reproducibility.ipynb
  • requirements.txt
  • run_experiment.py
  • src/artifacts.py
  • src/config.py
  • src/dataset.py
  • src/ddp_repro.py
  • src/device.py
  • src/eval.py
  • src/experiment.py
  • src/global_manifest.py
  • src/gpu_reproducibility_test.py
  • src/main.py
  • src/model.py
  • src/plot_divergence.py
  • src/reproducibility.py
  • src/signing.py
  • src/telemetry.py
  • sweep.py
  • tests/test_experiment.py

Comment thread demo.py Outdated
Comment thread run_experiment.py
Comment thread src/config.py
Comment thread src/config.py
Comment thread src/dataset.py
Comment thread src/experiment.py
Comment thread src/plot_divergence.py
Comment thread src/signing.py
Comment thread src/signing.py Outdated
Comment thread tests/test_experiment.py
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 11 file(s) based on 12 unresolved review comments.

Files modified:

  • demo.py
  • experiments/verifiable_llm_experiment.ipynb
  • run_experiment.py
  • src/config.py
  • src/dataset.py
  • src/device.py
  • src/experiment.py
  • src/global_manifest.py
  • src/plot_divergence.py
  • src/signing.py
  • tests/test_experiment.py

Commit: 3230960c7b85096917b6a711b8cec0857e58b206

The changes have been pushed to the expansion branch.

Time taken: 7m 36s

Fixed 11 file(s) based on 12 unresolved review comments.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/dataset.py (1)

168-187: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

CIFAR download lacks hash verification.

Unlike shakespeare and enwik8, the CIFAR-10 download at line 187 doesn't pass an expected_hash, leaving it vulnerable to tampering. For a reproducibility verification system, this inconsistency weakens the integrity guarantees—especially since the downloaded tarball is later unpickled.

🛡️ Suggested fix

Add CIFAR to _SOURCES with its known hash and use it:

 _SOURCES = {
     "shakespeare": {...},
     "enwik8": {...},
+    "cifar10": {
+        "url": "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz",
+        "sha256": "<compute_actual_hash>",
+    },
 }
-                    _download(self.URL, tgz)
+                    _download(
+                        _SOURCES["cifar10"]["url"],
+                        tgz,
+                        expected_hash=_SOURCES["cifar10"]["sha256"],
+                    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dataset.py` around lines 168 - 187, The _download call for CIFAR-10 in
the _load method does not pass an expected_hash parameter, unlike other datasets
in the codebase, which compromises integrity verification. Add CIFAR-10 to the
_SOURCES dictionary with its known hash value (similar to how shakespeare and
enwik8 are configured), then retrieve and pass this hash to the _download
function call when downloading the tarball to ensure consistent hash
verification across all datasets.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/dataset.py`:
- Around line 39-42: Replace the placeholder SHA-256 hash in the enwik8 dataset
configuration with the correct actual hash value for the enwik8.zip file.
Additionally, update the URL in the enwik8 dictionary entry from HTTP to HTTPS
to match the security standards used for other sources like shakespeare and to
prevent man-in-the-middle attacks during download verification.
- Around line 148-153: The max_start calculation in the get_batch method is off
by one. Since the target tensor y uses an offset of i+1 (reading from position
i+1 to i+1+block_size), the maximum valid starting index should be data.size(0)
- block_size - 1, not data.size(0) - block_size. Change the max_start
calculation to subtract an additional 1 to account for this offset, ensuring
that both x and y stay within tensor bounds when sampled at the maximum index.

---

Outside diff comments:
In `@src/dataset.py`:
- Around line 168-187: The _download call for CIFAR-10 in the _load method does
not pass an expected_hash parameter, unlike other datasets in the codebase,
which compromises integrity verification. Add CIFAR-10 to the _SOURCES
dictionary with its known hash value (similar to how shakespeare and enwik8 are
configured), then retrieve and pass this hash to the _download function call
when downloading the tarball to ensure consistent hash verification across all
datasets.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf7e3491-ef9e-48e8-975f-b0216bdd7065

📥 Commits

Reviewing files that changed from the base of the PR and between 91e4453 and 3230960.

📒 Files selected for processing (11)
  • demo.py
  • experiments/verifiable_llm_experiment.ipynb
  • run_experiment.py
  • src/config.py
  • src/dataset.py
  • src/device.py
  • src/experiment.py
  • src/global_manifest.py
  • src/plot_divergence.py
  • src/signing.py
  • tests/test_experiment.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • src/global_manifest.py
  • run_experiment.py
  • src/plot_divergence.py
  • src/signing.py
  • src/device.py
  • demo.py
  • tests/test_experiment.py
  • src/experiment.py

Comment thread src/dataset.py
Comment thread src/dataset.py Outdated
ryoari and others added 2 commits June 20, 2026 14:03
- CharDataset.get_batch: max_start was len-block_size, letting the target
  slice data[i+1:i+1+block_size] read one past the end (silently truncated,
  producing x/y length mismatch at the boundary). Use len-block_size-1.
- enwik8: replace placeholder SHA-256 with the verified hash of enwik8.zip
  (547994d9...) and switch the URL to HTTPS.
- shakespeare: the pinned hash was also a placeholder (5c2b5e66...), which
  would fail verification and silently fall back to the bundled sample. Pin
  the real tinyshakespeare hash (86c4e6aa...) so the full corpus is used.

Addresses CodeRabbit review comments on PR #92.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rash)

reproducibility.py loads checkpoints with map_location=DEVICE, which moves
every tensor (including the saved RNG state) onto the accelerator. But
torch.set_rng_state and torch.cuda.set_rng_state_all both require CPU
ByteTensors, so the segmented-replay audit crashed on GPU with
"RNG state must be a torch.ByteTensor" (it passed on CPU because
map_location=cpu left the state valid).

Move the CPU RNG state (3 audit restore sites) and the accelerator RNG
state tensors (device.restore_accel_rng_state) back to CPU before
restoring. No-op on CPU; fixes the CLEAN AUDIT on CUDA.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[EXPERIMENT]: Validate checkpoint reproducibility across identical training runs

1 participant