Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR replaces the toy TinyDataset/TinyGPT setup with a production-grade reproducibility verification system: ed25519 checkpoint signing, a multi-architecture model zoo, real corpus dataset loaders, parameterized determinism/precision control, a twin-run experiment engine, a signed segmented-replay audit, and CLI tools for sweeping and demoing reproducibility across precision/determinism conditions. ChangesOpenVerifiableLLM Reproducibility & Security Overhaul
Sequence Diagram(s)sequenceDiagram
participant CLI as demo.py / sweep.py
participant run_one as experiment.run_one
participant prepare_run
participant _single_train
participant signing as signing.sign_file
participant _merkle_from_model
participant reproducibility as reproducibility.run_training_segment
CLI->>run_one: model, dataset, precision, deterministic
run_one->>prepare_run: seed RNGs, apply_precision, configure_determinism
prepare_run-->>run_one: configured
run_one->>_single_train: run A
_single_train-->>run_one: param_sha256_A, losses_A
run_one->>_single_train: run B (twin)
_single_train-->>run_one: param_sha256_B, losses_B
run_one->>_merkle_from_model: save safetensors + build Merkle
_merkle_from_model->>signing: sign artifact
signing-->>_merkle_from_model: .sig written
_merkle_from_model-->>run_one: merkle_root, chunk_count
run_one-->>CLI: record {reproducible, first_divergence_step, ...}
Note over reproducibility: Segmented-replay audit
reproducibility->>signing: verify_file before torch.load
alt signature valid
signing-->>reproducibility: True
reproducibility->>reproducibility: restore RNG state, replay training
else tampered
signing-->>reproducibility: raises SignatureError
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (1)
src/artifacts.py (1)
85-101: ⚡ Quick winCompute file hash in the same pass as chunking.
Current two-pass approach can produce mismatched
chunksvs top-levelsha256if the file is mutated between reads. A single streaming pass avoids TOCTOU inconsistency and extra I/O.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/artifacts.py` around lines 85 - 101, The file is being read twice - once in the while loop to compute individual chunk hashes and offsets, and again separately with compute_sha256(file_path=path) to compute the overall file hash. This two-pass approach creates a TOCTOU vulnerability where the file could be modified between reads, resulting in mismatched hashes. Instead, compute the overall file hash during the same streaming pass as the chunking operation by accumulating a hash object as you iterate through chunks in the while loop, then use that final accumulated hash value instead of making the separate compute_sha256(file_path=path) call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@demo.py`:
- Line 134: The print statement on line 134 uses an f-string prefix (f"...") but
contains no format placeholders or variable interpolations. Remove the f prefix
from the string literal in the print function call to convert it to a regular
string literal, since there are no curly braces or variables that need to be
interpolated. This will resolve the Ruff F541 lint error.
In `@run_experiment.py`:
- Around line 38-40: Add validation logic after the argument parser processes
the --model, --dataset, and --precision arguments to check for valid
model/dataset combinations before training begins. After the p.parse_args()
call, implement an early guard that verifies the selected model and dataset pair
are compatible, and exit with a clear error message if they are not (for
example, reject combinations like cnn with shakespeare). This prevents invalid
pairs from reaching deeper code paths where errors are less actionable.
In `@src/config.py`:
- Around line 81-84: The get_config_hash() function currently only includes
TRAIN_CONFIG in the hash calculation, which means changes to MODEL_PRESETS are
not reflected in the config fingerprint. To fix this, modify the function to
include MODEL_PRESETS along with TRAIN_CONFIG when creating the JSON dump before
hashing. Create a dictionary that contains both TRAIN_CONFIG and MODEL_PRESETS,
then encode and hash this combined data instead of just TRAIN_CONFIG alone to
ensure all configuration changes are captured in the hash.
- Around line 68-77: The model_config function does not normalize the model_name
parameter to lowercase before looking it up in MODEL_PRESETS, while
build_model() performs lowercasing. This inconsistency causes mixed-case inputs
to potentially miss their architecture presets. Normalize model_name to
lowercase before the MODEL_PRESETS.get(model_name, {}) lookup call to ensure
consistent behavior between model_config and build_model functions.
In `@src/dataset.py`:
- Around line 33-36: The dataset ingest pipeline needs security hardening
against tampered archives and malicious pickle files. Add pinned hash values to
the _SOURCES dictionary alongside each URL, then in the extraction and
deserialization code (around lines 165-172), implement integrity verification by
computing the hash of downloaded files and comparing against the pinned values,
validate all tar archive members to prevent path traversal attacks by checking
member paths don't escape the target directory, and fail closed by raising an
exception immediately if any verification fails before proceeding with pickle
deserialization.
- Around line 125-131: The get_batch method in CharDataset has off-by-one errors
that prevent valid edge cases. The calculation of max_start on line 126 should
be self.data.size(0) - block_size (remove the extra -1 subtraction) to allow
sequences when data length equals block_size plus one. Additionally, the
torch.randint call on line 130 needs to use max_start + 1 as the upper bound
(instead of max_start) to include the last valid starting index, since
torch.randint's upper bound is exclusive.
In `@src/device.py`:
- Around line 181-185: The autocast code in the conditional block checking for
mode "bf16" and "fp16" will fail on CPU-only systems when mode="fp16" because
PyTorch 2.10 doesn't support float16 autocast on CPU. After determining
device_type using get_device().type, add an explicit check: if the device_type
is CPU and mode is "fp16", either raise a clear ValueError explaining the
incompatibility or return nullcontext() to disable autocast gracefully. This
guard should be placed before the torch.autocast call to prevent the failure.
In `@src/experiment.py`:
- Around line 142-143: The code crashes with an IndexError when accessing
lossesA[-1] if total_steps is set to 0 through the overrides parameter. After
calling model_config in the run_one function to create the cfg variable, add
validation to check that cfg.total_steps is at least 1, and raise an appropriate
ValueError or similar exception if total_steps is less than 1 to prevent the
crash downstream.
In `@src/plot_divergence.py`:
- Around line 21-24: The divergence_signal function silently truncates
mismatched input lists to the shorter length using min(len(a), len(b)), which
can hide tail divergence and report incorrect results on malformed records.
Replace the truncation logic with an explicit check that raises an exception
when the lengths of parameters a and b differ, implementing a fail-fast approach
to catch data inconsistencies early.
In `@src/signing.py`:
- Around line 65-69: The load_verify_key() function currently falls back to
generate_keypair() when the PUBLIC_KEY_PATH file is missing, which can cause
infinite recursion if the private key exists but the public key doesn't, and
silently replaces verification trust material. Replace the fallback behavior by
raising an exception (such as FileNotFoundError) when PUBLIC_KEY_PATH does not
exist, instead of calling generate_keypair(). This ensures the function fails
closed and does not silently modify security-critical verification material.
- Around line 117-119: The `verified_torch_load()` function has a TOCTOU
vulnerability where `verify_file(path)` checks the file at a specific moment,
but then `torch.load(path, ...)` re-reads the file from disk, creating a window
where an attacker could swap the file contents. To fix this, read the file
contents into memory once, verify the bytes against the signature, and then
deserialize those same in-memory bytes using torch.load with a file-like object
or BytesIO, ensuring verification and deserialization operate on identical data
without any re-reads from disk.
In `@tests/test_experiment.py`:
- Around line 32-36: In the torch import block where HAS_TORCH is set, replace
the broad `except Exception:` clause with `except ImportError:` to catch only
the expected module import failure. This prevents masking unrelated exceptions
that occur during torch's import process, such as syntax errors or missing
dependencies, which should propagate and be caught as real failures rather than
silently setting HAS_TORCH to False and skipping tests.
---
Nitpick comments:
In `@src/artifacts.py`:
- Around line 85-101: The file is being read twice - once in the while loop to
compute individual chunk hashes and offsets, and again separately with
compute_sha256(file_path=path) to compute the overall file hash. This two-pass
approach creates a TOCTOU vulnerability where the file could be modified between
reads, resulting in mismatched hashes. Instead, compute the overall file hash
during the same streaming pass as the chunking operation by accumulating a hash
object as you iterate through chunks in the while loop, then use that final
accumulated hash value instead of making the separate
compute_sha256(file_path=path) call.
🪄 Autofix (Beta)
✅ Autofix completed
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3842be42-f99a-4eea-b7a8-a51d67b7bc06
⛔ Files ignored due to path filters (1)
keys/ovl_ed25519.pubis excluded by!**/*.pub
📒 Files selected for processing (26)
.gitattributes.gitignoreREADME.mdRUNBOOK.mddata/shakespeare_sample.txtdemo.pynotebooks/colab_gpu_reproducibility.ipynbrequirements.txtrun_experiment.pysrc/artifacts.pysrc/config.pysrc/dataset.pysrc/ddp_repro.pysrc/device.pysrc/eval.pysrc/experiment.pysrc/global_manifest.pysrc/gpu_reproducibility_test.pysrc/main.pysrc/model.pysrc/plot_divergence.pysrc/reproducibility.pysrc/signing.pysrc/telemetry.pysweep.pytests/test_experiment.py
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Fixes Applied SuccessfullyFixed 11 file(s) based on 12 unresolved review comments. Files modified:
Commit: The changes have been pushed to the Time taken: |
Fixed 11 file(s) based on 12 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/dataset.py (1)
168-187:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCIFAR download lacks hash verification.
Unlike shakespeare and enwik8, the CIFAR-10 download at line 187 doesn't pass an
expected_hash, leaving it vulnerable to tampering. For a reproducibility verification system, this inconsistency weakens the integrity guarantees—especially since the downloaded tarball is later unpickled.🛡️ Suggested fix
Add CIFAR to
_SOURCESwith its known hash and use it:_SOURCES = { "shakespeare": {...}, "enwik8": {...}, + "cifar10": { + "url": "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz", + "sha256": "<compute_actual_hash>", + }, }- _download(self.URL, tgz) + _download( + _SOURCES["cifar10"]["url"], + tgz, + expected_hash=_SOURCES["cifar10"]["sha256"], + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/dataset.py` around lines 168 - 187, The _download call for CIFAR-10 in the _load method does not pass an expected_hash parameter, unlike other datasets in the codebase, which compromises integrity verification. Add CIFAR-10 to the _SOURCES dictionary with its known hash value (similar to how shakespeare and enwik8 are configured), then retrieve and pass this hash to the _download function call when downloading the tarball to ensure consistent hash verification across all datasets.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/dataset.py`:
- Around line 39-42: Replace the placeholder SHA-256 hash in the enwik8 dataset
configuration with the correct actual hash value for the enwik8.zip file.
Additionally, update the URL in the enwik8 dictionary entry from HTTP to HTTPS
to match the security standards used for other sources like shakespeare and to
prevent man-in-the-middle attacks during download verification.
- Around line 148-153: The max_start calculation in the get_batch method is off
by one. Since the target tensor y uses an offset of i+1 (reading from position
i+1 to i+1+block_size), the maximum valid starting index should be data.size(0)
- block_size - 1, not data.size(0) - block_size. Change the max_start
calculation to subtract an additional 1 to account for this offset, ensuring
that both x and y stay within tensor bounds when sampled at the maximum index.
---
Outside diff comments:
In `@src/dataset.py`:
- Around line 168-187: The _download call for CIFAR-10 in the _load method does
not pass an expected_hash parameter, unlike other datasets in the codebase,
which compromises integrity verification. Add CIFAR-10 to the _SOURCES
dictionary with its known hash value (similar to how shakespeare and enwik8 are
configured), then retrieve and pass this hash to the _download function call
when downloading the tarball to ensure consistent hash verification across all
datasets.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bf7e3491-ef9e-48e8-975f-b0216bdd7065
📒 Files selected for processing (11)
demo.pyexperiments/verifiable_llm_experiment.ipynbrun_experiment.pysrc/config.pysrc/dataset.pysrc/device.pysrc/experiment.pysrc/global_manifest.pysrc/plot_divergence.pysrc/signing.pytests/test_experiment.py
🚧 Files skipped from review as they are similar to previous changes (8)
- src/global_manifest.py
- run_experiment.py
- src/plot_divergence.py
- src/signing.py
- src/device.py
- demo.py
- tests/test_experiment.py
- src/experiment.py
- CharDataset.get_batch: max_start was len-block_size, letting the target slice data[i+1:i+1+block_size] read one past the end (silently truncated, producing x/y length mismatch at the boundary). Use len-block_size-1. - enwik8: replace placeholder SHA-256 with the verified hash of enwik8.zip (547994d9...) and switch the URL to HTTPS. - shakespeare: the pinned hash was also a placeholder (5c2b5e66...), which would fail verification and silently fall back to the bundled sample. Pin the real tinyshakespeare hash (86c4e6aa...) so the full corpus is used. Addresses CodeRabbit review comments on PR #92. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rash) reproducibility.py loads checkpoints with map_location=DEVICE, which moves every tensor (including the saved RNG state) onto the accelerator. But torch.set_rng_state and torch.cuda.set_rng_state_all both require CPU ByteTensors, so the segmented-replay audit crashed on GPU with "RNG state must be a torch.ByteTensor" (it passed on CPU because map_location=cpu left the state valid). Move the CPU RNG state (3 audit restore sites) and the accelerator RNG state tensors (device.restore_accel_rng_state) back to CPU before restoring. No-op on CPU; fixes the CLEAN AUDIT on CUDA. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Extends OpenVerifiableLLM from a single-batch toy into a swept experiment that
produces a models × conditions results matrix of training-reproducibility
verdicts, and hardens checkpoint loading with ed25519 verify-before-deserialize.
Two verdicts are kept deliberately separate (the crux of the experiment):
reproducible,first_divergence_step)vs_fp32_bitwise,vs_fp32_losstol)The matrix surfaces the planted result: every cell is run-to-run reproducible,
yet bf16 silently disagrees with the fp32 reference — reproducible ≠ correct.
What's included
src/experiment.py— shared engine: twin runs →reproducible/first_divergence_step, safetensors + Merkle, one JSON record per cell.run_experiment.py/sweep.py/demo.py— single cell / matrix grid / one-command narrative arc.src/signing.py— ed25519sign_file/verify_file/verified_torch_load(verify signature before deserialization; replaces load-then-SHA256, which executed code before the integrity check).src/config.py,src/model.py,src/dataset.py,src/device.py— scaled config + presets, configurable TinyGPT / MLP / LSTM / CNN, char-level corpora + CIFAR with replay-exactget_batch(), precision (fp32/tf32/bf16) + determinism toggles.src/reproducibility.py— segmented-replay audit (5 scenarios) rewired to the scaled model; broken seal now rejected pre-deserialization.src/plot_divergence.py(T7),src/ddp_repro.py(T9 stretch),tests/,RUNBOOK.md.Verification (local, CPU)
python -m unittest tests.test_artifacts tests.test_experiment→ 14 pass, 3 GPU tests skip.python sweep.py --quick→ 12 cells, all run-to-run PASS, 3 bf16 cells DIFF from fp32 reference.reproducibility.pysmoke → CLEAN AUDIT: PASS · BROKEN-SEAL REJECTED BEFORE LOAD: YES.GPU-only cells (TF32 divergence on Ampere, determinism-OFF, cross-GPU) are run on a RunPod pod per
RUNBOOK.md.Addressed Issues:
Fixes #62
Related to #37
Screenshots/Recordings:
Additional Notes:
SAMEin the local matrix; it only divergeson Ampere GPUs. Determinism-OFF and cross-GPU divergence are empirical and
must be confirmed on the pod before being relied on — see CONTEXT.md / RUNBOOK.md.
keys/ovl_ed25519.key) is git-ignored; only the public keyis committed.
results/andartifacts/are git-ignored.bitwise identity (strong, brittle, hardware-bound) or loss-tolerance
(portable, but admits silent precision drift)?
AI Usage Disclosure:
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.
I have used the following AI models and tools: Claude code for test case expansion
Checklist
Summary by CodeRabbit