Virtual experts#45
Merged
chrishayuk merged 80 commits intomainfrom May 4, 2026
Merged
Conversation
# walk_path_audit — baseline index
Per-path equivalence audit for `WalkFfn` dispatch paths. Each entry
below records a measurement of one (model, vindex variant) pair against
the `WeightFfn` dense matmul reference, with the assertion bounds
locked in from that measurement.
## Methodology
For each `WalkFfn` path a forced-dispatch measurement is taken via a
`MaskedGateIndex` wrapper that hides the `has_*` flags above the target
path in the routing ladder. Three prompts (anchor + factual + code)
are run end-to-end through `predict_with_ffn`, with a per-layer
`DualFfn` capturing the diff between the path's output and the
reference at every (layer, position).
Assertion metrics are **cos** and **relative L2** (`L2 / ‖primary‖`),
both magnitude-invariant. Absolute L2 and max-element drift are kept
as diagnostic columns to surface residual-magnitude outliers (e.g. the
L11/code/1 ` fibonacci` spike on Gemma 3 4B) without driving the
verdict. Per-path bounds use a measure-then-tighten rule: cosine floor
at one decimal less precise than the measured worst; rel_L2 ceiling at
measured worst × 4.
Source: `crates/larql-inference/examples/walk_path_audit.rs`.
## Baselines
| date | model | vindex | paths tested | min cos | max rel L2 | n_obs | verdict |
|---|---|---|---|---|---|---|---|
| 2026-05-01 | google/gemma-3-4b-it | gemma3-4b-f16 | sparse, full_mmap, exact | 0.999997 | 1.881e-3 | 1,326 | 3/3 PASS |
### 2026-05-01 — Gemma 3 4B f16 (canonical baseline)
The f32 paths agree at cos = 0.999997 across 1,326 observations, three
independent code paths land on identical assertion values, dispatch
trace verified 102/102 layers per path. Worst rel_L2 observed at
L32/paris/0 (BOS position of the Paris prompt). Top-1 token matches on
all three prompts × three paths; Paris probability holds to within
1.4e-4 of dense.
Bounds locked: `cos ≥ 0.99999, rel_L2 ≤ 1e-2` for the exact bucket.
The rel_L2 ceiling is intentionally loose pending Q4K and FP4 baseline
measurements — see inline comment at `BOUND_EXACT` for the sequencing
rule. Target post-matrix tightening: ~7.5e-3 (= measured × 4).
Artifacts: `walk_path_audit_gemma3_4b_f16_baseline.{md,json}`.
## Sequenced follow-ups
Each is its own measure-bound-commit cycle, separate PR:
1. `gemma3-4b-q4k-v2.vindex` → measure `interleaved_q4k:dequant`, set
quantized rel_L2 bound at measured × 4.
2. `gemma3-4b-fp4a.vindex` → measure `fp4_storage:sparse`, set fp4
bound at measured × 4.
3. Single cross-bucket bound-tightening commit once all three
measurements are in (will tighten the f16 exact rel_L2 from the
intentionally-loose 1e-2 to ~7.5e-3).
---
Commit message (paste into a HEREDOC or your editor):
docs(audits): walk path equivalence index — f16 baseline cos=0.999997
Adds docs/audits/walk_path_audit/INDEX.md documenting the per-path
equivalence audit methodology and recording the canonical Gemma 3 4B
f16 baseline measurement.
Headline finding: the f32 paths (sparse, full_mmap, exact) agree at
cos = 0.999997 across 1,326 observations, three independent code
paths land on identical assertion values, dispatch trace verified
102/102 layers per path. All three pass cos ≥ 0.99999, rel_L2 ≤ 1e-2
with comfortable margin. Top-1 matches on every prompt × path; Paris
probability holds to within 1.4e-4 of dense. Worst rel_L2 observed at
L32/paris/0.
The harness (walk_path_audit.rs example), the MaskedGateIndex
wrapper, and the per-path baseline artifacts landed in 84aee5a,
bundled with unrelated working-tree work. This commit is a follow-up
to make the audit and its baseline discoverable via `git log` and
repo search.
Searchable terms: walk path equivalence, walk_path_audit, f16
baseline, MaskedGateIndex, cos 0.999997, 1326 observations, dispatch
trace, WeightFfn / WalkFfn parity, rel_L2 1.881e-3, L32/paris/0.
Sequenced follow-ups (separate commits, one per vindex variant):
- gemma3-4b-q4k-v2.vindex → measure interleaved_q4k:dequant
- gemma3-4b-fp4a.vindex → measure fp4_storage:sparse
- then a single cross-bucket bound-tightening commit (will close
the deliberately-loose f16 exact rel_L2 ceiling once Q4K and FP4
measurements have set their own measured-worst-×-4 bounds)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.