Skip to content

feat(extract): MXFP4-aware streaming gate_vectors path#40

Open
mikeumus wants to merge 22 commits intochrishayuk:mainfrom
Divinci-AI:feat/streaming-extract-mxfp4
Open

feat(extract): MXFP4-aware streaming gate_vectors path#40
mikeumus wants to merge 22 commits intochrishayuk:mainfrom
Divinci-AI:feat/streaming-extract-mxfp4

Conversation

@mikeumus
Copy link
Copy Markdown
Contributor

The streaming extract has its own get_tensor_f32 helper (separate from the loader tensor_to_f32 in #35). It silently returned None for any non-{F32,F16,BF16} dtype, so per-expert MXFP4 gates (I8 packed nibbles + F8_E8M0 scales) were skipped during gate_vectors.bin extraction — every layer recorded "0.0s" and the output file was 0 bytes.

Adds an I8+F8_E8M0 MXFP4 detection path at the top of get_tensor_f32:

  • If the tensor name ends in .weight and dtype is I8
  • And a .scale companion of dtype F8_E8M0 exists with matching rows
  • And the cols ratio implies a sane group size {16, 32, 64, 128}
    → Unpack via crate::format::quant::mxfp4::dequantize_expert and return the resulting f32 Array2 directly.

Result: larql extract --level browse unsloth/DeepSeek-V4-Flash now produces a populated gate_vectors.bin (~1 GB) instead of the previous 0 B file.

Stacks on #35 / #36 / #37 / #38 / #39. With this PR the V4 extract pipeline is end-to-end functional.

🤖 Generated with Claude Code

mikeumus and others added 22 commits April 17, 2026 16:59
Proposes extending LarQL from weight-analysis into analysis+editing via
three new subcommands that implement ROME/MEMIT-family algorithms on top
of the existing larql-inference forward pass and capture hooks.

Based on 9 chapters of experimentation on Gemma 4 (4B and 26B) documented
in Divinci-AI/server notebooks/CHAPTER_15 through CHAPTER_23:

- larql crown: per-edit crown-layer discovery via module ablation
- larql edit: single-fact rank-1 edit with auto-scale calibration
- larql memit: batch fact editing via joint least-squares, grouped by crown

Also defines a patch file format (~55KB per Gemma 4 4B single edit) and
a non-destructive larql apply-patch command. Phased 4-step rollout plan.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements Phase A of RFC-0001 (#2): per-layer MLP ablation scan to find
the layer whose last-position MLP output is load-bearing for a given
(prompt, expected-token) pair.

Changes:
- crates/larql-inference/src/ffn/ablating.rs — new LastPositionAblatingFfn
  that wraps any FfnBackend and zeroes its output at the last-token row for
  one target layer. Thin wrapper, no math changes.
- crates/larql-cli/src/commands/extraction/crown_cmd.rs — new `larql crown`
  subcommand. Tokenises the prompt, runs a baseline forward pass, then
  iterates layers in [start..=end] running predict_with_ffn against the
  ablating backend, reports per-layer Δ in expected-token probability and
  picks the layer whose ablation causes the top prediction to flip with the
  largest suppression magnitude.

Methodology matches Phase 125c of Divinci-AI/server
notebooks/CHAPTER_17_CORONATION.md — on Gemma 4 4B, ablating L27 MLP on
"Capital of France? A:" makes the top prediction flip from " Paris" to
"France" (the country token). The command outputs JSON (optional --json)
so downstream commands (edit, memit) can consume the crown_layer field.

Compile-checked with `cargo check --package larql-cli`.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… RFC-0001) (#7)

Implements Phase B of RFC-0001 (#2): single-fact rank-1 editor with
portable patch file format. Builds on Phase A's LastPositionAblatingFfn
(#3) and adds the symmetric LastPositionInjectingFfn for scale search.

### New library module: `larql-inference/src/edit.rs`
- `EditPatch` struct (serializable via serde)
- `compute_rank1(k, d, scale, layer, provenance) -> EditPatch`
- `write_patch(path, &patch)` / `read_patch(path) -> EditPatch` with a
  simple binary format: LQPATCH magic + JSON meta + little-endian f32
  vectors for d and k_norm. ~55 KB for Gemma 4 4B.
- `apply_patch(&mut ModelWeights, &EditPatch)`: installs the rank-1
  outer product into `down_proj.weight` in place, handling both
  `[hidden, intermediate]` and `[intermediate, hidden]` layouts.

### New FFN wrapper: `larql-inference/src/ffn/injecting.rs`
- `LastPositionInjectingFfn` — adds a fixed delta vector to the inner
  backend's last-row output at one target layer. Symmetric to the
  ablating wrapper from PR #3. Used for auto-scale search.

### New CLI commands
- `larql edit <model> --src "..." --tgt "..." --new-token " Tokyo" --output f2t.lqpatch`
  Runs Phase A crown discovery (or accepts `--layer`), captures k at the
  crown layer for both prompts, computes d = W_down @ (k_tgt - k_src),
  linearly searches [0.5, 1, 1.5, 2, 2.5, 3, 4] for the minimum scale
  that flips the source's top-1 to --new-token, emits the patch.
- `larql apply-patch <model> --patch f2t.lqpatch --prompt "..."`
  Non-destructively installs one or more patches into the loaded
  weights, optionally runs a test prediction. Supports `--reverse`
  to subtract a patch (verifies reversibility).

### Supporting change
- Added `InferenceModel::weights_mut()` accessor so apply-patch can
  mutate the in-memory weight map without reloading.

Methodology validated in Python across Divinci-AI/server
notebooks/CHAPTER_20_HONEY.md (Phase 140c: France→Tokyo with 11/11
specificity at 0.9% weight perturbation) and CHAPTER_18_THE_EDIT.md
(Phase 130 scale search). The Rust port preserves the same math.

Compile-checked with `cargo check --package larql-cli`.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the existing covariance-MEMIT solver (larql_inference::forward::memit::
run_memit) with a CLI, an edits.json file format, and automatic crown-layer
discovery for each edit. Groups edits by crown layer, invokes the joint
least-squares solve, emits one dense `.lqpatch` per affected layer plus a
manifest.json. Phase C of RFC-0001 (#2), stacked on Phase B (#4).

### Extended patch file format (still backward compatible)
- Bumped patch version 1 → 2 with a `kind` field (defaults to "rank_one")
- New `kind = "dense"` variant carries a flat row-major ΔW matrix, needed
  because MEMIT's covariance-projected solve isn't natively a rank-1 outer
  product. Larger on disk (~72 MB per Gemma 4 4B layer) but semantically
  exact — no SVD approximation step.
- `write_patch`, `read_patch`, `apply_patch` all dispatch on kind. Phase B
  rank-1 patches continue to round-trip unchanged.
- New `compute_dense()` helper builds a Dense patch from an Array2<f32>.

### New CLI: `larql memit`
- Reads edits.json (list of {label, src, new_token, layer?} records).
- For each edit: tokenises src, resolves target_token_id, resolves crown
  layer (explicit or auto-scan).
- Calls `run_memit` with Vec<MemitFact>, receives one `MemitResult` per
  affected layer.
- Serialises each layer's ΔW as a Dense patch into the output directory,
  writes a manifest.json enumerating them.
- Prints the apply-patch command to install the batch.

### Usage

    cat > edits.json <<EOF
    [
      {"label":"france-to-tokyo","src":"Capital of France? A:",
       "new_token":" Tokyo","layer":27},
      {"label":"germany-to-rome","src":"Capital of Germany? A:",
       "new_token":" Rome","layer":27}
    ]
    EOF

    larql memit /path/to/gemma4 --edits edits.json --output patches/
    larql apply-patch /path/to/gemma4 \\
        -p patches/memit_L27.lqpatch \\
        --prompt "Capital of France? A:"

### Known ceiling
Chapter 22 established that single-layer MEMIT with correlated keys (~60%
cosine) lands ~3/5 concurrent targets. For 5+ correlated edits, users can
now distribute across multiple crown layers via `layer` overrides in
edits.json — MEMIT runs once per layer group.

Compile-checked with `cargo check --package larql-cli`.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… of RFC-0001) (#9)

Exposes the Phase A-C commands as Python callables so the Chapter 15-23
Colab experiments from Divinci-AI/server become one-liner Rust invocations
from Jupyter — no CLI shell-outs, no JSON parsing.

### New module: crates/larql-python/src/edit_py.rs

Four #[pyfunction] entry points:

- crown(model, prompt, expect, start_layer=None, end_layer=None, top_k=100)
  Returns {crown_layer, crown_delta_prob, top_after_ablation, scan: [...]}.

- edit(model, src, tgt, new_token, output, layer=None, scales=None,
       fixed_scale=None, top_k=100, label=None)
  Writes a rank-1 .lqpatch; returns {layer, scale, output, d_norm}.

- apply_patch(model, patches: list[str], prompt=None, top_k=5, reverse=False)
  Applies patches in-memory; optional prompt returns {predictions: [(tok, prob), ...]}.

- memit(model, edits: list[dict], output_dir, ridge=0.01, target_alpha=1.0,
        top_k=100)
  Batch fact editor wrapping run_memit — writes one dense patch per layer
  into output_dir + manifest.

### Wiring

- Registered in _native pymodule (src/lib.rs) via m.add_function.
- Re-exported from python/larql/__init__.py under the public `larql`
  namespace alongside the existing load_vindex/create_session functions.

### Example

    import larql
    scan = larql.crown("/path/to/gemma4",
                       "Capital of France? A:", " Paris")
    print(scan["crown_layer"])                    # 27 (on Gemma 4 4B)

    larql.edit("/path/to/gemma4",
               src="Capital of France? A:",
               tgt="Capital of Japan? A:",
               new_token=" Tokyo",
               output="france_to_tokyo.lqpatch")

    r = larql.apply_patch("/path/to/gemma4",
                          patches=["france_to_tokyo.lqpatch"],
                          prompt="Capital of France? A:")
    print(r["predictions"][0])                    # ['Tokyo', 0.97]

This closes the RFC-0001 phased rollout: Python scripts can now drive the
mechanistic fact-editing pipeline end-to-end.

Compile-checked with `cargo check --package larql-python`. Runtime import
requires `maturin develop` — standard PyO3 workflow, no Python side of
the package changed structurally.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#10)

Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base
`intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers,
last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14
have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide
layer, so the rank-1 edit hit `intermediate-size mismatch in captured
keys` against the config-wide base size.

Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize`
(default = `config.intermediate_size`, mirroring `head_dim_for_layer`).
`Gemma4Arch` overrides by reusing the precomputed `kv_sources` set —
one source of truth for KV-shared-layer membership.

Thread the per-layer lookup through:
- `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked.
- `edit_cmd.rs`: same for the CLI path.
- `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per
  layer, so covariances remain correctly sized across mixed layers.

Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`.

Tests (in `detect.rs`):
- `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34
  — matches the actual HF tensor shapes verified in the Colab repl.
- `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere.
- `test_non_gemma4_intermediate_default`: Llama returns base for all
  layers via the default trait impl.

The bare `weights.intermediate_size` field is left as "base" for
display / metadata call sites (demos, patch-print, vindex stats).
Patch file-format unchanged: `compute_rank1` / `compute_dense` already
derive `intermediate_size` from the runtime tensor, so new patches for
double-wide layers store 12288 correctly without a version bump.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g fixes (#12)

* feat(models): per-layer intermediate_size for Gemma 4 double-wide MLP

Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base
`intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers,
last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14
have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide
layer, so the rank-1 edit hit `intermediate-size mismatch in captured
keys` against the config-wide base size.

Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize`
(default = `config.intermediate_size`, mirroring `head_dim_for_layer`).
`Gemma4Arch` overrides by reusing the precomputed `kv_sources` set —
one source of truth for KV-shared-layer membership.

Thread the per-layer lookup through:
- `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked.
- `edit_cmd.rs`: same for the CLI path.
- `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per
  layer, so covariances remain correctly sized across mixed layers.

Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`.

Tests (in `detect.rs`):
- `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34
  — matches the actual HF tensor shapes verified in the Colab repl.
- `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere.
- `test_non_gemma4_intermediate_default`: Llama returns base for all
  layers via the default trait impl.

The bare `weights.intermediate_size` field is left as "base" for
display / metadata call sites (demos, patch-print, vindex stats).
Patch file-format unchanged: `compute_rank1` / `compute_dense` already
derive `intermediate_size` from the runtime tensor, so new patches for
double-wide layers store 12288 correctly without a version bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: write-lock starvation on INFER + patch-revert down/up vector leak

Three fixes for larql-server session management:

1. **Bug 1 — write-lock starvation on INFER**: switched sessions_blocking_write → sessions_blocking_read on the INFER path; made last_accessed AtomicU64 so touch() takes &self.
2. **Bug 2 — rebuild_overrides leak**: added base.down_overrides.clear() + base.up_overrides.clear() before replaying patches on remove.
3. **Bug 3 — blocking_read inside async**: pre-acquire base vindex before entering write lock in apply_patch to avoid tokio panic.

All three gates verified: T2 concurrent PASS, T3 global-leak PASS, T4 throughput PASS (mixed p50 0.94× same-session), T5 revert PASS.

* ci: add isolation-harness gates + synthetic tiny-vindex testdata

Three gates run on every push/PR (T2=concurrent, T3=global-leak, T5=revert).
Requires HARNESS_REPO_TOKEN secret (fine-grained PAT, Contents:read on
Divinci-AI/larql-isolation-harness).

testdata/tiny-vindex is a reproducible 5 MB synthetic vindex generated by
generate.py (seed=42, 8 layers, hidden=128) — no real model weights needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* working on arch b, unified insert

* working on memit with vindex, and templates

* memit style

* workig on latest memit

* working on wasm

* working on wasm

* cleaned up vindex and larql

* fix: Linux support — conditional BLAS and Q4 scalar fallback

- Implement Q4 scalar fallback for non-ARM targets:
  - Move decode_f16() before #if aarch64 (shared by both paths)
  - Replace empty stub functions with correct scalar implementations
  - q4_0_matvec_c and q4_0_vecmat_c now produce correct results on x86_64
  Affects: larql-compute/csrc/q4_dot.c

Tested on Ubuntu 24 (WSL2, x86_64): cargo build --release and
cargo test --workspace pass with 0 failures.
macOS path untested — preserves accelerate via cfg(target_os)
and requires validation on Apple hardware.

* working on bounded compute script

* refactored lql

* improved refacxtor

* updated executor

* gemma 4

* working on compute

* improved for gemma 4

* test: cherry-pick GGUF shape + Q4 correctness tests from chrishayuk#20

* updated examples

* working through python parity

* working on q4k tidyup

* improving testing and quantization

* improving testing

* gemma 4 support

* improved clu

* autoregressive generation

* kv cache works

* working on shader pipeline

* working shaders

* working on shaders and graph

* moved to full graph

* workin through ffn walk performance

* working version

* modulrized shaders

* working on decoupling decode

* working on performance

* more performance improvements

* improving performance

* more performance improvments

* working on performance

* working on distributed grid

* working on grid

* improving docs and moe

* working on moe

* improved publish pull

* binary format

* working binary format and performance

* updated vindex server specs for binary

* improved lm_head

* improved prefill

* improved lm head

* gemma 4 vindex

* working on gemma 4 moe

* working on cleanup for merge

* fixed issue with select

* residual stream

* working on benchmarks

---------

Co-authored-by: chrishayuk <chrishayuk@googlemail.com>
Co-authored-by: Remi <remipetiot@hotmail.com>
Co-authored-by: chrishayuk <chrishayuk@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on guard for rebuild_overrides (#14)

README:
  Add a fork notice block with badges (Divinci AI, Hugging Face, Vindex
  Viewer Space, License, Upstream link). Frames this repo as the
  Divinci-AI fork of chrishayuk/larql carrying RFC-0001 mechanistic
  fact-editing, Phase-1 unlearning with the revert-leak fix, Gemma 4
  per-layer intermediate-size, and the CI isolation harness.

Test (overlay_apply):
  Add `rebuild_overrides_clears_base_down_and_up_overrides` —
  permanent regression guard for the Phase-1 unlearning revert path.
  Pre-populates `base.down_overrides` + `base.up_overrides` via
  `set_down_vector` / `set_up_vector` (the COMPILE-WITH-REFINE write
  path), pushes any patch onto the overlay so `remove_patch(0)` triggers
  `rebuild_overrides`, then asserts both base maps are empty after
  revert. If a future refactor drops the two `clear()` calls in
  `rebuild_overrides` this test turns red — caught the same regression
  Gate 3 catches at the integration level, but in 1ms instead of 5sec.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cloud Run and Kubernetes inject secrets as env vars, not as CLI args.
When the value lives in `valueFrom: secretKeyRef`, Cloud Run does NOT
substitute it into container `args` via `\$(VAR)` expansion — that only
works for inline `value:` envs. As a result there's no ergonomic way to
pass a secret to `--api-key` today, and deployments end up unauthenticated
at the app layer even when a bearer token is provisioned.

Adding `env = "LARQL_API_KEY"` to the clap arg lets `valueFrom: secretKeyRef`
flow directly in:

    env:
      - name: LARQL_API_KEY
        valueFrom:
          secretKeyRef:
            name: larql-s2s-token-staging
            key: latest

The CLI arg still wins when both are set (standard clap precedence).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump safetensors crate 0.5 → 0.7 (which adds the F8_E8M0 enum variant
required by Open Compute MX format scales) and add bit-pattern → f32
decoders for the four new dtypes in larql-models/src/loading/safetensors.rs.

This unblocks loading any safetensors file that uses MXFP4 expert weights
(I8 packed nibbles + F8_E8M0 per-32-element scales — used by
deepseek-ai/DeepSeek-V4-* and unsloth/DeepSeek-V4-* among others) or
plain FP8 attention weights (F8_E4M3 / F8_E5M2 — GPT-OSS, etc.).

Currently `tensor_to_f32` decodes each tensor in isolation. Proper MXFP4
unpacking (where the I8 packed-nibble weight is paired with its F8_E8M0
scale companion) still needs cross-tensor logic — left as a follow-up
for the FFN tensor loading layer where weight + scale are loaded together.

Also includes:
- bench_cmd.rs: strip metal-only code path so `cargo build --no-default-features`
  works on Linux (metal crate is `cfg(target_os = "macos")`-only).
- compile_cmd/save.rs: fix `safetensors::serialize(&views, &None)` →
  `serialize(&views, None)` for the safetensors 0.7 signature change.

Verified `cargo check -p larql-cli --no-default-features` clean (1 dead-code
warning unrelated to this PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndex

larql vindexes can live in either HF model or dataset repos. Historically
the resolver hardcoded `RepoType::Dataset`, which 404s on model-type repos
(e.g. Divinci-AI/kimi-k2-instruct-vindex, deepseek-v4-flash-vindex,
deepseek-v4-pro-vindex — all model repos containing real vindex artifacts).

Patches `resolve_hf_vindex`, `resolve_hf_vindex_with_progress`, and
`download_hf_weights` to:
1. Build the repo handle for both types via a `make_repo` closure.
2. Try Dataset first (preserves legacy default).
3. On any error fetching `index.json`, retry with Model.
4. Use the successful repo type for all subsequent file fetches in the
   same call.

If both fail, return a combined error message naming both attempts so
the user sees the real failure mode.

Verified end-to-end: previously-failing
  larql pull Divinci-AI/kimi-k2-instruct-vindex
now succeeds and caches at
  ~/.cache/huggingface/hub/models--Divinci-AI--kimi-k2-instruct-vindex/snapshots/...
(the `models--` prefix confirms the fallback selected the right type).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DeepSeek-V4 stores expert weights as one (.weight, .scale) pair per
(expert, projection): `layers.X.ffn.experts.E.w1.weight` (I8 packed FP4) +
`layers.X.ffn.experts.E.w1.scale` (F8_E8M0 scales), ditto for w2/w3. This
is distinct from GPT-OSS's fused `experts.gate_up_proj_blocks` layout that
the existing `dequantize_mxfp4_experts` function handles.

Adds `dequantize_per_expert_mxfp4` in `loading/safetensors.rs`:
- Pattern-matches on tensor names ending `.experts.<digit>.w[123].weight`
  with I8 dtype + a companion `.scale` of dtype F8_E8M0.
- Dequantizes via the existing `quant::mxfp4::dequantize_expert` primitive.
- Returns the set of consumed tensor names so the main loading loop skips
  them (avoiding duplicate decoding of the I8 packed bytes).

Wired into the non-PackedMxfp4 (default) branch of `load_model_dir_filtered`
so it runs alongside the regular weight-loading path. No-op when no V4-style
expert tensors are present.

Builds on PRs chrishayuk#35 (F8_E8M0/E4M3/E5M2/I8 dtype dispatch — needed for the
header-parse step to succeed before this code runs) and chrishayuk#36 (Dataset →
Model fallback for `larql pull`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`resolve_hf_vindex` previously fetched all VINDEX_CORE_FILES eagerly,
including `gate_vectors.bin` and `embeddings.bin` — multi-GB binaries
that `larql show` never reads. Show would hang for 10+ minutes pulling
unused tensors before printing 5 lines of metadata.

Splits VINDEX_CORE_FILES into:
- VINDEX_METADATA_FILES — small (json, manifests, tokenizer, down_meta).
  Pulled by `resolve_hf_vindex`. Sub-second on cached repos.
- VINDEX_BIN_FILES     — large tensor files (gate_vectors, embeddings).
  Deferred to call sites that actually need them (run, walk).

`resolve_hf_vindex_with_progress` keeps the prior eager behavior — its
caller has explicitly opted into a progress bar, so they accept the
wait. Implementation uses a `vindex_core_files()` helper returning the
union, preserving identical file coverage for that path.

Verified: `larql show Divinci-AI/kimi-k2-instruct-vindex` (a 42 GB
gate_vectors vindex) now returns metadata in ~1s instead of hanging
on the bin download.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… w1/w2/w3)

DeepSeek-V4 (DeepSeek-V4-Flash, V4-Pro, etc.) breaks every key convention
that V3 used. The previous `DeepSeekArch` returns key paths like
`model.layers.X.self_attn.q_proj.weight` and `model.embed_tokens.weight`
— neither exist in V4 safetensors. `larql extract` failed with
"missing tensor: embed_tokens.weight" because the loader looked under
the V3 name.

Adds `crates/larql-models/src/architectures/deepseek_v4.rs` with V4 keys:

- No `model.` prefix anywhere; `key_prefixes_to_strip` = [].
- `embed.weight` (not `embed_tokens.weight`).
- `layers.X.attn.{q,k,v,o}_proj.weight` (not `self_attn.*`).
- `layers.X.attn_norm.weight` / `ffn_norm.weight` (not `input_layernorm` /
  `post_attention_layernorm`).
- `layers.X.ffn.experts.E.{w1,w2,w3}.weight` (not `mlp.experts.E.gate_proj`
  etc.) — w1=gate, w3=up, w2=down.
- `layers.X.ffn.gate.weight` for the router.
- MLA tensors at `attn.{wq_a, wq_b, wkv}.weight` (V4 fuses kv into wkv;
  no separate kv_b).

Wired in `detect.rs`: model_type `"deepseek_v4"` → `DeepSeekV4Arch`,
all other `"deepseek*"` continue to `DeepSeekArch`.

Scope: browse-tier extraction (gate vectors + embeddings + down_meta).
HCA / CSA forward-pass support is out of scope for this PR.

Stacks on chrishayuk#35 / chrishayuk#36 / chrishayuk#37 / chrishayuk#38.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The streaming extract has its own get_tensor_f32 helper (separate from the
larql-models::loading::safetensors::tensor_to_f32 patched in chrishayuk#35). It
silently returned None for any non-{F32,F16,BF16} dtype, so per-expert
MXFP4 gates (I8 packed nibbles + F8_E8M0 scale) were skipped during
gate_vectors.bin extraction — every layer recorded "0.0s" and the output
file was 0 bytes.

Adds an I8+F8_E8M0 MXFP4 detection path at the top of get_tensor_f32:
- If the tensor name ends in `.weight` and dtype is I8
- And a `.scale` companion of dtype F8_E8M0 exists with matching rows
- And the cols ratio implies a sane group size {16, 32, 64, 128}
→ Unpack via crate::format::quant::mxfp4::dequantize_expert and return
  the resulting f32 Array2 directly.

This makes `larql extract --level browse` produce a populated
gate_vectors.bin for DeepSeek-V4 family models. Verified locally:
gate_vectors.bin grows from 0 B to ~1 GB with real f32 data.

Stacks on chrishayuk#35-chrishayuk#39.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant