feat(extract): MXFP4-aware streaming gate_vectors path#40
Open
mikeumus wants to merge 22 commits intochrishayuk:mainfrom
Open
feat(extract): MXFP4-aware streaming gate_vectors path#40mikeumus wants to merge 22 commits intochrishayuk:mainfrom
mikeumus wants to merge 22 commits intochrishayuk:mainfrom
Conversation
Proposes extending LarQL from weight-analysis into analysis+editing via three new subcommands that implement ROME/MEMIT-family algorithms on top of the existing larql-inference forward pass and capture hooks. Based on 9 chapters of experimentation on Gemma 4 (4B and 26B) documented in Divinci-AI/server notebooks/CHAPTER_15 through CHAPTER_23: - larql crown: per-edit crown-layer discovery via module ablation - larql edit: single-fact rank-1 edit with auto-scale calibration - larql memit: batch fact editing via joint least-squares, grouped by crown Also defines a patch file format (~55KB per Gemma 4 4B single edit) and a non-destructive larql apply-patch command. Phased 4-step rollout plan. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements Phase A of RFC-0001 (#2): per-layer MLP ablation scan to find the layer whose last-position MLP output is load-bearing for a given (prompt, expected-token) pair. Changes: - crates/larql-inference/src/ffn/ablating.rs — new LastPositionAblatingFfn that wraps any FfnBackend and zeroes its output at the last-token row for one target layer. Thin wrapper, no math changes. - crates/larql-cli/src/commands/extraction/crown_cmd.rs — new `larql crown` subcommand. Tokenises the prompt, runs a baseline forward pass, then iterates layers in [start..=end] running predict_with_ffn against the ablating backend, reports per-layer Δ in expected-token probability and picks the layer whose ablation causes the top prediction to flip with the largest suppression magnitude. Methodology matches Phase 125c of Divinci-AI/server notebooks/CHAPTER_17_CORONATION.md — on Gemma 4 4B, ablating L27 MLP on "Capital of France? A:" makes the top prediction flip from " Paris" to "France" (the country token). The command outputs JSON (optional --json) so downstream commands (edit, memit) can consume the crown_layer field. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… RFC-0001) (#7) Implements Phase B of RFC-0001 (#2): single-fact rank-1 editor with portable patch file format. Builds on Phase A's LastPositionAblatingFfn (#3) and adds the symmetric LastPositionInjectingFfn for scale search. ### New library module: `larql-inference/src/edit.rs` - `EditPatch` struct (serializable via serde) - `compute_rank1(k, d, scale, layer, provenance) -> EditPatch` - `write_patch(path, &patch)` / `read_patch(path) -> EditPatch` with a simple binary format: LQPATCH magic + JSON meta + little-endian f32 vectors for d and k_norm. ~55 KB for Gemma 4 4B. - `apply_patch(&mut ModelWeights, &EditPatch)`: installs the rank-1 outer product into `down_proj.weight` in place, handling both `[hidden, intermediate]` and `[intermediate, hidden]` layouts. ### New FFN wrapper: `larql-inference/src/ffn/injecting.rs` - `LastPositionInjectingFfn` — adds a fixed delta vector to the inner backend's last-row output at one target layer. Symmetric to the ablating wrapper from PR #3. Used for auto-scale search. ### New CLI commands - `larql edit <model> --src "..." --tgt "..." --new-token " Tokyo" --output f2t.lqpatch` Runs Phase A crown discovery (or accepts `--layer`), captures k at the crown layer for both prompts, computes d = W_down @ (k_tgt - k_src), linearly searches [0.5, 1, 1.5, 2, 2.5, 3, 4] for the minimum scale that flips the source's top-1 to --new-token, emits the patch. - `larql apply-patch <model> --patch f2t.lqpatch --prompt "..."` Non-destructively installs one or more patches into the loaded weights, optionally runs a test prediction. Supports `--reverse` to subtract a patch (verifies reversibility). ### Supporting change - Added `InferenceModel::weights_mut()` accessor so apply-patch can mutate the in-memory weight map without reloading. Methodology validated in Python across Divinci-AI/server notebooks/CHAPTER_20_HONEY.md (Phase 140c: France→Tokyo with 11/11 specificity at 0.9% weight perturbation) and CHAPTER_18_THE_EDIT.md (Phase 130 scale search). The Rust port preserves the same math. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the existing covariance-MEMIT solver (larql_inference::forward::memit:: run_memit) with a CLI, an edits.json file format, and automatic crown-layer discovery for each edit. Groups edits by crown layer, invokes the joint least-squares solve, emits one dense `.lqpatch` per affected layer plus a manifest.json. Phase C of RFC-0001 (#2), stacked on Phase B (#4). ### Extended patch file format (still backward compatible) - Bumped patch version 1 → 2 with a `kind` field (defaults to "rank_one") - New `kind = "dense"` variant carries a flat row-major ΔW matrix, needed because MEMIT's covariance-projected solve isn't natively a rank-1 outer product. Larger on disk (~72 MB per Gemma 4 4B layer) but semantically exact — no SVD approximation step. - `write_patch`, `read_patch`, `apply_patch` all dispatch on kind. Phase B rank-1 patches continue to round-trip unchanged. - New `compute_dense()` helper builds a Dense patch from an Array2<f32>. ### New CLI: `larql memit` - Reads edits.json (list of {label, src, new_token, layer?} records). - For each edit: tokenises src, resolves target_token_id, resolves crown layer (explicit or auto-scan). - Calls `run_memit` with Vec<MemitFact>, receives one `MemitResult` per affected layer. - Serialises each layer's ΔW as a Dense patch into the output directory, writes a manifest.json enumerating them. - Prints the apply-patch command to install the batch. ### Usage cat > edits.json <<EOF [ {"label":"france-to-tokyo","src":"Capital of France? A:", "new_token":" Tokyo","layer":27}, {"label":"germany-to-rome","src":"Capital of Germany? A:", "new_token":" Rome","layer":27} ] EOF larql memit /path/to/gemma4 --edits edits.json --output patches/ larql apply-patch /path/to/gemma4 \\ -p patches/memit_L27.lqpatch \\ --prompt "Capital of France? A:" ### Known ceiling Chapter 22 established that single-layer MEMIT with correlated keys (~60% cosine) lands ~3/5 concurrent targets. For 5+ correlated edits, users can now distribute across multiple crown layers via `layer` overrides in edits.json — MEMIT runs once per layer group. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… of RFC-0001) (#9) Exposes the Phase A-C commands as Python callables so the Chapter 15-23 Colab experiments from Divinci-AI/server become one-liner Rust invocations from Jupyter — no CLI shell-outs, no JSON parsing. ### New module: crates/larql-python/src/edit_py.rs Four #[pyfunction] entry points: - crown(model, prompt, expect, start_layer=None, end_layer=None, top_k=100) Returns {crown_layer, crown_delta_prob, top_after_ablation, scan: [...]}. - edit(model, src, tgt, new_token, output, layer=None, scales=None, fixed_scale=None, top_k=100, label=None) Writes a rank-1 .lqpatch; returns {layer, scale, output, d_norm}. - apply_patch(model, patches: list[str], prompt=None, top_k=5, reverse=False) Applies patches in-memory; optional prompt returns {predictions: [(tok, prob), ...]}. - memit(model, edits: list[dict], output_dir, ridge=0.01, target_alpha=1.0, top_k=100) Batch fact editor wrapping run_memit — writes one dense patch per layer into output_dir + manifest. ### Wiring - Registered in _native pymodule (src/lib.rs) via m.add_function. - Re-exported from python/larql/__init__.py under the public `larql` namespace alongside the existing load_vindex/create_session functions. ### Example import larql scan = larql.crown("/path/to/gemma4", "Capital of France? A:", " Paris") print(scan["crown_layer"]) # 27 (on Gemma 4 4B) larql.edit("/path/to/gemma4", src="Capital of France? A:", tgt="Capital of Japan? A:", new_token=" Tokyo", output="france_to_tokyo.lqpatch") r = larql.apply_patch("/path/to/gemma4", patches=["france_to_tokyo.lqpatch"], prompt="Capital of France? A:") print(r["predictions"][0]) # ['Tokyo', 0.97] This closes the RFC-0001 phased rollout: Python scripts can now drive the mechanistic fact-editing pipeline end-to-end. Compile-checked with `cargo check --package larql-python`. Runtime import requires `maturin develop` — standard PyO3 workflow, no Python side of the package changed structurally. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#10) Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base `intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers, last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14 have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide layer, so the rank-1 edit hit `intermediate-size mismatch in captured keys` against the config-wide base size. Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize` (default = `config.intermediate_size`, mirroring `head_dim_for_layer`). `Gemma4Arch` overrides by reusing the precomputed `kv_sources` set — one source of truth for KV-shared-layer membership. Thread the per-layer lookup through: - `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked. - `edit_cmd.rs`: same for the CLI path. - `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per layer, so covariances remain correctly sized across mixed layers. Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`. Tests (in `detect.rs`): - `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34 — matches the actual HF tensor shapes verified in the Colab repl. - `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere. - `test_non_gemma4_intermediate_default`: Llama returns base for all layers via the default trait impl. The bare `weights.intermediate_size` field is left as "base" for display / metadata call sites (demos, patch-print, vindex stats). Patch file-format unchanged: `compute_rank1` / `compute_dense` already derive `intermediate_size` from the runtime tensor, so new patches for double-wide layers store 12288 correctly without a version bump. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g fixes (#12) * feat(models): per-layer intermediate_size for Gemma 4 double-wide MLP Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base `intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers, last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14 have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide layer, so the rank-1 edit hit `intermediate-size mismatch in captured keys` against the config-wide base size. Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize` (default = `config.intermediate_size`, mirroring `head_dim_for_layer`). `Gemma4Arch` overrides by reusing the precomputed `kv_sources` set — one source of truth for KV-shared-layer membership. Thread the per-layer lookup through: - `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked. - `edit_cmd.rs`: same for the CLI path. - `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per layer, so covariances remain correctly sized across mixed layers. Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`. Tests (in `detect.rs`): - `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34 — matches the actual HF tensor shapes verified in the Colab repl. - `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere. - `test_non_gemma4_intermediate_default`: Llama returns base for all layers via the default trait impl. The bare `weights.intermediate_size` field is left as "base" for display / metadata call sites (demos, patch-print, vindex stats). Patch file-format unchanged: `compute_rank1` / `compute_dense` already derive `intermediate_size` from the runtime tensor, so new patches for double-wide layers store 12288 correctly without a version bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: write-lock starvation on INFER + patch-revert down/up vector leak Three fixes for larql-server session management: 1. **Bug 1 — write-lock starvation on INFER**: switched sessions_blocking_write → sessions_blocking_read on the INFER path; made last_accessed AtomicU64 so touch() takes &self. 2. **Bug 2 — rebuild_overrides leak**: added base.down_overrides.clear() + base.up_overrides.clear() before replaying patches on remove. 3. **Bug 3 — blocking_read inside async**: pre-acquire base vindex before entering write lock in apply_patch to avoid tokio panic. All three gates verified: T2 concurrent PASS, T3 global-leak PASS, T4 throughput PASS (mixed p50 0.94× same-session), T5 revert PASS. * ci: add isolation-harness gates + synthetic tiny-vindex testdata Three gates run on every push/PR (T2=concurrent, T3=global-leak, T5=revert). Requires HARNESS_REPO_TOKEN secret (fine-grained PAT, Contents:read on Divinci-AI/larql-isolation-harness). testdata/tiny-vindex is a reproducible 5 MB synthetic vindex generated by generate.py (seed=42, 8 layers, hidden=128) — no real model weights needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* working on arch b, unified insert * working on memit with vindex, and templates * memit style * workig on latest memit * working on wasm * working on wasm * cleaned up vindex and larql * fix: Linux support — conditional BLAS and Q4 scalar fallback - Implement Q4 scalar fallback for non-ARM targets: - Move decode_f16() before #if aarch64 (shared by both paths) - Replace empty stub functions with correct scalar implementations - q4_0_matvec_c and q4_0_vecmat_c now produce correct results on x86_64 Affects: larql-compute/csrc/q4_dot.c Tested on Ubuntu 24 (WSL2, x86_64): cargo build --release and cargo test --workspace pass with 0 failures. macOS path untested — preserves accelerate via cfg(target_os) and requires validation on Apple hardware. * working on bounded compute script * refactored lql * improved refacxtor * updated executor * gemma 4 * working on compute * improved for gemma 4 * test: cherry-pick GGUF shape + Q4 correctness tests from chrishayuk#20 * updated examples * working through python parity * working on q4k tidyup * improving testing and quantization * improving testing * gemma 4 support * improved clu * autoregressive generation * kv cache works * working on shader pipeline * working shaders * working on shaders and graph * moved to full graph * workin through ffn walk performance * working version * modulrized shaders * working on decoupling decode * working on performance * more performance improvements * improving performance * more performance improvments * working on performance * working on distributed grid * working on grid * improving docs and moe * working on moe * improved publish pull * binary format * working binary format and performance * updated vindex server specs for binary * improved lm_head * improved prefill * improved lm head * gemma 4 vindex * working on gemma 4 moe * working on cleanup for merge * fixed issue with select * residual stream * working on benchmarks --------- Co-authored-by: chrishayuk <chrishayuk@googlemail.com> Co-authored-by: Remi <remipetiot@hotmail.com> Co-authored-by: chrishayuk <chrishayuk@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on guard for rebuild_overrides (#14) README: Add a fork notice block with badges (Divinci AI, Hugging Face, Vindex Viewer Space, License, Upstream link). Frames this repo as the Divinci-AI fork of chrishayuk/larql carrying RFC-0001 mechanistic fact-editing, Phase-1 unlearning with the revert-leak fix, Gemma 4 per-layer intermediate-size, and the CI isolation harness. Test (overlay_apply): Add `rebuild_overrides_clears_base_down_and_up_overrides` — permanent regression guard for the Phase-1 unlearning revert path. Pre-populates `base.down_overrides` + `base.up_overrides` via `set_down_vector` / `set_up_vector` (the COMPILE-WITH-REFINE write path), pushes any patch onto the overlay so `remove_patch(0)` triggers `rebuild_overrides`, then asserts both base maps are empty after revert. If a future refactor drops the two `clear()` calls in `rebuild_overrides` this test turns red — caught the same regression Gate 3 catches at the integration level, but in 1ms instead of 5sec. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cloud Run and Kubernetes inject secrets as env vars, not as CLI args.
When the value lives in `valueFrom: secretKeyRef`, Cloud Run does NOT
substitute it into container `args` via `\$(VAR)` expansion — that only
works for inline `value:` envs. As a result there's no ergonomic way to
pass a secret to `--api-key` today, and deployments end up unauthenticated
at the app layer even when a bearer token is provisioned.
Adding `env = "LARQL_API_KEY"` to the clap arg lets `valueFrom: secretKeyRef`
flow directly in:
env:
- name: LARQL_API_KEY
valueFrom:
secretKeyRef:
name: larql-s2s-token-staging
key: latest
The CLI arg still wins when both are set (standard clap precedence).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump safetensors crate 0.5 → 0.7 (which adds the F8_E8M0 enum variant required by Open Compute MX format scales) and add bit-pattern → f32 decoders for the four new dtypes in larql-models/src/loading/safetensors.rs. This unblocks loading any safetensors file that uses MXFP4 expert weights (I8 packed nibbles + F8_E8M0 per-32-element scales — used by deepseek-ai/DeepSeek-V4-* and unsloth/DeepSeek-V4-* among others) or plain FP8 attention weights (F8_E4M3 / F8_E5M2 — GPT-OSS, etc.). Currently `tensor_to_f32` decodes each tensor in isolation. Proper MXFP4 unpacking (where the I8 packed-nibble weight is paired with its F8_E8M0 scale companion) still needs cross-tensor logic — left as a follow-up for the FFN tensor loading layer where weight + scale are loaded together. Also includes: - bench_cmd.rs: strip metal-only code path so `cargo build --no-default-features` works on Linux (metal crate is `cfg(target_os = "macos")`-only). - compile_cmd/save.rs: fix `safetensors::serialize(&views, &None)` → `serialize(&views, None)` for the safetensors 0.7 signature change. Verified `cargo check -p larql-cli --no-default-features` clean (1 dead-code warning unrelated to this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndex larql vindexes can live in either HF model or dataset repos. Historically the resolver hardcoded `RepoType::Dataset`, which 404s on model-type repos (e.g. Divinci-AI/kimi-k2-instruct-vindex, deepseek-v4-flash-vindex, deepseek-v4-pro-vindex — all model repos containing real vindex artifacts). Patches `resolve_hf_vindex`, `resolve_hf_vindex_with_progress`, and `download_hf_weights` to: 1. Build the repo handle for both types via a `make_repo` closure. 2. Try Dataset first (preserves legacy default). 3. On any error fetching `index.json`, retry with Model. 4. Use the successful repo type for all subsequent file fetches in the same call. If both fail, return a combined error message naming both attempts so the user sees the real failure mode. Verified end-to-end: previously-failing larql pull Divinci-AI/kimi-k2-instruct-vindex now succeeds and caches at ~/.cache/huggingface/hub/models--Divinci-AI--kimi-k2-instruct-vindex/snapshots/... (the `models--` prefix confirms the fallback selected the right type). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DeepSeek-V4 stores expert weights as one (.weight, .scale) pair per (expert, projection): `layers.X.ffn.experts.E.w1.weight` (I8 packed FP4) + `layers.X.ffn.experts.E.w1.scale` (F8_E8M0 scales), ditto for w2/w3. This is distinct from GPT-OSS's fused `experts.gate_up_proj_blocks` layout that the existing `dequantize_mxfp4_experts` function handles. Adds `dequantize_per_expert_mxfp4` in `loading/safetensors.rs`: - Pattern-matches on tensor names ending `.experts.<digit>.w[123].weight` with I8 dtype + a companion `.scale` of dtype F8_E8M0. - Dequantizes via the existing `quant::mxfp4::dequantize_expert` primitive. - Returns the set of consumed tensor names so the main loading loop skips them (avoiding duplicate decoding of the I8 packed bytes). Wired into the non-PackedMxfp4 (default) branch of `load_model_dir_filtered` so it runs alongside the regular weight-loading path. No-op when no V4-style expert tensors are present. Builds on PRs chrishayuk#35 (F8_E8M0/E4M3/E5M2/I8 dtype dispatch — needed for the header-parse step to succeed before this code runs) and chrishayuk#36 (Dataset → Model fallback for `larql pull`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`resolve_hf_vindex` previously fetched all VINDEX_CORE_FILES eagerly, including `gate_vectors.bin` and `embeddings.bin` — multi-GB binaries that `larql show` never reads. Show would hang for 10+ minutes pulling unused tensors before printing 5 lines of metadata. Splits VINDEX_CORE_FILES into: - VINDEX_METADATA_FILES — small (json, manifests, tokenizer, down_meta). Pulled by `resolve_hf_vindex`. Sub-second on cached repos. - VINDEX_BIN_FILES — large tensor files (gate_vectors, embeddings). Deferred to call sites that actually need them (run, walk). `resolve_hf_vindex_with_progress` keeps the prior eager behavior — its caller has explicitly opted into a progress bar, so they accept the wait. Implementation uses a `vindex_core_files()` helper returning the union, preserving identical file coverage for that path. Verified: `larql show Divinci-AI/kimi-k2-instruct-vindex` (a 42 GB gate_vectors vindex) now returns metadata in ~1s instead of hanging on the bin download. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… w1/w2/w3)
DeepSeek-V4 (DeepSeek-V4-Flash, V4-Pro, etc.) breaks every key convention
that V3 used. The previous `DeepSeekArch` returns key paths like
`model.layers.X.self_attn.q_proj.weight` and `model.embed_tokens.weight`
— neither exist in V4 safetensors. `larql extract` failed with
"missing tensor: embed_tokens.weight" because the loader looked under
the V3 name.
Adds `crates/larql-models/src/architectures/deepseek_v4.rs` with V4 keys:
- No `model.` prefix anywhere; `key_prefixes_to_strip` = [].
- `embed.weight` (not `embed_tokens.weight`).
- `layers.X.attn.{q,k,v,o}_proj.weight` (not `self_attn.*`).
- `layers.X.attn_norm.weight` / `ffn_norm.weight` (not `input_layernorm` /
`post_attention_layernorm`).
- `layers.X.ffn.experts.E.{w1,w2,w3}.weight` (not `mlp.experts.E.gate_proj`
etc.) — w1=gate, w3=up, w2=down.
- `layers.X.ffn.gate.weight` for the router.
- MLA tensors at `attn.{wq_a, wq_b, wkv}.weight` (V4 fuses kv into wkv;
no separate kv_b).
Wired in `detect.rs`: model_type `"deepseek_v4"` → `DeepSeekV4Arch`,
all other `"deepseek*"` continue to `DeepSeekArch`.
Scope: browse-tier extraction (gate vectors + embeddings + down_meta).
HCA / CSA forward-pass support is out of scope for this PR.
Stacks on chrishayuk#35 / chrishayuk#36 / chrishayuk#37 / chrishayuk#38.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The streaming extract has its own get_tensor_f32 helper (separate from the larql-models::loading::safetensors::tensor_to_f32 patched in chrishayuk#35). It silently returned None for any non-{F32,F16,BF16} dtype, so per-expert MXFP4 gates (I8 packed nibbles + F8_E8M0 scale) were skipped during gate_vectors.bin extraction — every layer recorded "0.0s" and the output file was 0 bytes. Adds an I8+F8_E8M0 MXFP4 detection path at the top of get_tensor_f32: - If the tensor name ends in `.weight` and dtype is I8 - And a `.scale` companion of dtype F8_E8M0 exists with matching rows - And the cols ratio implies a sane group size {16, 32, 64, 128} → Unpack via crate::format::quant::mxfp4::dequantize_expert and return the resulting f32 Array2 directly. This makes `larql extract --level browse` produce a populated gate_vectors.bin for DeepSeek-V4 family models. Verified locally: gate_vectors.bin grows from 0 B to ~1 GB with real f32 data. Stacks on chrishayuk#35-chrishayuk#39. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The streaming extract has its own
get_tensor_f32helper (separate from the loadertensor_to_f32in #35). It silently returned None for any non-{F32,F16,BF16} dtype, so per-expert MXFP4 gates (I8 packed nibbles + F8_E8M0 scales) were skipped during gate_vectors.bin extraction — every layer recorded "0.0s" and the output file was 0 bytes.Adds an I8+F8_E8M0 MXFP4 detection path at the top of
get_tensor_f32:.weightand dtype is I8.scalecompanion of dtype F8_E8M0 exists with matching rows→ Unpack via
crate::format::quant::mxfp4::dequantize_expertand return the resulting f32Array2directly.Result:
larql extract --level browse unsloth/DeepSeek-V4-Flashnow produces a populatedgate_vectors.bin(~1 GB) instead of the previous 0 B file.Stacks on #35 / #36 / #37 / #38 / #39. With this PR the V4 extract pipeline is end-to-end functional.
🤖 Generated with Claude Code