feat(extract): MXFP4-aware streaming gate_vectors path by mikeumus · Pull Request #40 · chrishayuk/larql

mikeumus · 2026-04-26T07:03:14Z

The streaming extract has its own get_tensor_f32 helper (separate from the loader tensor_to_f32 in #35). It silently returned None for any non-{F32,F16,BF16} dtype, so per-expert MXFP4 gates (I8 packed nibbles + F8_E8M0 scales) were skipped during gate_vectors.bin extraction — every layer recorded "0.0s" and the output file was 0 bytes.

Adds an I8+F8_E8M0 MXFP4 detection path at the top of get_tensor_f32:

If the tensor name ends in .weight and dtype is I8
And a .scale companion of dtype F8_E8M0 exists with matching rows
And the cols ratio implies a sane group size {16, 32, 64, 128}
→ Unpack via crate::format::quant::mxfp4::dequantize_expert and return the resulting f32 Array2 directly.

Result: larql extract --level browse unsloth/DeepSeek-V4-Flash now produces a populated gate_vectors.bin (~1 GB) instead of the previous 0 B file.

Stacks on #35 / #36 / #37 / #38 / #39. With this PR the V4 extract pipeline is end-to-end functional.

🤖 Generated with Claude Code

Proposes extending LarQL from weight-analysis into analysis+editing via three new subcommands that implement ROME/MEMIT-family algorithms on top of the existing larql-inference forward pass and capture hooks. Based on 9 chapters of experimentation on Gemma 4 (4B and 26B) documented in Divinci-AI/server notebooks/CHAPTER_15 through CHAPTER_23: - larql crown: per-edit crown-layer discovery via module ablation - larql edit: single-fact rank-1 edit with auto-scale calibration - larql memit: batch fact editing via joint least-squares, grouped by crown Also defines a patch file format (~55KB per Gemma 4 4B single edit) and a non-destructive larql apply-patch command. Phased 4-step rollout plan. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements Phase A of RFC-0001 (#2): per-layer MLP ablation scan to find the layer whose last-position MLP output is load-bearing for a given (prompt, expected-token) pair. Changes: - crates/larql-inference/src/ffn/ablating.rs — new LastPositionAblatingFfn that wraps any FfnBackend and zeroes its output at the last-token row for one target layer. Thin wrapper, no math changes. - crates/larql-cli/src/commands/extraction/crown_cmd.rs — new `larql crown` subcommand. Tokenises the prompt, runs a baseline forward pass, then iterates layers in [start..=end] running predict_with_ffn against the ablating backend, reports per-layer Δ in expected-token probability and picks the layer whose ablation causes the top prediction to flip with the largest suppression magnitude. Methodology matches Phase 125c of Divinci-AI/server notebooks/CHAPTER_17_CORONATION.md — on Gemma 4 4B, ablating L27 MLP on "Capital of France? A:" makes the top prediction flip from " Paris" to "France" (the country token). The command outputs JSON (optional --json) so downstream commands (edit, memit) can consume the crown_layer field. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… RFC-0001) (#7) Implements Phase B of RFC-0001 (#2): single-fact rank-1 editor with portable patch file format. Builds on Phase A's LastPositionAblatingFfn (#3) and adds the symmetric LastPositionInjectingFfn for scale search. ### New library module: `larql-inference/src/edit.rs` - `EditPatch` struct (serializable via serde) - `compute_rank1(k, d, scale, layer, provenance) -> EditPatch` - `write_patch(path, &patch)` / `read_patch(path) -> EditPatch` with a simple binary format: LQPATCH magic + JSON meta + little-endian f32 vectors for d and k_norm. ~55 KB for Gemma 4 4B. - `apply_patch(&mut ModelWeights, &EditPatch)`: installs the rank-1 outer product into `down_proj.weight` in place, handling both `[hidden, intermediate]` and `[intermediate, hidden]` layouts. ### New FFN wrapper: `larql-inference/src/ffn/injecting.rs` - `LastPositionInjectingFfn` — adds a fixed delta vector to the inner backend's last-row output at one target layer. Symmetric to the ablating wrapper from PR #3. Used for auto-scale search. ### New CLI commands - `larql edit <model> --src "..." --tgt "..." --new-token " Tokyo" --output f2t.lqpatch` Runs Phase A crown discovery (or accepts `--layer`), captures k at the crown layer for both prompts, computes d = W_down @ (k_tgt - k_src), linearly searches [0.5, 1, 1.5, 2, 2.5, 3, 4] for the minimum scale that flips the source's top-1 to --new-token, emits the patch. - `larql apply-patch <model> --patch f2t.lqpatch --prompt "..."` Non-destructively installs one or more patches into the loaded weights, optionally runs a test prediction. Supports `--reverse` to subtract a patch (verifies reversibility). ### Supporting change - Added `InferenceModel::weights_mut()` accessor so apply-patch can mutate the in-memory weight map without reloading. Methodology validated in Python across Divinci-AI/server notebooks/CHAPTER_20_HONEY.md (Phase 140c: France→Tokyo with 11/11 specificity at 0.9% weight perturbation) and CHAPTER_18_THE_EDIT.md (Phase 130 scale search). The Rust port preserves the same math. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wraps the existing covariance-MEMIT solver (larql_inference::forward::memit:: run_memit) with a CLI, an edits.json file format, and automatic crown-layer discovery for each edit. Groups edits by crown layer, invokes the joint least-squares solve, emits one dense `.lqpatch` per affected layer plus a manifest.json. Phase C of RFC-0001 (#2), stacked on Phase B (#4). ### Extended patch file format (still backward compatible) - Bumped patch version 1 → 2 with a `kind` field (defaults to "rank_one") - New `kind = "dense"` variant carries a flat row-major ΔW matrix, needed because MEMIT's covariance-projected solve isn't natively a rank-1 outer product. Larger on disk (~72 MB per Gemma 4 4B layer) but semantically exact — no SVD approximation step. - `write_patch`, `read_patch`, `apply_patch` all dispatch on kind. Phase B rank-1 patches continue to round-trip unchanged. - New `compute_dense()` helper builds a Dense patch from an Array2<f32>. ### New CLI: `larql memit` - Reads edits.json (list of {label, src, new_token, layer?} records). - For each edit: tokenises src, resolves target_token_id, resolves crown layer (explicit or auto-scan). - Calls `run_memit` with Vec<MemitFact>, receives one `MemitResult` per affected layer. - Serialises each layer's ΔW as a Dense patch into the output directory, writes a manifest.json enumerating them. - Prints the apply-patch command to install the batch. ### Usage cat > edits.json <<EOF [ {"label":"france-to-tokyo","src":"Capital of France? A:", "new_token":" Tokyo","layer":27}, {"label":"germany-to-rome","src":"Capital of Germany? A:", "new_token":" Rome","layer":27} ] EOF larql memit /path/to/gemma4 --edits edits.json --output patches/ larql apply-patch /path/to/gemma4 \\ -p patches/memit_L27.lqpatch \\ --prompt "Capital of France? A:" ### Known ceiling Chapter 22 established that single-layer MEMIT with correlated keys (~60% cosine) lands ~3/5 concurrent targets. For 5+ correlated edits, users can now distribute across multiple crown layers via `layer` overrides in edits.json — MEMIT runs once per layer group. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… of RFC-0001) (#9) Exposes the Phase A-C commands as Python callables so the Chapter 15-23 Colab experiments from Divinci-AI/server become one-liner Rust invocations from Jupyter — no CLI shell-outs, no JSON parsing. ### New module: crates/larql-python/src/edit_py.rs Four #[pyfunction] entry points: - crown(model, prompt, expect, start_layer=None, end_layer=None, top_k=100) Returns {crown_layer, crown_delta_prob, top_after_ablation, scan: [...]}. - edit(model, src, tgt, new_token, output, layer=None, scales=None, fixed_scale=None, top_k=100, label=None) Writes a rank-1 .lqpatch; returns {layer, scale, output, d_norm}. - apply_patch(model, patches: list[str], prompt=None, top_k=5, reverse=False) Applies patches in-memory; optional prompt returns {predictions: [(tok, prob), ...]}. - memit(model, edits: list[dict], output_dir, ridge=0.01, target_alpha=1.0, top_k=100) Batch fact editor wrapping run_memit — writes one dense patch per layer into output_dir + manifest. ### Wiring - Registered in _native pymodule (src/lib.rs) via m.add_function. - Re-exported from python/larql/__init__.py under the public `larql` namespace alongside the existing load_vindex/create_session functions. ### Example import larql scan = larql.crown("/path/to/gemma4", "Capital of France? A:", " Paris") print(scan["crown_layer"]) # 27 (on Gemma 4 4B) larql.edit("/path/to/gemma4", src="Capital of France? A:", tgt="Capital of Japan? A:", new_token=" Tokyo", output="france_to_tokyo.lqpatch") r = larql.apply_patch("/path/to/gemma4", patches=["france_to_tokyo.lqpatch"], prompt="Capital of France? A:") print(r["predictions"][0]) # ['Tokyo', 0.97] This closes the RFC-0001 phased rollout: Python scripts can now drive the mechanistic fact-editing pipeline end-to-end. Compile-checked with `cargo check --package larql-python`. Runtime import requires `maturin develop` — standard PyO3 workflow, no Python side of the package changed structurally. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#10) Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base `intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers, last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14 have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide layer, so the rank-1 edit hit `intermediate-size mismatch in captured keys` against the config-wide base size. Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize` (default = `config.intermediate_size`, mirroring `head_dim_for_layer`). `Gemma4Arch` overrides by reusing the precomputed `kv_sources` set — one source of truth for KV-shared-layer membership. Thread the per-layer lookup through: - `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked. - `edit_cmd.rs`: same for the CLI path. - `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per layer, so covariances remain correctly sized across mixed layers. Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`. Tests (in `detect.rs`): - `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34 — matches the actual HF tensor shapes verified in the Colab repl. - `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere. - `test_non_gemma4_intermediate_default`: Llama returns base for all layers via the default trait impl. The bare `weights.intermediate_size` field is left as "base" for display / metadata call sites (demos, patch-print, vindex stats). Patch file-format unchanged: `compute_rank1` / `compute_dense` already derive `intermediate_size` from the runtime tensor, so new patches for double-wide layers store 12288 correctly without a version bump. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g fixes (#12) * feat(models): per-layer intermediate_size for Gemma 4 double-wide MLP Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base `intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers, last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14 have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide layer, so the rank-1 edit hit `intermediate-size mismatch in captured keys` against the config-wide base size. Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize` (default = `config.intermediate_size`, mirroring `head_dim_for_layer`). `Gemma4Arch` overrides by reusing the precomputed `kv_sources` set — one source of truth for KV-shared-layer membership. Thread the per-layer lookup through: - `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked. - `edit_cmd.rs`: same for the CLI path. - `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per layer, so covariances remain correctly sized across mixed layers. Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`. Tests (in `detect.rs`): - `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34 — matches the actual HF tensor shapes verified in the Colab repl. - `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere. - `test_non_gemma4_intermediate_default`: Llama returns base for all layers via the default trait impl. The bare `weights.intermediate_size` field is left as "base" for display / metadata call sites (demos, patch-print, vindex stats). Patch file-format unchanged: `compute_rank1` / `compute_dense` already derive `intermediate_size` from the runtime tensor, so new patches for double-wide layers store 12288 correctly without a version bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: write-lock starvation on INFER + patch-revert down/up vector leak Three fixes for larql-server session management: 1. **Bug 1 — write-lock starvation on INFER**: switched sessions_blocking_write → sessions_blocking_read on the INFER path; made last_accessed AtomicU64 so touch() takes &self. 2. **Bug 2 — rebuild_overrides leak**: added base.down_overrides.clear() + base.up_overrides.clear() before replaying patches on remove. 3. **Bug 3 — blocking_read inside async**: pre-acquire base vindex before entering write lock in apply_patch to avoid tokio panic. All three gates verified: T2 concurrent PASS, T3 global-leak PASS, T4 throughput PASS (mixed p50 0.94× same-session), T5 revert PASS. * ci: add isolation-harness gates + synthetic tiny-vindex testdata Three gates run on every push/PR (T2=concurrent, T3=global-leak, T5=revert). Requires HARNESS_REPO_TOKEN secret (fine-grained PAT, Contents:read on Divinci-AI/larql-isolation-harness). testdata/tiny-vindex is a reproducible 5 MB synthetic vindex generated by generate.py (seed=42, 8 layers, hidden=128) — no real model weights needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* working on arch b, unified insert * working on memit with vindex, and templates * memit style * workig on latest memit * working on wasm * working on wasm * cleaned up vindex and larql * fix: Linux support — conditional BLAS and Q4 scalar fallback - Implement Q4 scalar fallback for non-ARM targets: - Move decode_f16() before #if aarch64 (shared by both paths) - Replace empty stub functions with correct scalar implementations - q4_0_matvec_c and q4_0_vecmat_c now produce correct results on x86_64 Affects: larql-compute/csrc/q4_dot.c Tested on Ubuntu 24 (WSL2, x86_64): cargo build --release and cargo test --workspace pass with 0 failures. macOS path untested — preserves accelerate via cfg(target_os) and requires validation on Apple hardware. * working on bounded compute script * refactored lql * improved refacxtor * updated executor * gemma 4 * working on compute * improved for gemma 4 * test: cherry-pick GGUF shape + Q4 correctness tests from chrishayuk#20 * updated examples * working through python parity * working on q4k tidyup * improving testing and quantization * improving testing * gemma 4 support * improved clu * autoregressive generation * kv cache works * working on shader pipeline * working shaders * working on shaders and graph * moved to full graph * workin through ffn walk performance * working version * modulrized shaders * working on decoupling decode * working on performance * more performance improvements * improving performance * more performance improvments * working on performance * working on distributed grid * working on grid * improving docs and moe * working on moe * improved publish pull * binary format * working binary format and performance * updated vindex server specs for binary * improved lm_head * improved prefill * improved lm head * gemma 4 vindex * working on gemma 4 moe * working on cleanup for merge * fixed issue with select * residual stream * working on benchmarks --------- Co-authored-by: chrishayuk <chrishayuk@googlemail.com> Co-authored-by: Remi <remipetiot@hotmail.com> Co-authored-by: chrishayuk <chrishayuk@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on guard for rebuild_overrides (#14) README: Add a fork notice block with badges (Divinci AI, Hugging Face, Vindex Viewer Space, License, Upstream link). Frames this repo as the Divinci-AI fork of chrishayuk/larql carrying RFC-0001 mechanistic fact-editing, Phase-1 unlearning with the revert-leak fix, Gemma 4 per-layer intermediate-size, and the CI isolation harness. Test (overlay_apply): Add `rebuild_overrides_clears_base_down_and_up_overrides` — permanent regression guard for the Phase-1 unlearning revert path. Pre-populates `base.down_overrides` + `base.up_overrides` via `set_down_vector` / `set_up_vector` (the COMPILE-WITH-REFINE write path), pushes any patch onto the overlay so `remove_patch(0)` triggers `rebuild_overrides`, then asserts both base maps are empty after revert. If a future refactor drops the two `clear()` calls in `rebuild_overrides` this test turns red — caught the same regression Gate 3 catches at the integration level, but in 1ms instead of 5sec. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cloud Run and Kubernetes inject secrets as env vars, not as CLI args. When the value lives in `valueFrom: secretKeyRef`, Cloud Run does NOT substitute it into container `args` via `\$(VAR)` expansion — that only works for inline `value:` envs. As a result there's no ergonomic way to pass a secret to `--api-key` today, and deployments end up unauthenticated at the app layer even when a bearer token is provisioned. Adding `env = "LARQL_API_KEY"` to the clap arg lets `valueFrom: secretKeyRef` flow directly in: env: - name: LARQL_API_KEY valueFrom: secretKeyRef: name: larql-s2s-token-staging key: latest The CLI arg still wins when both are set (standard clap precedence). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bump safetensors crate 0.5 → 0.7 (which adds the F8_E8M0 enum variant required by Open Compute MX format scales) and add bit-pattern → f32 decoders for the four new dtypes in larql-models/src/loading/safetensors.rs. This unblocks loading any safetensors file that uses MXFP4 expert weights (I8 packed nibbles + F8_E8M0 per-32-element scales — used by deepseek-ai/DeepSeek-V4-* and unsloth/DeepSeek-V4-* among others) or plain FP8 attention weights (F8_E4M3 / F8_E5M2 — GPT-OSS, etc.). Currently `tensor_to_f32` decodes each tensor in isolation. Proper MXFP4 unpacking (where the I8 packed-nibble weight is paired with its F8_E8M0 scale companion) still needs cross-tensor logic — left as a follow-up for the FFN tensor loading layer where weight + scale are loaded together. Also includes: - bench_cmd.rs: strip metal-only code path so `cargo build --no-default-features` works on Linux (metal crate is `cfg(target_os = "macos")`-only). - compile_cmd/save.rs: fix `safetensors::serialize(&views, &None)` → `serialize(&views, None)` for the safetensors 0.7 signature change. Verified `cargo check -p larql-cli --no-default-features` clean (1 dead-code warning unrelated to this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ndex larql vindexes can live in either HF model or dataset repos. Historically the resolver hardcoded `RepoType::Dataset`, which 404s on model-type repos (e.g. Divinci-AI/kimi-k2-instruct-vindex, deepseek-v4-flash-vindex, deepseek-v4-pro-vindex — all model repos containing real vindex artifacts). Patches `resolve_hf_vindex`, `resolve_hf_vindex_with_progress`, and `download_hf_weights` to: 1. Build the repo handle for both types via a `make_repo` closure. 2. Try Dataset first (preserves legacy default). 3. On any error fetching `index.json`, retry with Model. 4. Use the successful repo type for all subsequent file fetches in the same call. If both fail, return a combined error message naming both attempts so the user sees the real failure mode. Verified end-to-end: previously-failing larql pull Divinci-AI/kimi-k2-instruct-vindex now succeeds and caches at ~/.cache/huggingface/hub/models--Divinci-AI--kimi-k2-instruct-vindex/snapshots/... (the `models--` prefix confirms the fallback selected the right type). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DeepSeek-V4 stores expert weights as one (.weight, .scale) pair per (expert, projection): `layers.X.ffn.experts.E.w1.weight` (I8 packed FP4) + `layers.X.ffn.experts.E.w1.scale` (F8_E8M0 scales), ditto for w2/w3. This is distinct from GPT-OSS's fused `experts.gate_up_proj_blocks` layout that the existing `dequantize_mxfp4_experts` function handles. Adds `dequantize_per_expert_mxfp4` in `loading/safetensors.rs`: - Pattern-matches on tensor names ending `.experts.<digit>.w[123].weight` with I8 dtype + a companion `.scale` of dtype F8_E8M0. - Dequantizes via the existing `quant::mxfp4::dequantize_expert` primitive. - Returns the set of consumed tensor names so the main loading loop skips them (avoiding duplicate decoding of the I8 packed bytes). Wired into the non-PackedMxfp4 (default) branch of `load_model_dir_filtered` so it runs alongside the regular weight-loading path. No-op when no V4-style expert tensors are present. Builds on PRs chrishayuk#35 (F8_E8M0/E4M3/E5M2/I8 dtype dispatch — needed for the header-parse step to succeed before this code runs) and chrishayuk#36 (Dataset → Model fallback for `larql pull`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`resolve_hf_vindex` previously fetched all VINDEX_CORE_FILES eagerly, including `gate_vectors.bin` and `embeddings.bin` — multi-GB binaries that `larql show` never reads. Show would hang for 10+ minutes pulling unused tensors before printing 5 lines of metadata. Splits VINDEX_CORE_FILES into: - VINDEX_METADATA_FILES — small (json, manifests, tokenizer, down_meta). Pulled by `resolve_hf_vindex`. Sub-second on cached repos. - VINDEX_BIN_FILES — large tensor files (gate_vectors, embeddings). Deferred to call sites that actually need them (run, walk). `resolve_hf_vindex_with_progress` keeps the prior eager behavior — its caller has explicitly opted into a progress bar, so they accept the wait. Implementation uses a `vindex_core_files()` helper returning the union, preserving identical file coverage for that path. Verified: `larql show Divinci-AI/kimi-k2-instruct-vindex` (a 42 GB gate_vectors vindex) now returns metadata in ~1s instead of hanging on the bin download. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… w1/w2/w3) DeepSeek-V4 (DeepSeek-V4-Flash, V4-Pro, etc.) breaks every key convention that V3 used. The previous `DeepSeekArch` returns key paths like `model.layers.X.self_attn.q_proj.weight` and `model.embed_tokens.weight` — neither exist in V4 safetensors. `larql extract` failed with "missing tensor: embed_tokens.weight" because the loader looked under the V3 name. Adds `crates/larql-models/src/architectures/deepseek_v4.rs` with V4 keys: - No `model.` prefix anywhere; `key_prefixes_to_strip` = []. - `embed.weight` (not `embed_tokens.weight`). - `layers.X.attn.{q,k,v,o}_proj.weight` (not `self_attn.*`). - `layers.X.attn_norm.weight` / `ffn_norm.weight` (not `input_layernorm` / `post_attention_layernorm`). - `layers.X.ffn.experts.E.{w1,w2,w3}.weight` (not `mlp.experts.E.gate_proj` etc.) — w1=gate, w3=up, w2=down. - `layers.X.ffn.gate.weight` for the router. - MLA tensors at `attn.{wq_a, wq_b, wkv}.weight` (V4 fuses kv into wkv; no separate kv_b). Wired in `detect.rs`: model_type `"deepseek_v4"` → `DeepSeekV4Arch`, all other `"deepseek*"` continue to `DeepSeekArch`. Scope: browse-tier extraction (gate vectors + embeddings + down_meta). HCA / CSA forward-pass support is out of scope for this PR. Stacks on chrishayuk#35 / chrishayuk#36 / chrishayuk#37 / chrishayuk#38. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The streaming extract has its own get_tensor_f32 helper (separate from the larql-models::loading::safetensors::tensor_to_f32 patched in chrishayuk#35). It silently returned None for any non-{F32,F16,BF16} dtype, so per-expert MXFP4 gates (I8 packed nibbles + F8_E8M0 scale) were skipped during gate_vectors.bin extraction — every layer recorded "0.0s" and the output file was 0 bytes. Adds an I8+F8_E8M0 MXFP4 detection path at the top of get_tensor_f32: - If the tensor name ends in `.weight` and dtype is I8 - And a `.scale` companion of dtype F8_E8M0 exists with matching rows - And the cols ratio implies a sane group size {16, 32, 64, 128} → Unpack via crate::format::quant::mxfp4::dequantize_expert and return the resulting f32 Array2 directly. This makes `larql extract --level browse` produce a populated gate_vectors.bin for DeepSeek-V4 family models. Verified locally: gate_vectors.bin grows from 0 B to ~1 GB with real f32 data. Stacks on chrishayuk#35-chrishayuk#39. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mikeumus and others added 22 commits April 17, 2026 16:59

chore: remove sed leftover .bak.bak files + gitignore them

75cc955

Merge feat/safetensors-mxfp4-dtypes: F8_E8M0/E4M3/E5M2/I8 dtype support

ba1cd0c

Merge feat/hf-resolver-model-fallback: support model-type vindex repos

6ee6dfe

Merge feat/mxfp4-per-expert-dequant: DeepSeek-V4 per-expert MXFP4 unpack

a509590

Merge feat/show-metadata-only-resolve

93b22e4

Merge feat/deepseek-v4-arch: DeepSeekV4Arch with V4 tensor naming

c02c3c7

This was referenced Apr 26, 2026

feat(extract): per-expert top-K SVD summary tier for many-experts MoE #42

Open

feat(extract): cap down_meta features via LARQL_SUMMARY_FEATURES_PER_EXPERT #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extract): MXFP4-aware streaming gate_vectors path#40

feat(extract): MXFP4-aware streaming gate_vectors path#40
mikeumus wants to merge 22 commits intochrishayuk:mainfrom
Divinci-AI:feat/streaming-extract-mxfp4

mikeumus commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikeumus commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant