configs: fix I-Nano/I-Micro NULL output on Qwen3.6 MTP variants (missing nextn.eh_proj override) by bfox55 · Pull Request #9 · localai-org/apex-quant

bfox55 · 2026-05-20T18:10:08Z

Summary

Every *_mtp_nano.txt and *_mtp_micro.txt config in the Qwen3.6 family is missing a per-tensor override for blk.40.nextn.eh_proj — the MTP head's embed→hidden projection (16 MB bf16). Without an explicit override, the tensor falls through to the base type (iq2_xxs for nano, iq1_m for micro), both of which guard against very-low-bit quantization with no imatrix data. llama-imatrix only forward-passes through the trunk and never activates the MTP head, so this tensor has no calibration data — guard trips, llama-quantize exits before writing the GGUF header, output is a NULL-header file of ~expected size.

This matches the file signature on the published I-Nano artifacts in mudler/Qwen3.6-...-Distilled-APEX-MTP-GGUF (and the 4.7 sibling repo): all-zero first 32 bytes, file size ~10.88 GB, etags 28b27ae3..., 280e4530..., b945a4f2..., 41f1719f.... If reproducing locally, the relevant llama-quantize error is:

Missing importance matrix for tensor blk.40.nextn.eh_proj.weight in a very low-bit quantization
The result will be garbage, so bailing out
llama_model_quantize: failed to quantize

Fix

One line per affected config:

blk.40.nextn.eh_proj=Q4_K

Q4_K matches the edge-tier precision already used for blk.40.attn_* in the same configs.

The three other MTP-specific tensors (blk.40.nextn.enorm, blk.40.nextn.hnorm, blk.40.nextn.shared_head_norm) are F32 norms — they pass through untouched and need no override.

Test plan

Patched config produces a valid 10.88 GB GGUF for Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled with the patched qwen36_opus_distill_mtp_nano.txt
First 32 bytes start with GGUF magic (vs all-NULL on broken upload)
Loads and self-speculates under llama-server --draft-mtp (in progress on my machine — will append results)
Re-quantize one of the existing broken HF uploads with the patched config and confirm valid output

Scope

I've only patched the 10 Qwen3.6-family *_mtp_nano.txt / *_mtp_micro.txt configs because that's the architecture I verified end-to-end. Other MTP-bearing families (if any) likely have the same issue but may use different MTP block indices — happy to extend if you point me at the architecture details.

Generator note

scripts/generate_config.sh doesn't currently handle MTP at all — these _mtp_ configs look like they're produced by an out-of-tree post-process. Whatever produces them should also be updated so regenerated configs don't regress to the broken state. Not in scope for this PR.

…tiers The MTP head's embed→hidden projection tensor (blk.40.nextn.eh_proj, 16 MB bf16) is not covered by any per-tensor override in the existing *_mtp_nano.txt / *_mtp_micro.txt configs, so it falls through to the base quant type (iq2_xxs for nano, iq1_m for micro). llama-imatrix only forward-passes through the trunk and never activates the MTP head, so nextn.eh_proj has zero calibration data. iq2_xxs and iq1_m guard against very-low-bit quantization without imatrix: Missing importance matrix for tensor blk.40.nextn.eh_proj.weight in a very low-bit quantization The result will be garbage, so bailing out llama-quantize then exits before writing the GGUF header, producing a NULL-header output file of roughly the expected size. This matches the file signature on the published mudler/Qwen3.6-...-APEX-MTP-GGUF I-Nano uploads (etags 28b27ae3..., 280e4530..., b945a4f2..., 41f1719f...). Fix: set blk.40.nextn.eh_proj=Q4_K — matches edge-tier attention precision already used for blk.40.attn_* in the same config. The other three MTP-specific tensors (blk.40.nextn.enorm, hnorm, shared_head_norm) are F32 norms and pass through untouched, so they need no override. Verified end-to-end on Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled: patched config produces a clean 10.88 GB GGUF with valid header that loads and self-speculates correctly under llama-server --draft-mtp. Note: generate_config.sh does not currently handle MTP block 40 at all — the *_mtp_*.txt configs appear to be produced by an out-of-tree post-process. Whatever that post-process is should also be updated so regenerated configs don't regress. Scope: only the 10 *_mtp_nano.txt / *_mtp_micro.txt configs in the Qwen3.6 family are patched here, as that is the architecture I have verified end-to-end on. Other MTP-bearing model families likely have the same issue but may use different MTP block indices and should be patched separately.

bfox55 · 2026-05-21T02:39:39Z

Following up — built and benched the patched I-Nano locally, fix confirmed working end-to-end.

Build (with this PR's patched qwen36_opus_distill_mtp_nano.txt):

quant size  = 11134.77 MiB (2.63 BPW)
magic bytes = 47475546 (GGUF)  ← valid header, not NULL
753/753 tensors processed cleanly

Pre-patch (with the missing blk.40.nextn.eh_proj override) — same NULL-header 00000000 00000000 ... failure as the public I-Nano uploads in your HF repo. With the one-line patch added, the build completes cleanly and loads in llama.cpp.

Inference validation (llama.cpp ad27757, RTX 5060 Ti 16GB, -c 131072 --cache-type-k q8_0 --cache-type-v q8_0 --spec-type draft-mtp --spec-draft-n-max 2):

Decode: 139 t/s short prompts, 106 t/s at 16K-depth (thinking-on)
MTP self-spec acceptance: ~56% on reasoning-mode content, ~90% on structured summary content — head functions correctly across both
Probe-spark behavioral suite: 100% (19/19) with thinking on
Spark-suite (Brett's 34-test agent bench): 91% overall vs reference Qwen3.6-35B-Opus-Distill UD-IQ2_M at 97% — tied or above on finance/email/tool-select/instruction; 67% on multi-step reasoning is where the distill-of-distill loss shows (your I-Compact / I-Balanced should be cleaner here since they have more bits to spend on the trunk).

Pipeline notes from the run (in case useful for the repo README):

F16 → Q8_0 intermediate before llama-imatrix is the key for ≤32 GB RAM hosts; activation stats are PPL-equivalent per your own published table (F16 6.537 vs Q8_0 6.533) so no fidelity cost. Imatrix from Q8_0 + auto-fit (no -ngl) completed in 16 min on partial offload — model load was 6 min from HDD, then 32 chunks × ~22s.
The blk.40.nextn.eh_proj bf16→Q4_K override added by this PR is the only difference from a vanilla llama-quantize --tensor-type-file CONFIG --imatrix IMAT INPUT OUTPUT iq2_xxs invocation. Repro is iq2_xxs base with config + imatrix, nothing exotic.

I-Nano is now running as my production agent (DIY-Nano-as-Spark) at 128K context with no crashes through ~25K-token sessions. Net: this fix is correct, the file is shippable, please consider merging when you have a moment. Happy to share the GGUF or the bench harness if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs: fix I-Nano/I-Micro NULL output on Qwen3.6 MTP variants (missing nextn.eh_proj override)#9

configs: fix I-Nano/I-Micro NULL output on Qwen3.6 MTP variants (missing nextn.eh_proj override)#9
bfox55 wants to merge 1 commit into
localai-org:mainfrom
bfox55:fix/inano-mtp-eh-proj-override

bfox55 commented May 20, 2026 •

edited

Loading

Uh oh!

bfox55 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bfox55 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Test plan

Scope

Generator note

Uh oh!

bfox55 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bfox55 commented May 20, 2026 •

edited

Loading