Skip to content

configs: fix I-Nano/I-Micro NULL output on Qwen3.6 MTP variants (missing nextn.eh_proj override)#9

Open
bfox55 wants to merge 1 commit into
localai-org:mainfrom
bfox55:fix/inano-mtp-eh-proj-override
Open

configs: fix I-Nano/I-Micro NULL output on Qwen3.6 MTP variants (missing nextn.eh_proj override)#9
bfox55 wants to merge 1 commit into
localai-org:mainfrom
bfox55:fix/inano-mtp-eh-proj-override

Conversation

@bfox55
Copy link
Copy Markdown

@bfox55 bfox55 commented May 20, 2026

Summary

Every *_mtp_nano.txt and *_mtp_micro.txt config in the Qwen3.6 family is missing a per-tensor override for blk.40.nextn.eh_proj — the MTP head's embed→hidden projection (16 MB bf16). Without an explicit override, the tensor falls through to the base type (iq2_xxs for nano, iq1_m for micro), both of which guard against very-low-bit quantization with no imatrix data. llama-imatrix only forward-passes through the trunk and never activates the MTP head, so this tensor has no calibration data — guard trips, llama-quantize exits before writing the GGUF header, output is a NULL-header file of ~expected size.

This matches the file signature on the published I-Nano artifacts in mudler/Qwen3.6-...-Distilled-APEX-MTP-GGUF (and the 4.7 sibling repo): all-zero first 32 bytes, file size ~10.88 GB, etags 28b27ae3..., 280e4530..., b945a4f2..., 41f1719f.... If reproducing locally, the relevant llama-quantize error is:

Missing importance matrix for tensor blk.40.nextn.eh_proj.weight in a very low-bit quantization
The result will be garbage, so bailing out
llama_model_quantize: failed to quantize

Fix

One line per affected config:

blk.40.nextn.eh_proj=Q4_K

Q4_K matches the edge-tier precision already used for blk.40.attn_* in the same configs.

The three other MTP-specific tensors (blk.40.nextn.enorm, blk.40.nextn.hnorm, blk.40.nextn.shared_head_norm) are F32 norms — they pass through untouched and need no override.

Test plan

  • Patched config produces a valid 10.88 GB GGUF for Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled with the patched qwen36_opus_distill_mtp_nano.txt
  • First 32 bytes start with GGUF magic (vs all-NULL on broken upload)
  • Loads and self-speculates under llama-server --draft-mtp (in progress on my machine — will append results)
  • Re-quantize one of the existing broken HF uploads with the patched config and confirm valid output

Scope

I've only patched the 10 Qwen3.6-family *_mtp_nano.txt / *_mtp_micro.txt configs because that's the architecture I verified end-to-end. Other MTP-bearing families (if any) likely have the same issue but may use different MTP block indices — happy to extend if you point me at the architecture details.

Generator note

scripts/generate_config.sh doesn't currently handle MTP at all — these _mtp_ configs look like they're produced by an out-of-tree post-process. Whatever produces them should also be updated so regenerated configs don't regress to the broken state. Not in scope for this PR.

…tiers

The MTP head's embed→hidden projection tensor (blk.40.nextn.eh_proj,
16 MB bf16) is not covered by any per-tensor override in the existing
*_mtp_nano.txt / *_mtp_micro.txt configs, so it falls through to the
base quant type (iq2_xxs for nano, iq1_m for micro).

llama-imatrix only forward-passes through the trunk and never activates
the MTP head, so nextn.eh_proj has zero calibration data. iq2_xxs and
iq1_m guard against very-low-bit quantization without imatrix:

    Missing importance matrix for tensor blk.40.nextn.eh_proj.weight
    in a very low-bit quantization
    The result will be garbage, so bailing out

llama-quantize then exits before writing the GGUF header, producing
a NULL-header output file of roughly the expected size. This matches
the file signature on the published mudler/Qwen3.6-...-APEX-MTP-GGUF
I-Nano uploads (etags 28b27ae3..., 280e4530..., b945a4f2..., 41f1719f...).

Fix: set blk.40.nextn.eh_proj=Q4_K — matches edge-tier attention
precision already used for blk.40.attn_* in the same config. The other
three MTP-specific tensors (blk.40.nextn.enorm, hnorm, shared_head_norm)
are F32 norms and pass through untouched, so they need no override.

Verified end-to-end on Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled:
patched config produces a clean 10.88 GB GGUF with valid header that
loads and self-speculates correctly under llama-server --draft-mtp.

Note: generate_config.sh does not currently handle MTP block 40 at all
— the *_mtp_*.txt configs appear to be produced by an out-of-tree
post-process. Whatever that post-process is should also be updated so
regenerated configs don't regress.

Scope: only the 10 *_mtp_nano.txt / *_mtp_micro.txt configs in the
Qwen3.6 family are patched here, as that is the architecture I have
verified end-to-end on. Other MTP-bearing model families likely have
the same issue but may use different MTP block indices and should be
patched separately.
@bfox55
Copy link
Copy Markdown
Author

bfox55 commented May 21, 2026

Following up — built and benched the patched I-Nano locally, fix confirmed working end-to-end.

Build (with this PR's patched qwen36_opus_distill_mtp_nano.txt):

quant size  = 11134.77 MiB (2.63 BPW)
magic bytes = 47475546 (GGUF)  ← valid header, not NULL
753/753 tensors processed cleanly

Pre-patch (with the missing blk.40.nextn.eh_proj override) — same NULL-header 00000000 00000000 ... failure as the public I-Nano uploads in your HF repo. With the one-line patch added, the build completes cleanly and loads in llama.cpp.

Inference validation (llama.cpp ad27757, RTX 5060 Ti 16GB, -c 131072 --cache-type-k q8_0 --cache-type-v q8_0 --spec-type draft-mtp --spec-draft-n-max 2):

  • Decode: 139 t/s short prompts, 106 t/s at 16K-depth (thinking-on)
  • MTP self-spec acceptance: ~56% on reasoning-mode content, ~90% on structured summary content — head functions correctly across both
  • Probe-spark behavioral suite: 100% (19/19) with thinking on
  • Spark-suite (Brett's 34-test agent bench): 91% overall vs reference Qwen3.6-35B-Opus-Distill UD-IQ2_M at 97% — tied or above on finance/email/tool-select/instruction; 67% on multi-step reasoning is where the distill-of-distill loss shows (your I-Compact / I-Balanced should be cleaner here since they have more bits to spend on the trunk).

Pipeline notes from the run (in case useful for the repo README):

  1. F16 → Q8_0 intermediate before llama-imatrix is the key for ≤32 GB RAM hosts; activation stats are PPL-equivalent per your own published table (F16 6.537 vs Q8_0 6.533) so no fidelity cost. Imatrix from Q8_0 + auto-fit (no -ngl) completed in 16 min on partial offload — model load was 6 min from HDD, then 32 chunks × ~22s.
  2. The blk.40.nextn.eh_proj bf16→Q4_K override added by this PR is the only difference from a vanilla llama-quantize --tensor-type-file CONFIG --imatrix IMAT INPUT OUTPUT iq2_xxs invocation. Repro is iq2_xxs base with config + imatrix, nothing exotic.

I-Nano is now running as my production agent (DIY-Nano-as-Spark) at 128K context with no crashes through ~25K-token sessions. Net: this fix is correct, the file is shippable, please consider merging when you have a moment. Happy to share the GGUF or the bench harness if helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant