configs: fix I-Nano/I-Micro NULL output on Qwen3.6 MTP variants (missing nextn.eh_proj override)#9
Conversation
…tiers
The MTP head's embed→hidden projection tensor (blk.40.nextn.eh_proj,
16 MB bf16) is not covered by any per-tensor override in the existing
*_mtp_nano.txt / *_mtp_micro.txt configs, so it falls through to the
base quant type (iq2_xxs for nano, iq1_m for micro).
llama-imatrix only forward-passes through the trunk and never activates
the MTP head, so nextn.eh_proj has zero calibration data. iq2_xxs and
iq1_m guard against very-low-bit quantization without imatrix:
Missing importance matrix for tensor blk.40.nextn.eh_proj.weight
in a very low-bit quantization
The result will be garbage, so bailing out
llama-quantize then exits before writing the GGUF header, producing
a NULL-header output file of roughly the expected size. This matches
the file signature on the published mudler/Qwen3.6-...-APEX-MTP-GGUF
I-Nano uploads (etags 28b27ae3..., 280e4530..., b945a4f2..., 41f1719f...).
Fix: set blk.40.nextn.eh_proj=Q4_K — matches edge-tier attention
precision already used for blk.40.attn_* in the same config. The other
three MTP-specific tensors (blk.40.nextn.enorm, hnorm, shared_head_norm)
are F32 norms and pass through untouched, so they need no override.
Verified end-to-end on Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled:
patched config produces a clean 10.88 GB GGUF with valid header that
loads and self-speculates correctly under llama-server --draft-mtp.
Note: generate_config.sh does not currently handle MTP block 40 at all
— the *_mtp_*.txt configs appear to be produced by an out-of-tree
post-process. Whatever that post-process is should also be updated so
regenerated configs don't regress.
Scope: only the 10 *_mtp_nano.txt / *_mtp_micro.txt configs in the
Qwen3.6 family are patched here, as that is the architecture I have
verified end-to-end on. Other MTP-bearing model families likely have
the same issue but may use different MTP block indices and should be
patched separately.
|
Following up — built and benched the patched I-Nano locally, fix confirmed working end-to-end. Build (with this PR's patched Pre-patch (with the missing Inference validation (llama.cpp
Pipeline notes from the run (in case useful for the repo README):
I-Nano is now running as my production agent (DIY-Nano-as-Spark) at 128K context with no crashes through ~25K-token sessions. Net: this fix is correct, the file is shippable, please consider merging when you have a moment. Happy to share the GGUF or the bench harness if helpful. |
Summary
Every
*_mtp_nano.txtand*_mtp_micro.txtconfig in the Qwen3.6 family is missing a per-tensor override forblk.40.nextn.eh_proj— the MTP head's embed→hidden projection (16 MB bf16). Without an explicit override, the tensor falls through to the base type (iq2_xxsfor nano,iq1_mfor micro), both of which guard against very-low-bit quantization with no imatrix data.llama-imatrixonly forward-passes through the trunk and never activates the MTP head, so this tensor has no calibration data — guard trips,llama-quantizeexits before writing the GGUF header, output is a NULL-header file of ~expected size.This matches the file signature on the published I-Nano artifacts in
mudler/Qwen3.6-...-Distilled-APEX-MTP-GGUF(and the 4.7 sibling repo): all-zero first 32 bytes, file size ~10.88 GB, etags28b27ae3...,280e4530...,b945a4f2...,41f1719f.... If reproducing locally, the relevant llama-quantize error is:Fix
One line per affected config:
Q4_K matches the edge-tier precision already used for
blk.40.attn_*in the same configs.The three other MTP-specific tensors (
blk.40.nextn.enorm,blk.40.nextn.hnorm,blk.40.nextn.shared_head_norm) are F32 norms — they pass through untouched and need no override.Test plan
qwen36_opus_distill_mtp_nano.txtGGUFmagic (vs all-NULL on broken upload)llama-server --draft-mtp(in progress on my machine — will append results)Scope
I've only patched the 10 Qwen3.6-family
*_mtp_nano.txt/*_mtp_micro.txtconfigs because that's the architecture I verified end-to-end. Other MTP-bearing families (if any) likely have the same issue but may use different MTP block indices — happy to extend if you point me at the architecture details.Generator note
scripts/generate_config.shdoesn't currently handle MTP at all — these_mtp_configs look like they're produced by an out-of-tree post-process. Whatever produces them should also be updated so regenerated configs don't regress to the broken state. Not in scope for this PR.