WIP: Improve model config processing to support Qwen 35b, 122b and 397b models#1
Open
itanka9 wants to merge 9 commits into
Open
WIP: Improve model config processing to support Qwen 35b, 122b and 397b models#1itanka9 wants to merge 9 commits into
itanka9 wants to merge 9 commits into
Conversation
Update all architecture constants, expert layout, and tooling to support
Qwen3.5-122B-A10B-4bit (48 layers, 256 experts, hidden_size=3072) loaded
from ~/.cache/modelscope.
Changes:
- infer.m: update HIDDEN_DIM, NUM_LAYERS, NUM_EXPERTS, NUM_EXPERTS_PER_TOK,
NUM_FULL_ATTN_LAYERS, NUM_LINEAR_LAYERS, all 4-bit/2-bit expert byte
offsets, and MODEL_PATH_DEFAULT for 122B
- extract_weights.py: update model config and default path for 122B
- repack_experts.py: update COMPONENTS layout, EXPERT_SIZE, NUM_EXPERTS,
NUM_LAYERS, and fix verify loop (was hardcoded to expert index 511)
- generate_expert_index.py: new script — scans safetensors headers and
writes expert_index.json mapping each layer's stacked expert tensors
to their file offsets and strides
- export_vocab.py: new script — exports vocab.bin with proper GPT-2
byte-level BPE decoding so Chinese, Arabic, and all non-ASCII tokens
render correctly in output
- usage.txt: new file — complete step-by-step command reference
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update repack_experts_2bit.py for Qwen3.5-122B-A10B-4bit: - EXPERT_SIZE_4BIT 7,077,888 → 5,308,416 (hidden 4096→3072) - NUM_EXPERTS 512 → 256, NUM_LAYERS 60 → 48 - Recalculate all 4-bit and 2-bit offsets for 3072 hidden dim - EXPERT_SIZE_2BIT 3,932,160 → 2,949,120 - Default path updated to modelscope/mlx-community/122B Add Step 4b to usage.txt covering 2-bit repack commands (single-layer verify, full run) with note that 2-bit breaks JSON/tool calling.
Previously the server always sent SSE (text/event-stream) regardless of
the stream parameter. Now:
- Parse "stream" from the request body (default true)
- stream:true — existing SSE behaviour unchanged
- stream:false — buffer all tokens, send a single application/json
chat.completion object with Content-Length when generation finishes
Token accumulation was already happening for session persistence, so
non-streaming just skips the per-token SSE writes and emits one response.
…hanges: tools injection into system prompt, parse_tool_call (JSON formats), tool_calls response shape, cold-prefill bypass for tool requests, temperature parameter, reasoning_content extraction, and debug logging.
Key changes: build_multiturn_prompt replays full message history into the Qwen3.5 chat template for stateless clients, role:tool result turns, and auto-continuation detection (skips cold prefill when the last assistant message matches g_last_assistant_content).
Update all architecture constants for 35B: hidden=2048, 40 layers, 256 experts, K=8, MOE_INTERMEDIATE=512, LINEAR_NUM_V_HEADS=32. Fix expert byte offsets in infer.m (replace hardcoded 122B values with #defines for 35B layout). Add cpu_dequant_matvec_8bit for MoE routing gate, which mlx-community quantizes at bits=8 rather than bits=4. Update extract_weights.py, generate_expert_index.py, repack_experts.py, and repack_experts_2bit.py with 35B shapes, layer counts, and paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR remove hardcoded patch from infer and python scripts and grabs neccessery params from models
config.jsonfile.Now tested on 35b and 122b models. Test in 397b model pending.
Co-authored by opus 4.6