WIP: Improve model config processing to support Qwen 35b, 122b and 397b models by itanka9 · Pull Request #1 · Ma-Dan/flash-moe

itanka9 · 2026-06-22T10:40:13Z

This PR remove hardcoded patch from infer and python scripts and grabs neccessery params from models config.json file.

Now tested on 35b and 122b models. Test in 397b model pending.

Co-authored by opus 4.6

Update all architecture constants, expert layout, and tooling to support Qwen3.5-122B-A10B-4bit (48 layers, 256 experts, hidden_size=3072) loaded from ~/.cache/modelscope. Changes: - infer.m: update HIDDEN_DIM, NUM_LAYERS, NUM_EXPERTS, NUM_EXPERTS_PER_TOK, NUM_FULL_ATTN_LAYERS, NUM_LINEAR_LAYERS, all 4-bit/2-bit expert byte offsets, and MODEL_PATH_DEFAULT for 122B - extract_weights.py: update model config and default path for 122B - repack_experts.py: update COMPONENTS layout, EXPERT_SIZE, NUM_EXPERTS, NUM_LAYERS, and fix verify loop (was hardcoded to expert index 511) - generate_expert_index.py: new script — scans safetensors headers and writes expert_index.json mapping each layer's stacked expert tensors to their file offsets and strides - export_vocab.py: new script — exports vocab.bin with proper GPT-2 byte-level BPE decoding so Chinese, Arabic, and all non-ASCII tokens render correctly in output - usage.txt: new file — complete step-by-step command reference Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update repack_experts_2bit.py for Qwen3.5-122B-A10B-4bit: - EXPERT_SIZE_4BIT 7,077,888 → 5,308,416 (hidden 4096→3072) - NUM_EXPERTS 512 → 256, NUM_LAYERS 60 → 48 - Recalculate all 4-bit and 2-bit offsets for 3072 hidden dim - EXPERT_SIZE_2BIT 3,932,160 → 2,949,120 - Default path updated to modelscope/mlx-community/122B Add Step 4b to usage.txt covering 2-bit repack commands (single-layer verify, full run) with note that 2-bit breaks JSON/tool calling.

Previously the server always sent SSE (text/event-stream) regardless of the stream parameter. Now: - Parse "stream" from the request body (default true) - stream:true — existing SSE behaviour unchanged - stream:false — buffer all tokens, send a single application/json chat.completion object with Content-Length when generation finishes Token accumulation was already happening for session persistence, so non-streaming just skips the per-token SSE writes and emits one response.

…hanges: tools injection into system prompt, parse_tool_call (JSON formats), tool_calls response shape, cold-prefill bypass for tool requests, temperature parameter, reasoning_content extraction, and debug logging.

Key changes: build_multiturn_prompt replays full message history into the Qwen3.5 chat template for stateless clients, role:tool result turns, and auto-continuation detection (skips cold prefill when the last assistant message matches g_last_assistant_content).

Update all architecture constants for 35B: hidden=2048, 40 layers, 256 experts, K=8, MOE_INTERMEDIATE=512, LINEAR_NUM_V_HEADS=32. Fix expert byte offsets in infer.m (replace hardcoded 122B values with #defines for 35B layout). Add cpu_dequant_matvec_8bit for MoE routing gate, which mlx-community quantizes at bits=8 rather than bits=4. Update extract_weights.py, generate_expert_index.py, repack_experts.py, and repack_experts_2bit.py with 35B shapes, layer counts, and paths.

Dan and others added 9 commits April 12, 2026 12:13

universal

906ae09

getting things to work

8627679

review fixes

5c84e28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Improve model config processing to support Qwen 35b, 122b and 397b models#1

WIP: Improve model config processing to support Qwen 35b, 122b and 397b models#1
itanka9 wants to merge 9 commits into
Ma-Dan:mainfrom
itanka9:qwen-universal

itanka9 commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

itanka9 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

itanka9 commented Jun 22, 2026 •

edited

Loading