Published at: https://github.com/transformerlab/exp-north-mini-code-mac The companion repository referenced by the paper's Availability section.
Everything needed to reproduce the headline numbers in the paper and model card. All
numbers trace to the evidence ledger (evidence.md, E1–E33).
The deliverable is the RTN 4-bit MLX quant of CohereLabs/North-Mini-Code-1.0
(equivalent to mlx-community/North-Mini-Code-1.0-4bit). The project's finding: no
in-budget quantization route (cheaper or higher-bit) or cheap distillation we searched
beats RTN 4-bit at ≤24 GB.
README.md this file: full reproduction recipe
requirements.md the three environments (mlx-vlm / mlx-lm+port / swebench+Docker)
swebench_subset.json pinned agentic subset (n=30, seed 1234; sha256 in §3)
streaming-gptq/ stream_gptq.py, calib_data.py, eval_humaneval.py, RESULTS.md
mlx-lm-port/ cohere2_moe.py (the port), validate_parity.py, PARITY.md
eval-humaneval/ eval_he_mlxvlm.py (mlx-vlm HE harness), analyze.py (paired McNemar)
swebench/ build_subset.py, gen_searchreplace.py, agent.py, analyze.py, SWEBENCH.md
Model weights are not bundled (they live on the HF Hub, see requirements.md).
| Component | Version / spec |
|---|---|
| Hardware | Apple M4 Pro, 48 GB unified memory (deliverable target ≤24 GB) |
| OS | macOS (Darwin) |
| Convert + eval (compression arm) | mlx-vlm 0.6.2 (base mlx 0.31.x); required: stock mlx-lm lacks cohere2_moe |
| Streaming-GPTQ + parity port | mlx 0.31.2 / mlx_lm 0.31.3 with the cohere2_moe port dropped into mlx_lm/models/ (or upstream PR ml-explore/mlx-lm#1340) |
| Agentic scoring | pip install swebench; official swebench.harness.run_evaluation on x86 Linux + Docker (we used a cloud A10 GPU) |
Generation runs on the Mac (MLX); SWE-Bench test execution must run on x86 Linux + Docker (the deliverable's platform cannot run the harness locally). The split is clean: no live tunnel.
- Base coding eval (HumanEval / MBPP): greedy,
temp=0,max_tokens=4096. No sampling → deterministic given weights + harness. Answer extracted after<|START_TEXT|>, fenced code blocks concatenated (helpers kept); North is a reasoning model; the thinking trace must be discarded or the score collapses. - Agentic eval (SWE-Bench Verified): subset seed = 1234; greedy,
temp=0,max_tokens=9000; single-shot SEARCH/REPLACE scaffold with oracle file localization. - Streaming-GPTQ calibration: MBPP code, 64 samples × 256 tokens, disjoint from HumanEval.
| Dataset | Pin |
|---|---|
| HumanEval (base) | openai_humaneval, 164 problems |
| MBPP (sanitized) | mbpp sanitized, 257 problems |
| SWE-Bench Verified subset (n=30) | swebench_subset.json in this folder: sha256 e5407c97dfccdfe1cf820c9f25579e27a56a2321ddb52642b0ef98abfdbcc661 (30 pinned instance_ids, seed 1234, repo-stratified over 12 repos) |
| Calibration | permissively-licensed code (MBPP splits, disjoint from eval) |
Verify the subset pin: shasum -a 256 swebench_subset.json → the hash above.
# Option A: reproduce the quant locally (bit-equivalent to the published one):
mlx_vlm.convert --hf-path CohereLabs/North-Mini-Code-1.0 -q --q-bits 4 --q-group-size 64
# Option B: pull the published artifact:
# mlx-community/North-Mini-Code-1.0-4bit
# Quantizes expert/MLP linears only (affine, group_size 64); attention/router/embeddings/norms stay bf16.Script: eval-humaneval/eval_he_mlxvlm.py (env A / mlx-vlm; --temp 0.0,
--max-tokens 4096). Run against the 4-bit model for HumanEval and the
MBPP-sanitized variant.
Expected: HumanEval 0.8902 (146/164), MBPP 0.9066 (233/257), pooled 0.9002.
Build it (env A): mlx_vlm.convert --hf-path CohereLabs/North-Mini-Code-1.0 -q --q-bits 4 --q-group-size 64 --quant-predicate mixed_4_8 (→ 5.21 avg bits, 20.5 GB). Eval with
eval-humaneval/eval_he_mlxvlm.py --max-tokens 4096; compute the paired McNemar vs RTN with
eval-humaneval/analyze.py. Expected: 0.9024 (148/164), McNemar p=0.7905 vs RTN, a tie
(+1.2% < the 1.5% margin); all 6 RTN-pass/mixed-fail problems are mixed_4_8 truncations.
Scripts: streaming-gptq/{stream_gptq.py, calib_data.py, eval_humaneval.py} (env B / mlx-lm
port). Sequential per-layer GPTQ, per-expert Hessians over routed tokens only → 3.75 GB peak
on the 48 GB Mac (vs hundreds of GB for stock mlx_lm.quant.gptq). Result: HumanEval 0.8841
(145/164), McNemar p=1.000 vs RTN 4-bit (a tie). Parity of the port: mlx-lm-port/validate_parity.py.
Scripts: swebench/{build_subset.py, gen_searchreplace.py, agent.py, analyze.py}.
# Rebuild the exact subset (must reproduce the sha256 above):
python build_subset.py --n 30 --seed 1234 --out subset.json
# Generate predictions on the Mac (greedy, max_tokens 9000) → results/preds.jsonl
# Score on x86 Linux + Docker:
python -m swebench.harness.run_evaluation --predictions_path results/preds.jsonl --dataset_name princeton-nlp/SWE-bench_VerifiedExpected: 6/30 = 20.0% resolved (Wilson [9.5%, 37.3%]); 6/10 = 60% conditional on a patch; 21/30 looped to the token cap.
- Full method, figures, and statistics: the accompanying paper (arXiv preprint, link in the repository description).
- Evidence ledger (every number → source artifact):
evidence.md(E1–E33). - Model card:
model-card.md. - License: Apache-2.0 (base model + Cohere quants). No training/distillation data used in the deliverable → no third-party data-attribution obligation.