Skip to content

transformerlab/exp-north-mini-code-mac

Repository files navigation

Reproducibility package: north-mac

Published at: https://github.com/transformerlab/exp-north-mini-code-mac The companion repository referenced by the paper's Availability section.

Everything needed to reproduce the headline numbers in the paper and model card. All numbers trace to the evidence ledger (evidence.md, E1–E33).

The deliverable is the RTN 4-bit MLX quant of CohereLabs/North-Mini-Code-1.0 (equivalent to mlx-community/North-Mini-Code-1.0-4bit). The project's finding: no in-budget quantization route (cheaper or higher-bit) or cheap distillation we searched beats RTN 4-bit at ≤24 GB.

Layout

README.md            this file: full reproduction recipe
requirements.md      the three environments (mlx-vlm / mlx-lm+port / swebench+Docker)
swebench_subset.json pinned agentic subset (n=30, seed 1234; sha256 in §3)
streaming-gptq/       stream_gptq.py, calib_data.py, eval_humaneval.py, RESULTS.md
mlx-lm-port/          cohere2_moe.py (the port), validate_parity.py, PARITY.md
eval-humaneval/       eval_he_mlxvlm.py (mlx-vlm HE harness), analyze.py (paired McNemar)
swebench/             build_subset.py, gen_searchreplace.py, agent.py, analyze.py, SWEBENCH.md

Model weights are not bundled (they live on the HF Hub, see requirements.md).


1. Environment

Component Version / spec
Hardware Apple M4 Pro, 48 GB unified memory (deliverable target ≤24 GB)
OS macOS (Darwin)
Convert + eval (compression arm) mlx-vlm 0.6.2 (base mlx 0.31.x); required: stock mlx-lm lacks cohere2_moe
Streaming-GPTQ + parity port mlx 0.31.2 / mlx_lm 0.31.3 with the cohere2_moe port dropped into mlx_lm/models/ (or upstream PR ml-explore/mlx-lm#1340)
Agentic scoring pip install swebench; official swebench.harness.run_evaluation on x86 Linux + Docker (we used a cloud A10 GPU)

Generation runs on the Mac (MLX); SWE-Bench test execution must run on x86 Linux + Docker (the deliverable's platform cannot run the harness locally). The split is clean: no live tunnel.

2. Determinism: seeds & decoding

  • Base coding eval (HumanEval / MBPP): greedy, temp=0, max_tokens=4096. No sampling → deterministic given weights + harness. Answer extracted after <|START_TEXT|>, fenced code blocks concatenated (helpers kept); North is a reasoning model; the thinking trace must be discarded or the score collapses.
  • Agentic eval (SWE-Bench Verified): subset seed = 1234; greedy, temp=0, max_tokens=9000; single-shot SEARCH/REPLACE scaffold with oracle file localization.
  • Streaming-GPTQ calibration: MBPP code, 64 samples × 256 tokens, disjoint from HumanEval.

3. Data hashes & versions

Dataset Pin
HumanEval (base) openai_humaneval, 164 problems
MBPP (sanitized) mbpp sanitized, 257 problems
SWE-Bench Verified subset (n=30) swebench_subset.json in this folder: sha256 e5407c97dfccdfe1cf820c9f25579e27a56a2321ddb52642b0ef98abfdbcc661 (30 pinned instance_ids, seed 1234, repo-stratified over 12 repos)
Calibration permissively-licensed code (MBPP splits, disjoint from eval)

Verify the subset pin: shasum -a 256 swebench_subset.json → the hash above.

4. Reproduce the headline metrics

4a. Build the deliverable (or pull the published quant)

# Option A: reproduce the quant locally (bit-equivalent to the published one):
mlx_vlm.convert --hf-path CohereLabs/North-Mini-Code-1.0 -q --q-bits 4 --q-group-size 64
# Option B: pull the published artifact:
#   mlx-community/North-Mini-Code-1.0-4bit
# Quantizes expert/MLP linears only (affine, group_size 64); attention/router/embeddings/norms stay bf16.

4b. Base coding (pooled 0.900, the primary metric)

Script: eval-humaneval/eval_he_mlxvlm.py (env A / mlx-vlm; --temp 0.0, --max-tokens 4096). Run against the 4-bit model for HumanEval and the MBPP-sanitized variant.

Expected: HumanEval 0.8902 (146/164), MBPP 0.9066 (233/257), pooled 0.9002.

4c. mixed_4_8: the higher-bit in-budget config (full HE-164 tie)

Build it (env A): mlx_vlm.convert --hf-path CohereLabs/North-Mini-Code-1.0 -q --q-bits 4 --q-group-size 64 --quant-predicate mixed_4_8 (→ 5.21 avg bits, 20.5 GB). Eval with eval-humaneval/eval_he_mlxvlm.py --max-tokens 4096; compute the paired McNemar vs RTN with eval-humaneval/analyze.py. Expected: 0.9024 (148/164), McNemar p=0.7905 vs RTN, a tie (+1.2% < the 1.5% margin); all 6 RTN-pass/mixed-fail problems are mixed_4_8 truncations.

4d. (Optional) streaming GPTQ: the MoE quantization contribution

Scripts: streaming-gptq/{stream_gptq.py, calib_data.py, eval_humaneval.py} (env B / mlx-lm port). Sequential per-layer GPTQ, per-expert Hessians over routed tokens only → 3.75 GB peak on the 48 GB Mac (vs hundreds of GB for stock mlx_lm.quant.gptq). Result: HumanEval 0.8841 (145/164), McNemar p=1.000 vs RTN 4-bit (a tie). Parity of the port: mlx-lm-port/validate_parity.py.

4e. Agentic (SWE-Bench Verified subset → 20% resolved)

Scripts: swebench/{build_subset.py, gen_searchreplace.py, agent.py, analyze.py}.

# Rebuild the exact subset (must reproduce the sha256 above):
python build_subset.py --n 30 --seed 1234 --out subset.json
# Generate predictions on the Mac (greedy, max_tokens 9000) → results/preds.jsonl
# Score on x86 Linux + Docker:
python -m swebench.harness.run_evaluation --predictions_path results/preds.jsonl --dataset_name princeton-nlp/SWE-bench_Verified

Expected: 6/30 = 20.0% resolved (Wilson [9.5%, 37.3%]); 6/10 = 60% conditional on a patch; 21/30 looped to the token cap.

5. Provenance

  • Full method, figures, and statistics: the accompanying paper (arXiv preprint, link in the repository description).
  • Evidence ledger (every number → source artifact): evidence.md (E1–E33).
  • Model card: model-card.md.
  • License: Apache-2.0 (base model + Cohere quants). No training/distillation data used in the deliverable → no third-party data-attribution obligation.

About

Reproducibility package for paper "Round-to-Nearest Is Hard to Beat: A 30B MoE Coding Model in 24 GB on Apple Silicon"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages