Reproducibility package: north-mac

Published at: https://github.com/transformerlab/exp-north-mini-code-mac The companion repository referenced by the paper's Availability section.

Everything needed to reproduce the headline numbers in the paper and model card. All numbers trace to the evidence ledger (evidence.md, E1–E33).

The deliverable is the RTN 4-bit MLX quant of CohereLabs/North-Mini-Code-1.0 (equivalent to mlx-community/North-Mini-Code-1.0-4bit). The project's finding: no in-budget quantization route (cheaper or higher-bit) or cheap distillation we searched beats RTN 4-bit at ≤24 GB.

Layout

README.md            this file: full reproduction recipe
requirements.md      the three environments (mlx-vlm / mlx-lm+port / swebench+Docker)
swebench_subset.json pinned agentic subset (n=30, seed 1234; sha256 in §3)
streaming-gptq/       stream_gptq.py, calib_data.py, eval_humaneval.py, RESULTS.md
mlx-lm-port/          cohere2_moe.py (the port), validate_parity.py, PARITY.md
eval-humaneval/       eval_he_mlxvlm.py (mlx-vlm HE harness), analyze.py (paired McNemar)
swebench/             build_subset.py, gen_searchreplace.py, agent.py, analyze.py, SWEBENCH.md

Model weights are not bundled (they live on the HF Hub, see requirements.md).

1. Environment

Component	Version / spec
Hardware	Apple M4 Pro, 48 GB unified memory (deliverable target ≤24 GB)
OS	macOS (Darwin)
Convert + eval (compression arm)	`mlx-vlm` 0.6.2 (base `mlx` 0.31.x); required: stock `mlx-lm` lacks `cohere2_moe`
Streaming-GPTQ + parity port	`mlx` 0.31.2 / `mlx_lm` 0.31.3 with the `cohere2_moe` port dropped into `mlx_lm/models/` (or upstream PR ml-explore/mlx-lm#1340)
Agentic scoring	`pip install swebench`; official `swebench.harness.run_evaluation` on x86 Linux + Docker (we used a cloud A10 GPU)

Generation runs on the Mac (MLX); SWE-Bench test execution must run on x86 Linux + Docker (the deliverable's platform cannot run the harness locally). The split is clean: no live tunnel.

2. Determinism: seeds & decoding

Base coding eval (HumanEval / MBPP): greedy, temp=0, max_tokens=4096. No sampling → deterministic given weights + harness. Answer extracted after <|START_TEXT|>, fenced code blocks concatenated (helpers kept); North is a reasoning model; the thinking trace must be discarded or the score collapses.
Agentic eval (SWE-Bench Verified): subset seed = 1234; greedy, temp=0, max_tokens=9000; single-shot SEARCH/REPLACE scaffold with oracle file localization.
Streaming-GPTQ calibration: MBPP code, 64 samples × 256 tokens, disjoint from HumanEval.

3. Data hashes & versions

Dataset	Pin
HumanEval (base)	`openai_humaneval`, 164 problems
MBPP (sanitized)	`mbpp` sanitized, 257 problems
SWE-Bench Verified subset (n=30)	`swebench_subset.json` in this folder: sha256 `e5407c97dfccdfe1cf820c9f25579e27a56a2321ddb52642b0ef98abfdbcc661` (30 pinned `instance_ids`, seed 1234, repo-stratified over 12 repos)
Calibration	permissively-licensed code (MBPP splits, disjoint from eval)

Verify the subset pin: shasum -a 256 swebench_subset.json → the hash above.

4. Reproduce the headline metrics

4a. Build the deliverable (or pull the published quant)

# Option A: reproduce the quant locally (bit-equivalent to the published one):
mlx_vlm.convert --hf-path CohereLabs/North-Mini-Code-1.0 -q --q-bits 4 --q-group-size 64
# Option B: pull the published artifact:
#   mlx-community/North-Mini-Code-1.0-4bit
# Quantizes expert/MLP linears only (affine, group_size 64); attention/router/embeddings/norms stay bf16.

4b. Base coding (pooled 0.900, the primary metric)

Script: eval-humaneval/eval_he_mlxvlm.py (env A / mlx-vlm; --temp 0.0, --max-tokens 4096). Run against the 4-bit model for HumanEval and the MBPP-sanitized variant.

Expected: HumanEval 0.8902 (146/164), MBPP 0.9066 (233/257), pooled 0.9002.

4c. mixed_4_8: the higher-bit in-budget config (full HE-164 tie)

Build it (env A): mlx_vlm.convert --hf-path CohereLabs/North-Mini-Code-1.0 -q --q-bits 4 --q-group-size 64 --quant-predicate mixed_4_8 (→ 5.21 avg bits, 20.5 GB). Eval with eval-humaneval/eval_he_mlxvlm.py --max-tokens 4096; compute the paired McNemar vs RTN with eval-humaneval/analyze.py. Expected: 0.9024 (148/164), McNemar p=0.7905 vs RTN, a tie (+1.2% < the 1.5% margin); all 6 RTN-pass/mixed-fail problems are mixed_4_8 truncations.

4d. (Optional) streaming GPTQ: the MoE quantization contribution

Scripts: streaming-gptq/{stream_gptq.py, calib_data.py, eval_humaneval.py} (env B / mlx-lm port). Sequential per-layer GPTQ, per-expert Hessians over routed tokens only → 3.75 GB peak on the 48 GB Mac (vs hundreds of GB for stock mlx_lm.quant.gptq). Result: HumanEval 0.8841 (145/164), McNemar p=1.000 vs RTN 4-bit (a tie). Parity of the port: mlx-lm-port/validate_parity.py.

4e. Agentic (SWE-Bench Verified subset → 20% resolved)

Scripts: swebench/{build_subset.py, gen_searchreplace.py, agent.py, analyze.py}.

# Rebuild the exact subset (must reproduce the sha256 above):
python build_subset.py --n 30 --seed 1234 --out subset.json
# Generate predictions on the Mac (greedy, max_tokens 9000) → results/preds.jsonl
# Score on x86 Linux + Docker:
python -m swebench.harness.run_evaluation --predictions_path results/preds.jsonl --dataset_name princeton-nlp/SWE-bench_Verified

Expected: 6/30 = 20.0% resolved (Wilson [9.5%, 37.3%]); 6/10 = 60% conditional on a patch; 21/30 looped to the token cap.

5. Provenance

Full method, figures, and statistics: the accompanying paper (arXiv preprint, link in the repository description).
Evidence ledger (every number → source artifact): evidence.md (E1–E33).
Model card: model-card.md.
License: Apache-2.0 (base model + Cohere quants). No training/distillation data used in the deliverable → no third-party data-attribution obligation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reproducibility package: north-mac

Layout

1. Environment

2. Determinism: seeds & decoding

3. Data hashes & versions

4. Reproduce the headline metrics

4a. Build the deliverable (or pull the published quant)

4b. Base coding (pooled 0.900, the primary metric)

4c. mixed_4_8: the higher-bit in-budget config (full HE-164 tie)

4d. (Optional) streaming GPTQ: the MoE quantization contribution

4e. Agentic (SWE-Bench Verified subset → 20% resolved)

5. Provenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
eval-humaneval		eval-humaneval
mlx-lm-port		mlx-lm-port
streaming-gptq		streaming-gptq
swebench		swebench
LICENSE		LICENSE
README.md		README.md
evidence.md		evidence.md
model-card.md		model-card.md
requirements.md		requirements.md
swebench_subset.json		swebench_subset.json

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Reproducibility package: north-mac

Layout

1. Environment

2. Determinism: seeds & decoding

3. Data hashes & versions

4. Reproduce the headline metrics

4a. Build the deliverable (or pull the published quant)

4b. Base coding (pooled 0.900, the primary metric)

4c. mixed_4_8: the higher-bit in-budget config (full HE-164 tie)

4d. (Optional) streaming GPTQ: the MoE quantization contribution

4e. Agentic (SWE-Bench Verified subset → 20% resolved)

5. Provenance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages