Fix C++ deployment parity, add top-p sampling, robust configs by guirguispierre · Pull Request #4 · guirguispierre/Atomic-1Bit

guirguispierre · 2026-04-30T21:07:47Z

Summary

The C++ inference path had latent bugs that quietly broke the project's "bit-exact Python ↔ C++ parity" claim end-to-end. The kernel itself was fine; the format/glue around it wasn't. This PR makes the deployment story actually correct, fixes a couple of trainer/CLI papercuts, and adds top-p / top-k / streaming to the runner.

What changed

Format and parity (showstoppers)

embedded/atomic_lib.h: the loader read 6 ints in the wrong order, skipped the ATOM magic and version, treated magic as vocab_size, and never consumed the per-tensor float32 weight scale written by the exporter. Rewritten to read the full 8-int header, validate magic/version, consume weight scales, and apply them in bit_linear via SubLN-aware dequantization matching the Python BitLinear forward exactly. (The standalone atomic_runner.cpp was already correct; this brings the documented header-only API in line.)
atomic_1bit/utils/thermal.py: add public get_current_temp() so train_instruct.py:356 stops calling a nonexistent method during temp logging; defensively format None when sensors are unavailable.

Configs (real UX bugs for users)

atomic_1bit/python/chat.py: drop hardcoded 50257 / 256 / 4 / 4 dims (silently produced garbage for any non-matching checkpoint). Auto-infer vocab / dim / depth / context_length from the checkpoint's state_dict, with CLI overrides; add --device and clean error paths.
atomic_1bit/utils/gen_gist.py: drop hardcoded flagship dims; accept --dim / --depth / --heads / --vocab_size / --context_length CLI args.

Runner feature: top-p + streaming

embedded/atomic_runner.cpp: replace temperature-only sample_logits with temperature + optional top-k + optional top-p (nucleus) sampling. New flags: --top_k, --top_p, --prompt 1,2,3 (multi-token start), --stream / --no-stream, --help. Validates prompt tokens against vocab and rejects unknown args.

Tests

tests/test_export_roundtrip.py: parses the exported binary using the C++ loader's exact expected layout (header, gist, embeddings, per-tensor scales, byte counts). Catches any future drift between exporter and loader.

Docs

docs/USAGE.md, docs/COMMANDS.md: document the new runner flags.

Verification

pytest tests/ → 132 passed (was 129; +3 new roundtrip tests).
cd atomic_1bit/core && make then pytest tests/test_kernel_parity.py → 9 passed on Apple Silicon (NEON SIMD path).
cd embedded && g++ -O3 -std=c++17 atomic_runner.cpp -o runner && ./runner --help → builds clean, prints new flag list.

Test plan

Train any checkpoint and run chat.py --checkpoint weights/<file>.pt without passing dims; verify auto-inference picks the right shape.
Export to .bin and run ./runner --top_p 0.9 --temp 0.8 --prompt 1,42 --steps 30 against a real model.
Run pytest tests/ on CI.

The C++ inference path had latent bugs that broke the project's "bit-exact Python <-> C++ parity" claim end-to-end. The kernel itself matches NumPy, but the export -> load -> inference glue around it did not. This change makes the deployment story actually correct, plus adds top-p / top-k / streaming to the runner. Format and parity (showstoppers) - embedded/atomic_lib.h: header read was 6 ints in the wrong order; it skipped the 'ATOM' magic and version, treated magic as vocab_size, and never consumed the per-tensor float32 weight scale written by the exporter. Rewritten to read the full 8-int header, validate magic/version, consume weight scales, and apply them in bit_linear via SubLN-aware dequantization that matches the Python BitLinear forward exactly. - atomic_1bit/utils/thermal.py: add public get_current_temp() so train_instruct.py:356 stops calling a nonexistent method when temp logging fires; defensively format None when sensors are unavailable. Configs (correctness for users) - atomic_1bit/python/chat.py: drop hardcoded 50257/256/4/4 dims; auto-infer vocab/dim/depth/context_length from the checkpoint's state_dict, with CLI overrides; add --device and clean loading/error paths. - atomic_1bit/utils/gen_gist.py: drop hardcoded flagship dims; accept --dim/--depth/--heads/--vocab_size/--context_length CLI args, derive UNK from --vocab_size. Runner feature: top-p + streaming - embedded/atomic_runner.cpp: replace temperature-only sample_logits with temperature + optional top-k + optional top-p (nucleus) sampling. Add --top_k, --top_p, --prompt 1,2,3, --stream/--no-stream, --help; validate prompt tokens against vocab; reject unknown args. Tests - tests/test_export_roundtrip.py: parse the exported binary using the C++ loader's exact expected layout (header, gist, embeddings, per-tensor scales, byte counts). Catches future drift between exporter and loader. Builds on existing test_export.py. Docs - docs/USAGE.md, docs/COMMANDS.md: document new runner flags. Verified: 132 pytest tests pass (was 129; +3 roundtrip). CPU kernel parity (NEON path on Apple Silicon) passes against NumPy reference across 9 size configurations.

Copilot

Pull request overview

This PR restores end-to-end Python ↔ C++ deployment parity for the exported .bin model format, improves configuration robustness in Python CLI tools, and extends the embedded runner with top-k/top-p sampling and streaming output.

Changes:

Fix/align the header-only C++ loader (embedded/atomic_lib.h) with the exporter format (magic/version validation, per-tensor scales, SubLN-aware dequant in bit_linear).
Enhance the embedded runner (embedded/atomic_runner.cpp) with top-k/top-p sampling, prompt lists, streaming controls, and --help/unknown-arg validation.
Improve Python UX and correctness via config inference/overrides (chat.py), configurable gist generation (gen_gist.py), and robust temperature logging (thermal.py, train_instruct.py), plus add a binary roundtrip/layout test and update docs.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_export_roundtrip.py	Adds a strict parser/byte-budget test to catch exporter/loader format drift.
embedded/atomic_runner.cpp	Implements temperature + top-k/top-p sampling, prompt parsing, streaming output, and help/arg validation.
embedded/atomic_lib.h	Reworks loader + `bit_linear` to match exporter format (magic/version, per-tensor scales, SubLN-aware math).
docs/USAGE.md	Documents updated runner usage and new sampling/streaming flags.
docs/COMMANDS.md	Updates runner command reference with new flags and semantics.
atomic_1bit/utils/thermal.py	Adds a public temperature accessor used by training logging.
atomic_1bit/utils/gen_gist.py	Removes hardcoded model dims; adds CLI args for model shape parameters.
atomic_1bit/training/train_instruct.py	Makes temperature logging robust when sensors are unavailable (`None`).
atomic_1bit/python/chat.py	Removes hardcoded dims; infers config from checkpoint with CLI overrides and device selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  for (int i = 0; i < model.config.depth; ++i) {
    AtomicLayer layer;

-    // LN1
    layer.ln1.resize(dim);
-    f.read((char *)layer.ln1.data(), dim * 4);
-
-    // Attn (dim * dim)
-    int size = dim * dim;
-    layer.q_w.resize(size);
-    f.read((char *)layer.q_w.data(), size);
-    layer.k_w.resize(size);
-    f.read((char *)layer.k_w.data(), size);
-    layer.v_w.resize(size);
-    f.read((char *)layer.v_w.data(), size);
-    layer.o_w.resize(size);
-    f.read((char *)layer.o_w.data(), size);
-
-    // LN2
+    f.read((char *)layer.ln1.data(), dim * sizeof(float));
+
+    int attn_size = dim * dim;
+    read_scale(f, layer.q_s);
+    layer.q_w.resize(attn_size);
+    f.read((char *)layer.q_w.data(), attn_size);
+
+    read_scale(f, layer.k_s);
+    layer.k_w.resize(attn_size);
+    f.read((char *)layer.k_w.data(), attn_size);
+
+    read_scale(f, layer.v_s);
+    layer.v_w.resize(attn_size);
+    f.read((char *)layer.v_w.data(), attn_size);
+
+    read_scale(f, layer.o_s);
+    layer.o_w.resize(attn_size);
+    f.read((char *)layer.o_w.data(), attn_size);
+
    layer.ln2.resize(dim);
-    f.read((char *)layer.ln2.data(), dim * 4);
+    f.read((char *)layer.ln2.data(), dim * sizeof(float));

-    // MLP
-    int size_fc1 = dim * (dim * 4);
-    int size_fc2 = (dim * 4) * dim;
+    int size_fc1 = dim * hidden;
+    int size_fc2 = hidden * dim;
+    read_scale(f, layer.fc1_s);
    layer.fc1_w.resize(size_fc1);
    f.read((char *)layer.fc1_w.data(), size_fc1);
+
+    read_scale(f, layer.fc2_s);
    layer.fc2_w.resize(size_fc2);
    f.read((char *)layer.fc2_w.data(), size_fc2);

    model.layers.push_back(layer);
  }


 def top_k_sampling(logits, k=50, temperature=1.0):
-    # Apply temp
    logits = logits / temperature
-    # Top K
    top_k_vals, top_k_inds = torch.topk(logits, k)
-    # Softmax
    probs = F.softmax(top_k_vals, dim=-1)
-    # Sample
    idx = torch.multinomial(probs, 1)
    return top_k_inds[0, idx[0]]


+  // Sort descending by probability.
+  std::sort(ranked.begin(), ranked.end(),
+            [](const pair<float, int> &a, const pair<float, int> &b) {
+              return a.first > b.first;
+            });
+
+  int kept = use_topk ? top_k : V;
+


@@ -144,78 +173,85 @@ inline bool load_model(const string &filename, AtomicModel &model) {
  f.read((char *)&model.config.has_gist, 4);

  int dim = model.config.dim;
+  int hidden = 4 * dim;


+  model.head_w.resize((size_t)dim * model.config.vocab_size);
+  f.read((char *)model.head_w.data(), model.head_w.size());

+  if (!f.good() && !f.eof()) {


+    int attn_size = dim * dim;
+    read_scale(f, layer.q_s);
+    layer.q_w.resize(attn_size);
+    f.read((char *)layer.q_w.data(), attn_size);
+
+    read_scale(f, layer.k_s);
+    layer.k_w.resize(attn_size);
+    f.read((char *)layer.k_w.data(), attn_size);


- atomic_1bit/__init__.py: __version__ = "1.4.0" - CHANGELOG.md: new file in Keep a Changelog format. Entry for 1.4.0 covers this PR's changes (C++ format/parity fixes, top-p/top-k sampling, chat.py/gen_gist.py CLI overhaul, export-roundtrip test). Backfilled entries for 1.0.0, 1.2.0, 1.3.0 from RELEASE_NOTES.md and the README roadmap. - README.md: roadmap section now lists v1.4 as current and links to CHANGELOG.md.

Copilot AI review requested due to automatic review settings April 30, 2026 21:07

Copilot started reviewing on behalf of guirguispierre April 30, 2026 21:08 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix C++ deployment parity, add top-p sampling, robust configs#4

Fix C++ deployment parity, add top-p sampling, robust configs#4
guirguispierre wants to merge 2 commits into
masterfrom
improvements/parity-and-simd

guirguispierre commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guirguispierre commented Apr 30, 2026

Summary

What changed

Verification

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants