Skip to content

Fix C++ deployment parity, add top-p sampling, robust configs#4

Open
guirguispierre wants to merge 2 commits into
masterfrom
improvements/parity-and-simd
Open

Fix C++ deployment parity, add top-p sampling, robust configs#4
guirguispierre wants to merge 2 commits into
masterfrom
improvements/parity-and-simd

Conversation

@guirguispierre
Copy link
Copy Markdown
Owner

Summary

The C++ inference path had latent bugs that quietly broke the project's "bit-exact Python ↔ C++ parity" claim end-to-end. The kernel itself was fine; the format/glue around it wasn't. This PR makes the deployment story actually correct, fixes a couple of trainer/CLI papercuts, and adds top-p / top-k / streaming to the runner.

What changed

Format and parity (showstoppers)

  • embedded/atomic_lib.h: the loader read 6 ints in the wrong order, skipped the ATOM magic and version, treated magic as vocab_size, and never consumed the per-tensor float32 weight scale written by the exporter. Rewritten to read the full 8-int header, validate magic/version, consume weight scales, and apply them in bit_linear via SubLN-aware dequantization matching the Python BitLinear forward exactly. (The standalone atomic_runner.cpp was already correct; this brings the documented header-only API in line.)
  • atomic_1bit/utils/thermal.py: add public get_current_temp() so train_instruct.py:356 stops calling a nonexistent method during temp logging; defensively format None when sensors are unavailable.

Configs (real UX bugs for users)

  • atomic_1bit/python/chat.py: drop hardcoded 50257 / 256 / 4 / 4 dims (silently produced garbage for any non-matching checkpoint). Auto-infer vocab / dim / depth / context_length from the checkpoint's state_dict, with CLI overrides; add --device and clean error paths.
  • atomic_1bit/utils/gen_gist.py: drop hardcoded flagship dims; accept --dim / --depth / --heads / --vocab_size / --context_length CLI args.

Runner feature: top-p + streaming

  • embedded/atomic_runner.cpp: replace temperature-only sample_logits with temperature + optional top-k + optional top-p (nucleus) sampling. New flags: --top_k, --top_p, --prompt 1,2,3 (multi-token start), --stream / --no-stream, --help. Validates prompt tokens against vocab and rejects unknown args.

Tests

  • tests/test_export_roundtrip.py: parses the exported binary using the C++ loader's exact expected layout (header, gist, embeddings, per-tensor scales, byte counts). Catches any future drift between exporter and loader.

Docs

  • docs/USAGE.md, docs/COMMANDS.md: document the new runner flags.

Verification

  • pytest tests/132 passed (was 129; +3 new roundtrip tests).
  • cd atomic_1bit/core && make then pytest tests/test_kernel_parity.py9 passed on Apple Silicon (NEON SIMD path).
  • cd embedded && g++ -O3 -std=c++17 atomic_runner.cpp -o runner && ./runner --help → builds clean, prints new flag list.

Test plan

  • Train any checkpoint and run chat.py --checkpoint weights/<file>.pt without passing dims; verify auto-inference picks the right shape.
  • Export to .bin and run ./runner --top_p 0.9 --temp 0.8 --prompt 1,42 --steps 30 against a real model.
  • Run pytest tests/ on CI.

The C++ inference path had latent bugs that broke the project's
"bit-exact Python <-> C++ parity" claim end-to-end. The kernel itself
matches NumPy, but the export -> load -> inference glue around it did
not. This change makes the deployment story actually correct, plus
adds top-p / top-k / streaming to the runner.

Format and parity (showstoppers)
- embedded/atomic_lib.h: header read was 6 ints in the wrong order;
  it skipped the 'ATOM' magic and version, treated magic as
  vocab_size, and never consumed the per-tensor float32 weight scale
  written by the exporter. Rewritten to read the full 8-int header,
  validate magic/version, consume weight scales, and apply them in
  bit_linear via SubLN-aware dequantization that matches the Python
  BitLinear forward exactly.
- atomic_1bit/utils/thermal.py: add public get_current_temp() so
  train_instruct.py:356 stops calling a nonexistent method when temp
  logging fires; defensively format None when sensors are
  unavailable.

Configs (correctness for users)
- atomic_1bit/python/chat.py: drop hardcoded 50257/256/4/4 dims;
  auto-infer vocab/dim/depth/context_length from the checkpoint's
  state_dict, with CLI overrides; add --device and clean
  loading/error paths.
- atomic_1bit/utils/gen_gist.py: drop hardcoded flagship dims;
  accept --dim/--depth/--heads/--vocab_size/--context_length CLI
  args, derive UNK from --vocab_size.

Runner feature: top-p + streaming
- embedded/atomic_runner.cpp: replace temperature-only sample_logits
  with temperature + optional top-k + optional top-p (nucleus)
  sampling. Add --top_k, --top_p, --prompt 1,2,3, --stream/--no-stream,
  --help; validate prompt tokens against vocab; reject unknown args.

Tests
- tests/test_export_roundtrip.py: parse the exported binary using the
  C++ loader's exact expected layout (header, gist, embeddings,
  per-tensor scales, byte counts). Catches future drift between
  exporter and loader. Builds on existing test_export.py.

Docs
- docs/USAGE.md, docs/COMMANDS.md: document new runner flags.

Verified: 132 pytest tests pass (was 129; +3 roundtrip). CPU kernel
parity (NEON path on Apple Silicon) passes against NumPy reference
across 9 size configurations.
Copilot AI review requested due to automatic review settings April 30, 2026 21:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restores end-to-end Python ↔ C++ deployment parity for the exported .bin model format, improves configuration robustness in Python CLI tools, and extends the embedded runner with top-k/top-p sampling and streaming output.

Changes:

  • Fix/align the header-only C++ loader (embedded/atomic_lib.h) with the exporter format (magic/version validation, per-tensor scales, SubLN-aware dequant in bit_linear).
  • Enhance the embedded runner (embedded/atomic_runner.cpp) with top-k/top-p sampling, prompt lists, streaming controls, and --help/unknown-arg validation.
  • Improve Python UX and correctness via config inference/overrides (chat.py), configurable gist generation (gen_gist.py), and robust temperature logging (thermal.py, train_instruct.py), plus add a binary roundtrip/layout test and update docs.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_export_roundtrip.py Adds a strict parser/byte-budget test to catch exporter/loader format drift.
embedded/atomic_runner.cpp Implements temperature + top-k/top-p sampling, prompt parsing, streaming output, and help/arg validation.
embedded/atomic_lib.h Reworks loader + bit_linear to match exporter format (magic/version, per-tensor scales, SubLN-aware math).
docs/USAGE.md Documents updated runner usage and new sampling/streaming flags.
docs/COMMANDS.md Updates runner command reference with new flags and semantics.
atomic_1bit/utils/thermal.py Adds a public temperature accessor used by training logging.
atomic_1bit/utils/gen_gist.py Removes hardcoded model dims; adds CLI args for model shape parameters.
atomic_1bit/training/train_instruct.py Makes temperature logging robust when sensors are unavailable (None).
atomic_1bit/python/chat.py Removes hardcoded dims; infers config from checkpoint with CLI overrides and device selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread embedded/atomic_lib.h
Comment on lines 190 to 227
for (int i = 0; i < model.config.depth; ++i) {
AtomicLayer layer;

// LN1
layer.ln1.resize(dim);
f.read((char *)layer.ln1.data(), dim * 4);

// Attn (dim * dim)
int size = dim * dim;
layer.q_w.resize(size);
f.read((char *)layer.q_w.data(), size);
layer.k_w.resize(size);
f.read((char *)layer.k_w.data(), size);
layer.v_w.resize(size);
f.read((char *)layer.v_w.data(), size);
layer.o_w.resize(size);
f.read((char *)layer.o_w.data(), size);

// LN2
f.read((char *)layer.ln1.data(), dim * sizeof(float));

int attn_size = dim * dim;
read_scale(f, layer.q_s);
layer.q_w.resize(attn_size);
f.read((char *)layer.q_w.data(), attn_size);

read_scale(f, layer.k_s);
layer.k_w.resize(attn_size);
f.read((char *)layer.k_w.data(), attn_size);

read_scale(f, layer.v_s);
layer.v_w.resize(attn_size);
f.read((char *)layer.v_w.data(), attn_size);

read_scale(f, layer.o_s);
layer.o_w.resize(attn_size);
f.read((char *)layer.o_w.data(), attn_size);

layer.ln2.resize(dim);
f.read((char *)layer.ln2.data(), dim * 4);
f.read((char *)layer.ln2.data(), dim * sizeof(float));

// MLP
int size_fc1 = dim * (dim * 4);
int size_fc2 = (dim * 4) * dim;
int size_fc1 = dim * hidden;
int size_fc2 = hidden * dim;
read_scale(f, layer.fc1_s);
layer.fc1_w.resize(size_fc1);
f.read((char *)layer.fc1_w.data(), size_fc1);

read_scale(f, layer.fc2_s);
layer.fc2_w.resize(size_fc2);
f.read((char *)layer.fc2_w.data(), size_fc2);

model.layers.push_back(layer);
}
Comment on lines 13 to 18
def top_k_sampling(logits, k=50, temperature=1.0):
# Apply temp
logits = logits / temperature
# Top K
top_k_vals, top_k_inds = torch.topk(logits, k)
# Softmax
probs = F.softmax(top_k_vals, dim=-1)
# Sample
idx = torch.multinomial(probs, 1)
return top_k_inds[0, idx[0]]
Comment on lines +540 to +547
// Sort descending by probability.
std::sort(ranked.begin(), ranked.end(),
[](const pair<float, int> &a, const pair<float, int> &b) {
return a.first > b.first;
});

int kept = use_topk ? top_k : V;

Comment thread embedded/atomic_lib.h
Comment on lines 168 to +176
@@ -144,78 +173,85 @@ inline bool load_model(const string &filename, AtomicModel &model) {
f.read((char *)&model.config.has_gist, 4);

int dim = model.config.dim;
int hidden = 4 * dim;
Comment thread embedded/atomic_lib.h
model.head_w.resize((size_t)dim * model.config.vocab_size);
f.read((char *)model.head_w.data(), model.head_w.size());

if (!f.good() && !f.eof()) {
Comment thread embedded/atomic_lib.h
Comment on lines +196 to +203
int attn_size = dim * dim;
read_scale(f, layer.q_s);
layer.q_w.resize(attn_size);
f.read((char *)layer.q_w.data(), attn_size);

read_scale(f, layer.k_s);
layer.k_w.resize(attn_size);
f.read((char *)layer.k_w.data(), attn_size);
- atomic_1bit/__init__.py: __version__ = "1.4.0"
- CHANGELOG.md: new file in Keep a Changelog format. Entry for 1.4.0
  covers this PR's changes (C++ format/parity fixes, top-p/top-k
  sampling, chat.py/gen_gist.py CLI overhaul, export-roundtrip
  test). Backfilled entries for 1.0.0, 1.2.0, 1.3.0 from
  RELEASE_NOTES.md and the README roadmap.
- README.md: roadmap section now lists v1.4 as current and links
  to CHANGELOG.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants