Fix C++ deployment parity, add top-p sampling, robust configs#4
Open
guirguispierre wants to merge 2 commits into
Open
Fix C++ deployment parity, add top-p sampling, robust configs#4guirguispierre wants to merge 2 commits into
guirguispierre wants to merge 2 commits into
Conversation
The C++ inference path had latent bugs that broke the project's "bit-exact Python <-> C++ parity" claim end-to-end. The kernel itself matches NumPy, but the export -> load -> inference glue around it did not. This change makes the deployment story actually correct, plus adds top-p / top-k / streaming to the runner. Format and parity (showstoppers) - embedded/atomic_lib.h: header read was 6 ints in the wrong order; it skipped the 'ATOM' magic and version, treated magic as vocab_size, and never consumed the per-tensor float32 weight scale written by the exporter. Rewritten to read the full 8-int header, validate magic/version, consume weight scales, and apply them in bit_linear via SubLN-aware dequantization that matches the Python BitLinear forward exactly. - atomic_1bit/utils/thermal.py: add public get_current_temp() so train_instruct.py:356 stops calling a nonexistent method when temp logging fires; defensively format None when sensors are unavailable. Configs (correctness for users) - atomic_1bit/python/chat.py: drop hardcoded 50257/256/4/4 dims; auto-infer vocab/dim/depth/context_length from the checkpoint's state_dict, with CLI overrides; add --device and clean loading/error paths. - atomic_1bit/utils/gen_gist.py: drop hardcoded flagship dims; accept --dim/--depth/--heads/--vocab_size/--context_length CLI args, derive UNK from --vocab_size. Runner feature: top-p + streaming - embedded/atomic_runner.cpp: replace temperature-only sample_logits with temperature + optional top-k + optional top-p (nucleus) sampling. Add --top_k, --top_p, --prompt 1,2,3, --stream/--no-stream, --help; validate prompt tokens against vocab; reject unknown args. Tests - tests/test_export_roundtrip.py: parse the exported binary using the C++ loader's exact expected layout (header, gist, embeddings, per-tensor scales, byte counts). Catches future drift between exporter and loader. Builds on existing test_export.py. Docs - docs/USAGE.md, docs/COMMANDS.md: document new runner flags. Verified: 132 pytest tests pass (was 129; +3 roundtrip). CPU kernel parity (NEON path on Apple Silicon) passes against NumPy reference across 9 size configurations.
There was a problem hiding this comment.
Pull request overview
This PR restores end-to-end Python ↔ C++ deployment parity for the exported .bin model format, improves configuration robustness in Python CLI tools, and extends the embedded runner with top-k/top-p sampling and streaming output.
Changes:
- Fix/align the header-only C++ loader (
embedded/atomic_lib.h) with the exporter format (magic/version validation, per-tensor scales, SubLN-aware dequant inbit_linear). - Enhance the embedded runner (
embedded/atomic_runner.cpp) with top-k/top-p sampling, prompt lists, streaming controls, and--help/unknown-arg validation. - Improve Python UX and correctness via config inference/overrides (
chat.py), configurable gist generation (gen_gist.py), and robust temperature logging (thermal.py,train_instruct.py), plus add a binary roundtrip/layout test and update docs.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_export_roundtrip.py | Adds a strict parser/byte-budget test to catch exporter/loader format drift. |
| embedded/atomic_runner.cpp | Implements temperature + top-k/top-p sampling, prompt parsing, streaming output, and help/arg validation. |
| embedded/atomic_lib.h | Reworks loader + bit_linear to match exporter format (magic/version, per-tensor scales, SubLN-aware math). |
| docs/USAGE.md | Documents updated runner usage and new sampling/streaming flags. |
| docs/COMMANDS.md | Updates runner command reference with new flags and semantics. |
| atomic_1bit/utils/thermal.py | Adds a public temperature accessor used by training logging. |
| atomic_1bit/utils/gen_gist.py | Removes hardcoded model dims; adds CLI args for model shape parameters. |
| atomic_1bit/training/train_instruct.py | Makes temperature logging robust when sensors are unavailable (None). |
| atomic_1bit/python/chat.py | Removes hardcoded dims; infers config from checkpoint with CLI overrides and device selection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
190
to
227
| for (int i = 0; i < model.config.depth; ++i) { | ||
| AtomicLayer layer; | ||
|
|
||
| // LN1 | ||
| layer.ln1.resize(dim); | ||
| f.read((char *)layer.ln1.data(), dim * 4); | ||
|
|
||
| // Attn (dim * dim) | ||
| int size = dim * dim; | ||
| layer.q_w.resize(size); | ||
| f.read((char *)layer.q_w.data(), size); | ||
| layer.k_w.resize(size); | ||
| f.read((char *)layer.k_w.data(), size); | ||
| layer.v_w.resize(size); | ||
| f.read((char *)layer.v_w.data(), size); | ||
| layer.o_w.resize(size); | ||
| f.read((char *)layer.o_w.data(), size); | ||
|
|
||
| // LN2 | ||
| f.read((char *)layer.ln1.data(), dim * sizeof(float)); | ||
|
|
||
| int attn_size = dim * dim; | ||
| read_scale(f, layer.q_s); | ||
| layer.q_w.resize(attn_size); | ||
| f.read((char *)layer.q_w.data(), attn_size); | ||
|
|
||
| read_scale(f, layer.k_s); | ||
| layer.k_w.resize(attn_size); | ||
| f.read((char *)layer.k_w.data(), attn_size); | ||
|
|
||
| read_scale(f, layer.v_s); | ||
| layer.v_w.resize(attn_size); | ||
| f.read((char *)layer.v_w.data(), attn_size); | ||
|
|
||
| read_scale(f, layer.o_s); | ||
| layer.o_w.resize(attn_size); | ||
| f.read((char *)layer.o_w.data(), attn_size); | ||
|
|
||
| layer.ln2.resize(dim); | ||
| f.read((char *)layer.ln2.data(), dim * 4); | ||
| f.read((char *)layer.ln2.data(), dim * sizeof(float)); | ||
|
|
||
| // MLP | ||
| int size_fc1 = dim * (dim * 4); | ||
| int size_fc2 = (dim * 4) * dim; | ||
| int size_fc1 = dim * hidden; | ||
| int size_fc2 = hidden * dim; | ||
| read_scale(f, layer.fc1_s); | ||
| layer.fc1_w.resize(size_fc1); | ||
| f.read((char *)layer.fc1_w.data(), size_fc1); | ||
|
|
||
| read_scale(f, layer.fc2_s); | ||
| layer.fc2_w.resize(size_fc2); | ||
| f.read((char *)layer.fc2_w.data(), size_fc2); | ||
|
|
||
| model.layers.push_back(layer); | ||
| } |
Comment on lines
13
to
18
| def top_k_sampling(logits, k=50, temperature=1.0): | ||
| # Apply temp | ||
| logits = logits / temperature | ||
| # Top K | ||
| top_k_vals, top_k_inds = torch.topk(logits, k) | ||
| # Softmax | ||
| probs = F.softmax(top_k_vals, dim=-1) | ||
| # Sample | ||
| idx = torch.multinomial(probs, 1) | ||
| return top_k_inds[0, idx[0]] |
Comment on lines
+540
to
+547
| // Sort descending by probability. | ||
| std::sort(ranked.begin(), ranked.end(), | ||
| [](const pair<float, int> &a, const pair<float, int> &b) { | ||
| return a.first > b.first; | ||
| }); | ||
|
|
||
| int kept = use_topk ? top_k : V; | ||
|
|
Comment on lines
168
to
+176
| @@ -144,78 +173,85 @@ inline bool load_model(const string &filename, AtomicModel &model) { | |||
| f.read((char *)&model.config.has_gist, 4); | |||
|
|
|||
| int dim = model.config.dim; | |||
| int hidden = 4 * dim; | |||
| model.head_w.resize((size_t)dim * model.config.vocab_size); | ||
| f.read((char *)model.head_w.data(), model.head_w.size()); | ||
|
|
||
| if (!f.good() && !f.eof()) { |
Comment on lines
+196
to
+203
| int attn_size = dim * dim; | ||
| read_scale(f, layer.q_s); | ||
| layer.q_w.resize(attn_size); | ||
| f.read((char *)layer.q_w.data(), attn_size); | ||
|
|
||
| read_scale(f, layer.k_s); | ||
| layer.k_w.resize(attn_size); | ||
| f.read((char *)layer.k_w.data(), attn_size); |
- atomic_1bit/__init__.py: __version__ = "1.4.0" - CHANGELOG.md: new file in Keep a Changelog format. Entry for 1.4.0 covers this PR's changes (C++ format/parity fixes, top-p/top-k sampling, chat.py/gen_gist.py CLI overhaul, export-roundtrip test). Backfilled entries for 1.0.0, 1.2.0, 1.3.0 from RELEASE_NOTES.md and the README roadmap. - README.md: roadmap section now lists v1.4 as current and links to CHANGELOG.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The C++ inference path had latent bugs that quietly broke the project's "bit-exact Python ↔ C++ parity" claim end-to-end. The kernel itself was fine; the format/glue around it wasn't. This PR makes the deployment story actually correct, fixes a couple of trainer/CLI papercuts, and adds top-p / top-k / streaming to the runner.
What changed
Format and parity (showstoppers)
embedded/atomic_lib.h: the loader read 6 ints in the wrong order, skipped theATOMmagic and version, treatedmagicasvocab_size, and never consumed the per-tensorfloat32weight scale written by the exporter. Rewritten to read the full 8-int header, validate magic/version, consume weight scales, and apply them inbit_linearvia SubLN-aware dequantization matching the PythonBitLinearforward exactly. (The standaloneatomic_runner.cppwas already correct; this brings the documented header-only API in line.)atomic_1bit/utils/thermal.py: add publicget_current_temp()sotrain_instruct.py:356stops calling a nonexistent method during temp logging; defensively formatNonewhen sensors are unavailable.Configs (real UX bugs for users)
atomic_1bit/python/chat.py: drop hardcoded50257 / 256 / 4 / 4dims (silently produced garbage for any non-matching checkpoint). Auto-infervocab / dim / depth / context_lengthfrom the checkpoint'sstate_dict, with CLI overrides; add--deviceand clean error paths.atomic_1bit/utils/gen_gist.py: drop hardcoded flagship dims; accept--dim / --depth / --heads / --vocab_size / --context_lengthCLI args.Runner feature: top-p + streaming
embedded/atomic_runner.cpp: replace temperature-onlysample_logitswith temperature + optional top-k + optional top-p (nucleus) sampling. New flags:--top_k,--top_p,--prompt 1,2,3(multi-token start),--stream/--no-stream,--help. Validates prompt tokens against vocab and rejects unknown args.Tests
tests/test_export_roundtrip.py: parses the exported binary using the C++ loader's exact expected layout (header, gist, embeddings, per-tensor scales, byte counts). Catches any future drift between exporter and loader.Docs
docs/USAGE.md,docs/COMMANDS.md: document the new runner flags.Verification
pytest tests/→ 132 passed (was 129; +3 new roundtrip tests).cd atomic_1bit/core && makethenpytest tests/test_kernel_parity.py→ 9 passed on Apple Silicon (NEON SIMD path).cd embedded && g++ -O3 -std=c++17 atomic_runner.cpp -o runner && ./runner --help→ builds clean, prints new flag list.Test plan
chat.py --checkpoint weights/<file>.ptwithout passing dims; verify auto-inference picks the right shape..binand run./runner --top_p 0.9 --temp 0.8 --prompt 1,42 --steps 30against a real model.pytest tests/on CI.