Harden codebase: error handling, configs, tests, and docs by guirguispierre · Pull Request #3 · guirguispierre/Atomic-1Bit

guirguispierre · 2026-03-29T02:58:13Z

Summary

Replace 14 bare except: blocks with specific exception types across training, eval, benchmark, and utility scripts
Fix Metal backend: resolve metallib path relative to __FILE__ instead of hardcoded CWD-relative path; use zero-copy buffers (newBufferWithBytesNoCopy) for M-series unified memory
Wire YAML configs to all 3 training scripts via --config flag (backward-compatible defaults)
Add checkpoint validation with missing/unexpected key warnings before load_state_dict
Fix thermal monitor to log warnings on sensor errors instead of silently returning 0.0
Fix tokenizer base class to use @abstractmethod instead of manual raise NotImplementedError
Add CUDA kernel documentation with inline comments for kernel launch config, shared memory, and sync points
Add 58 new tests: E2E pipeline (train → export → binary validation), quantization edge cases (NaN, Inf, zeros, extremes), eval/benchmark smoke tests
Add docs: API_REFERENCE.md for the C++ inference API, TROUBLESHOOTING.md for common errors and production deployment

Test plan

Run pytest tests/ to verify all 58 new tests pass
Run python3 atomic_1bit/python/inference.py to verify parity still holds
Build Metal backend (make BACKEND=METAL) and verify metallib path resolves correctly
Run training with --config flag: python3 atomic_1bit/training/train.py --config configs/stories_base.yaml
Verify existing training scripts still work without --config (backward compat)

@AbstractMethod

- Replace 14 bare except: blocks with specific exception types - Fix hardcoded Metal metallib path to resolve relative to __FILE__ - Use zero-copy Metal buffers (newBufferWithBytesNoCopy) for M-series - Wire YAML config files to training scripts via --config flag - Add checkpoint validation with missing/unexpected key warnings - Fix thermal monitor to log warnings instead of silent failure - Use @AbstractMethod in tokenizer base class - Add CUDA kernel inline documentation - Add 58 new tests: E2E pipeline, quantization edge cases, eval smoke - Add docs: API_REFERENCE.md, TROUBLESHOOTING.md

The test referenced get_current_temp() which never existed on ThermalMonitor. Updated to use _get_max_temp() and accept None for unavailable sensors.

Copilot

Pull request overview

This PR hardens the Atomic-1Bit codebase by tightening exception handling, adding YAML-driven training configs and checkpoint validation, improving backend robustness/perf (Metal), and adding substantial test + documentation coverage.

Changes:

Add new pytest coverage (quantization edge cases, eval/benchmark smoke tests, and an end-to-end train→export→binary-validate pipeline).
Introduce --config YAML loading and checkpoint key validation warnings across training scripts.
Improve backend and utility robustness (Metal metallib path + zero-copy buffers, thermal monitor logging, bare except: removal, tokenizer abstractmethod fix), plus add API/troubleshooting docs.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
tests/test_quantization_edge_cases.py	Adds BitLinear/quantization edge-case tests (NaN/Inf/zeros/extremes, gradients).
tests/test_eval_benchmark.py	Adds import smoke tests and deterministic metric checks for evaluation + benchmark modules.
tests/test_e2e_pipeline.py	Adds slow-marked end-to-end pipeline test (train steps, checkpoint reload, export header checks).
scripts/run_decoder.py	Replaces bare `except:` with specific decode-related exceptions.
scripts/evaluate_quality.py	Replaces bare `except:` with specific decode-related exceptions.
pytest.ini	Defines `slow` marker for pytest selection.
docs/TROUBLESHOOTING.md	Adds extensive troubleshooting + deployment guide covering build/runtime/export issues.
docs/API_REFERENCE.md	Adds C++ inference API reference for `embedded/atomic_lib.h` and binary format notes.
benchmarks/run_suite.py	Replaces bare `except:` blocks with more specific exceptions.
atomic_1bit/utils/thermal.py	Logs sensor read failures and degrades gracefully instead of returning 0.0 silently.
atomic_1bit/training/train.py	Adds YAML config loading + checkpoint key validation before `load_state_dict`.
atomic_1bit/training/train_instruct.py	Adds YAML config loading + checkpoint key validation + safer exception handling around optimizer/scheduler restore.
atomic_1bit/training/train_pocket.py	Adds YAML config loading + checkpoint key validation + safer exception handling around optimizer restore.
atomic_1bit/tokenizers/base.py	Makes `eot_token` an `@abstractmethod` property instead of raising `NotImplementedError`.
atomic_1bit/tests/debug_crash.py	Narrows exception handling from bare `except:` to `Exception`.
atomic_1bit/evaluation/run_eval.py	Replaces bare `except:` with specific decode-related exceptions.
atomic_1bit/core/backends/metal_kernel.mm	Attempts to fix metallib path resolution and switches to zero-copy Metal buffers.
atomic_1bit/core/backends/cuda_kernel.cu	Adds detailed inline documentation for kernel tiling/shared-memory/synchronization/launch config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-29T03:06:01Z

+if os.path.exists(_config_path):
+    with open(_config_path, "r") as _f:
+        _cfg = yaml.safe_load(_f)
+    print(f"[Config] Loaded from {_config_path}")
+else:
+    print(f"[Config] File not found: {_config_path}, using hardcoded defaults")
+
+_model = _cfg.get("model", {})
+_training = _cfg.get("training", {})


yaml.safe_load() can return None (e.g., empty config file). In that case _cfg.get(...) will raise an AttributeError at import time. Consider defaulting to {} (e.g., yaml.safe_load(_f) or {}) before accessing .get().

Copilot · 2026-03-29T03:06:01Z

+            print(f"   Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}")
+        if unexpected:
+            print(f"   Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}")
+        result = model.load_state_dict(state_dict, strict=False)
+        if result.missing_keys or result.unexpected_keys:
+            print(f"   load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}")


Printing sorted(missing) / sorted(unexpected) can dump hundreds/thousands of state_dict keys for large checkpoints, which can significantly slow startup and spam logs. Consider printing counts plus a small sample (first N keys), and/or rely on the load_state_dict(...).missing_keys/unexpected_keys lists instead.

Suggested change

print(f" Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}")

if unexpected:

print(f" Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}")

result = model.load_state_dict(state_dict, strict=False)

if result.missing_keys or result.unexpected_keys:

print(f" load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}")

sorted_missing = sorted(missing)

sample_missing = sorted_missing[:10]

suffix = " ..." if len(sorted_missing) > len(sample_missing) else ""

print(

f" Warning: {len(sorted_missing)} missing key(s) in checkpoint "

f"(showing up to 10): {sample_missing}{suffix}"

)

if unexpected:

sorted_unexpected = sorted(unexpected)

sample_unexpected = sorted_unexpected[:10]

suffix = " ..." if len(sorted_unexpected) > len(sample_unexpected) else ""

print(

f" Warning: {len(sorted_unexpected)} unexpected key(s) in checkpoint "

f"(showing up to 10): {sample_unexpected}{suffix}"

)

result = model.load_state_dict(state_dict, strict=False)

if result.missing_keys or result.unexpected_keys:

missing_result = list(result.missing_keys)

unexpected_result = list(result.unexpected_keys)

sample_missing_result = missing_result[:10]

sample_unexpected_result = unexpected_result[:10]

missing_suffix = " ..." if len(missing_result) > len(sample_missing_result) else ""

unexpected_suffix = " ..." if len(unexpected_result) > len(sample_unexpected_result) else ""

print(

" load_state_dict result — "

f"missing ({len(missing_result)}, showing up to 10): {sample_missing_result}{missing_suffix}, "

f"unexpected ({len(unexpected_result)}, showing up to 10): {sample_unexpected_result}{unexpected_suffix}"

)

Copilot · 2026-03-29T03:06:02Z

+_cfg = {}
+_config_path = _args.config or os.path.join(
+    os.path.dirname(__file__), "..", "..", "configs", "flagship_12m.yaml"
+)
+if os.path.exists(_config_path):
+    with open(_config_path, "r") as _f:
+        _cfg = yaml.safe_load(_f)
+    print(f"[Config] Loaded from {_config_path}")
+else:
+    print(f"[Config] File not found: {_config_path}, using hardcoded defaults")
+
+_model = _cfg.get("model", {})
+_training = _cfg.get("training", {})
+


yaml.safe_load() can return None (e.g., empty config file). In that case _cfg.get(...) will raise an AttributeError at import time. Consider defaulting to {} (e.g., yaml.safe_load(_f) or {}) before accessing .get().

Copilot · 2026-03-29T03:06:02Z

+            model_keys = set(model.state_dict().keys())
+            ckpt_keys = set(state_dict.keys())
+            missing = model_keys - ckpt_keys
+            unexpected = ckpt_keys - model_keys
+            if missing:
+                print(f"   Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}")
+            if unexpected:
+                print(f"   Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}")
+            result = model.load_state_dict(state_dict, strict=False)
+            if result.missing_keys or result.unexpected_keys:
+                print(f"   load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}")


Printing full sorted(missing) / sorted(unexpected) key lists can be extremely verbose for large models and can slow checkpoint resume due to large console output. Consider logging only counts plus a small sample, and/or rely on the load_state_dict(...).missing_keys/unexpected_keys output.

Copilot · 2026-03-29T03:06:02Z

+_cfg = {}
+_config_path = _args.config or os.path.join(
+    os.path.dirname(__file__), "..", "..", "configs", "pocket_4k.yaml"
+)
+if os.path.exists(_config_path):
+    with open(_config_path, "r") as _f:
+        _cfg = yaml.safe_load(_f)
+    print(f"[Config] Loaded from {_config_path}")
+else:
+    print(f"[Config] File not found: {_config_path}, using hardcoded defaults")
+
+_model = _cfg.get("model", {})
+_training = _cfg.get("training", {})
+


yaml.safe_load() can return None (e.g., empty config file). In that case _cfg.get(...) will raise an AttributeError at import time. Consider defaulting to {} (e.g., yaml.safe_load(_f) or {}) before accessing .get().

Copilot · 2026-03-29T03:06:02Z

+    // Resolve metallib path relative to this source file's directory.
+    // __FILE__ expands to the full path of this .mm file at compile time:
+    //   .../atomic_1bit/core/backends/metal_kernel.mm
+    // Deleting the last path component ("backends") gives us the "core/"
+    // directory, which is where default.metallib is placed by the Makefile.
+    NSString *sourceDir =
+        [[[NSString stringWithUTF8String:__FILE__]
+            stringByDeletingLastPathComponent]   // strip "metal_kernel.mm"
+            stringByDeletingLastPathComponent];  // strip "backends/"
+    NSString *libPath =
+        [sourceDir stringByAppendingPathComponent:@"default.metallib"];
    NSURL *libOnDisk = [NSURL fileURLWithPath:libPath];

    id<MTLLibrary> defaultLibrary = [device newLibraryWithURL:libOnDisk
                                                        error:&error];
    if (!defaultLibrary) {
-      // Fallback: try just "default.metallib" if running from core/
+      // Fallback: try "default.metallib" relative to cwd (e.g. when running
+      // from atomic_1bit/core/ directly during development).
      libOnDisk = [NSURL fileURLWithPath:@"default.metallib"];
      defaultLibrary = [device newLibraryWithURL:libOnDisk error:&error];
    }


Resolving default.metallib via __FILE__ is brittle: __FILE__ points to the source path on the build machine, which often won't exist on the runtime machine/container (even though default.metallib sits next to the built libatomic.so). Consider resolving the metallib path relative to the loaded shared library location (e.g., dladdr on ternary_matmul / MetalContext symbols) or allowing an override via env var.

Copilot · 2026-03-29T03:06:03Z

+Loads a binary model file produced by `atomic_1bit/utils/export_to_cpp.py` into an `AtomicModel` struct.
+
+**Parameters:**
+- `filename` — Path to the `.bin` model file.
+- `model` — Output parameter. The struct is populated in-place.
+
+**Returns:** `true` on success, `false` if the file cannot be opened.
+
+**Notes:**
+- The binary format does not include a magic-number check in this header version. The runner (`atomic_runner.cpp`) does validate the `ATOM` magic bytes and version before using this path.
+- Weights are stored in the following order: config header, optional gist vector, token embeddings, positional embeddings, per-layer weights (ln1, q/k/v/o, ln2, fc1, fc2), final norm, language model head.
+- Float tensors (`ln1`, `ln2`, `ln_f`, embeddings) are stored as raw `float32`. Ternary weight tensors are stored as raw `int8_t` with values in `{-1, 0, 1}`.
+- The caller is responsible for keeping `model` alive for as long as inference runs, since `forward` holds non-owning pointers into the vectors.


load_model is documented as loading binaries produced by atomic_1bit/utils/export_to_cpp.py, but embedded/atomic_lib.h currently reads a legacy format (no magic/version header and no per-weight scale prefixes). As written, users following this doc will generate .bin files that atomic_lib.h cannot parse correctly. Please either update the doc to clearly distinguish the legacy vs current format (and how to export each), or update atomic_lib.h to support the current exporter format.

Copilot · 2026-03-29T03:06:03Z

+  // Zero-copy buffers: wrap the caller's memory directly using
+  // newBufferWithBytesNoCopy with MTLResourceStorageModeShared.
+  // This is safe on M-series (Apple Silicon) unified memory because the CPU
+  // and GPU share the same physical memory — no explicit DMA transfer occurs
+  // and no staging copy is needed. The nil deallocator means Metal will NOT
+  // free the underlying memory when the buffer is released; the caller (Python
+  // / ctypes) owns and manages the lifetime of A, B_transposed, and C.
+  // Note: on Apple Silicon, MTLResourceStorageModeShared does not require
+  // page alignment for correctness, though aligned allocations are faster.
+  id<MTLBuffer> bufferA = [g_metal_context->device
+      newBufferWithBytesNoCopy:(void *)A
+                        length:(M * K) * sizeof(int8_t)
+                       options:MTLResourceStorageModeShared
+                   deallocator:nil];
+  id<MTLBuffer> bufferB = [g_metal_context->device
+      newBufferWithBytesNoCopy:(void *)B_transposed
+                        length:(N * K) * sizeof(int8_t)
+                       options:MTLResourceStorageModeShared
+                   deallocator:nil];
+  // bufferC wraps the output array directly; after GPU execution the results
+  // are already in C — no memcpy needed.
+  id<MTLBuffer> bufferC = [g_metal_context->device
+      newBufferWithBytesNoCopy:(void *)C
+                        length:(M * N) * sizeof(int32_t)
+                       options:MTLResourceStorageModeShared
+                   deallocator:nil];


The switch to newBufferWithBytesNoCopy is likely unsafe with NumPy/ctypes-allocated memory: Metal’s no-copy buffer APIs generally require page-aligned memory in a single VM region, and the backing memory must remain valid for the lifetime of the buffer and any in-flight GPU work. With typical malloc-backed NumPy arrays this can fail or cause undefined behavior, and the code doesn’t check for nil buffers / provide a fallback. Consider keeping newBufferWithBytes (copy) as the safe default, or implement a runtime fallback (if no-copy returns nil) and/or enforce page-aligned allocations on the Python side.

Copilot · 2026-03-29T03:06:03Z

+  cd atomic_1bit/core
+  make BACKEND=METAL
+  ```
+- The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/` and cleaned by `make clean`.


This line says default.metallib is cleaned by make clean, but atomic_1bit/core/Makefile currently removes backends/*.air and backends/*.metallib and does not remove default.metallib in the core directory. Either adjust the docs or update the Makefile clean target to remove default.metallib as well.

Suggested change

- The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/` and cleaned by `make clean`.

- The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/`, but `make clean` only removes `backends/*.air` and `backends/*.metallib`; remove `default.metallib` in `atomic_1bit/core/` manually if needed.

Copilot · 2026-03-29T03:06:03Z

+
+import pytest
+import torch
+import torch.nn as nn


Unused import: import torch.nn as nn is not referenced in this test module. This will be flagged by flake8 (F401) and makes the test harder to keep clean; please remove it.

Suggested change

import torch.nn as nn

Resolved conflicts: - CONTRIBUTING.md: kept master's more detailed version (from PR #3) - README.md: kept launch-prep style with badges; corrected stale test count - debug_crash.py: deleted per PR intent (scratch debug file) Doc fixes: - docs/INSTALL.md: corrected 'make metal/cuda' to 'make BACKEND=METAL/CUDA' (Makefile uses env var) - README.md: removed hard-coded test count (was '67 tests')

Overhauls README, adds INSTALL/USAGE docs, examples/ folder (quickstart and generation demos), package __init__.py, expanded .gitignore, gist.py docstring cleanup. Removes debug_crash.py and v1.3_quality_report.txt scratch files. Resolved against PR #3's CONTRIBUTING.md and docs additions.

Copilot AI review requested due to automatic review settings March 29, 2026 02:58

Copilot started reviewing on behalf of guirguispierre March 29, 2026 02:58 View session

fix: correct thermal test calling nonexistent get_current_temp method

a670c13

The test referenced get_current_temp() which never existed on ThermalMonitor. Updated to use _get_max_temp() and accept None for unavailable sensors.

guirguispierre merged commit 22f1286 into master Mar 29, 2026
2 checks passed

guirguispierre deleted the improvements/codebase-hardening branch March 29, 2026 03:05

Copilot AI reviewed Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden codebase: error handling, configs, tests, and docs#3

Harden codebase: error handling, configs, tests, and docs#3
guirguispierre merged 2 commits into
masterfrom
improvements/codebase-hardening

guirguispierre commented Mar 29, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            print(f"   Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}")
-        if unexpected:
-            print(f"   Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}")
-        result = model.load_state_dict(state_dict, strict=False)
-        if result.missing_keys or result.unexpected_keys:
-            print(f"   load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}")
+            sorted_missing = sorted(missing)
+            sample_missing = sorted_missing[:10]
+            suffix = " ..." if len(sorted_missing) > len(sample_missing) else ""
+            print(
+                f"   Warning: {len(sorted_missing)} missing key(s) in checkpoint "
+                f"(showing up to 10): {sample_missing}{suffix}"
+            )
+        if unexpected:
+            sorted_unexpected = sorted(unexpected)
+            sample_unexpected = sorted_unexpected[:10]
+            suffix = " ..." if len(sorted_unexpected) > len(sample_unexpected) else ""
+            print(
+                f"   Warning: {len(sorted_unexpected)} unexpected key(s) in checkpoint "
+                f"(showing up to 10): {sample_unexpected}{suffix}"
+            )
+        result = model.load_state_dict(state_dict, strict=False)
+        if result.missing_keys or result.unexpected_keys:
+            missing_result = list(result.missing_keys)
+            unexpected_result = list(result.unexpected_keys)
+            sample_missing_result = missing_result[:10]
+            sample_unexpected_result = unexpected_result[:10]
+            missing_suffix = " ..." if len(missing_result) > len(sample_missing_result) else ""
+            unexpected_suffix = " ..." if len(unexpected_result) > len(sample_unexpected_result) else ""
+            print(
+                "   load_state_dict result — "
+                f"missing ({len(missing_result)}, showing up to 10): {sample_missing_result}{missing_suffix}, "
+                f"unexpected ({len(unexpected_result)}, showing up to 10): {sample_unexpected_result}{unexpected_suffix}"
+            )

	- The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/` and cleaned by `make clean`.
	- The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/`, but `make clean` only removes `backends/.air` and `backends/.metallib`; remove `default.metallib` in `atomic_1bit/core/` manually if needed.

Conversation

guirguispierre commented Mar 29, 2026

Summary

Test plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants