Skip to content

Harden codebase: error handling, configs, tests, and docs#3

Merged
guirguispierre merged 2 commits into
masterfrom
improvements/codebase-hardening
Mar 29, 2026
Merged

Harden codebase: error handling, configs, tests, and docs#3
guirguispierre merged 2 commits into
masterfrom
improvements/codebase-hardening

Conversation

@guirguispierre
Copy link
Copy Markdown
Owner

Summary

  • Replace 14 bare except: blocks with specific exception types across training, eval, benchmark, and utility scripts
  • Fix Metal backend: resolve metallib path relative to __FILE__ instead of hardcoded CWD-relative path; use zero-copy buffers (newBufferWithBytesNoCopy) for M-series unified memory
  • Wire YAML configs to all 3 training scripts via --config flag (backward-compatible defaults)
  • Add checkpoint validation with missing/unexpected key warnings before load_state_dict
  • Fix thermal monitor to log warnings on sensor errors instead of silently returning 0.0
  • Fix tokenizer base class to use @abstractmethod instead of manual raise NotImplementedError
  • Add CUDA kernel documentation with inline comments for kernel launch config, shared memory, and sync points
  • Add 58 new tests: E2E pipeline (train → export → binary validation), quantization edge cases (NaN, Inf, zeros, extremes), eval/benchmark smoke tests
  • Add docs: API_REFERENCE.md for the C++ inference API, TROUBLESHOOTING.md for common errors and production deployment

Test plan

  • Run pytest tests/ to verify all 58 new tests pass
  • Run python3 atomic_1bit/python/inference.py to verify parity still holds
  • Build Metal backend (make BACKEND=METAL) and verify metallib path resolves correctly
  • Run training with --config flag: python3 atomic_1bit/training/train.py --config configs/stories_base.yaml
  • Verify existing training scripts still work without --config (backward compat)

- Replace 14 bare except: blocks with specific exception types
- Fix hardcoded Metal metallib path to resolve relative to __FILE__
- Use zero-copy Metal buffers (newBufferWithBytesNoCopy) for M-series
- Wire YAML config files to training scripts via --config flag
- Add checkpoint validation with missing/unexpected key warnings
- Fix thermal monitor to log warnings instead of silent failure
- Use @AbstractMethod in tokenizer base class
- Add CUDA kernel inline documentation
- Add 58 new tests: E2E pipeline, quantization edge cases, eval smoke
- Add docs: API_REFERENCE.md, TROUBLESHOOTING.md
Copilot AI review requested due to automatic review settings March 29, 2026 02:58
The test referenced get_current_temp() which never existed on ThermalMonitor.
Updated to use _get_max_temp() and accept None for unavailable sensors.
@guirguispierre guirguispierre merged commit 22f1286 into master Mar 29, 2026
2 checks passed
@guirguispierre guirguispierre deleted the improvements/codebase-hardening branch March 29, 2026 03:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Atomic-1Bit codebase by tightening exception handling, adding YAML-driven training configs and checkpoint validation, improving backend robustness/perf (Metal), and adding substantial test + documentation coverage.

Changes:

  • Add new pytest coverage (quantization edge cases, eval/benchmark smoke tests, and an end-to-end train→export→binary-validate pipeline).
  • Introduce --config YAML loading and checkpoint key validation warnings across training scripts.
  • Improve backend and utility robustness (Metal metallib path + zero-copy buffers, thermal monitor logging, bare except: removal, tokenizer abstractmethod fix), plus add API/troubleshooting docs.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/test_quantization_edge_cases.py Adds BitLinear/quantization edge-case tests (NaN/Inf/zeros/extremes, gradients).
tests/test_eval_benchmark.py Adds import smoke tests and deterministic metric checks for evaluation + benchmark modules.
tests/test_e2e_pipeline.py Adds slow-marked end-to-end pipeline test (train steps, checkpoint reload, export header checks).
scripts/run_decoder.py Replaces bare except: with specific decode-related exceptions.
scripts/evaluate_quality.py Replaces bare except: with specific decode-related exceptions.
pytest.ini Defines slow marker for pytest selection.
docs/TROUBLESHOOTING.md Adds extensive troubleshooting + deployment guide covering build/runtime/export issues.
docs/API_REFERENCE.md Adds C++ inference API reference for embedded/atomic_lib.h and binary format notes.
benchmarks/run_suite.py Replaces bare except: blocks with more specific exceptions.
atomic_1bit/utils/thermal.py Logs sensor read failures and degrades gracefully instead of returning 0.0 silently.
atomic_1bit/training/train.py Adds YAML config loading + checkpoint key validation before load_state_dict.
atomic_1bit/training/train_instruct.py Adds YAML config loading + checkpoint key validation + safer exception handling around optimizer/scheduler restore.
atomic_1bit/training/train_pocket.py Adds YAML config loading + checkpoint key validation + safer exception handling around optimizer restore.
atomic_1bit/tokenizers/base.py Makes eot_token an @abstractmethod property instead of raising NotImplementedError.
atomic_1bit/tests/debug_crash.py Narrows exception handling from bare except: to Exception.
atomic_1bit/evaluation/run_eval.py Replaces bare except: with specific decode-related exceptions.
atomic_1bit/core/backends/metal_kernel.mm Attempts to fix metallib path resolution and switches to zero-copy Metal buffers.
atomic_1bit/core/backends/cuda_kernel.cu Adds detailed inline documentation for kernel tiling/shared-memory/synchronization/launch config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +29 to +37
if os.path.exists(_config_path):
with open(_config_path, "r") as _f:
_cfg = yaml.safe_load(_f)
print(f"[Config] Loaded from {_config_path}")
else:
print(f"[Config] File not found: {_config_path}, using hardcoded defaults")

_model = _cfg.get("model", {})
_training = _cfg.get("training", {})
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml.safe_load() can return None (e.g., empty config file). In that case _cfg.get(...) will raise an AttributeError at import time. Consider defaulting to {} (e.g., yaml.safe_load(_f) or {}) before accessing .get().

Copilot uses AI. Check for mistakes.
Comment on lines +206 to +211
print(f" Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}")
if unexpected:
print(f" Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}")
result = model.load_state_dict(state_dict, strict=False)
if result.missing_keys or result.unexpected_keys:
print(f" load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}")
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing sorted(missing) / sorted(unexpected) can dump hundreds/thousands of state_dict keys for large checkpoints, which can significantly slow startup and spam logs. Consider printing counts plus a small sample (first N keys), and/or rely on the load_state_dict(...).missing_keys/unexpected_keys lists instead.

Suggested change
print(f" Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}")
if unexpected:
print(f" Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}")
result = model.load_state_dict(state_dict, strict=False)
if result.missing_keys or result.unexpected_keys:
print(f" load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}")
sorted_missing = sorted(missing)
sample_missing = sorted_missing[:10]
suffix = " ..." if len(sorted_missing) > len(sample_missing) else ""
print(
f" Warning: {len(sorted_missing)} missing key(s) in checkpoint "
f"(showing up to 10): {sample_missing}{suffix}"
)
if unexpected:
sorted_unexpected = sorted(unexpected)
sample_unexpected = sorted_unexpected[:10]
suffix = " ..." if len(sorted_unexpected) > len(sample_unexpected) else ""
print(
f" Warning: {len(sorted_unexpected)} unexpected key(s) in checkpoint "
f"(showing up to 10): {sample_unexpected}{suffix}"
)
result = model.load_state_dict(state_dict, strict=False)
if result.missing_keys or result.unexpected_keys:
missing_result = list(result.missing_keys)
unexpected_result = list(result.unexpected_keys)
sample_missing_result = missing_result[:10]
sample_unexpected_result = unexpected_result[:10]
missing_suffix = " ..." if len(missing_result) > len(sample_missing_result) else ""
unexpected_suffix = " ..." if len(unexpected_result) > len(sample_unexpected_result) else ""
print(
" load_state_dict result — "
f"missing ({len(missing_result)}, showing up to 10): {sample_missing_result}{missing_suffix}, "
f"unexpected ({len(unexpected_result)}, showing up to 10): {sample_unexpected_result}{unexpected_suffix}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +45
_cfg = {}
_config_path = _args.config or os.path.join(
os.path.dirname(__file__), "..", "..", "configs", "flagship_12m.yaml"
)
if os.path.exists(_config_path):
with open(_config_path, "r") as _f:
_cfg = yaml.safe_load(_f)
print(f"[Config] Loaded from {_config_path}")
else:
print(f"[Config] File not found: {_config_path}, using hardcoded defaults")

_model = _cfg.get("model", {})
_training = _cfg.get("training", {})

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml.safe_load() can return None (e.g., empty config file). In that case _cfg.get(...) will raise an AttributeError at import time. Consider defaulting to {} (e.g., yaml.safe_load(_f) or {}) before accessing .get().

Copilot uses AI. Check for mistakes.
Comment on lines +272 to +282
model_keys = set(model.state_dict().keys())
ckpt_keys = set(state_dict.keys())
missing = model_keys - ckpt_keys
unexpected = ckpt_keys - model_keys
if missing:
print(f" Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}")
if unexpected:
print(f" Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}")
result = model.load_state_dict(state_dict, strict=False)
if result.missing_keys or result.unexpected_keys:
print(f" load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}")
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing full sorted(missing) / sorted(unexpected) key lists can be extremely verbose for large models and can slow checkpoint resume due to large console output. Consider logging only counts plus a small sample, and/or rely on the load_state_dict(...).missing_keys/unexpected_keys output.

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +40
_cfg = {}
_config_path = _args.config or os.path.join(
os.path.dirname(__file__), "..", "..", "configs", "pocket_4k.yaml"
)
if os.path.exists(_config_path):
with open(_config_path, "r") as _f:
_cfg = yaml.safe_load(_f)
print(f"[Config] Loaded from {_config_path}")
else:
print(f"[Config] File not found: {_config_path}, using hardcoded defaults")

_model = _cfg.get("model", {})
_training = _cfg.get("training", {})

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml.safe_load() can return None (e.g., empty config file). In that case _cfg.get(...) will raise an AttributeError at import time. Consider defaulting to {} (e.g., yaml.safe_load(_f) or {}) before accessing .get().

Copilot uses AI. Check for mistakes.
Comment on lines +31 to 51
// Resolve metallib path relative to this source file's directory.
// __FILE__ expands to the full path of this .mm file at compile time:
// .../atomic_1bit/core/backends/metal_kernel.mm
// Deleting the last path component ("backends") gives us the "core/"
// directory, which is where default.metallib is placed by the Makefile.
NSString *sourceDir =
[[[NSString stringWithUTF8String:__FILE__]
stringByDeletingLastPathComponent] // strip "metal_kernel.mm"
stringByDeletingLastPathComponent]; // strip "backends/"
NSString *libPath =
[sourceDir stringByAppendingPathComponent:@"default.metallib"];
NSURL *libOnDisk = [NSURL fileURLWithPath:libPath];

id<MTLLibrary> defaultLibrary = [device newLibraryWithURL:libOnDisk
error:&error];
if (!defaultLibrary) {
// Fallback: try just "default.metallib" if running from core/
// Fallback: try "default.metallib" relative to cwd (e.g. when running
// from atomic_1bit/core/ directly during development).
libOnDisk = [NSURL fileURLWithPath:@"default.metallib"];
defaultLibrary = [device newLibraryWithURL:libOnDisk error:&error];
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving default.metallib via __FILE__ is brittle: __FILE__ points to the source path on the build machine, which often won't exist on the runtime machine/container (even though default.metallib sits next to the built libatomic.so). Consider resolving the metallib path relative to the loaded shared library location (e.g., dladdr on ternary_matmul / MetalContext symbols) or allowing an override via env var.

Copilot uses AI. Check for mistakes.
Comment thread docs/API_REFERENCE.md
Comment on lines +72 to +84
Loads a binary model file produced by `atomic_1bit/utils/export_to_cpp.py` into an `AtomicModel` struct.

**Parameters:**
- `filename` — Path to the `.bin` model file.
- `model` — Output parameter. The struct is populated in-place.

**Returns:** `true` on success, `false` if the file cannot be opened.

**Notes:**
- The binary format does not include a magic-number check in this header version. The runner (`atomic_runner.cpp`) does validate the `ATOM` magic bytes and version before using this path.
- Weights are stored in the following order: config header, optional gist vector, token embeddings, positional embeddings, per-layer weights (ln1, q/k/v/o, ln2, fc1, fc2), final norm, language model head.
- Float tensors (`ln1`, `ln2`, `ln_f`, embeddings) are stored as raw `float32`. Ternary weight tensors are stored as raw `int8_t` with values in `{-1, 0, 1}`.
- The caller is responsible for keeping `model` alive for as long as inference runs, since `forward` holds non-owning pointers into the vectors.
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_model is documented as loading binaries produced by atomic_1bit/utils/export_to_cpp.py, but embedded/atomic_lib.h currently reads a legacy format (no magic/version header and no per-weight scale prefixes). As written, users following this doc will generate .bin files that atomic_lib.h cannot parse correctly. Please either update the doc to clearly distinguish the legacy vs current format (and how to export each), or update atomic_lib.h to support the current exporter format.

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +112
// Zero-copy buffers: wrap the caller's memory directly using
// newBufferWithBytesNoCopy with MTLResourceStorageModeShared.
// This is safe on M-series (Apple Silicon) unified memory because the CPU
// and GPU share the same physical memory — no explicit DMA transfer occurs
// and no staging copy is needed. The nil deallocator means Metal will NOT
// free the underlying memory when the buffer is released; the caller (Python
// / ctypes) owns and manages the lifetime of A, B_transposed, and C.
// Note: on Apple Silicon, MTLResourceStorageModeShared does not require
// page alignment for correctness, though aligned allocations are faster.
id<MTLBuffer> bufferA = [g_metal_context->device
newBufferWithBytesNoCopy:(void *)A
length:(M * K) * sizeof(int8_t)
options:MTLResourceStorageModeShared
deallocator:nil];
id<MTLBuffer> bufferB = [g_metal_context->device
newBufferWithBytesNoCopy:(void *)B_transposed
length:(N * K) * sizeof(int8_t)
options:MTLResourceStorageModeShared
deallocator:nil];
// bufferC wraps the output array directly; after GPU execution the results
// are already in C — no memcpy needed.
id<MTLBuffer> bufferC = [g_metal_context->device
newBufferWithBytesNoCopy:(void *)C
length:(M * N) * sizeof(int32_t)
options:MTLResourceStorageModeShared
deallocator:nil];
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch to newBufferWithBytesNoCopy is likely unsafe with NumPy/ctypes-allocated memory: Metal’s no-copy buffer APIs generally require page-aligned memory in a single VM region, and the backing memory must remain valid for the lifetime of the buffer and any in-flight GPU work. With typical malloc-backed NumPy arrays this can fail or cause undefined behavior, and the code doesn’t check for nil buffers / provide a fallback. Consider keeping newBufferWithBytes (copy) as the safe default, or implement a runtime fallback (if no-copy returns nil) and/or enforce page-aligned allocations on the Python side.

Copilot uses AI. Check for mistakes.
Comment thread docs/TROUBLESHOOTING.md
cd atomic_1bit/core
make BACKEND=METAL
```
- The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/` and cleaned by `make clean`.
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line says default.metallib is cleaned by make clean, but atomic_1bit/core/Makefile currently removes backends/*.air and backends/*.metallib and does not remove default.metallib in the core directory. Either adjust the docs or update the Makefile clean target to remove default.metallib as well.

Suggested change
- The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/` and cleaned by `make clean`.
- The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/`, but `make clean` only removes `backends/*.air` and `backends/*.metallib`; remove `default.metallib` in `atomic_1bit/core/` manually if needed.

Copilot uses AI. Check for mistakes.

import pytest
import torch
import torch.nn as nn
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import: import torch.nn as nn is not referenced in this test module. This will be flagged by flake8 (F401) and makes the test harder to keep clean; please remove it.

Suggested change
import torch.nn as nn

Copilot uses AI. Check for mistakes.
guirguispierre added a commit that referenced this pull request Apr 28, 2026
Resolved conflicts:
- CONTRIBUTING.md: kept master's more detailed version (from PR #3)
- README.md: kept launch-prep style with badges; corrected stale test count
- debug_crash.py: deleted per PR intent (scratch debug file)

Doc fixes:
- docs/INSTALL.md: corrected 'make metal/cuda' to 'make BACKEND=METAL/CUDA' (Makefile uses env var)
- README.md: removed hard-coded test count (was '67 tests')
guirguispierre added a commit that referenced this pull request Apr 28, 2026
Overhauls README, adds INSTALL/USAGE docs, examples/ folder (quickstart and generation demos), package __init__.py, expanded .gitignore, gist.py docstring cleanup. Removes debug_crash.py and v1.3_quality_report.txt scratch files. Resolved against PR #3's CONTRIBUTING.md and docs additions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants