Harden codebase: error handling, configs, tests, and docs#3
Conversation
- Replace 14 bare except: blocks with specific exception types - Fix hardcoded Metal metallib path to resolve relative to __FILE__ - Use zero-copy Metal buffers (newBufferWithBytesNoCopy) for M-series - Wire YAML config files to training scripts via --config flag - Add checkpoint validation with missing/unexpected key warnings - Fix thermal monitor to log warnings instead of silent failure - Use @AbstractMethod in tokenizer base class - Add CUDA kernel inline documentation - Add 58 new tests: E2E pipeline, quantization edge cases, eval smoke - Add docs: API_REFERENCE.md, TROUBLESHOOTING.md
The test referenced get_current_temp() which never existed on ThermalMonitor. Updated to use _get_max_temp() and accept None for unavailable sensors.
There was a problem hiding this comment.
Pull request overview
This PR hardens the Atomic-1Bit codebase by tightening exception handling, adding YAML-driven training configs and checkpoint validation, improving backend robustness/perf (Metal), and adding substantial test + documentation coverage.
Changes:
- Add new pytest coverage (quantization edge cases, eval/benchmark smoke tests, and an end-to-end train→export→binary-validate pipeline).
- Introduce
--configYAML loading and checkpoint key validation warnings across training scripts. - Improve backend and utility robustness (Metal metallib path + zero-copy buffers, thermal monitor logging, bare
except:removal, tokenizer abstractmethod fix), plus add API/troubleshooting docs.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_quantization_edge_cases.py | Adds BitLinear/quantization edge-case tests (NaN/Inf/zeros/extremes, gradients). |
| tests/test_eval_benchmark.py | Adds import smoke tests and deterministic metric checks for evaluation + benchmark modules. |
| tests/test_e2e_pipeline.py | Adds slow-marked end-to-end pipeline test (train steps, checkpoint reload, export header checks). |
| scripts/run_decoder.py | Replaces bare except: with specific decode-related exceptions. |
| scripts/evaluate_quality.py | Replaces bare except: with specific decode-related exceptions. |
| pytest.ini | Defines slow marker for pytest selection. |
| docs/TROUBLESHOOTING.md | Adds extensive troubleshooting + deployment guide covering build/runtime/export issues. |
| docs/API_REFERENCE.md | Adds C++ inference API reference for embedded/atomic_lib.h and binary format notes. |
| benchmarks/run_suite.py | Replaces bare except: blocks with more specific exceptions. |
| atomic_1bit/utils/thermal.py | Logs sensor read failures and degrades gracefully instead of returning 0.0 silently. |
| atomic_1bit/training/train.py | Adds YAML config loading + checkpoint key validation before load_state_dict. |
| atomic_1bit/training/train_instruct.py | Adds YAML config loading + checkpoint key validation + safer exception handling around optimizer/scheduler restore. |
| atomic_1bit/training/train_pocket.py | Adds YAML config loading + checkpoint key validation + safer exception handling around optimizer restore. |
| atomic_1bit/tokenizers/base.py | Makes eot_token an @abstractmethod property instead of raising NotImplementedError. |
| atomic_1bit/tests/debug_crash.py | Narrows exception handling from bare except: to Exception. |
| atomic_1bit/evaluation/run_eval.py | Replaces bare except: with specific decode-related exceptions. |
| atomic_1bit/core/backends/metal_kernel.mm | Attempts to fix metallib path resolution and switches to zero-copy Metal buffers. |
| atomic_1bit/core/backends/cuda_kernel.cu | Adds detailed inline documentation for kernel tiling/shared-memory/synchronization/launch config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if os.path.exists(_config_path): | ||
| with open(_config_path, "r") as _f: | ||
| _cfg = yaml.safe_load(_f) | ||
| print(f"[Config] Loaded from {_config_path}") | ||
| else: | ||
| print(f"[Config] File not found: {_config_path}, using hardcoded defaults") | ||
|
|
||
| _model = _cfg.get("model", {}) | ||
| _training = _cfg.get("training", {}) |
There was a problem hiding this comment.
yaml.safe_load() can return None (e.g., empty config file). In that case _cfg.get(...) will raise an AttributeError at import time. Consider defaulting to {} (e.g., yaml.safe_load(_f) or {}) before accessing .get().
| print(f" Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}") | ||
| if unexpected: | ||
| print(f" Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}") | ||
| result = model.load_state_dict(state_dict, strict=False) | ||
| if result.missing_keys or result.unexpected_keys: | ||
| print(f" load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}") |
There was a problem hiding this comment.
Printing sorted(missing) / sorted(unexpected) can dump hundreds/thousands of state_dict keys for large checkpoints, which can significantly slow startup and spam logs. Consider printing counts plus a small sample (first N keys), and/or rely on the load_state_dict(...).missing_keys/unexpected_keys lists instead.
| print(f" Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}") | |
| if unexpected: | |
| print(f" Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}") | |
| result = model.load_state_dict(state_dict, strict=False) | |
| if result.missing_keys or result.unexpected_keys: | |
| print(f" load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}") | |
| sorted_missing = sorted(missing) | |
| sample_missing = sorted_missing[:10] | |
| suffix = " ..." if len(sorted_missing) > len(sample_missing) else "" | |
| print( | |
| f" Warning: {len(sorted_missing)} missing key(s) in checkpoint " | |
| f"(showing up to 10): {sample_missing}{suffix}" | |
| ) | |
| if unexpected: | |
| sorted_unexpected = sorted(unexpected) | |
| sample_unexpected = sorted_unexpected[:10] | |
| suffix = " ..." if len(sorted_unexpected) > len(sample_unexpected) else "" | |
| print( | |
| f" Warning: {len(sorted_unexpected)} unexpected key(s) in checkpoint " | |
| f"(showing up to 10): {sample_unexpected}{suffix}" | |
| ) | |
| result = model.load_state_dict(state_dict, strict=False) | |
| if result.missing_keys or result.unexpected_keys: | |
| missing_result = list(result.missing_keys) | |
| unexpected_result = list(result.unexpected_keys) | |
| sample_missing_result = missing_result[:10] | |
| sample_unexpected_result = unexpected_result[:10] | |
| missing_suffix = " ..." if len(missing_result) > len(sample_missing_result) else "" | |
| unexpected_suffix = " ..." if len(unexpected_result) > len(sample_unexpected_result) else "" | |
| print( | |
| " load_state_dict result — " | |
| f"missing ({len(missing_result)}, showing up to 10): {sample_missing_result}{missing_suffix}, " | |
| f"unexpected ({len(unexpected_result)}, showing up to 10): {sample_unexpected_result}{unexpected_suffix}" | |
| ) |
| _cfg = {} | ||
| _config_path = _args.config or os.path.join( | ||
| os.path.dirname(__file__), "..", "..", "configs", "flagship_12m.yaml" | ||
| ) | ||
| if os.path.exists(_config_path): | ||
| with open(_config_path, "r") as _f: | ||
| _cfg = yaml.safe_load(_f) | ||
| print(f"[Config] Loaded from {_config_path}") | ||
| else: | ||
| print(f"[Config] File not found: {_config_path}, using hardcoded defaults") | ||
|
|
||
| _model = _cfg.get("model", {}) | ||
| _training = _cfg.get("training", {}) | ||
|
|
There was a problem hiding this comment.
yaml.safe_load() can return None (e.g., empty config file). In that case _cfg.get(...) will raise an AttributeError at import time. Consider defaulting to {} (e.g., yaml.safe_load(_f) or {}) before accessing .get().
| model_keys = set(model.state_dict().keys()) | ||
| ckpt_keys = set(state_dict.keys()) | ||
| missing = model_keys - ckpt_keys | ||
| unexpected = ckpt_keys - model_keys | ||
| if missing: | ||
| print(f" Warning: {len(missing)} missing key(s) in checkpoint: {sorted(missing)}") | ||
| if unexpected: | ||
| print(f" Warning: {len(unexpected)} unexpected key(s) in checkpoint: {sorted(unexpected)}") | ||
| result = model.load_state_dict(state_dict, strict=False) | ||
| if result.missing_keys or result.unexpected_keys: | ||
| print(f" load_state_dict result — missing: {result.missing_keys}, unexpected: {result.unexpected_keys}") |
There was a problem hiding this comment.
Printing full sorted(missing) / sorted(unexpected) key lists can be extremely verbose for large models and can slow checkpoint resume due to large console output. Consider logging only counts plus a small sample, and/or rely on the load_state_dict(...).missing_keys/unexpected_keys output.
| _cfg = {} | ||
| _config_path = _args.config or os.path.join( | ||
| os.path.dirname(__file__), "..", "..", "configs", "pocket_4k.yaml" | ||
| ) | ||
| if os.path.exists(_config_path): | ||
| with open(_config_path, "r") as _f: | ||
| _cfg = yaml.safe_load(_f) | ||
| print(f"[Config] Loaded from {_config_path}") | ||
| else: | ||
| print(f"[Config] File not found: {_config_path}, using hardcoded defaults") | ||
|
|
||
| _model = _cfg.get("model", {}) | ||
| _training = _cfg.get("training", {}) | ||
|
|
There was a problem hiding this comment.
yaml.safe_load() can return None (e.g., empty config file). In that case _cfg.get(...) will raise an AttributeError at import time. Consider defaulting to {} (e.g., yaml.safe_load(_f) or {}) before accessing .get().
| // Resolve metallib path relative to this source file's directory. | ||
| // __FILE__ expands to the full path of this .mm file at compile time: | ||
| // .../atomic_1bit/core/backends/metal_kernel.mm | ||
| // Deleting the last path component ("backends") gives us the "core/" | ||
| // directory, which is where default.metallib is placed by the Makefile. | ||
| NSString *sourceDir = | ||
| [[[NSString stringWithUTF8String:__FILE__] | ||
| stringByDeletingLastPathComponent] // strip "metal_kernel.mm" | ||
| stringByDeletingLastPathComponent]; // strip "backends/" | ||
| NSString *libPath = | ||
| [sourceDir stringByAppendingPathComponent:@"default.metallib"]; | ||
| NSURL *libOnDisk = [NSURL fileURLWithPath:libPath]; | ||
|
|
||
| id<MTLLibrary> defaultLibrary = [device newLibraryWithURL:libOnDisk | ||
| error:&error]; | ||
| if (!defaultLibrary) { | ||
| // Fallback: try just "default.metallib" if running from core/ | ||
| // Fallback: try "default.metallib" relative to cwd (e.g. when running | ||
| // from atomic_1bit/core/ directly during development). | ||
| libOnDisk = [NSURL fileURLWithPath:@"default.metallib"]; | ||
| defaultLibrary = [device newLibraryWithURL:libOnDisk error:&error]; | ||
| } |
There was a problem hiding this comment.
Resolving default.metallib via __FILE__ is brittle: __FILE__ points to the source path on the build machine, which often won't exist on the runtime machine/container (even though default.metallib sits next to the built libatomic.so). Consider resolving the metallib path relative to the loaded shared library location (e.g., dladdr on ternary_matmul / MetalContext symbols) or allowing an override via env var.
| Loads a binary model file produced by `atomic_1bit/utils/export_to_cpp.py` into an `AtomicModel` struct. | ||
|
|
||
| **Parameters:** | ||
| - `filename` — Path to the `.bin` model file. | ||
| - `model` — Output parameter. The struct is populated in-place. | ||
|
|
||
| **Returns:** `true` on success, `false` if the file cannot be opened. | ||
|
|
||
| **Notes:** | ||
| - The binary format does not include a magic-number check in this header version. The runner (`atomic_runner.cpp`) does validate the `ATOM` magic bytes and version before using this path. | ||
| - Weights are stored in the following order: config header, optional gist vector, token embeddings, positional embeddings, per-layer weights (ln1, q/k/v/o, ln2, fc1, fc2), final norm, language model head. | ||
| - Float tensors (`ln1`, `ln2`, `ln_f`, embeddings) are stored as raw `float32`. Ternary weight tensors are stored as raw `int8_t` with values in `{-1, 0, 1}`. | ||
| - The caller is responsible for keeping `model` alive for as long as inference runs, since `forward` holds non-owning pointers into the vectors. |
There was a problem hiding this comment.
load_model is documented as loading binaries produced by atomic_1bit/utils/export_to_cpp.py, but embedded/atomic_lib.h currently reads a legacy format (no magic/version header and no per-weight scale prefixes). As written, users following this doc will generate .bin files that atomic_lib.h cannot parse correctly. Please either update the doc to clearly distinguish the legacy vs current format (and how to export each), or update atomic_lib.h to support the current exporter format.
| // Zero-copy buffers: wrap the caller's memory directly using | ||
| // newBufferWithBytesNoCopy with MTLResourceStorageModeShared. | ||
| // This is safe on M-series (Apple Silicon) unified memory because the CPU | ||
| // and GPU share the same physical memory — no explicit DMA transfer occurs | ||
| // and no staging copy is needed. The nil deallocator means Metal will NOT | ||
| // free the underlying memory when the buffer is released; the caller (Python | ||
| // / ctypes) owns and manages the lifetime of A, B_transposed, and C. | ||
| // Note: on Apple Silicon, MTLResourceStorageModeShared does not require | ||
| // page alignment for correctness, though aligned allocations are faster. | ||
| id<MTLBuffer> bufferA = [g_metal_context->device | ||
| newBufferWithBytesNoCopy:(void *)A | ||
| length:(M * K) * sizeof(int8_t) | ||
| options:MTLResourceStorageModeShared | ||
| deallocator:nil]; | ||
| id<MTLBuffer> bufferB = [g_metal_context->device | ||
| newBufferWithBytesNoCopy:(void *)B_transposed | ||
| length:(N * K) * sizeof(int8_t) | ||
| options:MTLResourceStorageModeShared | ||
| deallocator:nil]; | ||
| // bufferC wraps the output array directly; after GPU execution the results | ||
| // are already in C — no memcpy needed. | ||
| id<MTLBuffer> bufferC = [g_metal_context->device | ||
| newBufferWithBytesNoCopy:(void *)C | ||
| length:(M * N) * sizeof(int32_t) | ||
| options:MTLResourceStorageModeShared | ||
| deallocator:nil]; |
There was a problem hiding this comment.
The switch to newBufferWithBytesNoCopy is likely unsafe with NumPy/ctypes-allocated memory: Metal’s no-copy buffer APIs generally require page-aligned memory in a single VM region, and the backing memory must remain valid for the lifetime of the buffer and any in-flight GPU work. With typical malloc-backed NumPy arrays this can fail or cause undefined behavior, and the code doesn’t check for nil buffers / provide a fallback. Consider keeping newBufferWithBytes (copy) as the safe default, or implement a runtime fallback (if no-copy returns nil) and/or enforce page-aligned allocations on the Python side.
| cd atomic_1bit/core | ||
| make BACKEND=METAL | ||
| ``` | ||
| - The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/` and cleaned by `make clean`. |
There was a problem hiding this comment.
This line says default.metallib is cleaned by make clean, but atomic_1bit/core/Makefile currently removes backends/*.air and backends/*.metallib and does not remove default.metallib in the core directory. Either adjust the docs or update the Makefile clean target to remove default.metallib as well.
| - The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/` and cleaned by `make clean`. | |
| - The Metal build compiles `backends/kernels.metal` to `backends/kernels.air` and then links `default.metallib`. Both files are generated in `atomic_1bit/core/`, but `make clean` only removes `backends/*.air` and `backends/*.metallib`; remove `default.metallib` in `atomic_1bit/core/` manually if needed. |
|
|
||
| import pytest | ||
| import torch | ||
| import torch.nn as nn |
There was a problem hiding this comment.
Unused import: import torch.nn as nn is not referenced in this test module. This will be flagged by flake8 (F401) and makes the test harder to keep clean; please remove it.
| import torch.nn as nn |
Resolved conflicts: - CONTRIBUTING.md: kept master's more detailed version (from PR #3) - README.md: kept launch-prep style with badges; corrected stale test count - debug_crash.py: deleted per PR intent (scratch debug file) Doc fixes: - docs/INSTALL.md: corrected 'make metal/cuda' to 'make BACKEND=METAL/CUDA' (Makefile uses env var) - README.md: removed hard-coded test count (was '67 tests')
Overhauls README, adds INSTALL/USAGE docs, examples/ folder (quickstart and generation demos), package __init__.py, expanded .gitignore, gist.py docstring cleanup. Removes debug_crash.py and v1.3_quality_report.txt scratch files. Resolved against PR #3's CONTRIBUTING.md and docs additions.
Summary
except:blocks with specific exception types across training, eval, benchmark, and utility scripts__FILE__instead of hardcoded CWD-relative path; use zero-copy buffers (newBufferWithBytesNoCopy) for M-series unified memory--configflag (backward-compatible defaults)load_state_dict@abstractmethodinstead of manualraise NotImplementedErrorAPI_REFERENCE.mdfor the C++ inference API,TROUBLESHOOTING.mdfor common errors and production deploymentTest plan
pytest tests/to verify all 58 new tests passpython3 atomic_1bit/python/inference.pyto verify parity still holdsmake BACKEND=METAL) and verify metallib path resolves correctly--configflag:python3 atomic_1bit/training/train.py --config configs/stories_base.yaml--config(backward compat)