docs: prepare repository for public launch#1
Conversation
Overhaul README with badges, value proposition, quick start guide, benchmarks table, architecture section, and requirements. Add CONTRIBUTING.md, docs/INSTALL.md, docs/USAGE.md, and examples/ folder with quickstart and generation demos. Clean up gist.py comments, add package __init__.py, expand .gitignore, remove debug_crash.py and scratch quality report. https://claude.ai/code/session_01RGFx1LfDYnWuZs3ZTyD3mf
Resolved conflicts: - CONTRIBUTING.md: kept master's more detailed version (from PR #3) - README.md: kept launch-prep style with badges; corrected stale test count - debug_crash.py: deleted per PR intent (scratch debug file) Doc fixes: - docs/INSTALL.md: corrected 'make metal/cuda' to 'make BACKEND=METAL/CUDA' (Makefile uses env var) - README.md: removed hard-coded test count (was '67 tests')
There was a problem hiding this comment.
Pull request overview
Prepares the repository for a public launch by refreshing top-level documentation and adding onboarding materials (install/usage guides + runnable examples) while cleaning up misc artifacts.
Changes:
- Overhauled
README.mdwith launch-ready messaging, benchmarks, quick start, and structure overview. - Added installation/usage docs and an
examples/folder with runnable demo scripts. - Removed/cleaned up debug/report artifacts and small package hygiene updates (
__init__.py,.gitignore, clearer comments).
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| v1.3_quality_report.txt | Removes an outdated quality report artifact. |
| examples/quickstart.py | Adds a no-checkpoint “quickstart” demo showcasing quantization + generation. |
| examples/generate.py | Adds an interactive text generation script (currently has checkpoint/tokenizer compatibility issues). |
| examples/README.md | Documents how to run the example scripts. |
| docs/USAGE.md | Adds a usage guide covering training, inference, export, and evaluation. |
| docs/INSTALL.md | Adds installation instructions and troubleshooting notes. |
| atomic_1bit/tests/debug_crash.py | Removes a manual debug helper script. |
| atomic_1bit/model/gist.py | Replaces ambiguous comments with clearer docstrings for gist encoding. |
| atomic_1bit/init.py | Adds a package docstring. |
| README.md | Public-launch focused README overhaul (badges, positioning, quickstart, etc.). |
| .gitignore | Expands ignores for IDE files, logs, and packaging artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| **C++ compilation fails with "no member named 'format'"** | ||
|
|
||
| Your compiler may not support C++17. Update GCC to 7+ or Clang to 5+: |
There was a problem hiding this comment.
The troubleshooting note says a "no member named 'format'" error means the compiler may not support C++17, but std::format is a C++20 feature and the repo's C++ build instructions target -std=c++17. Since the codebase doesn't appear to use std::format, this entry is likely misleading—either remove it or clarify the actual failing feature / required standard.
| **C++ compilation fails with "no member named 'format'"** | |
| Your compiler may not support C++17. Update GCC to 7+ or Clang to 5+: | |
| **C++ compilation fails with missing standard library features or other C++17 errors** | |
| Ensure your compiler supports C++17 and that you are building with `-std=c++17`. If needed, update GCC to 7+ or Clang to 5+: |
|
|
||
| # Load checkpoint | ||
| print(f"Loading checkpoint: {args.checkpoint}") | ||
| checkpoint = torch.load(args.checkpoint, map_location="cpu", weights_only=False) |
There was a problem hiding this comment.
torch.load(..., weights_only=False) is incompatible with the repo's stated minimum PyTorch version (>= 1.13.0); weights_only was added in newer PyTorch releases. Drop this argument (or gate it behind a version check) so examples/generate.py works with the documented requirements.
| checkpoint = torch.load(args.checkpoint, map_location="cpu", weights_only=False) | |
| checkpoint = torch.load(args.checkpoint, map_location="cpu") |
| config = AtomicConfig( | ||
| vocab_size=checkpoint.get("vocab_size", 4096), | ||
| dim=checkpoint.get("dim", 256), | ||
| depth=checkpoint.get("depth", 6), | ||
| heads=checkpoint.get("heads", 4), | ||
| context_length=checkpoint.get("context_len", 128), |
There was a problem hiding this comment.
The checkpoint config is being read from top-level keys like dim, depth, heads, and context_len, but train_instruct.py saves these under checkpoint["config"] (and uses context_length, not context_len). As written, loading an instruct checkpoint will build the wrong AtomicConfig and load_state_dict will fail with shape mismatches. Prefer reading cfg = checkpoint.get("config", checkpoint) and using cfg["context_length"]/etc when present.
| config = AtomicConfig( | |
| vocab_size=checkpoint.get("vocab_size", 4096), | |
| dim=checkpoint.get("dim", 256), | |
| depth=checkpoint.get("depth", 6), | |
| heads=checkpoint.get("heads", 4), | |
| context_length=checkpoint.get("context_len", 128), | |
| cfg = checkpoint.get("config", checkpoint) | |
| config = AtomicConfig( | |
| vocab_size=cfg.get("vocab_size", 4096), | |
| dim=cfg.get("dim", 256), | |
| depth=cfg.get("depth", 6), | |
| heads=cfg.get("heads", 4), | |
| context_length=cfg.get("context_length", cfg.get("context_len", 128)), |
| # Tokenizer | ||
| enc = tiktoken.get_encoding("gpt2") | ||
|
|
||
| # Interactive generation | ||
| print(f"\nGenerating {args.tokens} tokens at temperature {args.temp}") | ||
| print("Type a prompt and press Enter. Type 'quit' to exit.\n") | ||
|
|
||
| while True: | ||
| try: | ||
| prompt = input(">>> ") | ||
| except (EOFError, KeyboardInterrupt): | ||
| break | ||
|
|
||
| if prompt.strip().lower() in ("quit", "exit", "q"): | ||
| break | ||
|
|
||
| if not prompt.strip(): | ||
| continue | ||
|
|
||
| tokens = enc.encode(prompt) | ||
| input_ids = torch.tensor([tokens], dtype=torch.long) | ||
|
|
There was a problem hiding this comment.
This example uses the GPT-2 tokenizer directly (tiktoken.get_encoding("gpt2")), but the default checkpoint path (weights/stories_final.pt) is trained with a pocket vocabulary (typically 4096) and requires mapping GPT-2 IDs into the pocket ID space. As-is, enc.encode(prompt) can produce token IDs > config.vocab_size and will crash the embedding lookup. Use atomic_1bit.tokenizers.PocketTokenizer (with the appropriate weights/vocab_map_*.json) when config.vocab_size is the pocket size, or otherwise ensure the tokenizer matches the training vocab.
| # Truncate to context length | ||
| if input_ids.shape[1] > config.context_length - args.tokens: | ||
| input_ids = input_ids[:, -(config.context_length - args.tokens) :] | ||
|
|
There was a problem hiding this comment.
Prompt truncation uses config.context_length - args.tokens without guarding for args.tokens >= config.context_length. In that case the computed window is 0/negative and the slice becomes incorrect, and model.generate(...) will still hit assert seq_len <= context_length. Consider validating args.tokens < config.context_length (or capping max_new_tokens) and truncating the prompt to <= config.context_length accordingly.
Overhaul README with badges, value proposition, quick start guide,
benchmarks table, architecture section, and requirements. Add
CONTRIBUTING.md, docs/INSTALL.md, docs/USAGE.md, and examples/ folder
with quickstart and generation demos. Clean up gist.py comments, add
package init.py, expand .gitignore, remove debug_crash.py and
scratch quality report.
https://claude.ai/code/session_01RGFx1LfDYnWuZs3ZTyD3mf