feat: Add Andreas' first test submission#419
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new non-record 16MB “first submission” entry under records/track_non_record_16mb, including the training script used, the captured training log, and the metadata needed to register the result.
Changes:
- Add a standalone training script (
train_gpt.py) for this submission variant (10-layer model w/ bigram hash, SWA, mixed int5/int6 quantization). - Add the full training output log (
train.log) to document the run and reported metrics/sizes. - Add submission metadata + minimal documentation (
submission.json,README.md) and a localrequirements.txt.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_non_record_16mb/first-submission/train_gpt.py | Training + quantization script for the submission artifact. |
| records/track_non_record_16mb/first-submission/train.log | Captured run log (includes code snapshot + metrics + artifact sizes). |
| records/track_non_record_16mb/first-submission/submission.json | Declares the submission’s reported metric(s), size, author, and date. |
| records/track_non_record_16mb/first-submission/requirements.txt | Dependency list for reproducing the run. |
| records/track_non_record_16mb/first-submission/README.md | Minimal explanation of what the submission is and how to set up deps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple( | ||
| pattern | ||
| for pattern in os.environ.get( | ||
| "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS", | ||
| ",".join(CONTROL_TENSOR_NAME_PATTERNS), | ||
| ).split(",") | ||
| if pattern | ||
| ) | ||
| INT8_KEEP_FLOAT_MAX_NUMEL = 65_536 | ||
| INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16 | ||
| INT8_PER_ROW_SCALE_DTYPE = torch.float16 | ||
| INT8_CLIP_PERCENTILE = 99.99984 | ||
| INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 | ||
|
|
||
| def tensor_nbytes(t: Tensor) -> int: | ||
| return int(t.numel()) * int(t.element_size()) | ||
|
|
There was a problem hiding this comment.
There are several quantization-related constants and helpers defined here that are never used in this script (e.g. INT8_KEEP_FLOAT_FP32_NAME_PATTERNS, INT8_KEEP_FLOAT_MAX_NUMEL, INT8_KEEP_FLOAT_STORE_DTYPE, tensor_nbytes). Consider removing unused definitions to reduce file size/complexity, especially since this script is intended to stay approachable for newcomers.
| INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple( | |
| pattern | |
| for pattern in os.environ.get( | |
| "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS", | |
| ",".join(CONTROL_TENSOR_NAME_PATTERNS), | |
| ).split(",") | |
| if pattern | |
| ) | |
| INT8_KEEP_FLOAT_MAX_NUMEL = 65_536 | |
| INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16 | |
| INT8_PER_ROW_SCALE_DTYPE = torch.float16 | |
| INT8_CLIP_PERCENTILE = 99.99984 | |
| INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 | |
| def tensor_nbytes(t: Tensor) -> int: | |
| return int(t.numel()) * int(t.element_size()) | |
| INT8_PER_ROW_SCALE_DTYPE = torch.float16 | |
| INT8_CLIP_PERCENTILE = 99.99984 | |
| INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 |
| def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]: | ||
| local_tokens = global_tokens // (self.world_size * grad_accum_steps) | ||
| per_rank_span = local_tokens + 1 | ||
| chunk = self.stream.take(per_rank_span * self.world_size) | ||
| start = self.rank * per_rank_span | ||
| local = chunk[start : start + per_rank_span].to(dtype=torch.int64) | ||
| x = local[:-1].reshape(-1, seq_len) | ||
| y = local[1:].reshape(-1, seq_len) | ||
| return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) |
There was a problem hiding this comment.
DistributedTokenLoader.next_batch() reshapes local[:-1] to (-1, seq_len), but local_tokens = global_tokens // (world_size * grad_accum_steps) is not validated to be a multiple of seq_len. If TRAIN_BATCH_TOKENS is set to a value that isn’t divisible by WORLD_SIZE * GRAD_ACCUM_STEPS * TRAIN_SEQ_LEN, this will raise a runtime reshape error. Add an explicit validation (and a clear ValueError) that local_tokens % seq_len == 0 (and ideally that global_tokens is divisible by world_size * grad_accum_steps) before reshaping.
| def __init__(self, bigram_vocab_size: int, bigram_dim: int, model_dim: int): | ||
| super().__init__() | ||
| self.bigram_vocab_size = bigram_vocab_size | ||
| self.embed = nn.Embedding(bigram_vocab_size, bigram_dim) | ||
| nn.init.zeros_(self.embed.weight) | ||
| self.proj = CastedLinear(bigram_dim, model_dim, bias=False) if bigram_dim != model_dim else None | ||
| if self.proj is not None: | ||
| nn.init.zeros_(self.proj.weight) | ||
| self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32)) | ||
|
|
||
| def bigram_hash(self, tokens: Tensor) -> Tensor: | ||
| t = tokens.to(torch.int32) | ||
| mod = self.bigram_vocab_size - 1 | ||
| out = torch.empty_like(t) | ||
| out[..., 0] = mod | ||
| out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod | ||
| return out.long() |
There was a problem hiding this comment.
BigramHashEmbedding.bigram_hash() uses mod = self.bigram_vocab_size - 1 and then does % mod. If bigram_vocab_size is set to 1 (or 0, though the module is only constructed for >0), this becomes a modulo-by-zero error. Add input validation in BigramHashEmbedding.__init__ (or when reading hyperparameters) to require bigram_vocab_size >= 2 when enabled.
| { | ||
| "name": "Andreas Test Submission", | ||
| "val_loss": 1.14276, | ||
| "bytes_total": 15900000, | ||
| "blurb": "Test submission based on the first place solution (10L Int5-MLP).", | ||
| "author": "Andreas", | ||
| "github_id": "andreasbis", | ||
| "date": "2026-03-22" | ||
| } |
There was a problem hiding this comment.
This submission.json format doesn’t match the other entries in records/track_non_record_16mb, which include track and typically record both val_loss and val_bpb. Also, val_loss here is set to ~1.14, but the accompanying train.log shows val_loss ~1.929 and val_bpb ~1.143 after quant roundtrip, so the metric appears mislabeled. Recommend adding track (e.g. non-record-16mb), adding val_bpb, and setting val_loss/bytes_total to the exact values emitted by the training log.
| # fixes applied | ||
| # tuned |
There was a problem hiding this comment.
The trailing # fixes applied / # tuned comments look like leftover scratch notes rather than explanatory documentation. Consider removing them or replacing with a brief, specific note about what was changed/tuned (or link to the relevant result) so the file doesn’t accumulate ambiguous annotations.
| # fixes applied | |
| # tuned |
No description provided.