Skip to content

feat: Add Andreas' first test submission#419

Open
AndreasBis wants to merge 1 commit intoopenai:mainfrom
AndreasBis:main
Open

feat: Add Andreas' first test submission#419
AndreasBis wants to merge 1 commit intoopenai:mainfrom
AndreasBis:main

Conversation

@AndreasBis
Copy link

No description provided.

Copilot AI review requested due to automatic review settings March 22, 2026 12:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new non-record 16MB “first submission” entry under records/track_non_record_16mb, including the training script used, the captured training log, and the metadata needed to register the result.

Changes:

  • Add a standalone training script (train_gpt.py) for this submission variant (10-layer model w/ bigram hash, SWA, mixed int5/int6 quantization).
  • Add the full training output log (train.log) to document the run and reported metrics/sizes.
  • Add submission metadata + minimal documentation (submission.json, README.md) and a local requirements.txt.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
records/track_non_record_16mb/first-submission/train_gpt.py Training + quantization script for the submission artifact.
records/track_non_record_16mb/first-submission/train.log Captured run log (includes code snapshot + metrics + artifact sizes).
records/track_non_record_16mb/first-submission/submission.json Declares the submission’s reported metric(s), size, author, and date.
records/track_non_record_16mb/first-submission/requirements.txt Dependency list for reproducing the run.
records/track_non_record_16mb/first-submission/README.md Minimal explanation of what the submission is and how to set up deps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +288 to +304
INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple(
pattern
for pattern in os.environ.get(
"INT8_KEEP_FLOAT_FP32_NAME_PATTERNS",
",".join(CONTROL_TENSOR_NAME_PATTERNS),
).split(",")
if pattern
)
INT8_KEEP_FLOAT_MAX_NUMEL = 65_536
INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16
INT8_PER_ROW_SCALE_DTYPE = torch.float16
INT8_CLIP_PERCENTILE = 99.99984
INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0

def tensor_nbytes(t: Tensor) -> int:
return int(t.numel()) * int(t.element_size())

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several quantization-related constants and helpers defined here that are never used in this script (e.g. INT8_KEEP_FLOAT_FP32_NAME_PATTERNS, INT8_KEEP_FLOAT_MAX_NUMEL, INT8_KEEP_FLOAT_STORE_DTYPE, tensor_nbytes). Consider removing unused definitions to reduce file size/complexity, especially since this script is intended to stay approachable for newcomers.

Suggested change
INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple(
pattern
for pattern in os.environ.get(
"INT8_KEEP_FLOAT_FP32_NAME_PATTERNS",
",".join(CONTROL_TENSOR_NAME_PATTERNS),
).split(",")
if pattern
)
INT8_KEEP_FLOAT_MAX_NUMEL = 65_536
INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16
INT8_PER_ROW_SCALE_DTYPE = torch.float16
INT8_CLIP_PERCENTILE = 99.99984
INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0
def tensor_nbytes(t: Tensor) -> int:
return int(t.numel()) * int(t.element_size())
INT8_PER_ROW_SCALE_DTYPE = torch.float16
INT8_CLIP_PERCENTILE = 99.99984
INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0

Copilot uses AI. Check for mistakes.
Comment on lines +454 to +462
def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]:
local_tokens = global_tokens // (self.world_size * grad_accum_steps)
per_rank_span = local_tokens + 1
chunk = self.stream.take(per_rank_span * self.world_size)
start = self.rank * per_rank_span
local = chunk[start : start + per_rank_span].to(dtype=torch.int64)
x = local[:-1].reshape(-1, seq_len)
y = local[1:].reshape(-1, seq_len)
return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistributedTokenLoader.next_batch() reshapes local[:-1] to (-1, seq_len), but local_tokens = global_tokens // (world_size * grad_accum_steps) is not validated to be a multiple of seq_len. If TRAIN_BATCH_TOKENS is set to a value that isn’t divisible by WORLD_SIZE * GRAD_ACCUM_STEPS * TRAIN_SEQ_LEN, this will raise a runtime reshape error. Add an explicit validation (and a clear ValueError) that local_tokens % seq_len == 0 (and ideally that global_tokens is divisible by world_size * grad_accum_steps) before reshaping.

Copilot uses AI. Check for mistakes.
Comment on lines +589 to +605
def __init__(self, bigram_vocab_size: int, bigram_dim: int, model_dim: int):
super().__init__()
self.bigram_vocab_size = bigram_vocab_size
self.embed = nn.Embedding(bigram_vocab_size, bigram_dim)
nn.init.zeros_(self.embed.weight)
self.proj = CastedLinear(bigram_dim, model_dim, bias=False) if bigram_dim != model_dim else None
if self.proj is not None:
nn.init.zeros_(self.proj.weight)
self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32))

def bigram_hash(self, tokens: Tensor) -> Tensor:
t = tokens.to(torch.int32)
mod = self.bigram_vocab_size - 1
out = torch.empty_like(t)
out[..., 0] = mod
out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod
return out.long()
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigramHashEmbedding.bigram_hash() uses mod = self.bigram_vocab_size - 1 and then does % mod. If bigram_vocab_size is set to 1 (or 0, though the module is only constructed for >0), this becomes a modulo-by-zero error. Add input validation in BigramHashEmbedding.__init__ (or when reading hyperparameters) to require bigram_vocab_size >= 2 when enabled.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +9
{
"name": "Andreas Test Submission",
"val_loss": 1.14276,
"bytes_total": 15900000,
"blurb": "Test submission based on the first place solution (10L Int5-MLP).",
"author": "Andreas",
"github_id": "andreasbis",
"date": "2026-03-22"
}
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This submission.json format doesn’t match the other entries in records/track_non_record_16mb, which include track and typically record both val_loss and val_bpb. Also, val_loss here is set to ~1.14, but the accompanying train.log shows val_loss ~1.929 and val_bpb ~1.143 after quant roundtrip, so the metric appears mislabeled. Recommend adding track (e.g. non-record-16mb), adding val_bpb, and setting val_loss/bytes_total to the exact values emitted by the training log.

Copilot uses AI. Check for mistakes.
Comment on lines +1230 to +1231
# fixes applied
# tuned
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trailing # fixes applied / # tuned comments look like leftover scratch notes rather than explanatory documentation. Consider removing them or replacing with a brief, specific note about what was changed/tuned (or link to the relevant result) so the file doesn’t accumulate ambiguous annotations.

Suggested change
# fixes applied
# tuned

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants