diff --git a/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/README.md b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/README.md new file mode 100644 index 0000000000..f0390c88ef --- /dev/null +++ b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/README.md @@ -0,0 +1,183 @@ +# JEPA-on-LM 14-run ablation — non-record submission (2026-05-02) + +This is a **non-record submission documenting a comprehensive negative result**: +JEPA auxiliary objectives do **not** improve `val_bpb` on parameter-golf at +the 17.06M-param / sp1024 / FineWeb scale. The cleanest recipe ties +baseline exactly. We submit this to formalize the negative finding so +future JEPA submitters don't re-run the same grid. + +## TL;DR + +- **Best JEPA variant** (`jepa-var-zero`, α=0.001, `VAR_WEIGHT=0`): + `val_bpb = 1.2311` at step 50K — **exact tie with same-seed baseline**. +- Same-seed JEPA-vs-baseline gap: **+0.0007 to +0.0009** across two seeds + (1337, 42). +- Cross-seed baseline variance: **0.0022**, larger than the JEPA gap → + statistically indistinguishable. +- λ matters by orders of magnitude. λ=0.001 = parity. λ=0.005 ≥ +0.005 BPB + cost. λ=0.2 (the obvious "JEPA paper" default) costs +0.018 BPB. + +## Track + +`non-record-unlimited-compute-16mb` — but **the model artifact was not +quantized for this submission**. We're submitting an ablation finding, +not a leaderboard candidate. The val_bpb reported is the pre-quant +running val_bpb at step 50K. + +## Setup + +All variants share one architectural backbone: + +- **Backbone**: `BaselineGPT`, 17,059,912 params +- **Layers**: `NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2` +- **Activation**: `relu_sq` +- **Tied embeddings**: `TIE_EMBEDDINGS=1` +- **Tokenizer/data**: `sp1024` BPE on FineWeb 10B +- **Batch**: `TRAIN_BATCH_TOKENS=65536 TRAIN_SEQ_LEN=1024` +- **Optimizer**: Muon (matrices) + Adam (scalars) — parameter-golf default +- **Schedule**: linear warmdown, 1200 step warmdown, 10-step warmup +- **Validation**: `VAL_LOSS_EVERY=10000` + +JEPA variants add a **single** small predictor MLP (model_dim → 64 → +model_dim, zero-init on output) totaling **65,536 params (+0.4%)**: + +- **JEPA total**: 17,125,448 params +- All 14 runs use the **same** model dim/layers/heads — only loss weights + and JEPA env vars differ. **Param-count clean**. + +## What we tested (14-run grid) + +Final `val_bpb` at step 50K, sorted ascending. Star (*) = wallclock cap +hit on slower hardware before step 50K; `step` column shows actual. + +| run | seed | config | step | **val_bpb** | Δ vs same-seed baseline | +|---|---|---|---|---|---| +| `baseline-seed42` | 42 | control | 50K | **1.2289** | 0 (own baseline) | +| `tiny-lambda-seed42` | 42 | α=0.001 | 50K | 1.2298 | +0.0009 | +| **`var-zero`** | 1337 | **α=0.001, VAR_WEIGHT=0** | 50K | **1.2311** | **0.0000 ✅ TIE** | +| `baseline-promo` | 1337 | control | 50K | 1.2311 | 0 (own baseline) | +| `tiny-lambda-v3` | 1337 | α=0.001 | 50K | 1.2318 | +0.0007 | +| `half-lambda` | 1337 | α=0.0005 | 50K | 1.2318 | +0.0007 | +| `chunk16` | 1337 | α=0.001, CHUNK=16 | 50K | 1.2318 | +0.0007 | +| `aux+token-tiny` | 1337 | α=β=0.001 | 50K | 1.2361 | +0.0050 | +| `tenth-lambda`* | 1337 | α=0.0001 | 40K | 1.2362 | tied @ 40K | +| `covar-v3` | 1337 | α=0.005, COVAR_WEIGHT=0.05 | 50K | 1.2374 | +0.0063 | +| `token-only-tiny`* | 1337 | β=0.001 | 40K | 1.2408 | +0.0046 (40K) | +| `injection-v2`* | 1337 | α=0.005, INJECTION=1 | 40K | 1.2456 | +0.0094 (40K) | +| `aux-v1` | 1337 | α=0.2 (the "JEPA paper" default) | 50K | 1.2492 | +0.0181 | +| `aux-low-v2`* | 1337 | α=0.005 | 30K | 1.2553 | +0.0060 (30K) | + +(Cross-seed baseline gap = 1.2311 − 1.2289 = **0.0022**, our noise floor.) + +## Component-by-component verdict at the whisper regime (λ=0.001) + +| component active | effect on val_bpb @ 50K | +|---|---| +| Path A MSE alone (VAR_WEIGHT=0) | **0.000** ← exact baseline | +| Path A + VICReg variance reg (VAR_WEIGHT=0.1) | +0.0007 (within seed noise) | +| Path A + V-JEPA off-diag covariance (COVAR=0.05) | +0.0063 | +| Path B (token decoder via tied LM head) alone | +0.0046 | +| Path A + Path B both at whisper | +0.0050 | +| Path A + injection (zero-init latent into hidden) | +0.0094 | +| Higher λ: 0.005 | +0.005 to +0.010 | +| Higher λ: 0.2 | +0.018 (catastrophic, v1 default) | + +## Three findings + +1. **λ matters most, by orders of magnitude.** PR #832 (winner pattern) + used λ=0.001. We confirm parity at that magnitude. Going to λ=0.005 + already costs ≥0.005 BPB. λ=0.2 (a common JEPA paper default) costs + 0.018 BPB. This is the single most consequential knob. + +2. **VICReg variance reg adds small harm at this λ.** With λ already at + the noise floor, the variance hinge `relu(1 - z_std)` injects a tiny + asymmetric force that nudges JEPA away from baseline. Setting + `VAR_WEIGHT=0` recovers exact parity (`var-zero` row above). + +3. **Path B (token-decoder JEPA) hurts even at β=0.001.** The JEPA + token-CE competes with main CE for the tied LM head, so even whisper + magnitudes pull the head in two directions. Path A (hidden-state aux + MSE) is benign at small λ because it doesn't touch the LM head. + +## Reproducibility + +- **Architecture**: `jepa_lm.py` (this directory) — also published in the + `crucible-community-tap` at + [`architectures/jepa_lm/`](https://github.com/eren23/crucible-community-tap/tree/main/architectures/jepa_lm). + Tap commit `bc93273`. +- **Training script**: `train_gpt.py` (this directory) is a thin + compatibility wrapper that delegates to + `src/crucible/training/torch_backend.py` from the + [Crucible](https://github.com/eren23/crucible) ML platform (commit `969cac5`). +- **Compute**: 4× RunPod RTX 4090 (3 dedicated + 1 shared overnight). All + variants ran the `promotion` preset (~2h wallclock, + `MAX_WALLCLOCK_SECONDS=7200`, target `ITERATIONS=100000`, 65,536 + `TRAIN_BATCH_TOKENS`). +- **Total cost**: ~$15 over ~16 GPU-hours. +- **W&B**: project `parameter-golf`, entity `eren23`. Run names match the + table above (e.g. https://wandb.ai/eren23/parameter-golf/runs/n22iw31q + for `var-zero`). +- **Full ablation finding** (per-step val_bpb curves CSV, structured + finding doc): `crucible-community-tap` at + [`findings/parameter-golf-jepa-ablation/`](https://github.com/eren23/crucible-community-tap/tree/main/findings/parameter-golf-jepa-ablation). + +### Repro command (var-zero, the baseline-tying recipe) + +```bash +# Install the JEPA tap plugin +crucible tap add https://github.com/eren23/crucible-community-tap +crucible tap install jepa_lm --type architectures + +# Run var-zero +MODEL_FAMILY=jepa_lm \ +JEPA_ALPHA=0.001 \ +JEPA_BETA=0 \ +JEPA_VAR_WEIGHT=0 \ +JEPA_COVAR_WEIGHT=0 \ +JEPA_CHUNK=8 \ +JEPA_PREDICTOR_DIM=64 \ +JEPA_INJECTION=0 \ +SEED=1337 \ +NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2 \ +ACTIVATION=relu_sq TIE_EMBEDDINGS=1 \ +TRAIN_BATCH_TOKENS=65536 TRAIN_SEQ_LEN=1024 \ +ITERATIONS=100000 WARMUP_STEPS=10 WARMDOWN_ITERS=1200 \ +VAL_LOSS_EVERY=10000 \ +MAX_WALLCLOCK_SECONDS=7200 \ +PYTHONPATH=src python -m crucible.cli.main run experiment --preset promotion +``` + +## Why this is publishable as a non-record submission + +- 14 runs at the same N (17.06M / 17.13M with predictor), promotion-tier + budget each (~2h wallclock, 50K steps). +- Two-seed paired baselines (1337, 42) establish a 0.0022 noise floor — + roughly **2.5× larger than any JEPA-vs-baseline gap we measured at + the cleanest configs**. +- λ sweep across 4 orders of magnitude (0.0001, 0.0005, 0.001, 0.005, 0.2). +- Path ablation (A only / B only / both / injection / covar). +- Three previously untested knobs added: `chunk16`, `var-zero`, `tenth-lambda`. + +The cleanest negative-result JEPA submission on parameter-golf to date. +PR #896 was a single-config failure; this is a saturated grid that +identifies *exactly* which JEPA components hurt and which is benign. + +## Files + +- `README.md` — this file +- `submission.json` — leaderboard metadata (track, val_bpb, ablation JSON) +- `train.log` — full training stdout for the best-variant `jepa-var-zero` run +- `jepa_lm.py` — the architecture plugin (also in `crucible-community-tap`) +- `train_gpt.py` — entry-point shim for the Crucible torch backend + +## Next directions (not yet tested) + +1. **Span-masking** (PR #1581 approach): replace target tokens with a + learned mask in the context-encoder pass. Forces non-trivial + prediction. Requires double forward pass — implementation cost is real. +2. **Phased α ramp**: pure AR (30%) → AR+JEPA ramp (50%) → pure AR + cooldown (20%). PR #832 schedule. +3. **EMA target encoder** (BYOL-style). PR #896 already showed no-gain + at this scale, deprioritized. +4. **Different backbone scale**: PR #832 won at 24M / byte-level. Maybe + JEPA helps below 17M but hurts above. Untested here. diff --git a/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/jepa_lm.py b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/jepa_lm.py new file mode 100644 index 0000000000..8939c5aec9 --- /dev/null +++ b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/jepa_lm.py @@ -0,0 +1,322 @@ +"""JEPA-on-LM architecture for parameter-golf non-record / unlimited-compute track. + +A standard parameter-golf BaselineGPT backbone (encoder-decoder skip, GQA, +augmentations, tied embeddings) drives the cross-entropy LM head and val_bpb. +On top of that, JEPA paths share a small predictor MLP: + + Path A — Hidden-state aux JEPA: + For each non-final position t, predict the model's own final hidden + state at position t + chunk (stop-grad target). Loss = MSE + + VICReg variance regularization (+ optional off-diagonal covariance). + + Path B — Token-decoder JEPA: + Project the predicted embedding through the tied LM head and apply CE + against the actual token at position t + chunk. + + Injection (optional, JEPA_INJECTION=1): + Project predicted latent through a zero-init linear and ADD to the + hidden stream at chunk-positions before CE compute. JEPA actively + contributes a feature, not just a regularizer. Inspired by jfprincz + PR #832 (val_bpb 1.1903, beats baseline 1.2244 by 0.034). + +Combined loss returned to the trainer: + + total = ce_main + alpha * (mse_aux + var_w * vicreg + covar_w * covar) + beta * ce_jepa + +v2 changes vs v1 (informed by parameter-golf community PRs): + + - Defaults dropped 40x: alpha=0.005, beta=0.005 (was 0.2 / 0.05). Successful + JEPA submissions in parameter-golf use lambda ~= 0.001-0.005, not 0.1+. + "JEPA contributes ~0.1% of peak gradient signal" (PR #832). + - Off-diagonal covariance penalty (V-JEPA style) opt-in via + JEPA_COVAR_WEIGHT > 0. Prevents low-rank predictor collapse beyond what + pure variance regularization catches (PR #1581 finding). + - Predictor injection mode opt-in via JEPA_INJECTION=1. Predicted latents + flow into the LM head as features (zero-init), not just as a side loss. + +Setting JEPA_ALPHA=0 disables path A, JEPA_BETA=0 disables path B, +JEPA_INJECTION=0 disables injection. All three at default-off recovers +plain BaselineGPT numerics. + +Env vars (read in the builder, not via Hyperparameters): + + JEPA_ALPHA default 0.005 weight for hidden-state aux loss + JEPA_BETA default 0.005 weight for token-decoder loss + JEPA_VAR_WEIGHT default 0.1 VICReg variance-reg weight + JEPA_COVAR_WEIGHT default 0.0 off-diagonal covariance penalty (V-JEPA) + JEPA_CHUNK default 8 positions ahead to predict + JEPA_PREDICTOR_DIM default 64 bottleneck dim of predictor MLP + JEPA_INJECTION default 0 1 = inject predicted latent into hidden stream + +The predictor and injection projection are zero-initialized on their output +layers, so JEPA paths start as a no-op and the trainer sees pure baseline +gradients at step 0. +""" +from __future__ import annotations + +import math +import os +from typing import Any + +import torch +import torch.nn.functional as F +from torch import Tensor, nn + +from crucible.models.architectures.baseline import BaselineGPT +from crucible.models.registry import register_model, register_schema + + +def _env_float(name: str, default: float) -> float: + val = os.environ.get(name) + return default if val is None or val == "" else float(val) + + +def _env_int(name: str, default: int) -> int: + val = os.environ.get(name) + return default if val is None or val == "" else int(val) + + +def _env_bool(name: str, default: bool) -> bool: + val = os.environ.get(name) + if val is None or val == "": + return default + return val.strip().lower() not in ("0", "false", "no", "off") + + +def _covariance_off_diag(z: Tensor) -> Tensor: + """V-JEPA-style off-diagonal covariance penalty. + + Decorrelates feature dimensions by penalizing off-diagonal entries of the + feature covariance matrix. Sums squared off-diagonals, normalized by D. + Input z: [N, D]. Returns scalar. + """ + z = z.float() + n = max(z.shape[0] - 1, 1) + z = z - z.mean(dim=0, keepdim=True) + cov = (z.T @ z) / n # [D, D] + d = cov.shape[0] + off_diag = cov - torch.diag(torch.diag(cov)) + return (off_diag.pow(2).sum() / d).clamp_min(0.0) + + +class JepaLM(BaselineGPT): + """BaselineGPT backbone + JEPA aux head + optional injection.""" + + def __init__( + self, + *, + jepa_alpha: float = 0.005, + jepa_beta: float = 0.005, + jepa_var_weight: float = 0.1, + jepa_covar_weight: float = 0.0, + jepa_chunk: int = 8, + jepa_predictor_dim: int = 64, + jepa_injection: bool = False, + **base_kwargs: Any, + ) -> None: + super().__init__(**base_kwargs) + if jepa_chunk < 1: + raise ValueError(f"JEPA_CHUNK must be >= 1, got {jepa_chunk}") + if jepa_predictor_dim < 1: + raise ValueError(f"JEPA_PREDICTOR_DIM must be >= 1, got {jepa_predictor_dim}") + self.jepa_alpha = float(jepa_alpha) + self.jepa_beta = float(jepa_beta) + self.jepa_var_weight = float(jepa_var_weight) + self.jepa_covar_weight = float(jepa_covar_weight) + self.jepa_chunk = int(jepa_chunk) + self.jepa_injection = bool(jepa_injection) + d = base_kwargs["model_dim"] + self.jepa_predictor = nn.Sequential( + nn.Linear(d, jepa_predictor_dim, bias=False), + nn.GELU(), + nn.Linear(jepa_predictor_dim, d, bias=False), + ) + # Zero-init the output projection so JEPA contributes nothing at step 0. + nn.init.zeros_(self.jepa_predictor[2].weight) + nn.init.normal_( + self.jepa_predictor[0].weight, + std=1.0 / math.sqrt(d), + ) + # Optional injection projection: predicted latent -> residual stream + # contribution at chunk-aligned positions. Zero-init keeps step-0 + # behavior identical to baseline. + if self.jepa_injection: + self.jepa_inject_proj = nn.Linear(d, d, bias=False) + nn.init.zeros_(self.jepa_inject_proj.weight) + else: + self.jepa_inject_proj = None + + def _maybe_inject(self, h: Tensor, h_pred: Tensor) -> Tensor: + """Add zero-init projected predicted latents into the hidden stream. + + h: [B, T, D] full hidden. h_pred: [B, T-chunk, D] predictions made at + positions 0..T-chunk-1 of what positions chunk..T-1 will look like. + We add the prediction at position t-chunk INTO h[t] for t >= chunk. + Positions 0..chunk-1 receive no injection (no prediction available). + """ + if self.jepa_inject_proj is None: + return h + chunk = self.jepa_chunk + inject = self.jepa_inject_proj(h_pred) # [B, T-chunk, D] + # Pad zero on the left for positions 0..chunk-1 + b, _, d = h.shape + zero_head = torch.zeros(b, chunk, d, dtype=h.dtype, device=h.device) + full_inject = torch.cat([zero_head, inject], dim=1) # [B, T, D] + return h + full_inject + + def _components( + self, + input_ids: Tensor, + target_ids: Tensor, + lora: Any = None, + ) -> dict[str, Tensor]: + """Forward + per-component losses.""" + h = self.hidden(input_ids, lora=lora) + chunk = self.jepa_chunk + seq_len = h.size(1) + do_jepa = (self.jepa_alpha > 0.0 or self.jepa_beta > 0.0 or self.jepa_injection) and seq_len > chunk + + if not do_jepa: + ce_main = self.compute_loss(h, target_ids, lora=lora) + return {"ce_loss": ce_main, "loss": ce_main} + + h_curr = h[:, :-chunk, :] # [B, T-chunk, D] + h_target = h[:, chunk:, :].detach() # stop-grad target + h_pred = self.jepa_predictor(h_curr) # [B, T-chunk, D] + + # Inject BEFORE computing main CE so injection helps the LM head. + h_for_ce = self._maybe_inject(h, h_pred) + ce_main = self.compute_loss(h_for_ce, target_ids, lora=lora) + out: dict[str, Tensor] = {"ce_loss": ce_main} + total = ce_main + + if self.jepa_alpha > 0.0: + # Normalize before MSE so un-RMSNormed magnitudes don't dominate. + h_pred_n = self.final_norm(h_pred) + h_target_n = self.final_norm(h_target) + mse_aux = F.mse_loss(h_pred_n, h_target_n) + # VICReg variance hinge over the predictor's feature dimension. + z_std = torch.sqrt(h_pred_n.float().var(dim=(0, 1)) + 1e-4) + vicreg = torch.relu(1.0 - z_std).mean() + jepa_aux = mse_aux + self.jepa_var_weight * vicreg + # V-JEPA off-diagonal covariance penalty (anti-collapse beyond + # variance reg). Opt-in via JEPA_COVAR_WEIGHT > 0. + if self.jepa_covar_weight > 0.0: + flat = h_pred_n.reshape(-1, h_pred_n.size(-1)) + covar = _covariance_off_diag(flat) + jepa_aux = jepa_aux + self.jepa_covar_weight * covar + out["jepa_covar"] = covar.detach() + total = total + self.jepa_alpha * jepa_aux + out["jepa_mse"] = mse_aux.detach() + out["jepa_vicreg"] = vicreg.detach() + + if self.jepa_beta > 0.0: + # Token-decoder JEPA: decode predicted embedding through tied LM head. + target_chunk_ids = input_ids[:, chunk:] + x = self.final_norm(h_pred) + flat = x.reshape(-1, x.size(-1)) + logits_proj = ( + self.tied_logits(flat) if self.tie_embeddings else self.lm_head(flat) + ) + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + ce_jepa = F.cross_entropy( + logits.float(), + target_chunk_ids.reshape(-1), + reduction="mean", + ignore_index=-100, + ) + total = total + self.jepa_beta * ce_jepa + out["jepa_token_ce"] = ce_jepa.detach() + + out["loss"] = total + return out + + def forward( + self, + input_ids: Tensor, + target_ids: Tensor, + lora: Any = None, + ) -> Tensor: # type: ignore[override] + return self._components(input_ids, target_ids, lora=lora)["loss"] + + def training_step(self, **batch: Any) -> dict[str, Tensor]: + return self._components( + batch["input_ids"], + batch["target_ids"], + lora=batch.get("lora"), + ) + + def validation_step(self, **batch: Any) -> dict[str, Tensor]: + # Validation reports val_bpb based on ce_loss only — JEPA aux is + # training-time regularization. With injection enabled the predicted + # latent IS part of the LM head input, so we keep that path live; + # variance/MSE losses are skipped. + h = self.hidden(batch["input_ids"], lora=batch.get("lora")) + if self.jepa_injection and h.size(1) > self.jepa_chunk: + h_curr = h[:, :-self.jepa_chunk, :] + h_pred = self.jepa_predictor(h_curr) + h = self._maybe_inject(h, h_pred) + ce = self.compute_loss(h, batch["target_ids"], lora=batch.get("lora")) + return {"loss": ce, "ce_loss": ce} + + +def _build_jepa_lm(args: Any) -> JepaLM: + base_kwargs = dict( + vocab_size=args.vocab_size, + num_layers=args.num_layers, + model_dim=args.model_dim, + num_heads=args.num_heads, + num_kv_heads=args.num_kv_heads, + mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, + tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, + rope_base=args.rope_base, + qk_gain_init=args.qk_gain_init, + attention_variant=args.attention_variant, + residual_variant=args.residual_variant, + embed_bottleneck_dim=getattr(args, "embed_bottleneck_dim", 0), + use_smear_gate=getattr(args, "smear_gate", False), + use_bigram_hash=getattr(args, "bigram_hash", False), + bigram_hash_buckets=getattr(args, "bigram_hash_buckets", 2048), + bigram_hash_embed_dim=getattr(args, "bigram_hash_embed_dim", 128), + ortho_init=getattr(args, "ortho_init", False), + spectral_embed_init=getattr(args, "spectral_embed_init", False), + use_conv_block=getattr(args, "conv_block", False), + conv_kernel=getattr(args, "conv_kernel", 3), + multiscale_window=getattr(args, "multiscale_window", 0), + token_merge_layer=getattr(args, "token_merge_layer", 0), + token_merge_threshold=getattr(args, "token_merge_threshold", 0.9), + block_pattern=getattr(args, "block_pattern", ""), + use_trigram_hash=getattr(args, "trigram_hash", False), + trigram_hash_buckets=getattr(args, "trigram_hash_buckets", 4096), + activation=getattr(args, "activation", "relu_sq"), + use_moe=getattr(args, "use_moe", False), + moe_num_experts=getattr(args, "moe_num_experts", 4), + moe_top_k=getattr(args, "moe_top_k", 2), + ) + return JepaLM( + jepa_alpha=_env_float("JEPA_ALPHA", 0.005), + jepa_beta=_env_float("JEPA_BETA", 0.005), + jepa_var_weight=_env_float("JEPA_VAR_WEIGHT", 0.1), + jepa_covar_weight=_env_float("JEPA_COVAR_WEIGHT", 0.0), + jepa_chunk=_env_int("JEPA_CHUNK", 8), + jepa_predictor_dim=_env_int("JEPA_PREDICTOR_DIM", 64), + jepa_injection=_env_bool("JEPA_INJECTION", False), + **base_kwargs, + ) + + +register_model("jepa_lm", _build_jepa_lm) +register_schema("jepa_lm", { + # Inherits all baseline knobs (MODEL_DIM, NUM_LAYERS, ...) — those are + # honored via the BaselineGPT constructor. Schema below documents the + # JEPA-specific env vars introduced by this plugin. + "JEPA_ALPHA": {"type": "float", "default": 0.005, "description": "Weight for hidden-state aux JEPA loss (MSE + VICReg + covar)"}, + "JEPA_BETA": {"type": "float", "default": 0.005, "description": "Weight for token-decoder JEPA cross-entropy loss"}, + "JEPA_VAR_WEIGHT": {"type": "float", "default": 0.1, "description": "VICReg variance-regularization weight"}, + "JEPA_COVAR_WEIGHT": {"type": "float", "default": 0.0, "description": "V-JEPA off-diagonal covariance penalty (0 = off)"}, + "JEPA_CHUNK": {"type": "int", "default": 8, "description": "Lookahead distance (positions) for JEPA prediction"}, + "JEPA_PREDICTOR_DIM": {"type": "int", "default": 64, "description": "Bottleneck dim of the JEPA predictor MLP"}, + "JEPA_INJECTION": {"type": "bool", "default": False, "description": "Inject predicted latent (zero-init) into hidden stream"}, +}) diff --git a/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/submission.json b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/submission.json new file mode 100644 index 0000000000..e0d8b9f5bf --- /dev/null +++ b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/submission.json @@ -0,0 +1,67 @@ +{ + "author": "Eren Akbulut", + "github_id": "eren23", + "name": "JEPA-on-LM 14-run Ablation (negative result, baseline-tying recipe)", + "blurb": "Non-record ablation of JEPA auxiliary objectives on a 17.06M-param BaselineGPT (9x512, KV4, MLP_MULT=2, sp1024, FineWeb 10B, promotion preset, 50K steps). 14 configs spanning lambda in [0.0001, 0.2], V-JEPA covariance, VICReg variance reg, predictor injection, chunk depth, two seeds. Cleanest recipe (Path-A MSE only with alpha=0.001, VAR_WEIGHT=0) ties baseline exactly at val_bpb=1.2311. Same-seed JEPA-vs-baseline gap of +0.0007 to +0.0009 is below the cross-seed 0.0022 noise floor. Quant pipeline NOT run for this submission - reporting pre-quant val_bpb at step 50K to document the ablation finding.", + "date": "2026-05-02T09:36:00Z", + "track": "non-record-unlimited-compute-16mb", + "val_loss": 2.0786, + "val_bpb": 1.2311, + "pre_quant_val_loss": 2.0786, + "pre_quant_val_bpb": 1.2311, + "step_stop": 50000, + "wallclock_seconds": 6281.365, + "bytes_total": null, + "bytes_model_int8_zlib": null, + "bytes_code": 14347, + "extra": { + "submission_kind": "ablation-finding", + "best_variant": "jepa-var-zero", + "best_variant_config": { + "MODEL_FAMILY": "jepa_lm", + "JEPA_ALPHA": 0.001, + "JEPA_BETA": 0.0, + "JEPA_VAR_WEIGHT": 0.0, + "JEPA_COVAR_WEIGHT": 0.0, + "JEPA_CHUNK": 8, + "JEPA_PREDICTOR_DIM": 64, + "JEPA_INJECTION": 0, + "SEED": 1337 + }, + "baseline_at_same_seed": { + "name": "baseline-promo", + "seed": 1337, + "val_bpb": 1.2311, + "val_loss": 2.0786 + }, + "baseline_at_alt_seed": { + "name": "baseline-seed42", + "seed": 42, + "val_bpb": 1.2289, + "val_loss": 2.0723 + }, + "param_count_baseline": 17059912, + "param_count_jepa": 17125448, + "predictor_overhead_pct": 0.4, + "n_runs": 14, + "lambda_sweep": [0.0001, 0.0005, 0.001, 0.005, 0.2], + "ablation_table": [ + {"run": "baseline-seed42", "seed": 42, "config": "control", "step": 50000, "val_bpb": 1.2289}, + {"run": "tiny-lambda-seed42", "seed": 42, "config": "alpha=0.001", "step": 50000, "val_bpb": 1.2298}, + {"run": "var-zero", "seed": 1337, "config": "alpha=0.001 VAR_WEIGHT=0", "step": 50000, "val_bpb": 1.2311}, + {"run": "baseline-promo", "seed": 1337, "config": "control", "step": 50000, "val_bpb": 1.2311}, + {"run": "tiny-lambda-v3", "seed": 1337, "config": "alpha=0.001", "step": 50000, "val_bpb": 1.2318}, + {"run": "half-lambda", "seed": 1337, "config": "alpha=0.0005", "step": 50000, "val_bpb": 1.2318}, + {"run": "chunk16", "seed": 1337, "config": "alpha=0.001 JEPA_CHUNK=16", "step": 50000, "val_bpb": 1.2318}, + {"run": "aux+token-tiny", "seed": 1337, "config": "alpha=0.001 beta=0.001", "step": 50000, "val_bpb": 1.2361}, + {"run": "tenth-lambda", "seed": 1337, "config": "alpha=0.0001", "step": 40000, "val_bpb": 1.2362}, + {"run": "covar-v3", "seed": 1337, "config": "alpha=0.005 COVAR_WEIGHT=0.05", "step": 50000, "val_bpb": 1.2374}, + {"run": "token-only-tiny", "seed": 1337, "config": "beta=0.001", "step": 40000, "val_bpb": 1.2408}, + {"run": "injection-v2", "seed": 1337, "config": "alpha=0.005 INJECTION=1", "step": 40000, "val_bpb": 1.2456}, + {"run": "aux-v1", "seed": 1337, "config": "alpha=0.2 (v1 default - too high)", "step": 50000, "val_bpb": 1.2492}, + {"run": "aux-low-v2", "seed": 1337, "config": "alpha=0.005", "step": 30000, "val_bpb": 1.2553} + ], + "tap_finding_url": "https://github.com/eren23/crucible-community-tap/tree/main/findings/parameter-golf-jepa-ablation", + "tap_architecture_url": "https://github.com/eren23/crucible-community-tap/tree/main/architectures/jepa_lm" + } +} diff --git a/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/train.log b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/train.log new file mode 100644 index 0000000000..904beb371c --- /dev/null +++ b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/train.log @@ -0,0 +1,1419 @@ +"""PyTorch training backend — main training loop entry point. + +Invoke directly (``python torch_backend.py``) or via the crucible runner / MCP tools. +""" +from __future__ import annotations + +import sys +from pathlib import Path + +# Self-bootstrap: ensure src/ is on path when invoked directly +_src = str(Path(__file__).resolve().parent.parent.parent) +if _src not in sys.path: + sys.path.insert(0, _src) + +import copy +import io +import math +import os +import random +import signal +import subprocess +import time + +import numpy as np +import sentencepiece as spm +import torch +import torch.distributed as dist +from torch import Tensor, nn +from torch.nn.parallel import DistributedDataParallel as DDP + +# Crucible training modules (siblings) +from crucible.training.hyperparams import Hyperparameters +from crucible.training.muon import zeropower_via_newtonschulz5 +from crucible.training.data_loader import DistributedTokenLoader +from crucible.training.validation import validate_model, build_sentencepiece_luts, load_validation_tokens +from crucible.training.quantization import ( + CONTROL_TENSOR_NAME_PATTERNS, + quantize_state_dict, + dequantize_state_dict, + compress_blob, + decompress_blob, + fake_int6_quant, +) +from crucible.training.ttt_eval import ttt_lora_evaluate + +# Crucible model layer +from crucible.models.registry import build_model +from crucible.models.components.linear import CastedLinear + +# Crucible runner utilities +from crucible.runner.tracker import RunTracker +from crucible.runner.wandb_logger import WandbLogger +from crucible.core.fingerprint import code_fingerprint +from crucible.core.io import collect_public_attrs + +try: + import zstandard as zstd +except ImportError: + zstd = None + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + + +def restore_low_dim_params_to_fp32(module: nn.Module) -> None: + # Keep small/control parameters in fp32 even when the model body runs in bf16. + with torch.no_grad(): + for name, param in module.named_parameters(): + if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32: + param.data = param.data.float() + + +# --------------------------------------------------------------------------- +# Main training loop +# --------------------------------------------------------------------------- + + +_zeropower_compiled = False + + +def main() -> None: + global _zeropower_compiled + import crucible.training.muon as _muon_mod + + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + if not _zeropower_compiled: + _muon_mod.zeropower_via_newtonschulz5 = torch.compile(zeropower_via_newtonschulz5) + _zeropower_compiled = True + + # ----------------------------- + # DISTRIBUTED + CUDA SETUP + # ----------------------------- + + # torchrun sets RANK, WORLD_SIZE, LOCAL_RANK automatically. Trust them. + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + distributed = world_size > 1 + + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + grad_accum_steps_env = os.environ.get("GRAD_ACCUM_STEPS") + if grad_accum_steps_env is not None: + grad_accum_steps = int(grad_accum_steps_env) + if grad_accum_steps <= 0: + raise ValueError(f"GRAD_ACCUM_STEPS must be positive, got {grad_accum_steps}") + elif world_size == 1: + grad_accum_steps = 1 + else: + if 8 % world_size != 0: + raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral") + grad_accum_steps = 8 // world_size + grad_scale = 1.0 / grad_accum_steps + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + _graceful_shutdown = False + def _handle_signal(signum, frame): + nonlocal _graceful_shutdown + _graceful_shutdown = True + signal.signal(signal.SIGTERM, _handle_signal) + signal.signal(signal.SIGINT, _handle_signal) + tracker: RunTracker | None = None + wandb: WandbLogger | None = None + + # Fast math knobs + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp + + enable_cudnn_sdp(False) + enable_flash_sdp(True) + enable_mem_efficient_sdp(False) + enable_math_sdp(args.multiscale_window > 0 or bool(args.block_pattern)) + + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + config = collect_public_attrs(args) + fp = code_fingerprint(Path(__file__).resolve().parent.parent.parent.parent) + config["code_fingerprint"] = fp["fingerprint"] + config["code_files"] = fp["files"] + run_tags = ["torch", args.model_family] + if args.attention_variant != "standard": + run_tags.append(f"attn:{args.attention_variant}") + if args.residual_variant != "standard": + run_tags.append(f"resid:{args.residual_variant}") + if args.embed_bottleneck_dim > 0: + run_tags.append("factorized_embed") + if args.gpu_count > 1: + run_tags.append(f"gpu:{args.gpu_count}") + run_preset = os.environ.get("RUN_PRESET", "").strip() + if run_preset: + run_tags.append(run_preset) + tracker = RunTracker(args.run_id, out_dir="logs", project_root=Path(__file__).resolve().parent.parent.parent.parent) + tracker.write_manifest( + backend="torch", + script_path=Path(__file__), + config=config, + tags=run_tags, + extra={ + "trainer": "torch_backend", + "run_preset": run_preset or None, + "parent_run_id": args.parent_run_id or None, + "gpu_count": args.gpu_count, + }, + ) + tracker.update(state="starting", phase="starting", backend="torch", config=config) + wandb = WandbLogger.create( + run_id=args.run_id, + config=config, + backend="torch", + tracker=tracker, + job_type=run_preset or None, + tags=run_tags, + ) + print(logfile) + + def log0(msg: str, console: bool = True) -> None: + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + + log0(code, console=False) + log0("=" * 100, console=False) + log0(f"Running Python {sys.version}", console=False) + log0(f"Running PyTorch {torch.__version__}", console=False) + log0( + subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout, + console=False, + ) + log0("=" * 100, console=False) + + # ----------------------------- + # TOKENIZER + VALIDATION METRIC SETUP + # ----------------------------- + + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + torch.cuda.manual_seed_all(args.seed) + + if not args.tokenizer_path.endswith(".model"): + raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}") + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + if int(sp.vocab_size()) != args.vocab_size: + raise ValueError( + f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}" + ) + dataset_dir = Path(args.data_path).resolve() + actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin"))) + val_tokens = load_validation_tokens(args.val_files, args.train_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts( + sp, args.vocab_size, device + ) + log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}") + log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}") + log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}") + + # ----------------------------- + # MODEL + OPTIMIZER SETUP + # ----------------------------- + + base_model = build_model(args).to(device).bfloat16() + for module in base_model.modules(): + if isinstance(module, CastedLinear): + module.float() + restore_low_dim_params_to_fp32(base_model) + + # Int6 QAT: register forward pre-hooks that fake-quantize weight matrices. + _qat_hooks: list = [] + if args.int6_qat: + def _make_qat_hook(module: nn.Module): + def hook(mod, inputs): + mod.weight.data = fake_int6_quant(mod.weight.data) + return hook + for module in base_model.modules(): + if isinstance(module, CastedLinear): + _qat_hooks.append(module.register_forward_pre_hook(_make_qat_hook(module))) + + # Discover and build callbacks BEFORE torch.compile so that on_model_ready + # can register forward hooks that will be visible to the compiled graph. + from crucible.core.plugin_discovery import discover_all_plugins + from crucible.training.callbacks import CALLBACK_REGISTRY, build_callbacks + _proj_root = Path(__file__).resolve().parent.parent.parent.parent + discover_all_plugins( + {"callbacks": CALLBACK_REGISTRY}, + project_root=_proj_root, + ) + _callbacks_str = os.environ.get("CALLBACKS", "") + _callbacks = build_callbacks(_callbacks_str) if _callbacks_str else [] + if _callbacks: + log0(f"callbacks: {[type(cb).__name__ for cb in _callbacks]}") + + # on_model_ready: let callbacks register forward hooks BEFORE compile. + _cb_state_early = {"model": base_model, "total_steps": args.iterations} + for _cb in _callbacks: + _cb.on_model_ready(_cb_state_early) + + # torch.compile gives meaningful throughput on long runs but takes 30-60s + # to warm up and uses fullgraph=True (any graph break is fatal). Set + # TORCH_COMPILE=0 to skip — useful for smoke iteration on plugins with + # compile-incompatible ops (e.g. .item() in metric stashes) and for + # variants whose compiled-graph time exceeds the smoke wallclock budget. + if os.environ.get("TORCH_COMPILE", "1") != "0": + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + else: + log0("torch.compile disabled by TORCH_COMPILE=0") + compiled_model = base_model + model: nn.Module = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False) if distributed else compiled_model + + # Optimizer split: + # - token embedding (Adam) uses EMBED_LR + # - untied lm_head (Adam) uses HEAD_LR + # - matrix params in transformer blocks use MATRIX_LR via Muon + # - vectors/scalars use SCALAR_LR via Adam + token_param_names = base_model.token_parameter_names() + head_param_names = {"lm_head.weight"} if base_model.lm_head is not None else set() + token_params: list[Tensor] = [] + head_params: list[Tensor] = [] + matrix_params: list[Tensor] = [] + scalar_params: list[Tensor] = [] + for name, p in base_model.named_parameters(): + if not p.requires_grad: + continue + if name in token_param_names: + token_params.append(p) + elif name in head_param_names: + head_params.append(p) + elif p.ndim == 2 and not any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS): + matrix_params.append(p) + else: + scalar_params.append(p) + token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr + + # Discover custom optimizer plugins (callbacks already discovered before torch.compile). + from crucible.training.optimizers import OPTIMIZER_REGISTRY, build_optimizer + discover_all_plugins( + {"optimizers": OPTIMIZER_REGISTRY}, + project_root=_proj_root, + ) + + # Pluggable per-group optimizers — env vars override defaults. + _embed_opt = os.environ.get("EMBED_OPTIMIZER", "adam") + _matrix_opt = os.environ.get("MATRIX_OPTIMIZER", "muon") + _scalar_opt = os.environ.get("SCALAR_OPTIMIZER", "adamw") + _head_opt = os.environ.get("HEAD_OPTIMIZER", "adam") + + # Adam-family kwargs — only forwarded when using adam/adamw to avoid + # TypeError on optimizers that don't accept betas/eps/fused. + _ADAM_FAMILY = {"adam", "adamw"} + _adam_kw = dict(betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True) + + optimizer_tok = build_optimizer( + _embed_opt, + [{"params": token_params, "lr": token_lr, "base_lr": token_lr}], + **(_adam_kw if _embed_opt in _ADAM_FAMILY else {}), + ) + optimizer_muon = build_optimizer( + _matrix_opt, + matrix_params, + lr=args.matrix_lr, + momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, + weight_decay=args.muon_weight_decay, + ) + for group in optimizer_muon.param_groups: + group["base_lr"] = args.matrix_lr + optimizer_scalar = build_optimizer( + _scalar_opt, + [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}], + weight_decay=args.adam_weight_decay, + **(_adam_kw if _scalar_opt in _ADAM_FAMILY else {}), + ) + optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] + if head_params: + optimizer_head = build_optimizer( + _head_opt, + [{"params": head_params, "lr": args.head_lr, "base_lr": args.head_lr}], + **(_adam_kw if _head_opt in _ADAM_FAMILY else {}), + ) + optimizers.insert(1, optimizer_head) + + n_params = sum(p.numel() for p in base_model.parameters()) + log0(f"model_params:{n_params}") + log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") + log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False") + log0( + f"model_family:{args.model_family} attention_variant:{args.attention_variant} " + f"residual_variant:{args.residual_variant} embed_bottleneck_dim:{args.embed_bottleneck_dim}" + ) + log0( + f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads} " + f"share_blocks:{args.share_blocks} recurrence_steps:{args.recurrence_steps} state_dim:{args.state_dim}" + ) + log0( + f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} " + f"head_lr:{args.head_lr if head_params else 0.0} " + f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}" + ) + log0( + f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} " + f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} " + f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}" + ) + log0( + f"lr_schedule:{args.lr_schedule} lr_decay_iters:{args.lr_decay_iters} " + f"min_lr_scale:{args.min_lr_scale:.4f}" + ) + log0(f"train_shard_limit:{args.train_shard_limit}") + log0(f"seed:{args.seed}") + + # ----------------------------- + # DATA LOADER & MODEL WARMUP + # ----------------------------- + + train_loader = DistributedTokenLoader( + args.train_files, + rank, + world_size, + device, + shard_limit=args.train_shard_limit, + ) + + # Epoch-based training: resolve EPOCHS to iterations from dataset size + if args.epochs > 0: + from crucible.training.data_loader import count_shard_tokens + total_tokens = count_shard_tokens(args.train_files, shard_limit=args.train_shard_limit) + if total_tokens > 0: + iterations = int(args.epochs * total_tokens / args.train_batch_tokens) + log0(f"epoch_mode:epochs={args.epochs} total_tokens={total_tokens:,} " + f"tokens_per_step={args.train_batch_tokens} iterations={iterations}") + args.iterations = iterations + + def zero_grad_all() -> None: + for opt in optimizers: + opt.zero_grad(set_to_none=True) + + max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None + _VAL_SAFETY_MS = 30_000.0 # reserve 30s for final validation + serialization + if max_wallclock_ms is not None: + max_wallclock_ms = max(max_wallclock_ms - _VAL_SAFETY_MS, 0.0) + + def lr_mul(step: int, elapsed_ms: float) -> float: + if args.lr_schedule == "cosine": + warmup_steps = max(args.warmup_steps, 0) + if warmup_steps > 0 and step < warmup_steps: + return max(step, 1) / warmup_steps + decay_iters = args.lr_decay_iters if args.lr_decay_iters > 0 else args.iterations + if decay_iters <= warmup_steps: + return args.min_lr_scale + if step >= decay_iters: + return args.min_lr_scale + progress = (step - warmup_steps) / max(decay_iters - warmup_steps, 1) + cosine = 0.5 * (1.0 + math.cos(math.pi * progress)) + return args.min_lr_scale + (1.0 - args.min_lr_scale) * cosine + if args.lr_schedule != "linear_warmdown": + raise ValueError(f"Unsupported LR_SCHEDULE={args.lr_schedule!r}") + if args.warmdown_iters <= 0: + return 1.0 + if max_wallclock_ms is None: + warmdown_start = max(args.iterations - args.warmdown_iters, 0) + return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0 + step_ms = elapsed_ms / max(step, 1) + warmdown_ms = args.warmdown_iters * step_ms + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0 + + # Warmup primes the compiled forward/backward/optimizer paths, then we restore the + # initial weights/optimizer state so measured training starts from the true init. + if args.warmup_steps > 0: + initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} + initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + model.train() + for warmup_step in range(args.warmup_steps): + zero_grad_all() + for micro_step in range(grad_accum_steps): + if distributed: + model.require_backward_grad_sync = micro_step == grad_accum_steps - 1 + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + warmup_loss = model(x, y) + (warmup_loss * grad_scale).backward() + for opt in optimizers: + opt.step() + zero_grad_all() + if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps: + log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}") + if tracker is not None: + tracker.heartbeat("warming_up", warmup_step=warmup_step + 1, warmup_total=args.warmup_steps) + base_model.load_state_dict(initial_model_state, strict=True) + for opt, state in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + zero_grad_all() + if distributed: + model.require_backward_grad_sync = True + train_loader = DistributedTokenLoader( + args.train_files, + rank, + world_size, + device, + shard_limit=args.train_shard_limit, + ) + + # ----------------------------- + # MAIN TRAINING LOOP + # ----------------------------- + + _cb_state = {"model": base_model, "total_steps": args.iterations, "optimizers": optimizers} + for _cb in _callbacks: + _cb.on_train_begin(_cb_state) + + training_time_ms = 0.0 + stop_after_step: int | None = None + swa_state: dict[str, Tensor] | None = None + swa_count = 0 + torch.cuda.synchronize() + t0 = time.perf_counter() + + step = 0 + while True: + last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step) + + should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = validate_model( + args, + model, + rank, + world_size, + device, + grad_accum_steps, + val_tokens, + base_bytes_lut, + has_leading_space_lut, + is_boundary_token_lut, + ) + log0( + f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms" + ) + if tracker is not None: + tracker.heartbeat( + "validating", + step=step, + total_steps=args.iterations, + latest_val_loss=val_loss, + latest_val_bpb=val_bpb, + train_time_ms=training_time_ms, + ) + _val_metrics = {"val_loss": val_loss, "val_bpb": val_bpb} + for _cb in _callbacks: + _cb.on_validation_end(step, _val_metrics, _cb_state) + if wandb is not None: + _wandb_val = { + "run/phase": "validating", + "metrics/val_loss": val_loss, + "metrics/val_bpb": val_bpb, + "timing/train_time_ms": training_time_ms, + "timing/step_avg_ms": training_time_ms / max(step, 1), + } + for _mk, _mv in _val_metrics.items(): + if _mk not in ("val_loss", "val_bpb") and isinstance(_mv, (int, float)): + _wandb_val[f"compression/{_mk}"] = _mv + wandb.log(_wandb_val, step=step) + torch.cuda.synchronize() + t0 = time.perf_counter() + + if last_step: + if stop_after_step is not None and step < args.iterations: + log0( + f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms " + f"step:{step}/{args.iterations}" + ) + break + + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + step_t0 = time.perf_counter() + for _cb in _callbacks: + _cb.on_step_begin(step, _cb_state) + zero_grad_all() + train_loss = torch.zeros((), device=device) + for micro_step in range(grad_accum_steps): + if distributed: + model.require_backward_grad_sync = micro_step == grad_accum_steps - 1 + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y) + train_loss += loss.detach() + (loss * grad_scale).backward() + train_loss /= grad_accum_steps + if torch.isnan(train_loss) or torch.isinf(train_loss): + log0(f"FATAL: train_loss is {train_loss.item()} at step {step}. Halting.") + if tracker is not None: + tracker.finalize("failed", phase="nan_detected", step=step) + if wandb is not None: + wandb.finish(1) + break + + for _cb in _callbacks: + _cb.on_after_backward(step, _cb_state) + + frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0 + muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum + for group in optimizer_muon.param_groups: + group["momentum"] = muon_momentum + + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * scale + + if args.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) + for opt in optimizers: + opt.step() + zero_grad_all() + + # SWA: accumulate fp32 weight average during warmdown phase. + if args.swa_interval > 0 and scale < 1.0 and (step + 1) % args.swa_interval == 0: + if swa_state is None: + swa_state = {n: p.data.float().clone() for n, p in base_model.named_parameters()} + else: + for n, p in base_model.named_parameters(): + swa_state[n].add_(p.data.float()) + swa_count += 1 + + step_ms = 1000.0 * (time.perf_counter() - step_t0) + step += 1 + + _step_metrics = {"train_loss": float(train_loss.item())} + for _cb in _callbacks: + _cb.on_step_end(step, _step_metrics, _cb_state) + + approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + tok_s = args.train_batch_tokens / max(step_ms / 1000.0, 1e-9) + should_log_train = ( + args.train_log_every > 0 + and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None) + ) + if should_log_train: + log0( + f"step:{step}/{args.iterations} train_loss:{train_loss.item():.4f} " + f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms tok_s:{tok_s:.0f}" + ) + if tracker is not None: + tracker.heartbeat( + "training", + step=step, + total_steps=args.iterations, + latest_train_loss=float(train_loss.item()), + train_time_ms=approx_training_time_ms, + step_avg_ms=approx_training_time_ms / step, + tok_s=tok_s, + ) + if wandb is not None: + _wandb_payload = { + "run/phase": "training", + "metrics/train_loss": float(train_loss.item()), + "timing/train_time_ms": approx_training_time_ms, + "timing/step_avg_ms": approx_training_time_ms / step, + "timing/tok_s": tok_s, + } + # Forward any extra metrics injected by callbacks + for _mk, _mv in _step_metrics.items(): + if _mk != "train_loss" and isinstance(_mv, (int, float)): + _wandb_payload[f"compression/{_mk}"] = _mv + wandb.log( + _wandb_payload, + step=step, + ) + + # Needed to sync whether we've reached the wallclock cap. + reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms + if distributed and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + if stop_after_step is None and _graceful_shutdown: + log0("graceful_shutdown: signal received, stopping after this step") + stop_after_step = step + + for _cb in _callbacks: + _cb.on_train_end(_cb_state) + + log0( + f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB " + f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB" + ) + if tracker is not None: + tracker.heartbeat( + "serializing", + peak_memory_allocated_mib=torch.cuda.max_memory_allocated() // 1024 // 1024, + peak_memory_reserved_mib=torch.cuda.max_memory_reserved() // 1024 // 1024, + ) + + # Remove QAT hooks before serialization. + for h in _qat_hooks: + h.remove() + + # Apply SWA averaged weights if collected. + if swa_state is not None and swa_count > 0: + log0(f"swa: applying averaged weights from {swa_count} snapshots") + with torch.no_grad(): + for n, p in base_model.named_parameters(): + p.data.copy_((swa_state[n] / swa_count).to(dtype=p.dtype)) + del swa_state + + # ----------------------------- + # SERIALIZATION + ROUNDTRIP VALIDATION + # ----------------------------- + # Save the raw state (useful for debugging/loading in PyTorch directly), then always produce + # the compressed int8+zlib artifact and validate the round-tripped weights. + + if master_process: + torch.save(base_model.state_dict(), "final_model.pt") + model_bytes = os.path.getsize("final_model.pt") + code_bytes = len(code.encode("utf-8")) + log0(f"Serialized model: {model_bytes} bytes") + log0(f"Code size: {code_bytes} bytes") + log0(f"Total submission size: {model_bytes + code_bytes} bytes") + + compress_mode = "zstd" if (args.quant_mode in ("int6", "int5_int6") and zstd is not None) else "zlib" + quant_obj, quant_stats = quantize_state_dict(base_model.state_dict(), mode=args.quant_mode) + quant_buf = io.BytesIO() + torch.save(quant_obj, quant_buf) + quant_raw = quant_buf.getvalue() + quant_blob = compress_blob(quant_raw, mode=compress_mode) + quant_raw_bytes = len(quant_raw) + artifact_name = f"final_model.{args.quant_mode}.ptz" + if master_process: + with open(artifact_name, "wb") as f: + f.write(quant_blob) + quant_file_bytes = os.path.getsize(artifact_name) + code_bytes = len(code.encode("utf-8")) + ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1) + log0( + f"Serialized model {args.quant_mode}+{compress_mode}: {quant_file_bytes} bytes " + f"(payload:{quant_stats['int8_payload_bytes']} raw_torch:{quant_raw_bytes} payload_ratio:{ratio:.2f}x)" + ) + log0(f"Total submission size {args.quant_mode}+{compress_mode}: {quant_file_bytes + code_bytes} bytes") + if tracker is not None: + tracker.heartbeat( + "serializing", + final_model_path=str(Path(artifact_name).resolve()), + model_bytes=quant_file_bytes, + ) + + if distributed: + dist.barrier() + with open(artifact_name, "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load(io.BytesIO(decompress_blob(quant_blob_disk)), map_location="cpu") + base_model.load_state_dict(dequantize_state_dict(quant_state), strict=True) + torch.cuda.synchronize() + t_qeval = time.perf_counter() + q_val_loss, q_val_bpb = validate_model( + args, + model, + rank, + world_size, + device, + grad_accum_steps, + val_tokens, + base_bytes_lut, + has_leading_space_lut, + is_boundary_token_lut, + ) + torch.cuda.synchronize() + q_eval_ms = 1000.0 * (time.perf_counter() - t_qeval) + log0( + f"final_{args.quant_mode}_{compress_mode}_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} " + f"eval_time:{q_eval_ms:.0f}ms" + ) + log0(f"final_{args.quant_mode}_{compress_mode}_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}") + if wandb is not None: + wandb.log( + { + "run/phase": "final", + "metrics/final_val_loss": q_val_loss, + "metrics/final_val_bpb": q_val_bpb, + "artifacts/model_bytes": quant_file_bytes if master_process else None, + "timing/final_eval_ms": q_eval_ms, + }, + step=step, + ) + # LoRA test-time training evaluation (optional, env-var gated). + ttt_val_loss, ttt_val_bpb = None, None + if args.ttt_enabled: + torch._dynamo.reset() + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_val_loss, ttt_val_bpb = ttt_lora_evaluate( + args, base_model, rank, world_size, device, + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + ) + torch.cuda.synchronize() + log0( + f"final_{args.quant_mode}_ttt_lora val_loss:{ttt_val_loss:.4f} val_bpb:{ttt_val_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_ttt):.0f}ms" + ) + if wandb is not None: + wandb.log({"metrics/ttt_val_loss": ttt_val_loss, "metrics/ttt_val_bpb": ttt_val_bpb}, step=step) + if wandb is not None: + wandb.update_summary( + { + "final_val_loss": q_val_loss, + "final_val_bpb": q_val_bpb, + "ttt_val_bpb": ttt_val_bpb, + "model_bytes": quant_file_bytes if master_process else None, + "backend": "torch", + } + ) + wandb.finish(0) + + if tracker is not None: + result_dict: dict = { + "val_loss": q_val_loss, + "val_bpb": q_val_bpb, + "steps_completed": step, + "train_time_ms": training_time_ms, + } + if ttt_val_bpb is not None: + result_dict["ttt_val_loss"] = ttt_val_loss + result_dict["ttt_val_bpb"] = ttt_val_bpb + tracker.finalize( + "completed", + phase="completed", + result=result_dict, + model_bytes=quant_file_bytes if master_process else None, + ) + + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main() + +==================================================================================================== +Running Python 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] +Running PyTorch 2.8.0+cu128 +Sat May 2 03:25:29 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off | +| 0% 31C P2 54W / 450W | 396MiB / 24564MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| 0 N/A N/A 10483 C /usr/bin/python3.12 386MiB | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:80 +val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021845 +model_params:17125448 +world_size:1 grad_accum_steps:1 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +model_family:jepa_lm attention_variant:standard residual_variant:standard embed_bottleneck_dim:0 +attention_mode:gqa num_heads:8 num_kv_heads:4 share_blocks:1 recurrence_steps:0 state_dim:256 +tie_embeddings:True embed_lr:0.03 head_lr:0.0 matrix_lr:0.02 scalar_lr:0.02 +train_batch_tokens:65536 train_seq_len:1024 iterations:100000 warmup_steps:10 max_wallclock_seconds:7200.000 +lr_schedule:linear_warmdown lr_decay_iters:0 min_lr_scale:0.1000 +train_shard_limit:0 +seed:1337 +warmup_step:1/10 +warmup_step:2/10 +warmup_step:3/10 +warmup_step:4/10 +warmup_step:5/10 +warmup_step:6/10 +warmup_step:7/10 +warmup_step:8/10 +warmup_step:9/10 +warmup_step:10/10 +step:0/100000 val_loss:6.9377 val_bpb:4.1089 train_time:0ms step_avg:0.01ms +step:1/100000 train_loss:6.9361 train_time:164ms step_avg:163.73ms tok_s:400364 +step:2/100000 train_loss:11.1738 train_time:317ms step_avg:158.54ms tok_s:522357 +step:3/100000 train_loss:9.0585 train_time:466ms step_avg:155.24ms tok_s:522548 +step:4/100000 train_loss:7.3698 train_time:613ms step_avg:153.13ms tok_s:524835 +step:5/100000 train_loss:6.8143 train_time:757ms step_avg:151.33ms tok_s:527224 +step:6/100000 train_loss:6.5800 train_time:916ms step_avg:152.66ms tok_s:527294 +step:7/100000 train_loss:6.4854 train_time:1067ms step_avg:152.40ms tok_s:524233 +step:8/100000 train_loss:6.3871 train_time:1215ms step_avg:151.89ms tok_s:528720 +step:9/100000 train_loss:6.3180 train_time:1362ms step_avg:151.28ms tok_s:527377 +step:10/100000 train_loss:6.0322 train_time:1561ms step_avg:156.10ms tok_s:529641 +step:100/100000 train_loss:3.6765 train_time:12810ms step_avg:128.10ms tok_s:525302 +step:200/100000 train_loss:3.2265 train_time:25307ms step_avg:126.54ms tok_s:527785 +step:300/100000 train_loss:2.8766 train_time:37829ms step_avg:126.10ms tok_s:522688 +step:400/100000 train_loss:2.8134 train_time:50365ms step_avg:125.91ms tok_s:523967 +step:500/100000 train_loss:2.6850 train_time:62920ms step_avg:125.84ms tok_s:523432 +step:600/100000 train_loss:2.7277 train_time:75462ms step_avg:125.77ms tok_s:522579 +step:700/100000 train_loss:2.6413 train_time:87986ms step_avg:125.69ms tok_s:528506 +step:800/100000 train_loss:2.6164 train_time:100522ms step_avg:125.65ms tok_s:522272 +step:900/100000 train_loss:2.5965 train_time:113048ms step_avg:125.61ms tok_s:524041 +step:1000/100000 train_loss:2.6580 train_time:125561ms step_avg:125.56ms tok_s:527181 +step:1100/100000 train_loss:2.5816 train_time:138112ms step_avg:125.56ms tok_s:526319 +step:1200/100000 train_loss:2.6573 train_time:150627ms step_avg:125.52ms tok_s:525273 +step:1300/100000 train_loss:2.5931 train_time:163164ms step_avg:125.51ms tok_s:523083 +step:1400/100000 train_loss:2.3222 train_time:175681ms step_avg:125.49ms tok_s:525110 +step:1500/100000 train_loss:2.5595 train_time:188186ms step_avg:125.46ms tok_s:525377 +step:1600/100000 train_loss:2.5641 train_time:201283ms step_avg:125.80ms tok_s:523061 +step:1700/100000 train_loss:2.6228 train_time:213822ms step_avg:125.78ms tok_s:527564 +step:1800/100000 train_loss:2.5953 train_time:226370ms step_avg:125.76ms tok_s:523318 +step:1900/100000 train_loss:2.5410 train_time:238875ms step_avg:125.72ms tok_s:522161 +step:2000/100000 train_loss:2.5317 train_time:251391ms step_avg:125.70ms tok_s:526247 +step:2100/100000 train_loss:2.5167 train_time:263890ms step_avg:125.66ms tok_s:524936 +step:2200/100000 train_loss:2.5021 train_time:276443ms step_avg:125.66ms tok_s:523174 +step:2300/100000 train_loss:2.3480 train_time:288998ms step_avg:125.65ms tok_s:521589 +step:2400/100000 train_loss:2.4269 train_time:301518ms step_avg:125.63ms tok_s:527032 +step:2500/100000 train_loss:2.4897 train_time:314027ms step_avg:125.61ms tok_s:526347 +step:2600/100000 train_loss:2.4433 train_time:326529ms step_avg:125.59ms tok_s:524202 +step:2700/100000 train_loss:2.5643 train_time:339064ms step_avg:125.58ms tok_s:522481 +step:2800/100000 train_loss:2.3903 train_time:351624ms step_avg:125.58ms tok_s:525427 +step:2900/100000 train_loss:2.3613 train_time:364140ms step_avg:125.57ms tok_s:527629 +step:3000/100000 train_loss:2.3994 train_time:376684ms step_avg:125.56ms tok_s:524244 +step:3100/100000 train_loss:2.3514 train_time:389793ms step_avg:125.74ms tok_s:523922 +step:3200/100000 train_loss:2.3177 train_time:402324ms step_avg:125.73ms tok_s:523024 +step:3300/100000 train_loss:2.4355 train_time:414869ms step_avg:125.72ms tok_s:523342 +step:3400/100000 train_loss:2.3348 train_time:427389ms step_avg:125.70ms tok_s:525061 +step:3500/100000 train_loss:2.3271 train_time:439887ms step_avg:125.68ms tok_s:524138 +step:3600/100000 train_loss:2.4534 train_time:452389ms step_avg:125.66ms tok_s:530093 +step:3700/100000 train_loss:2.3536 train_time:464943ms step_avg:125.66ms tok_s:512715 +step:3800/100000 train_loss:2.3132 train_time:477462ms step_avg:125.65ms tok_s:520223 +step:3900/100000 train_loss:2.3907 train_time:490006ms step_avg:125.64ms tok_s:524884 +step:4000/100000 train_loss:2.3518 train_time:502512ms step_avg:125.63ms tok_s:522776 +step:4100/100000 train_loss:2.2648 train_time:515024ms step_avg:125.62ms tok_s:524324 +step:4200/100000 train_loss:2.3637 train_time:527574ms step_avg:125.61ms tok_s:523940 +step:4300/100000 train_loss:2.1848 train_time:540077ms step_avg:125.60ms tok_s:526252 +step:4400/100000 train_loss:2.2881 train_time:552627ms step_avg:125.60ms tok_s:523285 +step:4500/100000 train_loss:2.3612 train_time:565132ms step_avg:125.58ms tok_s:524830 +step:4600/100000 train_loss:2.2713 train_time:578287ms step_avg:125.71ms tok_s:523606 +step:4700/100000 train_loss:2.2834 train_time:590832ms step_avg:125.71ms tok_s:525115 +step:4800/100000 train_loss:2.3915 train_time:603341ms step_avg:125.70ms tok_s:525181 +step:4900/100000 train_loss:2.3887 train_time:615873ms step_avg:125.69ms tok_s:523007 +step:5000/100000 train_loss:2.3579 train_time:628403ms step_avg:125.68ms tok_s:524241 +step:5100/100000 train_loss:2.4660 train_time:640933ms step_avg:125.67ms tok_s:522396 +step:5200/100000 train_loss:2.1997 train_time:653451ms step_avg:125.66ms tok_s:528234 +step:5300/100000 train_loss:2.3536 train_time:665961ms step_avg:125.65ms tok_s:529160 +step:5400/100000 train_loss:2.1927 train_time:678479ms step_avg:125.64ms tok_s:522873 +step:5500/100000 train_loss:2.2227 train_time:691024ms step_avg:125.64ms tok_s:523792 +step:5600/100000 train_loss:2.5474 train_time:703584ms step_avg:125.64ms tok_s:524505 +step:5700/100000 train_loss:2.2973 train_time:716096ms step_avg:125.63ms tok_s:526946 +step:5800/100000 train_loss:2.5089 train_time:728602ms step_avg:125.62ms tok_s:524127 +step:5900/100000 train_loss:2.2875 train_time:741106ms step_avg:125.61ms tok_s:526404 +step:6000/100000 train_loss:2.2035 train_time:753609ms step_avg:125.60ms tok_s:523536 +step:6100/100000 train_loss:2.1872 train_time:766181ms step_avg:125.60ms tok_s:520314 +step:6200/100000 train_loss:2.3437 train_time:779519ms step_avg:125.73ms tok_s:527974 +step:6300/100000 train_loss:2.2991 train_time:792022ms step_avg:125.72ms tok_s:526431 +step:6400/100000 train_loss:2.3337 train_time:804523ms step_avg:125.71ms tok_s:524053 +step:6500/100000 train_loss:2.1305 train_time:817026ms step_avg:125.70ms tok_s:521030 +step:6600/100000 train_loss:2.3687 train_time:829597ms step_avg:125.70ms tok_s:523774 +step:6700/100000 train_loss:2.2444 train_time:842110ms step_avg:125.69ms tok_s:524154 +step:6800/100000 train_loss:2.3002 train_time:854615ms step_avg:125.68ms tok_s:524487 +step:6900/100000 train_loss:2.2461 train_time:867131ms step_avg:125.67ms tok_s:525257 +step:7000/100000 train_loss:2.2826 train_time:879655ms step_avg:125.66ms tok_s:523017 +step:7100/100000 train_loss:2.2340 train_time:892196ms step_avg:125.66ms tok_s:522102 +step:7200/100000 train_loss:2.3161 train_time:904733ms step_avg:125.66ms tok_s:524787 +step:7300/100000 train_loss:2.1389 train_time:917233ms step_avg:125.65ms tok_s:523713 +step:7400/100000 train_loss:2.3208 train_time:929718ms step_avg:125.64ms tok_s:527473 +step:7500/100000 train_loss:2.2854 train_time:942247ms step_avg:125.63ms tok_s:522538 +step:7600/100000 train_loss:2.3131 train_time:954756ms step_avg:125.63ms tok_s:523550 +step:7700/100000 train_loss:2.2194 train_time:967939ms step_avg:125.71ms tok_s:522959 +step:7800/100000 train_loss:2.2640 train_time:980488ms step_avg:125.70ms tok_s:524703 +step:7900/100000 train_loss:2.2994 train_time:992988ms step_avg:125.69ms tok_s:524601 +step:8000/100000 train_loss:2.3006 train_time:1005531ms step_avg:125.69ms tok_s:528117 +step:8100/100000 train_loss:2.2975 train_time:1018028ms step_avg:125.68ms tok_s:525128 +step:8200/100000 train_loss:2.5491 train_time:1030541ms step_avg:125.68ms tok_s:522892 +step:8300/100000 train_loss:2.3190 train_time:1043085ms step_avg:125.67ms tok_s:524813 +step:8400/100000 train_loss:2.2438 train_time:1055593ms step_avg:125.67ms tok_s:527089 +step:8500/100000 train_loss:2.2290 train_time:1068140ms step_avg:125.66ms tok_s:523922 +step:8600/100000 train_loss:2.2815 train_time:1080645ms step_avg:125.66ms tok_s:524455 +step:8700/100000 train_loss:2.1982 train_time:1093142ms step_avg:125.65ms tok_s:526198 +step:8800/100000 train_loss:2.1955 train_time:1105675ms step_avg:125.64ms tok_s:523095 +step:8900/100000 train_loss:2.4433 train_time:1118177ms step_avg:125.64ms tok_s:522884 +step:9000/100000 train_loss:2.3575 train_time:1130699ms step_avg:125.63ms tok_s:525154 +step:9100/100000 train_loss:2.3532 train_time:1143195ms step_avg:125.63ms tok_s:523293 +step:9200/100000 train_loss:2.1776 train_time:1156384ms step_avg:125.69ms tok_s:523036 +step:9300/100000 train_loss:2.4188 train_time:1168890ms step_avg:125.69ms tok_s:523748 +step:9400/100000 train_loss:2.2155 train_time:1181458ms step_avg:125.69ms tok_s:520661 +step:9500/100000 train_loss:2.2165 train_time:1193998ms step_avg:125.68ms tok_s:523555 +step:9600/100000 train_loss:2.2918 train_time:1206489ms step_avg:125.68ms tok_s:526183 +step:9700/100000 train_loss:2.1936 train_time:1218982ms step_avg:125.67ms tok_s:526829 +step:9800/100000 train_loss:2.3028 train_time:1231489ms step_avg:125.66ms tok_s:523310 +step:9900/100000 train_loss:2.2756 train_time:1244041ms step_avg:125.66ms tok_s:522615 +step:10000/100000 train_loss:2.1569 train_time:1256556ms step_avg:125.66ms tok_s:523940 +step:10000/100000 val_loss:2.1986 val_bpb:1.3021 train_time:1256596ms step_avg:125.66ms +step:10100/100000 train_loss:2.3963 train_time:1269132ms step_avg:125.66ms tok_s:522256 +step:10200/100000 train_loss:2.0496 train_time:1281653ms step_avg:125.65ms tok_s:524152 +step:10300/100000 train_loss:2.2632 train_time:1294158ms step_avg:125.65ms tok_s:522137 +step:10400/100000 train_loss:2.1685 train_time:1306662ms step_avg:125.64ms tok_s:527295 +step:10500/100000 train_loss:2.3905 train_time:1319189ms step_avg:125.64ms tok_s:522280 +step:10600/100000 train_loss:2.2276 train_time:1331718ms step_avg:125.63ms tok_s:523472 +step:10700/100000 train_loss:2.2248 train_time:1344806ms step_avg:125.68ms tok_s:525389 +step:10800/100000 train_loss:2.2280 train_time:1357307ms step_avg:125.68ms tok_s:523085 +step:10900/100000 train_loss:2.3513 train_time:1369804ms step_avg:125.67ms tok_s:527149 +step:11000/100000 train_loss:2.1986 train_time:1382342ms step_avg:125.67ms tok_s:525681 +step:11100/100000 train_loss:2.1969 train_time:1394853ms step_avg:125.66ms tok_s:521867 +step:11200/100000 train_loss:2.2875 train_time:1407397ms step_avg:125.66ms tok_s:528013 +step:11300/100000 train_loss:2.1404 train_time:1419899ms step_avg:125.65ms tok_s:524510 +step:11400/100000 train_loss:2.4164 train_time:1432406ms step_avg:125.65ms tok_s:529247 +step:11500/100000 train_loss:2.2536 train_time:1444938ms step_avg:125.65ms tok_s:526901 +step:11600/100000 train_loss:2.3052 train_time:1457431ms step_avg:125.64ms tok_s:525296 +step:11700/100000 train_loss:2.2103 train_time:1469947ms step_avg:125.64ms tok_s:523094 +step:11800/100000 train_loss:2.0250 train_time:1482475ms step_avg:125.63ms tok_s:524365 +step:11900/100000 train_loss:2.3461 train_time:1494983ms step_avg:125.63ms tok_s:522259 +step:12000/100000 train_loss:2.2272 train_time:1507531ms step_avg:125.63ms tok_s:523924 +step:12100/100000 train_loss:2.2611 train_time:1520029ms step_avg:125.62ms tok_s:526883 +step:12200/100000 train_loss:2.1620 train_time:1532524ms step_avg:125.62ms tok_s:524073 +step:12300/100000 train_loss:2.1250 train_time:1545740ms step_avg:125.67ms tok_s:528455 +step:12400/100000 train_loss:3.6906 train_time:1558262ms step_avg:125.67ms tok_s:522742 +step:12500/100000 train_loss:2.1965 train_time:1570776ms step_avg:125.66ms tok_s:524103 +step:12600/100000 train_loss:2.2111 train_time:1583293ms step_avg:125.66ms tok_s:520004 +step:12700/100000 train_loss:2.2526 train_time:1595818ms step_avg:125.65ms tok_s:528300 +step:12800/100000 train_loss:2.2599 train_time:1608331ms step_avg:125.65ms tok_s:523164 +step:12900/100000 train_loss:2.1978 train_time:1620881ms step_avg:125.65ms tok_s:525161 +step:13000/100000 train_loss:2.1837 train_time:1633387ms step_avg:125.65ms tok_s:527899 +step:13100/100000 train_loss:2.1569 train_time:1645898ms step_avg:125.64ms tok_s:528981 +step:13200/100000 train_loss:2.2340 train_time:1658399ms step_avg:125.64ms tok_s:528453 +step:13300/100000 train_loss:2.2284 train_time:1670915ms step_avg:125.63ms tok_s:526800 +step:13400/100000 train_loss:2.2540 train_time:1683482ms step_avg:125.63ms tok_s:521264 +step:13500/100000 train_loss:2.2400 train_time:1696004ms step_avg:125.63ms tok_s:528847 +step:13600/100000 train_loss:2.1654 train_time:1708519ms step_avg:125.63ms tok_s:521752 +step:13700/100000 train_loss:2.2087 train_time:1721024ms step_avg:125.62ms tok_s:525442 +step:13800/100000 train_loss:2.1979 train_time:1734284ms step_avg:125.67ms tok_s:521517 +step:13900/100000 train_loss:2.1792 train_time:1746854ms step_avg:125.67ms tok_s:523506 +step:14000/100000 train_loss:2.1768 train_time:1759412ms step_avg:125.67ms tok_s:526857 +step:14100/100000 train_loss:2.2320 train_time:1771914ms step_avg:125.67ms tok_s:528548 +step:14200/100000 train_loss:2.1702 train_time:1784428ms step_avg:125.66ms tok_s:524579 +step:14300/100000 train_loss:2.1193 train_time:1796950ms step_avg:125.66ms tok_s:520447 +step:14400/100000 train_loss:2.2341 train_time:1809462ms step_avg:125.66ms tok_s:526850 +step:14500/100000 train_loss:2.2549 train_time:1822012ms step_avg:125.66ms tok_s:524045 +step:14600/100000 train_loss:2.2070 train_time:1834523ms step_avg:125.65ms tok_s:526466 +step:14700/100000 train_loss:2.1121 train_time:1847085ms step_avg:125.65ms tok_s:520921 +step:14800/100000 train_loss:2.2255 train_time:1859620ms step_avg:125.65ms tok_s:522047 +step:14900/100000 train_loss:2.2193 train_time:1872131ms step_avg:125.65ms tok_s:525244 +step:15000/100000 train_loss:2.2545 train_time:1884645ms step_avg:125.64ms tok_s:524082 +step:15100/100000 train_loss:2.3004 train_time:1897187ms step_avg:125.64ms tok_s:526920 +step:15200/100000 train_loss:2.2010 train_time:1909708ms step_avg:125.64ms tok_s:525503 +step:15300/100000 train_loss:2.2508 train_time:1922883ms step_avg:125.68ms tok_s:526383 +step:15400/100000 train_loss:2.1389 train_time:1935402ms step_avg:125.68ms tok_s:523920 +step:15500/100000 train_loss:2.2349 train_time:1947927ms step_avg:125.67ms tok_s:524163 +step:15600/100000 train_loss:2.2824 train_time:1960462ms step_avg:125.67ms tok_s:522243 +step:15700/100000 train_loss:2.2045 train_time:1972962ms step_avg:125.67ms tok_s:528231 +step:15800/100000 train_loss:2.2350 train_time:1985496ms step_avg:125.66ms tok_s:528006 +step:15900/100000 train_loss:2.2555 train_time:1997995ms step_avg:125.66ms tok_s:525850 +step:16000/100000 train_loss:2.2688 train_time:2010493ms step_avg:125.66ms tok_s:526111 +step:16100/100000 train_loss:2.2268 train_time:2023009ms step_avg:125.65ms tok_s:520830 +step:16200/100000 train_loss:2.1300 train_time:2035550ms step_avg:125.65ms tok_s:522823 +step:16300/100000 train_loss:2.1876 train_time:2048090ms step_avg:125.65ms tok_s:526107 +step:16400/100000 train_loss:2.2774 train_time:2060609ms step_avg:125.65ms tok_s:523475 +step:16500/100000 train_loss:2.2129 train_time:2073109ms step_avg:125.64ms tok_s:525700 +step:16600/100000 train_loss:2.1743 train_time:2085630ms step_avg:125.64ms tok_s:527692 +step:16700/100000 train_loss:2.3332 train_time:2098193ms step_avg:125.64ms tok_s:521615 +step:16800/100000 train_loss:2.3673 train_time:2111543ms step_avg:125.69ms tok_s:527045 +step:16900/100000 train_loss:2.2187 train_time:2124048ms step_avg:125.68ms tok_s:527633 +step:17000/100000 train_loss:2.2591 train_time:2136559ms step_avg:125.68ms tok_s:525546 +step:17100/100000 train_loss:2.1840 train_time:2149053ms step_avg:125.68ms tok_s:523897 +step:17200/100000 train_loss:2.1606 train_time:2161599ms step_avg:125.67ms tok_s:522619 +step:17300/100000 train_loss:2.1174 train_time:2174151ms step_avg:125.67ms tok_s:524298 +step:17400/100000 train_loss:2.1762 train_time:2186657ms step_avg:125.67ms tok_s:528140 +step:17500/100000 train_loss:2.2548 train_time:2199143ms step_avg:125.67ms tok_s:524153 +step:17600/100000 train_loss:2.1257 train_time:2211654ms step_avg:125.66ms tok_s:526702 +step:17700/100000 train_loss:2.1261 train_time:2224189ms step_avg:125.66ms tok_s:527869 +step:17800/100000 train_loss:2.2662 train_time:2236730ms step_avg:125.66ms tok_s:523155 +step:17900/100000 train_loss:2.1066 train_time:2249229ms step_avg:125.66ms tok_s:524967 +step:18000/100000 train_loss:2.1481 train_time:2261728ms step_avg:125.65ms tok_s:524396 +step:18100/100000 train_loss:2.0774 train_time:2274251ms step_avg:125.65ms tok_s:512500 +step:18200/100000 train_loss:2.1728 train_time:2286795ms step_avg:125.65ms tok_s:527397 +step:18300/100000 train_loss:2.1655 train_time:2299294ms step_avg:125.64ms tok_s:521631 +step:18400/100000 train_loss:2.4325 train_time:2312599ms step_avg:125.68ms tok_s:527381 +step:18500/100000 train_loss:2.2338 train_time:2325100ms step_avg:125.68ms tok_s:524924 +step:18600/100000 train_loss:2.2692 train_time:2337632ms step_avg:125.68ms tok_s:523646 +step:18700/100000 train_loss:2.1679 train_time:2350154ms step_avg:125.68ms tok_s:524200 +step:18800/100000 train_loss:2.1822 train_time:2362662ms step_avg:125.67ms tok_s:522834 +step:18900/100000 train_loss:2.3894 train_time:2375196ms step_avg:125.67ms tok_s:525623 +step:19000/100000 train_loss:2.1276 train_time:2387721ms step_avg:125.67ms tok_s:529835 +step:19100/100000 train_loss:2.2356 train_time:2400259ms step_avg:125.67ms tok_s:523196 +step:19200/100000 train_loss:2.1757 train_time:2412768ms step_avg:125.66ms tok_s:527253 +step:19300/100000 train_loss:2.1564 train_time:2425268ms step_avg:125.66ms tok_s:527227 +step:19400/100000 train_loss:2.1803 train_time:2437775ms step_avg:125.66ms tok_s:518174 +step:19500/100000 train_loss:2.1752 train_time:2450332ms step_avg:125.66ms tok_s:526473 +step:19600/100000 train_loss:2.2425 train_time:2462875ms step_avg:125.66ms tok_s:524381 +step:19700/100000 train_loss:2.1603 train_time:2475385ms step_avg:125.65ms tok_s:523303 +step:19800/100000 train_loss:2.2841 train_time:2487903ms step_avg:125.65ms tok_s:525073 +step:19900/100000 train_loss:2.2407 train_time:2501130ms step_avg:125.68ms tok_s:522288 +step:20000/100000 train_loss:2.1985 train_time:2513671ms step_avg:125.68ms tok_s:521513 +step:20000/100000 val_loss:2.1346 val_bpb:1.2642 train_time:2513690ms step_avg:125.68ms +step:20100/100000 train_loss:2.2852 train_time:2526123ms step_avg:125.68ms tok_s:525148 +step:20200/100000 train_loss:2.1426 train_time:2538671ms step_avg:125.68ms tok_s:528024 +step:20300/100000 train_loss:2.3058 train_time:2551132ms step_avg:125.67ms tok_s:529511 +step:20400/100000 train_loss:2.0100 train_time:2563609ms step_avg:125.67ms tok_s:529108 +step:20500/100000 train_loss:2.2288 train_time:2576073ms step_avg:125.66ms tok_s:527210 +step:20600/100000 train_loss:2.2684 train_time:2588540ms step_avg:125.66ms tok_s:524424 +step:20700/100000 train_loss:2.0716 train_time:2601083ms step_avg:125.66ms tok_s:523372 +step:20800/100000 train_loss:2.2623 train_time:2613608ms step_avg:125.65ms tok_s:524876 +step:20900/100000 train_loss:2.1987 train_time:2626076ms step_avg:125.65ms tok_s:527874 +step:21000/100000 train_loss:2.2087 train_time:2638546ms step_avg:125.65ms tok_s:528563 +step:21100/100000 train_loss:2.0819 train_time:2651016ms step_avg:125.64ms tok_s:526717 +step:21200/100000 train_loss:2.1995 train_time:2663585ms step_avg:125.64ms tok_s:528013 +step:21300/100000 train_loss:2.2800 train_time:2676118ms step_avg:125.64ms tok_s:523284 +step:21400/100000 train_loss:2.2261 train_time:2689239ms step_avg:125.67ms tok_s:527772 +step:21500/100000 train_loss:2.2266 train_time:2701714ms step_avg:125.66ms tok_s:522734 +step:21600/100000 train_loss:2.1617 train_time:2714211ms step_avg:125.66ms tok_s:522943 +step:21700/100000 train_loss:2.1506 train_time:2726708ms step_avg:125.65ms tok_s:528449 +step:21800/100000 train_loss:2.1850 train_time:2739202ms step_avg:125.65ms tok_s:522940 +step:21900/100000 train_loss:2.1879 train_time:2751721ms step_avg:125.65ms tok_s:526761 +step:22000/100000 train_loss:2.3050 train_time:2764183ms step_avg:125.64ms tok_s:529809 +step:22100/100000 train_loss:2.0736 train_time:2776701ms step_avg:125.64ms tok_s:521986 +step:22200/100000 train_loss:2.1545 train_time:2789187ms step_avg:125.64ms tok_s:520520 +step:22300/100000 train_loss:2.3120 train_time:2801673ms step_avg:125.64ms tok_s:529159 +step:22400/100000 train_loss:2.1781 train_time:2814201ms step_avg:125.63ms tok_s:517507 +step:22500/100000 train_loss:2.2108 train_time:2826678ms step_avg:125.63ms tok_s:522732 +step:22600/100000 train_loss:2.1377 train_time:2839194ms step_avg:125.63ms tok_s:524892 +step:22700/100000 train_loss:2.1947 train_time:2851668ms step_avg:125.62ms tok_s:522707 +step:22800/100000 train_loss:2.1891 train_time:2864126ms step_avg:125.62ms tok_s:524170 +step:22900/100000 train_loss:2.2526 train_time:2877137ms step_avg:125.64ms tok_s:523462 +step:23000/100000 train_loss:2.2820 train_time:2889654ms step_avg:125.64ms tok_s:526895 +step:23100/100000 train_loss:1.9986 train_time:2902168ms step_avg:125.63ms tok_s:527683 +step:23200/100000 train_loss:2.2509 train_time:2914622ms step_avg:125.63ms tok_s:527032 +step:23300/100000 train_loss:2.2028 train_time:2927086ms step_avg:125.63ms tok_s:524855 +step:23400/100000 train_loss:2.1877 train_time:2939543ms step_avg:125.62ms tok_s:525065 +step:23500/100000 train_loss:2.1994 train_time:2952076ms step_avg:125.62ms tok_s:520717 +step:23600/100000 train_loss:2.2186 train_time:2964578ms step_avg:125.62ms tok_s:528198 +step:23700/100000 train_loss:2.3031 train_time:2977052ms step_avg:125.61ms tok_s:528524 +step:23800/100000 train_loss:2.1060 train_time:2989520ms step_avg:125.61ms tok_s:526393 +step:23900/100000 train_loss:2.2596 train_time:3001986ms step_avg:125.61ms tok_s:528379 +step:24000/100000 train_loss:2.1284 train_time:3014475ms step_avg:125.60ms tok_s:518753 +step:24100/100000 train_loss:2.1409 train_time:3026996ms step_avg:125.60ms tok_s:530063 +step:24200/100000 train_loss:2.1217 train_time:3039473ms step_avg:125.60ms tok_s:523852 +step:24300/100000 train_loss:2.3113 train_time:3051942ms step_avg:125.59ms tok_s:523133 +step:24400/100000 train_loss:2.1856 train_time:3064411ms step_avg:125.59ms tok_s:527067 +step:24500/100000 train_loss:2.2913 train_time:3077579ms step_avg:125.62ms tok_s:523122 +step:24600/100000 train_loss:2.1284 train_time:3090107ms step_avg:125.61ms tok_s:524186 +step:24700/100000 train_loss:2.1561 train_time:3102599ms step_avg:125.61ms tok_s:527919 +step:24800/100000 train_loss:2.2011 train_time:3115077ms step_avg:125.61ms tok_s:529688 +step:24900/100000 train_loss:2.0527 train_time:3127547ms step_avg:125.60ms tok_s:524700 +step:25000/100000 train_loss:2.1740 train_time:3140067ms step_avg:125.60ms tok_s:527265 +step:25100/100000 train_loss:2.1606 train_time:3152510ms step_avg:125.60ms tok_s:522409 +step:25200/100000 train_loss:2.1442 train_time:3165050ms step_avg:125.60ms tok_s:523823 +step:25300/100000 train_loss:2.1660 train_time:3177506ms step_avg:125.59ms tok_s:527446 +step:25400/100000 train_loss:2.2144 train_time:3190034ms step_avg:125.59ms tok_s:526786 +step:25500/100000 train_loss:2.1140 train_time:3202586ms step_avg:125.59ms tok_s:522482 +step:25600/100000 train_loss:2.1369 train_time:3215116ms step_avg:125.59ms tok_s:522639 +step:25700/100000 train_loss:2.2215 train_time:3227647ms step_avg:125.59ms tok_s:523344 +step:25800/100000 train_loss:2.1468 train_time:3240158ms step_avg:125.59ms tok_s:521801 +step:25900/100000 train_loss:2.0214 train_time:3252650ms step_avg:125.58ms tok_s:522689 +step:26000/100000 train_loss:2.0689 train_time:3265791ms step_avg:125.61ms tok_s:525841 +step:26100/100000 train_loss:2.2879 train_time:3278302ms step_avg:125.61ms tok_s:527909 +step:26200/100000 train_loss:2.0245 train_time:3290804ms step_avg:125.60ms tok_s:528242 +step:26300/100000 train_loss:2.0645 train_time:3303356ms step_avg:125.60ms tok_s:524871 +step:26400/100000 train_loss:2.1269 train_time:3315875ms step_avg:125.60ms tok_s:522107 +step:26500/100000 train_loss:2.1372 train_time:3328381ms step_avg:125.60ms tok_s:525670 +step:26600/100000 train_loss:2.2011 train_time:3340888ms step_avg:125.60ms tok_s:528015 +step:26700/100000 train_loss:2.1744 train_time:3353393ms step_avg:125.60ms tok_s:524023 +step:26800/100000 train_loss:2.1620 train_time:3365928ms step_avg:125.59ms tok_s:526471 +step:26900/100000 train_loss:2.5756 train_time:3378468ms step_avg:125.59ms tok_s:529342 +step:27000/100000 train_loss:2.1502 train_time:3390954ms step_avg:125.59ms tok_s:524996 +step:27100/100000 train_loss:2.0927 train_time:3403466ms step_avg:125.59ms tok_s:519026 +step:27200/100000 train_loss:2.1427 train_time:3415956ms step_avg:125.59ms tok_s:524119 +step:27300/100000 train_loss:2.1516 train_time:3428470ms step_avg:125.58ms tok_s:526022 +step:27400/100000 train_loss:2.0763 train_time:3441045ms step_avg:125.59ms tok_s:523756 +step:27500/100000 train_loss:2.2741 train_time:3454289ms step_avg:125.61ms tok_s:528936 +step:27600/100000 train_loss:2.0483 train_time:3466810ms step_avg:125.61ms tok_s:522526 +step:27700/100000 train_loss:2.1684 train_time:3479317ms step_avg:125.61ms tok_s:528342 +step:27800/100000 train_loss:2.1817 train_time:3491836ms step_avg:125.61ms tok_s:522020 +step:27900/100000 train_loss:2.1758 train_time:3504389ms step_avg:125.61ms tok_s:524869 +step:28000/100000 train_loss:2.2678 train_time:3516913ms step_avg:125.60ms tok_s:527894 +step:28100/100000 train_loss:2.1860 train_time:3529423ms step_avg:125.60ms tok_s:524276 +step:28200/100000 train_loss:2.7327 train_time:3541933ms step_avg:125.60ms tok_s:527738 +step:28300/100000 train_loss:2.2143 train_time:3554456ms step_avg:125.60ms tok_s:519273 +step:28400/100000 train_loss:2.1410 train_time:3566962ms step_avg:125.60ms tok_s:528320 +step:28500/100000 train_loss:2.1368 train_time:3579514ms step_avg:125.60ms tok_s:522025 +step:28600/100000 train_loss:2.2738 train_time:3592024ms step_avg:125.60ms tok_s:524466 +step:28700/100000 train_loss:2.2086 train_time:3604519ms step_avg:125.59ms tok_s:526513 +step:28800/100000 train_loss:2.1193 train_time:3617051ms step_avg:125.59ms tok_s:521944 +step:28900/100000 train_loss:2.2082 train_time:3629566ms step_avg:125.59ms tok_s:524934 +step:29000/100000 train_loss:2.1535 train_time:3642788ms step_avg:125.61ms tok_s:524422 +step:29100/100000 train_loss:2.2248 train_time:3655324ms step_avg:125.61ms tok_s:525862 +step:29200/100000 train_loss:2.1532 train_time:3667824ms step_avg:125.61ms tok_s:525656 +step:29300/100000 train_loss:2.1833 train_time:3680357ms step_avg:125.61ms tok_s:526138 +step:29400/100000 train_loss:2.1080 train_time:3692855ms step_avg:125.61ms tok_s:525347 +step:29500/100000 train_loss:2.3803 train_time:3705365ms step_avg:125.61ms tok_s:521364 +step:29600/100000 train_loss:2.1420 train_time:3717891ms step_avg:125.60ms tok_s:527954 +step:29700/100000 train_loss:2.2362 train_time:3730396ms step_avg:125.60ms tok_s:524270 +step:29800/100000 train_loss:2.2077 train_time:3742936ms step_avg:125.60ms tok_s:525846 +step:29900/100000 train_loss:2.1809 train_time:3755434ms step_avg:125.60ms tok_s:522186 +step:30000/100000 train_loss:2.0976 train_time:3767951ms step_avg:125.60ms tok_s:526844 +step:30000/100000 val_loss:2.1086 val_bpb:1.2488 train_time:3767972ms step_avg:125.60ms +step:30100/100000 train_loss:2.2273 train_time:3780456ms step_avg:125.60ms tok_s:516588 +step:30200/100000 train_loss:2.2191 train_time:3792963ms step_avg:125.59ms tok_s:524938 +step:30300/100000 train_loss:2.2119 train_time:3805504ms step_avg:125.59ms tok_s:522554 +step:30400/100000 train_loss:2.2722 train_time:3818029ms step_avg:125.59ms tok_s:526700 +step:30500/100000 train_loss:2.2094 train_time:3830521ms step_avg:125.59ms tok_s:525258 +step:30600/100000 train_loss:2.1271 train_time:3843648ms step_avg:125.61ms tok_s:524351 +step:30700/100000 train_loss:2.1823 train_time:3856145ms step_avg:125.61ms tok_s:524986 +step:30800/100000 train_loss:2.1851 train_time:3868674ms step_avg:125.61ms tok_s:520566 +step:30900/100000 train_loss:2.1986 train_time:3881222ms step_avg:125.61ms tok_s:523068 +step:31000/100000 train_loss:2.2415 train_time:3893739ms step_avg:125.60ms tok_s:523716 +step:31100/100000 train_loss:2.0873 train_time:3906255ms step_avg:125.60ms tok_s:525952 +step:31200/100000 train_loss:2.1808 train_time:3918768ms step_avg:125.60ms tok_s:526671 +step:31300/100000 train_loss:2.2884 train_time:3931277ms step_avg:125.60ms tok_s:522619 +step:31400/100000 train_loss:2.1790 train_time:3943831ms step_avg:125.60ms tok_s:523173 +step:31500/100000 train_loss:2.2261 train_time:3956347ms step_avg:125.60ms tok_s:523864 +step:31600/100000 train_loss:2.1827 train_time:3968843ms step_avg:125.60ms tok_s:521977 +step:31700/100000 train_loss:2.1109 train_time:3981375ms step_avg:125.60ms tok_s:522707 +step:31800/100000 train_loss:2.0036 train_time:3993887ms step_avg:125.59ms tok_s:522444 +step:31900/100000 train_loss:2.2046 train_time:4006401ms step_avg:125.59ms tok_s:524056 +step:32000/100000 train_loss:2.1322 train_time:4018930ms step_avg:125.59ms tok_s:529152 +step:32100/100000 train_loss:2.2328 train_time:4032164ms step_avg:125.61ms tok_s:527458 +step:32200/100000 train_loss:2.2151 train_time:4044661ms step_avg:125.61ms tok_s:529837 +step:32300/100000 train_loss:2.2864 train_time:4057196ms step_avg:125.61ms tok_s:530193 +step:32400/100000 train_loss:2.0870 train_time:4069702ms step_avg:125.61ms tok_s:523703 +step:32500/100000 train_loss:2.0381 train_time:4082242ms step_avg:125.61ms tok_s:523393 +step:32600/100000 train_loss:2.2028 train_time:4094747ms step_avg:125.61ms tok_s:525539 +step:32700/100000 train_loss:2.1459 train_time:4107242ms step_avg:125.60ms tok_s:523718 +step:32800/100000 train_loss:2.2167 train_time:4119782ms step_avg:125.60ms tok_s:527375 +step:32900/100000 train_loss:2.1830 train_time:4132289ms step_avg:125.60ms tok_s:526817 +step:33000/100000 train_loss:2.1611 train_time:4144784ms step_avg:125.60ms tok_s:520953 +step:33100/100000 train_loss:2.2363 train_time:4157319ms step_avg:125.60ms tok_s:523230 +step:33200/100000 train_loss:2.0904 train_time:4169831ms step_avg:125.60ms tok_s:521705 +step:33300/100000 train_loss:2.2439 train_time:4182371ms step_avg:125.60ms tok_s:528964 +step:33400/100000 train_loss:2.1925 train_time:4194866ms step_avg:125.59ms tok_s:523689 +step:33500/100000 train_loss:2.1685 train_time:4207368ms step_avg:125.59ms tok_s:526188 +step:33600/100000 train_loss:2.2121 train_time:4220465ms step_avg:125.61ms tok_s:525317 +step:33700/100000 train_loss:2.1323 train_time:4233009ms step_avg:125.61ms tok_s:523453 +step:33800/100000 train_loss:2.0677 train_time:4245510ms step_avg:125.61ms tok_s:522060 +step:33900/100000 train_loss:2.0852 train_time:4258014ms step_avg:125.61ms tok_s:527182 +step:34000/100000 train_loss:2.1896 train_time:4270505ms step_avg:125.60ms tok_s:528409 +step:34100/100000 train_loss:2.1376 train_time:4283008ms step_avg:125.60ms tok_s:522824 +step:34200/100000 train_loss:2.0540 train_time:4295616ms step_avg:125.60ms tok_s:524276 +step:34300/100000 train_loss:2.1875 train_time:4308097ms step_avg:125.60ms tok_s:529167 +step:34400/100000 train_loss:2.1553 train_time:4320599ms step_avg:125.60ms tok_s:526466 +step:34500/100000 train_loss:2.2116 train_time:4333097ms step_avg:125.60ms tok_s:523681 +step:34600/100000 train_loss:2.2359 train_time:4345596ms step_avg:125.60ms tok_s:529063 +step:34700/100000 train_loss:2.1578 train_time:4358151ms step_avg:125.60ms tok_s:525088 +step:34800/100000 train_loss:2.1242 train_time:4370668ms step_avg:125.59ms tok_s:525183 +step:34900/100000 train_loss:2.1083 train_time:4383185ms step_avg:125.59ms tok_s:523533 +step:35000/100000 train_loss:2.1489 train_time:4395686ms step_avg:125.59ms tok_s:525821 +step:35100/100000 train_loss:2.1296 train_time:4408893ms step_avg:125.61ms tok_s:522145 +step:35200/100000 train_loss:2.3878 train_time:4421424ms step_avg:125.61ms tok_s:528204 +step:35300/100000 train_loss:2.1100 train_time:4433976ms step_avg:125.61ms tok_s:524221 +step:35400/100000 train_loss:2.2310 train_time:4446467ms step_avg:125.61ms tok_s:529062 +step:35500/100000 train_loss:2.0483 train_time:4458961ms step_avg:125.60ms tok_s:526588 +step:35600/100000 train_loss:2.1781 train_time:4471486ms step_avg:125.60ms tok_s:520356 +step:35700/100000 train_loss:2.0104 train_time:4484012ms step_avg:125.60ms tok_s:527375 +step:35800/100000 train_loss:2.2101 train_time:4496542ms step_avg:125.60ms tok_s:525676 +step:35900/100000 train_loss:2.1754 train_time:4509069ms step_avg:125.60ms tok_s:521902 +step:36000/100000 train_loss:2.2056 train_time:4521579ms step_avg:125.60ms tok_s:525350 +step:36100/100000 train_loss:2.1562 train_time:4534110ms step_avg:125.60ms tok_s:524439 +step:36200/100000 train_loss:2.4406 train_time:4546608ms step_avg:125.60ms tok_s:523951 +step:36300/100000 train_loss:1.9619 train_time:4559111ms step_avg:125.60ms tok_s:525419 +step:36400/100000 train_loss:2.1774 train_time:4571646ms step_avg:125.59ms tok_s:525492 +step:36500/100000 train_loss:2.2209 train_time:4584154ms step_avg:125.59ms tok_s:525891 +step:36600/100000 train_loss:2.1949 train_time:4596689ms step_avg:125.59ms tok_s:522085 +step:36700/100000 train_loss:2.1786 train_time:4609894ms step_avg:125.61ms tok_s:517137 +step:36800/100000 train_loss:2.0205 train_time:4622388ms step_avg:125.61ms tok_s:524541 +step:36900/100000 train_loss:2.1504 train_time:4634923ms step_avg:125.61ms tok_s:523470 +step:37000/100000 train_loss:2.1476 train_time:4647461ms step_avg:125.61ms tok_s:528458 +step:37100/100000 train_loss:2.1871 train_time:4659996ms step_avg:125.61ms tok_s:524188 +step:37200/100000 train_loss:2.1224 train_time:4672492ms step_avg:125.60ms tok_s:524837 +step:37300/100000 train_loss:2.1541 train_time:4685013ms step_avg:125.60ms tok_s:525711 +step:37400/100000 train_loss:2.1387 train_time:4697514ms step_avg:125.60ms tok_s:524224 +step:37500/100000 train_loss:2.1377 train_time:4710088ms step_avg:125.60ms tok_s:524657 +step:37600/100000 train_loss:2.1899 train_time:4722617ms step_avg:125.60ms tok_s:526520 +step:37700/100000 train_loss:2.1343 train_time:4735122ms step_avg:125.60ms tok_s:524568 +step:37800/100000 train_loss:2.2577 train_time:4747642ms step_avg:125.60ms tok_s:527891 +step:37900/100000 train_loss:2.2717 train_time:4760158ms step_avg:125.60ms tok_s:524051 +step:38000/100000 train_loss:2.1097 train_time:4772694ms step_avg:125.60ms tok_s:522392 +step:38100/100000 train_loss:2.1934 train_time:4785217ms step_avg:125.60ms tok_s:527891 +step:38200/100000 train_loss:2.2169 train_time:4798445ms step_avg:125.61ms tok_s:527835 +step:38300/100000 train_loss:2.1461 train_time:4810975ms step_avg:125.61ms tok_s:513749 +step:38400/100000 train_loss:2.0586 train_time:4823480ms step_avg:125.61ms tok_s:526013 +step:38500/100000 train_loss:2.0669 train_time:4836028ms step_avg:125.61ms tok_s:526551 +step:38600/100000 train_loss:2.2197 train_time:4848596ms step_avg:125.61ms tok_s:524249 +step:38700/100000 train_loss:2.1365 train_time:4861101ms step_avg:125.61ms tok_s:527120 +step:38800/100000 train_loss:2.2404 train_time:4873619ms step_avg:125.61ms tok_s:525265 +step:38900/100000 train_loss:2.1169 train_time:4886130ms step_avg:125.61ms tok_s:525772 +step:39000/100000 train_loss:2.0871 train_time:4898684ms step_avg:125.61ms tok_s:524178 +step:39100/100000 train_loss:2.1214 train_time:4911210ms step_avg:125.61ms tok_s:523570 +step:39200/100000 train_loss:2.0810 train_time:4923735ms step_avg:125.61ms tok_s:525465 +step:39300/100000 train_loss:2.1604 train_time:4936247ms step_avg:125.60ms tok_s:522098 +step:39400/100000 train_loss:2.1391 train_time:4948751ms step_avg:125.60ms tok_s:523062 +step:39500/100000 train_loss:2.2653 train_time:4961266ms step_avg:125.60ms tok_s:525667 +step:39600/100000 train_loss:2.3532 train_time:4973769ms step_avg:125.60ms tok_s:521775 +step:39700/100000 train_loss:2.1473 train_time:4987136ms step_avg:125.62ms tok_s:522445 +step:39800/100000 train_loss:2.1760 train_time:4999651ms step_avg:125.62ms tok_s:529546 +step:39900/100000 train_loss:2.1537 train_time:5012166ms step_avg:125.62ms tok_s:522177 +step:40000/100000 train_loss:2.2008 train_time:5024689ms step_avg:125.62ms tok_s:523064 +step:40000/100000 val_loss:2.0875 val_bpb:1.2363 train_time:5024722ms step_avg:125.62ms +step:40100/100000 train_loss:2.1034 train_time:5037247ms step_avg:125.62ms tok_s:522837 +step:40200/100000 train_loss:2.1623 train_time:5049758ms step_avg:125.62ms tok_s:523307 +step:40300/100000 train_loss:2.1545 train_time:5062279ms step_avg:125.61ms tok_s:521956 +step:40400/100000 train_loss:2.0666 train_time:5074819ms step_avg:125.61ms tok_s:521602 +step:40500/100000 train_loss:2.1174 train_time:5087350ms step_avg:125.61ms tok_s:522338 +step:40600/100000 train_loss:2.0854 train_time:5099927ms step_avg:125.61ms tok_s:524496 +step:40700/100000 train_loss:2.1831 train_time:5112423ms step_avg:125.61ms tok_s:526457 +step:40800/100000 train_loss:2.1008 train_time:5124955ms step_avg:125.61ms tok_s:527659 +step:40900/100000 train_loss:2.2627 train_time:5137469ms step_avg:125.61ms tok_s:521954 +step:41000/100000 train_loss:2.1341 train_time:5150039ms step_avg:125.61ms tok_s:523275 +step:41100/100000 train_loss:2.1196 train_time:5162549ms step_avg:125.61ms tok_s:526820 +step:41200/100000 train_loss:2.2816 train_time:5175782ms step_avg:125.63ms tok_s:525796 +step:41300/100000 train_loss:2.0747 train_time:5188293ms step_avg:125.62ms tok_s:524658 +step:41400/100000 train_loss:2.0836 train_time:5200800ms step_avg:125.62ms tok_s:523258 +step:41500/100000 train_loss:2.2346 train_time:5213359ms step_avg:125.62ms tok_s:523991 +step:41600/100000 train_loss:2.1674 train_time:5225877ms step_avg:125.62ms tok_s:528105 +step:41700/100000 train_loss:2.2472 train_time:5238384ms step_avg:125.62ms tok_s:524365 +step:41800/100000 train_loss:2.1314 train_time:5250882ms step_avg:125.62ms tok_s:526911 +step:41900/100000 train_loss:2.3944 train_time:5263389ms step_avg:125.62ms tok_s:528054 +step:42000/100000 train_loss:2.1553 train_time:5275943ms step_avg:125.62ms tok_s:523727 +step:42100/100000 train_loss:2.1728 train_time:5288490ms step_avg:125.62ms tok_s:522449 +step:42200/100000 train_loss:2.1994 train_time:5301004ms step_avg:125.62ms tok_s:527909 +step:42300/100000 train_loss:2.2001 train_time:5313519ms step_avg:125.62ms tok_s:522643 +step:42400/100000 train_loss:2.1379 train_time:5326057ms step_avg:125.61ms tok_s:521466 +step:42500/100000 train_loss:2.1549 train_time:5338591ms step_avg:125.61ms tok_s:524695 +step:42600/100000 train_loss:2.1463 train_time:5351133ms step_avg:125.61ms tok_s:520497 +step:42700/100000 train_loss:2.2067 train_time:5363663ms step_avg:125.61ms tok_s:525266 +step:42800/100000 train_loss:1.9043 train_time:5376918ms step_avg:125.63ms tok_s:524329 +step:42900/100000 train_loss:3.2506 train_time:5389454ms step_avg:125.63ms tok_s:523108 +step:43000/100000 train_loss:2.0216 train_time:5401975ms step_avg:125.63ms tok_s:523976 +step:43100/100000 train_loss:2.0616 train_time:5414490ms step_avg:125.63ms tok_s:522184 +step:43200/100000 train_loss:2.1106 train_time:5427033ms step_avg:125.63ms tok_s:526813 +step:43300/100000 train_loss:2.0591 train_time:5439537ms step_avg:125.62ms tok_s:523972 +step:43400/100000 train_loss:2.1607 train_time:5452086ms step_avg:125.62ms tok_s:524211 +step:43500/100000 train_loss:2.0575 train_time:5464598ms step_avg:125.62ms tok_s:527246 +step:43600/100000 train_loss:2.3296 train_time:5477113ms step_avg:125.62ms tok_s:527724 +step:43700/100000 train_loss:2.1536 train_time:5489668ms step_avg:125.62ms tok_s:521055 +step:43800/100000 train_loss:1.9377 train_time:5502182ms step_avg:125.62ms tok_s:521967 +step:43900/100000 train_loss:2.2077 train_time:5514731ms step_avg:125.62ms tok_s:523371 +step:44000/100000 train_loss:2.1409 train_time:5527237ms step_avg:125.62ms tok_s:526839 +step:44100/100000 train_loss:2.2051 train_time:5539757ms step_avg:125.62ms tok_s:527638 +step:44200/100000 train_loss:2.1198 train_time:5552277ms step_avg:125.62ms tok_s:524000 +step:44300/100000 train_loss:1.9822 train_time:5565609ms step_avg:125.63ms tok_s:521164 +step:44400/100000 train_loss:2.3156 train_time:5578164ms step_avg:125.63ms tok_s:527386 +step:44500/100000 train_loss:2.0383 train_time:5590692ms step_avg:125.63ms tok_s:525283 +step:44600/100000 train_loss:2.1604 train_time:5603196ms step_avg:125.63ms tok_s:521657 +step:44700/100000 train_loss:2.2099 train_time:5615712ms step_avg:125.63ms tok_s:523877 +step:44800/100000 train_loss:2.0513 train_time:5628279ms step_avg:125.63ms tok_s:522133 +step:44900/100000 train_loss:2.0510 train_time:5640821ms step_avg:125.63ms tok_s:524990 +step:45000/100000 train_loss:2.0200 train_time:5653355ms step_avg:125.63ms tok_s:527526 +step:45100/100000 train_loss:2.1761 train_time:5665868ms step_avg:125.63ms tok_s:528829 +step:45200/100000 train_loss:2.0640 train_time:5678377ms step_avg:125.63ms tok_s:526197 +step:45300/100000 train_loss:2.1302 train_time:5690913ms step_avg:125.63ms tok_s:522633 +step:45400/100000 train_loss:2.1667 train_time:5703476ms step_avg:125.63ms tok_s:522670 +step:45500/100000 train_loss:2.2097 train_time:5715987ms step_avg:125.63ms tok_s:522019 +step:45600/100000 train_loss:2.3897 train_time:5728522ms step_avg:125.63ms tok_s:523973 +step:45700/100000 train_loss:1.9981 train_time:5741049ms step_avg:125.62ms tok_s:526128 +step:45800/100000 train_loss:2.0946 train_time:5754239ms step_avg:125.64ms tok_s:526679 +step:45900/100000 train_loss:2.1099 train_time:5766780ms step_avg:125.64ms tok_s:522827 +step:46000/100000 train_loss:2.0827 train_time:5779308ms step_avg:125.64ms tok_s:523358 +step:46100/100000 train_loss:2.3855 train_time:5791825ms step_avg:125.64ms tok_s:527606 +step:46200/100000 train_loss:2.1534 train_time:5804346ms step_avg:125.64ms tok_s:525588 +step:46300/100000 train_loss:2.0582 train_time:5816886ms step_avg:125.63ms tok_s:510977 +step:46400/100000 train_loss:2.0286 train_time:5829384ms step_avg:125.63ms tok_s:525876 +step:46500/100000 train_loss:2.2393 train_time:5841946ms step_avg:125.63ms tok_s:526504 +step:46600/100000 train_loss:2.0532 train_time:5854449ms step_avg:125.63ms tok_s:525532 +step:46700/100000 train_loss:2.1869 train_time:5866978ms step_avg:125.63ms tok_s:522968 +step:46800/100000 train_loss:2.1441 train_time:5879504ms step_avg:125.63ms tok_s:520809 +step:46900/100000 train_loss:2.0181 train_time:5892035ms step_avg:125.63ms tok_s:527115 +step:47000/100000 train_loss:2.1819 train_time:5904572ms step_avg:125.63ms tok_s:522999 +step:47100/100000 train_loss:2.0942 train_time:5917093ms step_avg:125.63ms tok_s:524108 +step:47200/100000 train_loss:2.0815 train_time:5929621ms step_avg:125.63ms tok_s:524050 +step:47300/100000 train_loss:2.1108 train_time:5942156ms step_avg:125.63ms tok_s:526740 +step:47400/100000 train_loss:2.1333 train_time:5955510ms step_avg:125.64ms tok_s:525289 +step:47500/100000 train_loss:2.0688 train_time:5968035ms step_avg:125.64ms tok_s:521723 +step:47600/100000 train_loss:2.0643 train_time:5980588ms step_avg:125.64ms tok_s:526173 +step:47700/100000 train_loss:2.1641 train_time:5993138ms step_avg:125.64ms tok_s:523941 +step:47800/100000 train_loss:2.0643 train_time:6005638ms step_avg:125.64ms tok_s:522247 +step:47900/100000 train_loss:2.1857 train_time:6018144ms step_avg:125.64ms tok_s:528336 +step:48000/100000 train_loss:2.1370 train_time:6030649ms step_avg:125.64ms tok_s:523653 +step:48100/100000 train_loss:2.2579 train_time:6043193ms step_avg:125.64ms tok_s:523185 +step:48200/100000 train_loss:2.1138 train_time:6055746ms step_avg:125.64ms tok_s:525171 +step:48300/100000 train_loss:2.2028 train_time:6068250ms step_avg:125.64ms tok_s:526541 +step:48400/100000 train_loss:1.9908 train_time:6080773ms step_avg:125.64ms tok_s:527363 +step:48500/100000 train_loss:2.4648 train_time:6093273ms step_avg:125.63ms tok_s:527981 +step:48600/100000 train_loss:2.2082 train_time:6105776ms step_avg:125.63ms tok_s:522222 +step:48700/100000 train_loss:2.1807 train_time:6118327ms step_avg:125.63ms tok_s:526809 +step:48800/100000 train_loss:2.5954 train_time:6130783ms step_avg:125.63ms tok_s:526849 +step:48900/100000 train_loss:2.1255 train_time:6143879ms step_avg:125.64ms tok_s:525759 +step:49000/100000 train_loss:2.3894 train_time:6156343ms step_avg:125.64ms tok_s:525742 +step:49100/100000 train_loss:2.2673 train_time:6168837ms step_avg:125.64ms tok_s:522275 +step:49200/100000 train_loss:2.1066 train_time:6181367ms step_avg:125.64ms tok_s:524503 +step:49300/100000 train_loss:2.1779 train_time:6193857ms step_avg:125.64ms tok_s:523379 +step:49400/100000 train_loss:2.1406 train_time:6206321ms step_avg:125.63ms tok_s:526069 +step:49500/100000 train_loss:2.1461 train_time:6218805ms step_avg:125.63ms tok_s:528848 +step:49600/100000 train_loss:2.1215 train_time:6231326ms step_avg:125.63ms tok_s:521531 +step:49700/100000 train_loss:2.1108 train_time:6243812ms step_avg:125.63ms tok_s:526220 +step:49800/100000 train_loss:2.0952 train_time:6256373ms step_avg:125.63ms tok_s:524905 +step:49900/100000 train_loss:2.3691 train_time:6268854ms step_avg:125.63ms tok_s:526552 +step:50000/100000 train_loss:2.0843 train_time:6281339ms step_avg:125.63ms tok_s:525697 +step:50000/100000 val_loss:2.0786 val_bpb:1.2311 train_time:6281365ms step_avg:125.63ms +step:50100/100000 train_loss:2.1589 train_time:6293805ms step_avg:125.62ms tok_s:528085 +step:50200/100000 train_loss:2.0704 train_time:6306310ms step_avg:125.62ms tok_s:522280 +step:50300/100000 train_loss:2.1433 train_time:6318797ms step_avg:125.62ms tok_s:525869 +step:50400/100000 train_loss:2.1942 train_time:6331883ms step_avg:125.63ms tok_s:523687 +step:50500/100000 train_loss:2.1465 train_time:6344431ms step_avg:125.63ms tok_s:523846 +step:50600/100000 train_loss:2.2128 train_time:6356967ms step_avg:125.63ms tok_s:524110 +step:50700/100000 train_loss:2.1317 train_time:6369511ms step_avg:125.63ms tok_s:524511 +step:50800/100000 train_loss:2.1453 train_time:6382013ms step_avg:125.63ms tok_s:525729 +step:50900/100000 train_loss:2.3119 train_time:6394533ms step_avg:125.63ms tok_s:523536 +step:51000/100000 train_loss:2.3085 train_time:6407064ms step_avg:125.63ms tok_s:524265 +step:51100/100000 train_loss:2.0818 train_time:6419600ms step_avg:125.63ms tok_s:524950 +step:51200/100000 train_loss:2.3628 train_time:6432145ms step_avg:125.63ms tok_s:525544 +step:51300/100000 train_loss:2.0174 train_time:6444672ms step_avg:125.63ms tok_s:523357 +step:51400/100000 train_loss:2.3946 train_time:6457191ms step_avg:125.63ms tok_s:498776 +step:51500/100000 train_loss:2.1833 train_time:6469745ms step_avg:125.63ms tok_s:523148 +step:51600/100000 train_loss:2.0762 train_time:6482289ms step_avg:125.63ms tok_s:521742 +step:51700/100000 train_loss:2.3023 train_time:6494836ms step_avg:125.63ms tok_s:522858 +step:51800/100000 train_loss:2.0697 train_time:6507353ms step_avg:125.62ms tok_s:524170 +step:51900/100000 train_loss:1.9340 train_time:6520410ms step_avg:125.63ms tok_s:526551 +step:52000/100000 train_loss:2.1224 train_time:6532904ms step_avg:125.63ms tok_s:528325 +step:52100/100000 train_loss:2.1727 train_time:6545409ms step_avg:125.63ms tok_s:520868 +step:52200/100000 train_loss:2.1571 train_time:6557956ms step_avg:125.63ms tok_s:515131 +step:52300/100000 train_loss:2.1107 train_time:6570456ms step_avg:125.63ms tok_s:524814 +step:52400/100000 train_loss:1.9621 train_time:6582945ms step_avg:125.63ms tok_s:525192 +step:52500/100000 train_loss:1.9908 train_time:6595440ms step_avg:125.63ms tok_s:525375 +step:52600/100000 train_loss:2.1009 train_time:6607974ms step_avg:125.63ms tok_s:524181 +step:52700/100000 train_loss:2.1119 train_time:6620496ms step_avg:125.63ms tok_s:523874 +step:52800/100000 train_loss:2.1664 train_time:6633013ms step_avg:125.63ms tok_s:527788 +step:52900/100000 train_loss:2.0532 train_time:6645516ms step_avg:125.62ms tok_s:528676 +step:53000/100000 train_loss:2.1365 train_time:6658013ms step_avg:125.62ms tok_s:523781 +step:53100/100000 train_loss:2.0674 train_time:6670535ms step_avg:125.62ms tok_s:524661 +step:53200/100000 train_loss:2.1221 train_time:6683042ms step_avg:125.62ms tok_s:524002 +step:53300/100000 train_loss:2.0715 train_time:6695594ms step_avg:125.62ms tok_s:525855 +step:53400/100000 train_loss:2.1359 train_time:6708087ms step_avg:125.62ms tok_s:522099 +step:53500/100000 train_loss:2.1182 train_time:6721131ms step_avg:125.63ms tok_s:524566 +step:53600/100000 train_loss:2.1735 train_time:6733661ms step_avg:125.63ms tok_s:529891 +step:53700/100000 train_loss:2.1066 train_time:6746146ms step_avg:125.63ms tok_s:525381 +step:53800/100000 train_loss:2.2506 train_time:6758678ms step_avg:125.63ms tok_s:524813 +step:53900/100000 train_loss:2.1350 train_time:6771181ms step_avg:125.62ms tok_s:524682 diff --git a/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/train_gpt.py b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/train_gpt.py new file mode 100644 index 0000000000..5be36c0dae --- /dev/null +++ b/records/track_non_record_16mb/2026-05-02_JEPA_Ablation_14run_NegativeResult/train_gpt.py @@ -0,0 +1,14 @@ +"""Compatibility wrapper -- delegates to src/crucible/training/torch_backend.py + +The training loop has been extracted into the crucible.training module. +This file remains for backward compatibility so that existing scripts, +fleet configs, and documentation that reference ``train_gpt.py`` keep working. +""" +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent / "src")) +from crucible.training.torch_backend import main + +if __name__ == "__main__": + main()