Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions records/track_10min_16mb/2026-03-22_RadialBitNet/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Radial-BitNet 16MB Titan

This submission presents an experimental compressed language-model design for the Parameter Golf 16MB track.

The approach combines:
- BitNet-style ternary-weight linear projections,
- a custom positional scheme called **Radial Encoding**,
- a custom optimizer called **FRO (Fractal Resonant Optimization)**,
- compressed post-training export under the official artifact-size accounting rule.

This is a public experimental submission intended to demonstrate a non-standard architecture under the Parameter Golf constraints. The attached reported result was obtained from a development run on non-target hardware. No claim is made in this README that the reported score has already been reproduced under the official 8xH100 SXM record-track environment.

## Summary

The goal of this design is to push model capacity as far as possible under the official submission artifact limit by combining:
- ternary-style projection behavior for major linear layers,
- reduced learned overhead,
- tied embeddings,
- compressed final export,
- a training setup optimized for short wall-clock execution.

Rather than following a conventional FP16 baseline recipe, this submission explores a more aggressive compression-oriented design.

## Key Ideas

### 1. BitLinear Expansion
All major projections (`Q`, `K`, `V`, `O`, and MLP projections) use BitNet-style ternary-weight forward behavior. The purpose is to reduce effective storage pressure while preserving as much model width and depth as possible within the artifact budget.

### 2. Radial Encoding
Learned positional embeddings are removed. Instead, position-dependent geometric features are injected analytically through `RadialEncoding(8)`. This reduces learned parameter overhead while retaining explicit positional structure.

### 3. FRO Optimizer
`FRO` is a custom optimizer designed for short-horizon convergence under highly quantized weight dynamics. It replaces AdamW in this submission and is part of the experimental contribution.

## Configuration

- **Layers:** 12
- **Model Dimension:** 384
- **Attention Heads:** 6
- **KV Heads:** 2
- **Vocabulary Size:** 1024
- **Approximate Parameter Count:** 15.6M

## Artifact Accounting

The submission script performs a post-training artifact audit using:
- counted source-code bytes from `train_gpt.py`
- compressed exported model bytes
- a final decimal-byte check against the official `16,000,000` byte submission limit

The audit is performed after training and writes the compressed model artifact physically to disk before measuring its byte size.

## Evaluation

The script implements tokenizer-agnostic BPB evaluation over the official validation shard format used by the challenge. In record-track mode, the script is designed to fail explicitly if required tokenizer or dataset files are missing.

Mock or debug behavior is only enabled when explicitly requested through environment flags.

## Reproducibility Notes

`train_gpt.py` is designed to:
- support distributed execution,
- run with explicit record-track failure behavior when required assets are missing,
- produce a final post-training artifact audit,
- run final validation before reporting the final result.

## Development Status

The result currently attached to this submission comes from a development run on non-target hardware. This repository entry is intended as a serious experimental submission and as a candidate for further validation under the official challenge hardware setting.

## Files Included

This submission includes:
- `README.md`
- `submission.json`
- `train.log`
- `train_gpt.py`

## Notes

This submission should be interpreted as an experimental compressed-model approach, not as a claim of already-verified record-track performance on 8xH100 SXM.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"author": "Christian Q. De Luca",
"github_id": "rthgit",
"val_bpb": "2.6034",
"model_size": "13100000",
"hardware": "Kaggle Dual T4 (development run)",
"training_time": "562s"
}
39 changes: 39 additions & 0 deletions records/track_10min_16mb/2026-03-22_RadialBitNet/train.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
✨ Initializing Radial-BitNet for Parameter Golf (Constraint: 16MB)

📦 Artifact Size Audit:
- Parameters: 15.7 M
- Compressed Size: 12.55 MB
✅ QUALIFIED FOR PARAMETER GOLF! (<16MB)
⏳ Loading training tokens into memory...
Loading single dataset shard to protect Kaggle RAM: /kaggle/working/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_train_000000.bin

🚀 Starting 10-Minute Rapid Convergence Cycle on real dataset...
Step 0000 | Time 1s | Train Loss: 148.7086 | Val BPB: 28.1633 ⛳
Step 0050 | Time 23s | Train Loss: 9.2294 | Val BPB: 3.9549 ⛳
Step 0100 | Time 46s | Train Loss: 6.5566 | Val BPB: 2.8735 ⛳
Step 0150 | Time 69s | Train Loss: 6.2854 | Val BPB: 2.7771 ⛳
Step 0200 | Time 91s | Train Loss: 6.6208 | Val BPB: 2.7590 ⛳
Step 0250 | Time 114s | Train Loss: 6.1678 | Val BPB: 2.6836 ⛳
Step 0300 | Time 136s | Train Loss: 6.2128 | Val BPB: 2.6946 ⛳
Step 0350 | Time 159s | Train Loss: 6.1435 | Val BPB: 2.6694 ⛳
Step 0400 | Time 181s | Train Loss: 6.0490 | Val BPB: 2.7252 ⛳
Step 0450 | Time 204s | Train Loss: 6.2580 | Val BPB: 2.6844 ⛳
Step 0500 | Time 226s | Train Loss: 6.7366 | Val BPB: 2.6667 ⛳
Step 0550 | Time 249s | Train Loss: 6.1070 | Val BPB: 2.6770 ⛳
Step 0600 | Time 271s | Train Loss: 6.1023 | Val BPB: 2.6680 ⛳
Step 0650 | Time 294s | Train Loss: 7.1158 | Val BPB: 2.6698 ⛳
Step 0700 | Time 316s | Train Loss: 6.1919 | Val BPB: 2.6919 ⛳
Step 0750 | Time 338s | Train Loss: 6.2160 | Val BPB: 2.6585 ⛳
Step 0800 | Time 361s | Train Loss: 6.1988 | Val BPB: 2.6854 ⛳
Step 0850 | Time 383s | Train Loss: 6.2080 | Val BPB: 2.6751 ⛳
Step 0900 | Time 406s | Train Loss: 6.1793 | Val BPB: 2.6787 ⛳
Step 0950 | Time 428s | Train Loss: 6.1073 | Val BPB: 2.6438 ⛳
Step 1000 | Time 450s | Train Loss: 6.0260 | Val BPB: 2.6274 ⛳
Step 1050 | Time 473s | Train Loss: 6.0984 | Val BPB: 2.6307 ⛳
Step 1100 | Time 495s | Train Loss: 6.1011 | Val BPB: 2.6244 ⛳
Step 1150 | Time 518s | Train Loss: 5.9497 | Val BPB: 2.5936 ⛳
Step 1200 | Time 540s | Train Loss: 6.0033 | Val BPB: 2.6363 ⛳
Step 1250 | Time 562s | Train Loss: 5.8162 | Val BPB: 2.5498 ⛳

⏰ 10-Minute training time budget exhausted. Validating final model...
FINAL RESULT | Val BPB: 2.6034 🏆
Loading