Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
⛳ 16MB Track Submission: Radial-BitNet (15.6M Parameters)
This PR submits an official entry for the 10-Minute / 16MB Parameter Golf track, achieving a final validation of 2.6034 BPB heavily constrained by a pure ternary-weight architecture.
The approach aggressively deviates from standard FP16 LLM baselines to maximize parameter volume per megabyte, utilizing my proprietary architectural designs (Radial Encoding and FRO) to push performance beyond conventional limits.
🧠 Proprietary Architecture & Innovations$\phi$ (Golden Ratio) and directly injects this geometric signal into the token embedding, freeing up massive parameter space.
Weight-Only BitNet (W1.58b / A16b): Every linear projection inside the Attention and MLP layers relies on strict ternary matrices (-1, 0, 1) internally explicitly during the forward pass. This allows my 15.6M parameter model to confidently scale capacity while compressing losslessly under Zstandard down to just 12.55 MB, clearing the 16MB limit by over 3.4 Megabytes.
Radial Positional Bypass (Proprietary Design): The architecture features exactly Zero learned sequential positional embeddings. Instead, I integrated my custom
RadialEncoding
algorithm. It mathematically computes Euler-based spatial frequencies over
Fractal Resonant Optimization - FRO (Proprietary Optimizer): To handle the shattered gradient momentum of step-quantized ternary weights within the strict 10-minute constraint, I am introducing my custom built-in optimizer: FRO. It enforces extreme early convergence through multi-scale resonance alignment, dramatically outperforming AdamW in this constrained environment.
⚙️ Hyperparameters & Configuration
Parameters: 15,600,000 (~12.55MB Compressed)
Shape: 12 Layers | 384 Model Dimension | 6 Q-Heads | 2 KV-Heads
Environment Execution: Kaggle Dual T4 GPUs (Falling back dynamically to native float16 to bypass Turing-architecture bfloat16 emulation bloating).
Train Sequence Length: 1024
Batch Size: 4 (Scale-Invariant deep accumulation logic)
📉 Final 10-Minute Convergence Log Snippet
text
Step 1100 | Time 495s | Train Loss: 6.1011 | Val BPB: 2.6244 ⛳
Step 1150 | Time 518s | Train Loss: 5.9497 | Val BPB: 2.5936 ⛳
Step 1200 | Time 540s | Train Loss: 6.0033 | Val BPB: 2.6363 ⛳
Step 1250 | Time 562s | Train Loss: 5.8162 | Val BPB: 2.5498 ⛳
⏰ 10-Minute training time budget exhausted. Validating final model...
FINAL RESULT | Val BPB: 2.6034 🏆