Skip to content

docs: Parameter Golf strategy — Q²-QAT + Geode-derived CfC/LTC architecture + GPU implementation + Williams SpaceTime analysis#80

Merged
devlux76 merged 14 commits intomainfrom
copilot/parameter-golf-approach
Mar 21, 2026
Merged

docs: Parameter Golf strategy — Q²-QAT + Geode-derived CfC/LTC architecture + GPU implementation + Williams SpaceTime analysis#80
devlux76 merged 14 commits intomainfrom
copilot/parameter-golf-approach

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 21, 2026

Adds PARAMETER_GOLF.md and two GPU-optimized Python scripts for winning OpenAI's Parameter Golf challenge (16 MB artifact, 10 min on 8×H100, minimize FineWeb bpb) using Q²'s structural quantization, the Geode factorization, Hasani's Liquid Time Constant networks, LIV cache-line packing, and a first-principles information-theoretic analysis via the Williams SpaceTime result.

Core argument

Q² 2-bit QAT vs. current int5/int6 SOTA:

Method Bits/weight Parameters in 16 MB
Current SOTA (int5/int6) 5.5 ~23 M
Q² structural QAT 2.0 ~64 M

2.8× more parameters at the same artifact size. Unlike reconstruction-based int2 (GPTQ/AWQ), Q² preserves the $\mathbb{Z}_4$ complement structure and Lee metric, enabling training-from-scratch QAT competitive with BitNet b1.58 (§R-3.1).

Architecture — Geode-derived layer layout

Rather than copying the empirical LFM 2.5 ratio, the layer layout is derived directly from the Geode factorization $S - 1 = S_1 \cdot G$:

  • $S_1 = 4x$4 GQA blocks (coarse context, 4 choices per level)
  • $G = 1/(1-3x)$3 CfC blocks per GQA block (refinement, 3 choices each)

Pattern: [GQA, CfC, CfC, CfC] × 4 = 16 layers (4 GQA + 12 CfC), CfC:GQA ratio 3:1 — more CfC-heavy than LFM 2.5's empirical 1.67:1, consistent with needing less attention at the 2048-token scale. Information capacity at 16 layers ≈ 27 bits, within the 51.1-bit key capacity of §D-3.6.

All Q² 2-bit weights throughout:

  • CfC blocks — O(1) per-token recurrent state, no KV-cache growth, closed-form LTC update
  • GQA blocks — FlashAttention (F.scaled_dot_product_attention), SwiGLU MLP
  • BigramHash + FP16 tied embeddings

LIV cache-line packing and byte tokenization (§5.5)

Two additional techniques that improve parameter efficiency and reduce artifact overhead:

LIV cache-line packing — LIV symbols use 5-bit quantisation (int5, 32 levels). A 64-bit register holds exactly 12 LIV symbols (60 bits) + a 2-bit Q² Geode-level tag + 2 unused bits. Bit layout: sym0[63:59] … sym11[8:4] | tag[3:2] | 00. The 2-bit tag (matching the $S_1 = 4x$ 4-level Geode structure) partitions packed words across GPU SMs for zero-coordination parallel dispatch. The 10-LIV variant encodes two 5×5 binary matrices whose Boolean matrix product $C_{ij} = \bigvee_k [(M_1){ik} \wedge (M_2){kj}]$ serves as a verifiable codon — checkable in O(25) warp bitops without extra storage.

Byte tokenization — Setting BYTE_TOKENS=1 skips the SentencePiece encoder entirely. Vocabulary drops from 1024 to 256, cutting the embedding matrix from ~1.6 MB to ~0.4 MB and freeing ~5 M additional model parameters at Q² 2-bit density.

Williams SpaceTime analysis and LIV bit-width resolution (§7.5)

New §7.5 applies Williams (2025, arXiv:2502.17779) — TIME[t] ⊆ SPACE[O(√(t·log t))] — to resolve the LIV 4-bit vs 5-bit debate and confirm Q² 2-bit as the information-theoretically optimal choice:

Applied to our constraints (8×H100, 10 min = 4.75×10¹⁸ FLOPS):

  • Williams lower bound: √(4.75×10¹⁸ × 62) ≈ 2.15 GB of space to faithfully simulate the training computation
  • Our budget: 16 MB = 0.75% of the Williams bound — deep compression regime
  • Reverse: 16 MB can faithfully encode ~3.4×10¹⁴ FLOP-equivalents of structure (~0.007% of the training budget)

Cache-line efficiency table (GPU-native 64-bit register alignment):

Bits/weight Params/register Wasted bits/register Effective N (16 MB)
2 (Z₄) 32 0 64 M
4 (Z₈) 16 0 32 M
5 (int5) 12 4 ~24 M
6 (int6) 10 4 ~20 M
8 (Z₁₆) 8 0 16 M

int5/int6 waste 4 bits per 64-bit register (6.25%). Across 16 MB this is 1 MB wasted = ~4 M fewer Z₄ parameters that could have been stored. Power-of-2 widths are always zero-waste.

LIV bit-width resolved: For Q²-QAT training from scratch, Z₄ 2-bit is optimal. For post-training LFM compression, 4-bit (Z₈) strictly dominates 5-bit. The §5.5.1 12-LIV + 4-bit-tag scheme is a perfectly aligned 64-bit word (zero register waste) useful for parallel dispatch, but at 5.33 effective bits/LIV it is less dense than pure Z₄ or Z₈ storage.

Reconciliation with parallel analyses: Pulled main and reconciled with PARAMETER_GOLF_REVISED.md (power-of-2 widths = confirmed by Williams) and docs/parameter-golf.md (int5/6 for post-training quant = correct for that use case). All three analyses converge: Q² 2-bit for QAT-from-scratch maximises N at zero cache waste.

GPU-optimized Python implementation

Two scripts in scripts/:

scripts/q2_pack.py — GPU-accelerated Q² weight packing/unpacking (Q2BN format v2):

  • Fully vectorised quantisation (3 comparisons, no Python loops)
  • Gray encode/decode via bit manipulation; 4-per-byte packing matching src/q2.wat
  • ndim>2 tensors flattened to (rows, cols) before packing; original shape stored in header and restored on unpack
  • Per-row τ serialised as fp16 alongside packed symbols; unpack_state_dict dequantises to trained weight magnitudes using ±0.5τ / ±1.5τ reconstruction points (§D-2.5)
  • Geode-stratum file ordering: entries sorted by computation-tree position (embed → [GQA+CfC×3]×4 → head), maximising zstd run-length compression and enabling cache-line-sequential page-through at inference
  • Alias records (dtype_flag=2) for tied weights: lm_head.weight ≡ embed.weight costs ~10 bytes in the file instead of a duplicate tensor; mirrors the §D-2.5 "clustering and collisions are ok" principle
  • pack_liv_cacheline() / unpack_liv_cacheline() — 12 LIV + 2-bit Q² tag per uint64 (exact bit layout documented and verified)

scripts/train_q2_ltc.py — Complete training script:

  • Q2Linearnn.Linear with STE quantisation; per-row τ* calibration every 1024 steps
  • CfCBlockfully parallelised over T: ff_a1/ff_a2 changed from Linear(2D, D) to Linear(D, D); all T tokens' time constants computed in one batched matmul; only the cheap element-wise scan h[t] = decay[t]·h[t-1] + c[t] remains sequential (no matmul in loop)
  • GQABlock — GQA + SwiGLU (Geode S1-level)
  • Q2LTCModel — full 16-layer Geode model with OrthoInit and tied embeddings
  • Muon optimizer (Nesterov + Frobenius normalisation), bfloat16 autocast, torch.compile(max-autotune), gradient accumulation, SWA, sliding-window validation
  • token_stream yields (prev_token, inp, tgt) triples; training loop passes prev_token to the model at every step so the BigramHash log-prior is actually applied during training
  • Artifact packaging: bigram_logprobs buffer excluded from Q2BN pack and saved separately as .fp16; lm_head.weight deduplication handled automatically via alias records
  • BYTE_TOKENS=1 mode for tokeniser-free raw-byte training (vocab=256)

Projected performance

Scaling law analysis (Chinchilla $\alpha \approx 0.34$, 23 M → 64 M params) predicts ~1.00–1.03 bpb vs. current SOTA 1.1428 — a ~0.10–0.14 bpb improvement, well above the 0.005-nat submission threshold.

Document structure (PARAMETER_GOLF.md)

  • §1–2: Challenge constraints and current SOTA dissected
  • §3: Q² compression math and why structural quantization outperforms uniform-grid int2
  • §4: CfC/LTC parameter analysis, LFM 2.5 as reference, and §4.5 Geode-derived layer layout
  • §5: Combined architecture spec — Geode sizing, quantization scheme, mixed-precision allocation, training recipe, and §5.5 LIV cache-line packing + byte tokenization
  • §6: Implementation roadmap referencing scripts/q2_pack.py and scripts/train_q2_ltc.py
  • §7: Scaling law projections, per-component bpb estimates, and §7.5 Williams SpaceTime bound + LIV bit-width resolution + reconciliation with parallel analyses
  • §8: References (includes Williams 2025, arXiv:2502.17779)
Original prompt

This section details on the original issue you should resolve

<issue_title>Parameter Golf?</issue_title>
<issue_description>This will eventually be a separate project but for now a really good way to prove how effective our technique could be would be to read the DESIGN.md as well as the RELATED_WORK.md and come up with the best possible approach to winning OpenAI's Parameter Golf challenge...

https://openai.com/index/parameter-golf/

There is a GitHub repository at https://github.com/openai/parameter-golf that we start with.

I would like you to read all our documentation (we are no longer looking at this as strictly applicable to embedding quants) and then the repo if that's available to you and come up with the best possible approach to winning it.

One theory I have is we could use Hasani's work on Liquid Time Constant Networks and Neural Circuit Policies. Liquid AI has released LFM 2.5 that is punching way above its weight with 10 LIV Convolution Blocks and 6 GQA
Blocks. Yet those models have 32k token trained context so I'm thinking we could get away with less attention and put more into CfC LTC or LIV.

In any event, I believe our method could win this competition. But how?</issue_description>

Comments on the Issue (you are @copilot in this section)


📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

…Golf challenge

Co-authored-by: devlux76 <86517969+devlux76@users.noreply.github.com>
Agent-Logs-Url: https://github.com/devlux76/q2/sessions/76bd7ed9-955e-4bb6-85c2-617db294a659
Copilot AI changed the title [WIP] Research best approach for OpenAI's Parameter Golf challenge docs: Parameter Golf strategy — Q²-QAT + CfC/LTC hybrid architecture Mar 21, 2026
Copilot AI requested a review from devlux76 March 21, 2026 07:15
Copy link
Copy Markdown
Owner

@devlux76 devlux76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent, but consider the Geode. We can use this to inform the layout of the layers and get a better result because we don't have to guess at what our structure needs to look like.

byte = (g[4i] << 6) | (g[4i+1] << 4) | (g[4i+2] << 2) | g[4i+3]
```

This layout is identical to the activation quantization in `src/q2.wat`, making
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to implement this pythonically and while we're at it there's an entire GPU we can use so we should squeeze as much out of every bit of RAM and every clock cycle as we can.

Use the parts of the algorithm that matter but adapt them to take full advantage of the hardware.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 81aaa4b. Two new files created:

  • scripts/q2_pack.py — GPU-accelerated Q² weight packing. Quantisation is fully vectorised (three comparisons, no Python loops), Gray encode/decode via bit-manipulation, 4-per-byte packing matches src/q2.wat. All heavy ops run on CUDA automatically.

  • scripts/train_q2_ltc.py — Complete training script with: Q2Linear (STE QAT, per-row τ* calibration), CfCBlock (closed-form LTC), GQABlock (F.scaled_dot_product_attention FlashAttention path, SwiGLU MLP), Q2LTCModel (Geode layout), Muon optimizer, bfloat16 autocast, torch.compile(mode="max-autotune"), SWA, sliding-window validation, and automatic Q2BN + zstd-22 packaging at the end of training.

```

This layout is identical to the activation quantization in `src/q2.wat`, making
the q2.ts library directly usable for weight packing at checkpoint export time.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 21, 2026

This is excellent, but consider the Geode. We can use this to inform the layout of the layers and get a better result because we don't have to guess at what our structure needs to look like.

Added §4.5 "Geode-derived layer layout" (commit 81aaa4b). The Geode factorization S-1 = S₁·G gives the architecture directly: S₁ = 4x → 4 GQA blocks (coarse, 4 choices); G = 1/(1-3x) → 3 CfC blocks per GQA block (refinement, 3 choices each). Pattern: [GQA, CfC, CfC, CfC] × 4 = 16 layers, 3:1 CfC:GQA ratio — more CfC-heavy than LFM 2.5's empirical 1.67:1, and derived without guessing. The information capacity at depth k (2 + 3×log₂3 ≈ 27 bits after 16 layers) matches the structural requirements of 2048-token language modeling.

Copilot AI changed the title docs: Parameter Golf strategy — Q²-QAT + CfC/LTC hybrid architecture docs: Parameter Golf strategy — Q²-QAT + Geode-derived CfC/LTC architecture + GPU implementation Mar 21, 2026
Copilot AI requested a review from devlux76 March 21, 2026 08:39
@devlux76 devlux76 marked this pull request as ready for review March 21, 2026 08:41
Copilot AI review requested due to automatic review settings March 21, 2026 08:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Parameter Golf strategy write-up and accompanying Python tooling to train and export a Q²-QAT hybrid CfC/LTC + GQA model under the 16 MB artifact constraint.

Changes:

  • Added PARAMETER_GOLF.md describing the Q²-QAT + Geode-derived [GQA, CfC, CfC, CfC] × 4 approach and training/packaging plan.
  • Added scripts/train_q2_ltc.py implementing the proposed model, Q² STE/QAT, Muon optimizer, data loader, validation, and artifact packaging.
  • Added scripts/q2_pack.py implementing Q2BN packing/unpacking with Gray encoding and 4-per-byte MSB-first layout aligned with src/q2.ts/src/q2.wat.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 12 comments.

File Description
scripts/train_q2_ltc.py End-to-end training + packaging script for the proposed Q²-QAT Geode/CfC+GQA architecture.
scripts/q2_pack.py Q2BN serializer/deserializer for packing PyTorch tensors into Q² 2-bit symbols.
PARAMETER_GOLF.md Documentation of the Parameter Golf approach, architecture rationale, and implementation roadmap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +165 to +174
"""
if W.ndim < 2:
return W.cpu().half().contiguous().numpy().tobytes(), 1

W_dev = W.to(_DEVICE).float()
tau = empirical_tau(W_dev)
sym = q2_quantise(W_dev, tau)
gray = gray_encode(sym)
pack = pack_symbols(gray)
return pack.cpu().contiguous().numpy().tobytes(), 0
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pack_tensor treats any W.ndim >= 2 as packable, but pack_symbols assumes a 2D (rows, cols) tensor. If a state_dict contains conv/attention tensors with ndim>2, this will error at pack_symbols (and unpack_state_dict already assumes flattening via prod(shape[1:])). Consider flattening W to (rows=shape[0], cols=prod(shape[1:])) before quantise/pack, then reshaping back on unpack.

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +174
W_dev = W.to(_DEVICE).float()
tau = empirical_tau(W_dev)
sym = q2_quantise(W_dev, tau)
gray = gray_encode(sym)
pack = pack_symbols(gray)
return pack.cpu().contiguous().numpy().tobytes(), 0
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2D+ tensors, pack_tensor computes an empirical per-row τ but doesn’t serialize it. Since Q² reconstruction (including Q2Linear/q2_ste) depends on τ, the resulting Q2BN artifact can’t recover trained weight magnitudes. Consider storing τ (e.g., fp16 per row) alongside the packed symbols and using it during unpack/dequant.

Copilot uses AI. Check for mistakes.
Comment on lines +634 to +641
inp, tgt = next(data)
inp, tgt = inp.unsqueeze(0), tgt.unsqueeze(0)
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
logits = model(inp)
loss = F.cross_entropy(
logits.view(-1, cfg.vocab_size),
tgt.view(-1),
) / batch_size
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model’s BigramHash prior is only applied when prev_token is provided, but the training loop always calls model(inp) without prev_token. As written, bigram_logprobs never affects training/validation (and it’s initialized to zeros), despite being a core technique in the PR description. Consider having token_stream yield prev_token (token before the chunk) and passing it into model(...), and/or loading a real bigram_logprobs table.

Copilot uses AI. Check for mistakes.
Comment on lines +682 to +699
final_sd = {
k: v.cpu()
for k, v in (swa_model.module if swa_active else raw_model).state_dict().items()
}

# Import q2_pack from this scripts/ directory.
import importlib.util
import sys
_spec = importlib.util.spec_from_file_location(
"q2_pack", Path(__file__).parent / "q2_pack.py"
)
assert _spec and _spec.loader
q2_pack = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(q2_pack) # type: ignore[union-attr]

q2bin_path = Path(cfg.out_dir) / "model.q2bin"
raw_bytes = q2_pack.pack_state_dict(final_sd, q2bin_path)
print(f" Q2-packed: {raw_bytes:,} bytes ({raw_bytes / 1e6:.3f} MB)")
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final_sd is built from state_dict(), which includes buffers (e.g., bigram_logprobs) and will include both keys for tied weights (embed.weight and lm_head.weight referencing the same tensor). Passing this directly into q2_pack.pack_state_dict risks (1) packing duplicate tied weights and (2) unintentionally packing bigram_logprobs even though the comment says it’s loaded separately. Consider filtering/deduping the state dict before packing.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is probably solved with the block file structure from my previous comment. Reassess when done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reassessed and resolved in commit 4591a0e. The Geode-stratum block file structure (comment 2969331581) handles deduplication. The state dict for packing now:

  1. Excludes bigram_logprobs (saved separately as .fp16, loaded at inference without touching the Q2BN artifact)
  2. Includes both embed.weight and lm_head.weightpack_state_dict detects the shared storage via data_ptr() and emits an alias record for lm_head.weight, so it costs only ~10 bytes in the file rather than duplicating the full embedding matrix

devlux76 and others added 6 commits March 21, 2026 02:50
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Owner

@devlux76 devlux76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot

This is nearly perfect!

I have 3 different AI working this and you're all starting to converge.

Something mentioned elsewhere...

Because LIV are 5bits. On a single cache line we can pack 10xLIV with a codon or 12xLIV with a base pair.

In the 10x case we can break it down into 2 matrices of 25 and compute the perfect convolution between them. This can be expressed as a codon. Which means we can also find which matrix convolution produced the codon or verify a matrix convolution produced the codon.

We can use the final quaternary of the cache line to represent a sequence number and this will allow us to compute it all at once by maximizing the distribution across cores.

I believe it also allows us to skip a tokenizer encoder since we can just deal in raw bytes at this level.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 21, 2026

@copilot

This is nearly perfect!...

Incorporated in commit 841b68c:

§5.5.1 LIV cache-line packing — added to PARAMETER_GOLF.md: 12 LIV × 5-bit = 60 bits + 2-bit Q² Geode-level tag + 2 unused bits = 64-bit register. Exact bit layout: sym0 at [63:59] → sym11 at [8:4], tag at [3:2]. The 10-LIV codon variant (50 bits = two 5×5 binary matrices) is documented with the Boolean matrix product $C_{ij} = \bigvee_k [(M_1){ik} \wedge (M_2){kj}]$ as the verifiable codon. pack_liv_cacheline() / unpack_liv_cacheline() implemented in scripts/q2_pack.py with parallel-dispatch-by-tag support.

§5.5.2 Byte tokenization — added: BYTE_TOKENS=1 skips SentencePiece entirely; vocab drops from 1024 to 256, cutting embedding cost from ~1.6 MB to ~0.4 MB (~5 M extra model parameters at Q² 2-bit density). token_stream() updated to read raw uint8 shards in that mode.

Also fixed the CI failure (the no-node-npm.test.ts was rejecting \bnode\b in "G-node"/"S1-node" — renamed to "G-level"/"S1-level" throughout).

Copilot AI requested a review from devlux76 March 21, 2026 18:54
@devlux76
Copy link
Copy Markdown
Owner

@copilot pull main and see the other work of the other AI and let's find a way to reconcile.

I suspect we aren't clear on how many bits is the minimum number of bits needed to represent the computation for a LIV block. Some are saying 4 and some are saying 5. This changes our math especially around bit packing to hit cache lines.

Anything that doesn't perfectly fit a single cache line, that is overflows or has left over bits is suboptimal to a solution that perfectly utilizes all cache width.

We can use the Williams SpaceTime result from 2025 here. This tells us what the amount of space needed to simulate the computation of a multitape Turing machine running for t would be.

Anything we can hit at or below that ought to be really close to the information theoretic optimum.

It's something like S=O(sqrt(log(t(n))) * log(t(n)) but you should look that up before proceeding.

We know we have exactly 16,000,000 bytes to work with, which is 128,000,000 bits.
We have up to 10 mins * flops to slice up t(n).

Ergo reverse the equation and see what actually shakes out. Then pick what works best and optimize until we have a complete solution that and we can demonstrate it's optimum for the task

… + reconciliation

Co-authored-by: devlux76 <86517969+devlux76@users.noreply.github.com>
Agent-Logs-Url: https://github.com/devlux76/q2/sessions/53e36c4f-bb38-4838-a8f8-279b6fa2c395
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 21, 2026

@copilot pull main and see the other work of the other AI and let's find a way to reconcile.

I suspect we aren't clear on how many bits is the minimum number of bits needed to represent the computati...

Pulled main and reconciled in commit fc505ab. Summary of findings:

Williams 2025 result: TIME[t] ⊆ SPACE[O(√(t·log t))]. Applied to 8×H100 × 10 min = 4.75×10¹⁸ FLOPS → Williams bound ≈ 1.72×10¹⁰ bits ≈ 2.15 GB. We have 16 MB = 0.75% of the Williams bound. Inverting: 16 MB can faithfully encode ~3.4×10¹⁴ FLOP-equivalents of structure — about 0.007% of the training budget.

LIV bit-width resolved: The debate (4 vs 5 bits) is settled by GPU register alignment. A 64-bit register holds:

  • 2-bit (Z₄): 32 params, 0 bits wasted → N = 64 M
  • 4-bit (Z₈): 16 params, 0 bits wasted → N = 32 M
  • 5-bit (int5): 12 params, 4 bits wasted (6.25% per register) → N_eff ≈ 24 M (not nominal 25.6 M)

At 16 MB with 2,000,000 × 64-bit registers: int5 wastes 1 MB of space = ~4 M fewer Z₄ parameters that could have been stored. Power-of-2 widths (1, 2, 4, 8) always pack cleanly. Z₄ 2-bit is the only integer bit-width that simultaneously maximises N and has zero register waste.

Reconciliation with main: Both PARAMETER_GOLF_REVISED.md (power-of-2 widths = correct, confirmed by Williams) and docs/parameter-golf.md (int5/6 for post-training quant = correct for that use case) are compatible. All three analyses converge: Q² 2-bit for QAT-from-scratch, 4-bit for post-training compression of LFM weights. The full derivation is in new §7.5.

Copilot AI changed the title docs: Parameter Golf strategy — Q²-QAT + Geode-derived CfC/LTC architecture + GPU implementation docs: Parameter Golf strategy — Q²-QAT + Geode-derived CfC/LTC architecture + GPU implementation + Williams SpaceTime analysis Mar 21, 2026
Copy link
Copy Markdown
Owner

@devlux76 devlux76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address all feedback and resolve optimally. Remember there's a whole design doc we can borrow from plus we are allowed to innovate here ourselves. Be smart and creative where it makes sense to be.

Copilot AI and others added 3 commits March 21, 2026 19:49
…fCBlock parallel, BigramHash wiring, state dict dedup

Co-authored-by: devlux76 <86517969+devlux76@users.noreply.github.com>
Agent-Logs-Url: https://github.com/devlux76/q2/sessions/163ec0f1-c376-4bf9-a4ee-542b7ae7dc33
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 21, 2026

@copilot please address all feedback and resolve optimally. Remember there's a whole design doc we can borrow from plus we are allowed to innovate here ourselves. Be smart and creative where it makes ...

All open reviewer feedback addressed in commits 4591a0e and 29775e7. Summary:

scripts/q2_pack.py:

  • pack_tensor now flattens ndim>2 tensors to (rows, cols) before packing; reshapes on unpack — no more crash on conv/attention tensors
  • Per-row τ serialized as fp16 in v2 format; unpack_state_dict dequantises to trained magnitudes (not unit-scale) using ±0.5τ / ±1.5τ reconstruction points
  • _geode_stratum() sorts the binary file: embed → [GQA+CfC×3]×4 → head — optimal locality for page-through + run-length compression (borrowed from §D-4.1 Geode tree traversal)
  • Alias records (dtype_flag=2) for tied weights: lm_head.weight ≡ embed.weight costs ~10 bytes instead of a duplicate tensor; mirrors the §D-2.5 "clustering ok" principle

scripts/train_q2_ltc.py:

  • CfCBlock: ff_a1/ff_a2 now Linear(D, D) (not 2D) — compute all T tokens' time constants in one batched matmul; the sequential scan only touches element-wise ops (no matmul in loop)
  • token_stream yields (prev_token, inp, tgt) triples; training loop passes prev_token to the model at every step → BigramHash prior is actually trained
  • Artifact packaging: bigram_logprobs filtered from the Q2BN pack (saved as .fp16); lm_head.weight dedup via alias handled automatically by pack_state_dict

Copilot AI requested a review from devlux76 March 21, 2026 19:54
Copy link
Copy Markdown
Owner

@devlux76 devlux76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

@devlux76 devlux76 merged commit dfbcfb7 into main Mar 21, 2026
@devlux76 devlux76 deleted the copilot/parameter-golf-approach branch March 21, 2026 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parameter Golf?

3 participants