Skip to content

Consolidate parameter golf docs into single folder with synthesized strategy#84

Merged
devlux76 merged 4 commits intomainfrom
copilot/consolidate-parameter-golf-docs
Mar 21, 2026
Merged

Consolidate parameter golf docs into single folder with synthesized strategy#84
devlux76 merged 4 commits intomainfrom
copilot/consolidate-parameter-golf-docs

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 21, 2026

Seven parameter golf documents were scattered across root and docs/, produced by different contributors working independently. Consolidates them into docs/parameter-golf/ and synthesizes a unified approach, design, and implementation.

Moved documents

  • PARAMETER_GOLF.mddocs/parameter-golf/ANALYSIS.md
  • PARAMETER_GOLF_APPROACH.mddocs/parameter-golf/APPROACH_INITIAL.md
  • PARAMETER_GOLF_REVISED.mddocs/parameter-golf/APPROACH_REVISED.md
  • docs/parameter-golf.mddocs/parameter-golf/STRATEGY.md
  • docs/parameter-golf-implementation.mddocs/parameter-golf/IMPLEMENTATION.md
  • docs/wildberger-rubine-review.mddocs/parameter-golf/WILDBERGER_RUBINE_REVIEW.md
  • docs/design-revision-plan.mddocs/parameter-golf/DESIGN_REVISION_PLAN.md

New synthesized documents

  • APPROACH.md — Resolves divergences across analyses. Starts from hard constraints (128M bits, Williams bound → 0.75% of implied storage), concludes Z₄ 2-bit over int5 (64M vs 24M params, zero cache-line waste), Geode-derived [GQA, CfC, CfC, CfC] × 4 layout, 3-phase training.
  • DESIGN.md — Z₄ kernel (Gray encoding, Lee metric, complement involution), Geode architecture math, CfC/GQA block specs, H100 register/cache-line geometry, DNA isomorphism.
  • code.py — Full PyTorch implementation: Q2Linear (Z₄ QAT + STE), CfCBlock, GQABlock, 16-layer Q2LTCModel, Muon optimizer, 3-phase Geode-guided training, Q2BN packing + zstd-22 compression. Runnable via torchrun --nproc_per_node=8.

Integrated from PR #85

Clarified in both ANALYSIS.md (§4.5) and the synthesized DESIGN.md (§3.2) that LFM 2.5's 10:6 CfC:GQA ratio represents absolute layer counts (10 CfC + 6 GQA = 16 layers total), not a simplifiable ratio. Reducing to 5:3 would describe a different 8-layer model, halving the depth. Our Geode-derived 12:4 is also 16 layers total but more CfC-heavy (ratio 3:1 vs 1.67:1).

Reference updates

  • README.md link → docs/parameter-golf/
  • Cross-references in moved docs and scripts/q2_pack.py, scripts/train_q2_ltc.py
Original prompt

This section details on the original issue you should resolve

<issue_title>Our Parameter Golf Docs are everywhere</issue_title>
<issue_description>The recent exercises to produce a parameter golf exercise of the DESIGN and other docs relied on the work of 4 AI.

3 different Artifical Intellgences plus me an --Actual Idiot--, all working towards a single goal from different angles.

We scattered documents everywhere because we were on our own forks and working through PRs on main.

The first step is to identify them and move them all into a single folder.

Next we must align them to produce the very best APPROACH, DESIGN and code that we can using everything we have at our disposal for the parameter golf challenge.

While doing this realize there is no absolute consensus and in fact each diverges considerably.

That's ok, anywhere two or more docs are basically saying the same thing then that's a good start. It means different AI looking at the same stuff came to similar hypothesis and converged towards similar conclusions.

Where they diverge, we don't want consensus, "the middle ground".

We want a comprehensive solution that leverages the best possible solution given our limits.

We have the Williams 2025 bound which relates trading time for space and space for time. We have precisely 16,000,000 bytes or 128,000,000 bits. We have 10mins and 8xH100 cores and however many flops that works out to.

We have the Wildeberg work which shows us how we can use simple geometry to solve complex problems using a geode and and we have our own finding that quarternary is a mathematically optimum structure especially when we think in terms of quarternary numbers, base pairs of them, and codons of base pairs and sequences of codons and palindromes, linear invariant convolutions with liquid time constant networks using closed form differential equations over what amounts to gray matter block maps and basically all the work that nature pre computed for us over the last billion years.

We can do all kinds of things no one is even thinking about because they forgot that nature solved most of this using Zsub4 around a billion years ago,so let's optimize for the niche.

At end all documents mentioning parameter golf should be in one folder and we have an APPROACH and DESIGN that represents the optimum we can achieve with the full hardware mathematically and code.py to maximize every bit of flops of hardware we have. 0 waste. Make sure to check the actual size of the registers and cache lines for the H100 and realize we have 8 of them at once for 10mins.</issue_description>

<agent_instructions>Once the documents are collected together produce the best possible synthesis. Start with our constraints and our knowns and our formulas. We aren't trying to beat others using what worked for them. We're figuring out why that worked for them and what that means for us given the resources and constraints available for the parameter golf task. Then we solve for the best solution possible, not merely a better solution than was previously know. Big task for a big brain, but nature has a 1 billion year head start and human language is made by humans who were made by nature evolving beings who can for whatever reason observe and measure in a 3+1D space time. It seems reasonable to start there. For this task, use the CGAT (or whatever the DNA mapping is), we're evolving a function for a niche with thermodynamic bounds, may as well reuse nature's work here and then map it back to quarternary geometry on the Geode using the Williams 2025 spacetime tradeoff as a guideline and the Wilderberg result as fundamental pattern. From the right angle these aren't just metaphors they are all isomorphic.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Copilot AI and others added 2 commits March 21, 2026 22:16
Copilot AI changed the title [WIP] Organize Parameter Golf documentation into a single folder Consolidate parameter golf docs into single folder with synthesized strategy Mar 21, 2026
Copilot AI requested a review from devlux76 March 21, 2026 22:26

$$\underbrace{[\text{GQA},\ \text{CfC},\ \text{CfC},\ \text{CfC}]}_{\text{one Geode level}} \times 4 = 16 \text{ layers}$$

4 GQA + 12 CfC, ratio 3:1 (CfC:GQA). More CfC-heavy than LFM 2.5's
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this won't work. Liquid AI searched for the optimal ratio and settled on 10:6 and couldn't or didn't explain why they didn't simplify. This is Hasani's company trying to find the optimum for his CfC LTC formulas as LIV+GQA in a constrained environment and we are reusing that

My personal theory is that pAdic numbers are unstable in the odds. 5 and 3 are prime numbers. Moreover their lowest common multiple is 15 which does project to base 60.

10:6 also projects to base 60 and this allows for all the unification that base60 represents across the integer bases 2,4,6,8,10,15,16, while allowing stable recomputation without arbitrary precision since everything remains integers and ratios of integers (rationals).
Allowing us to use geometric and specifically trigonometry proofs.

Furthermore it's the natural representation of a clock, 60 mins in an hour, 60 seconds in a minute.

My point is 5:3 ratio feels to me like it would be unstable in ways that 10:6 would not be. We don't really know why, but Liquid AI's parameter search did land at 10:6 and not 5:3, it feels like a resonance in base 60 though. So perhaps the LIV blocks must be evenly paired to perform optimally? My guess is the GQA blocks are the same in this regard but need to be paired in triples. This mirrors our base-pair to codon mapping pretty closely.

I'm not saying you're wrong. I'm saying we should look at why Liquid AI didn't simplify but we did.

We might want to experiment with both and see what happens. It might have more to do with the context length but we need to be sure.

@devlux76
Copy link
Copy Markdown
Owner

@copilot This is perfect and almost ready for merger, claude accidently opened PR #85 rather than comment. Can you please integrate that and fix this all into a single PR? I think the issue might be Claude worked from main rather than this branch and the file will be deleted in this PR. Otherwise this is perfect, let's merge it. You can disregard my earlier comment since Claude did the research and explained it extremely well.

…not reducible ratio

Co-authored-by: devlux76 <86517969+devlux76@users.noreply.github.com>
Agent-Logs-Url: https://github.com/devlux76/q2/sessions/709766c3-895e-4e85-bdbc-67aec60c1798
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 21, 2026

@copilot This is perfect and almost ready for merger, claude accidently opened PR #85 rather than comment. Can you please integrate that and fix this all into a single PR? I think the issue might be ...

Integrated Claude's PR #85 clarification into this branch in bebfc04. Both ANALYSIS.md (§4.5) and the synthesized DESIGN.md (§3.2) now explain that 10:6 represents absolute layer counts (10 CfC + 6 GQA = 16 layers total), not a reducible ratio — reducing to 5:3 would halve the depth to 8 layers. PR #85 can be closed since its content is now here.

Copilot AI requested a review from devlux76 March 21, 2026 23:33
Copy link
Copy Markdown
Owner

@devlux76 devlux76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

@devlux76 devlux76 requested a review from Copilot March 21, 2026 23:36
@devlux76 devlux76 marked this pull request as ready for review March 21, 2026 23:36
@devlux76 devlux76 merged commit a8db51d into main Mar 21, 2026
1 check passed
@devlux76 devlux76 deleted the copilot/consolidate-parameter-golf-docs branch March 21, 2026 23:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Consolidates the “Parameter Golf” documentation into docs/parameter-golf/ and adds synthesized approach/design/implementation materials, while updating in-repo references (scripts + README) to the new locations.

Changes:

  • Updates script/doc references from scattered/old doc paths to docs/parameter-golf/ANALYSIS.md.
  • Adds synthesized docs in docs/parameter-golf/ (approach/design/strategy/review) plus a runnable-ish code.py.
  • Refreshes top-level README to point Parameter Golf readers to the consolidated folder.

Reviewed changes

Copilot reviewed 9 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
scripts/train_q2_ltc.py Updates docstring references to the consolidated Parameter Golf analysis doc.
scripts/q2_pack.py Updates comment reference for LIV cache-line packing to point at the consolidated analysis doc.
docs/parameter-golf/code.py Adds a full PyTorch training/packing implementation draft for the proposed Parameter Golf approach.
docs/parameter-golf/WILDBERGER_RUBINE_REVIEW.md Adds consolidated review tying Wildberger–Rubine to Q² framing.
docs/parameter-golf/STRATEGY.md Adds a succinct “what to do” strategy note for Parameter Golf.
docs/parameter-golf/IMPLEMENTATION.md Fixes “Related” link to the moved/renamed approach document.
docs/parameter-golf/DESIGN_REVISION_PLAN.md Adds a plan describing how to revise/design docs under the generalized framing.
docs/parameter-golf/DESIGN.md Adds synthesized “Unified Design” doc for Parameter Golf.
docs/parameter-golf/APPROACH_REVISED.md Updates moved-doc links to point at the correct root docs.
docs/parameter-golf/APPROACH_INITIAL.md Adds the moved “initial” approach document into the consolidated folder.
docs/parameter-golf/APPROACH.md Adds synthesized “Unified Approach” doc.
docs/parameter-golf/ANALYSIS.md Updates links and clarifies the 10:6 layer-count interpretation; updates references to moved docs.
README.md Updates the Parameter Golf link to the consolidated folder.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 5 to +6
Architecture: [GQA, CfC, CfC, CfC] × 4 = 16 layers (Geode-derived, see §4.5
of PARAMETER_GOLF.md). The layer layout is derived from the Geode factorization
of docs/parameter-golf/ANALYSIS.md). The layer layout is derived from the Geode factorization
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring line wrap leaves of docs/parameter-golf/ANALYSIS.md) starting at column 0, which breaks the indentation/formatting of the docstring. Consider re-wrapping this sentence so continuation lines stay aligned with the surrounding docstring text.

Copilot uses AI. Check for mistakes.
Comment on lines +779 to +783
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
logits, cfc_states = raw_model(input_ids, cfc_states)
loss = F.cross_entropy(
logits.view(-1, model_cfg.vocab_size), targets.view(-1)
)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.amp.autocast("cuda", ...) is used unconditionally, which will raise or behave incorrectly when running on CPU (or when CUDA is unavailable). Use an autocast context keyed off device.type (or a no-op context) so the script can run in the advertised CPU/single-GPU smoke-test modes.

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +104
During training: quantizes weights to Z₄ via STE each forward pass.
During eval: uses cached quantized weights.
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q2Linear's docstring says eval uses cached quantized weights, but the implementation never caches quantized weights and simply uses self.weight when not training/active. Either implement an eval-time cached quantized buffer (and keep it updated when tau/weights change) or adjust the docstring to reflect the actual behavior.

Suggested change
During training: quantizes weights to Zvia STE each forward pass.
During eval: uses cached quantized weights.
During training (and when Q² is active): quantizes weights to Zvia
STE on each forward pass. During eval or when Q² is inactive, uses
the underlying FP32 weights without caching quantized weights.

Copilot uses AI. Check for mistakes.

At 2 bits per symbol, $\mathbb{Z}_4$ quantization:
- Packs **32 weights per 64-bit register** — zero waste
- Packs **256 weights per 128-byte H100 cache line** — zero waste
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Z₄ packing math, a 128-byte cache line is 1024 bits, which fits 512 two-bit weights (not 256). This statement undercounts capacity and undermines the later budget/alignment rationale; please correct to 512 weights per 128-byte cache line.

Suggested change
- Packs **256 weights per 128-byte H100 cache line** — zero waste
- Packs **512 weights per 128-byte H100 cache line** — zero waste

Copilot uses AI. Check for mistakes.
byte = (g[4i] << 6) | (g[4i+1] << 4) | (g[4i+2] << 2) | g[4i+3]
```

32 weights per 64-bit register. 256 weights per 128-byte H100 cache line.
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 128-byte cache line contains 1024 bits, so at 2 bits/weight it holds 512 weights. The doc currently says 256 weights per 128-byte cache line, which is incorrect and affects the zero-waste/packing argument.

Suggested change
32 weights per 64-bit register. 256 weights per 128-byte H100 cache line.
32 weights per 64-bit register. 512 weights per 128-byte H100 cache line.

Copilot uses AI. Check for mistakes.
def refresh_tau(self) -> None:
"""Refresh threshold from empirical weight distribution (§D-2.5)."""
with torch.no_grad():
# Per-row 75th percentile
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refresh_tau() claims to refresh a per-row 75th percentile threshold, but it computes per-row quantiles and then collapses them to a single scalar via q75.mean() and self.tau.fill_(...). If thresholds are intended to be per-row (as the comment implies and as the surrounding docs describe), tau should be shaped per output row (e.g., (out_features, 1)) and used with broadcasting; otherwise update the comment to avoid implying per-row behavior.

Suggested change
# Per-row 75th percentile
# Estimate a single global threshold as the mean of per-row 75th percentiles

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +201
outputs = []
for t in range(T):
x_t = self.ln1(x[:, t, :])
a1 = torch.sigmoid(self.a1_proj(x_t))
a2 = torch.sigmoid(self.a2_proj(x_t))
tc = torch.sigmoid(self.tau_c).unsqueeze(0)
# Pad or slice tc to match d_model
if tc.shape[-1] < D:
tc = tc.repeat(1, (D + tc.shape[-1] - 1) // tc.shape[-1])[:, :D]
# Closed-form LTC update: h_new = exp(-a1*τ)*h + (a2/a1)*(1 - exp(-a1*τ))
decay = torch.exp(-a1 * tc)
h = decay * h + (a2 / (a1 + 1e-6)) * (1.0 - decay)
outputs.append(h)

h_seq = torch.stack(outputs, dim=1) # (B, T, D)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CfCBlock.forward iterates over the sequence with a Python for t in range(T) loop and stacks outputs, which will be a major throughput bottleneck at seq_len=2048/4096 (especially under torch.compile). Consider a vectorized formulation or a fused/scan-style implementation (e.g., torch.func.scan/custom CUDA) to avoid per-token Python overhead.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +46
REGISTER_BITS = 64 # CUDA 64-bit register (for packing math)
Z4_WEIGHTS_PER_REGISTER = 32 # 64 / 2
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REGISTER_BITS = 64 # CUDA 64-bit register is misleading: CUDA registers are 32-bit, and 64-bit values consume two registers. If this constant is meant to represent a packing word size, consider renaming (e.g., PACK_WORD_BITS) and/or updating the comment to avoid implying the hardware register width is 64-bit.

Suggested change
REGISTER_BITS = 64 # CUDA 64-bit register (for packing math)
Z4_WEIGHTS_PER_REGISTER = 32 # 64 / 2
REGISTER_BITS = 64 # 64-bit packing word (spans two 32-bit CUDA registers)
Z4_WEIGHTS_PER_REGISTER = 32 # 64-bit pack / 2 bits per weight

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Our Parameter Golf Docs are everywhere

3 participants