Product keys in FF by wojsza05 · Pull Request #163 · llm-random/nano

wojsza05 · 2026-02-04T16:25:53Z

No description provided.

crewtool · 2026-02-04T16:28:48Z

src/core/model.py

        seq_len,
        rope_base,
        rope_scale_freqs: bool,
+        causal=True,


We do not add default values here, move it to defaults.yaml

I don't see a file with default values for class arguments. Should I add an argument to the 17 configuration files where the RoPEAttention class is used?

However, I've removed this argument completely. I added it earlier because I thought it might be useful to someone in the future, but we don't use this class at the moment, so I won't bother.

crewtool

as in comment

mtboro · 2026-02-18T10:05:41Z

src/product_keys/model.py

+        )
+
+        # Weighted sum of retrieved values
+        out_heads = (values_selected * attn_weights.unsqueeze(-1)).sum(dim=2)


you can replace this with a single matrix multiplication. By doing so, no intermediate array is created, and memory usage is reduced

mtboro · 2026-02-18T10:11:43Z

src/product_keys/model.py

+        out_heads = (values_selected * attn_weights.unsqueeze(-1)).sum(dim=2)
+
+        # 6. Aggregation
+        # Sum outputs across all heads
+        output = out_heads.sum(dim=1)  # (BS, d_model)


I believe you can fuse these operations as well, reducing memory usage even further.

smth like this should work:
output = torch.einsum('b h k, b h k d -> b d', attn_weights, values_selected)

I think that torch.nn.functional.embedding_bag could be used here as well – that way the whole values matrix won't be instantiated.

# Flatten indices and weights to (Batch * Heads, K_neighbors) # We treat (Batch * Seq * Heads) as the "bag" dimension input_flat = memory_indices.view(-1, self.k) weights_flat = attn_weights.view(-1, self.k) # Fused Lookup + Weighted Sum # Output shape: (BS * Seq * Heads, D) # This avoids creating the (BS, H, K, D) tensor entirely out_flat = F.embedding_bag( input_flat, self.values.weight, per_sample_weights=weights_flat, mode='sum' ) # Reshape and sum over heads out_flat = out_flat.view(bs, seq_len, self.n_heads, d_model) output = out_flat.sum(dim=2) # (BS, Seq, D)

mtboro

Overall the PR looks good to me. Take a look at suggestions that would improve memory usage, and resolve the suggestions made by @crewtool

ggwozdz2 · 2026-02-18T10:34:44Z

src/product_keys/model.py


        # Calculate similarity between full Q and the reconstructed candidates
        # q needs unsqueeze to broadcast: (B, H, S, 1, D) @ (B, H, S, K*K, D).T
+        # TODO


is this TODO done? if yes, remove this comment line

I think I've addressed this on this branch: https://github.com/llm-random/nano/tree/bgw/pk_attn_update

The pr will follow after the current one is merged

This is a suggestion from a previous PR that something could be done more optimally. It's not done yet.

mtboro · 2026-02-18T11:29:24Z

src/product_keys/model.py

+        self.c1 = nn.Parameter(torch.randn(n_heads, n_sub_keys, query_dim // 2))
+        self.c2 = nn.Parameter(torch.randn(n_heads, n_sub_keys, query_dim // 2))


I'm think that adding std=d_model**-0.5 might slightly improve convergence

crewtool reviewed Feb 4, 2026

View reviewed changes

crewtool requested changes Feb 4, 2026

View reviewed changes

mtboro reviewed Feb 18, 2026

View reviewed changes

mtboro requested changes Feb 18, 2026

View reviewed changes

ggwozdz2 reviewed Feb 18, 2026

View reviewed changes

mtboro reviewed Feb 18, 2026

View reviewed changes

mtboro and others added 8 commits March 4, 2026 09:37

Add initial PK implementation

8790898

PK feed-forward improvements

0746609

Product keys in FF. Adjusting configs to run comparisons

378179d

Fix

1f93892

Fixed formatting

a75519c

Review fixes, added optimizer_param_groups for lr

8e81076

Fixes and formatting

fe79d21

Optimizer param groups

c4cc2d9

wojsza05 force-pushed the bgw/pk branch from 8654d3f to c4cc2d9 Compare March 4, 2026 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Product keys in FF#163

Product keys in FF#163
wojsza05 wants to merge 8 commits intomainfrom
bgw/pk

wojsza05 commented Feb 4, 2026

Uh oh!

crewtool Feb 4, 2026

Uh oh!

wojsza05 Feb 18, 2026

Uh oh!

wojsza05 Feb 18, 2026

Uh oh!

crewtool left a comment •

edited

Loading

Uh oh!

mtboro Feb 18, 2026

Uh oh!

mtboro Feb 18, 2026

Uh oh!

mtboro Feb 18, 2026

Uh oh!

mtboro left a comment

Uh oh!

ggwozdz2 Feb 18, 2026

Uh oh!

mtboro Feb 18, 2026

Uh oh!

wojsza05 Feb 18, 2026

Uh oh!

mtboro Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		self.c1 = nn.Parameter(torch.randn(n_heads, n_sub_keys, query_dim // 2))
		self.c2 = nn.Parameter(torch.randn(n_heads, n_sub_keys, query_dim // 2))

Conversation

wojsza05 commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crewtool left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtboro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

crewtool left a comment •

edited

Loading