Systematic throughput benchmarking to identify MLA-o advantages

Current experiments don't show throughput benefits for MLA-o. We need systematic benchmarking to find the model scales where computational benefits emerge.

**Current Status:** No speed difference observed even at sequence length 1024

**Tasks:**
- [ ] Create isolated attention layer benchmarking script
- [ ] Test various configurations:
  - [ ] Number of heads (current: 8, try: 12, 16, 24)
  - [ ] Head sizes (current: 32, try: 64, 128)  
  - [ ] Sequence lengths (128, 512, 1024, 2048, 4096)
  - [ ] Hidden dimensions
- [ ] Benchmark throughput:
  - [ ] Training - for, e.g., 1K steps.
  - [ ] Inference - just use random weights.
- [ ] Document the crossover points where MLA-o becomes faster

**Hypothesis:** Head count and size matter for performance benefits to appear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systematic throughput benchmarking to identify MLA-o advantages #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Systematic throughput benchmarking to identify MLA-o advantages #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions