Current experiments don't show throughput benefits for MLA-o. We need systematic benchmarking to find the model scales where computational benefits emerge.
Current Status: No speed difference observed even at sequence length 1024
Tasks:
Hypothesis: Head count and size matter for performance benefits to appear.
Current experiments don't show throughput benefits for MLA-o. We need systematic benchmarking to find the model scales where computational benefits emerge.
Current Status: No speed difference observed even at sequence length 1024
Tasks:
Hypothesis: Head count and size matter for performance benefits to appear.