Investigate the relative impact of Query latent space vs Output latent space decomposition.
Decomposing the query space is an already-established technique in DeepSeek-V3.
Let's say we find that decomposing the Query space:
- Doesn't improve throughput.
- Hurts performance on the benchmark task.
(We might expect to see this given that DeepSeek-V2-Lite doesn't use a query latent space!)
If that's the case, it would be helpful context for our results.
In particular, it would be interesting to see the relative impact between the two. Does one seem more beneficial than the other?
Approach
I think that comparing "Query only" and "Output only" and standard MHA might be the most direct comparison.
Investigate the relative impact of Query latent space vs Output latent space decomposition.
Decomposing the query space is an already-established technique in DeepSeek-V3.
Let's say we find that decomposing the Query space:
(We might expect to see this given that DeepSeek-V2-Lite doesn't use a query latent space!)
If that's the case, it would be helpful context for our results.
In particular, it would be interesting to see the relative impact between the two. Does one seem more beneficial than the other?
Approach
No decompositions (standard MHA)
KV only
Query only
Output only
Query and KV (standard MLA)
Query and Output
KV and Output
Query, KV, and Output (MLA-o)Tasks:* [ ] Some variants to compare (we already have some of these):
I think that comparing "Query only" and "Output only" and standard MHA might be the most direct comparison.