Conversation
ca494ae to
6ee362d
Compare
4b0c90d to
0a73757
Compare
|
You are an expert GPU kernel autotuner for Helion/Triton kernels. Use the provided Configuration Space and Default Configuration as the source of truth for:
General heuristics:
Output contract:
User Prompt Kernel Source Codedef attention(
q_in: torch.Tensor,
k_in: torch.Tensor,
v_in: torch.Tensor,
) -> torch.Tensor:
"""
Computes scaled dot-product attention.
Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
Args:
q_in: Query tensor of shape [..., seq_len_q, head_dim]
k_in: Key tensor of shape [..., seq_len_k, head_dim]
v_in: Value tensor of shape [..., seq_len_k, head_dim]
Returns:
Output tensor of shape [..., seq_len_q, head_dim]
"""
m_dim = q_in.size(-2)
n_dim = k_in.size(-2)
assert n_dim == v_in.size(-2)
head_dim = hl.specialize(q_in.size(-1))
assert head_dim == k_in.size(-1) == v_in.size(-1)
q_view = q_in.reshape([-1, m_dim, head_dim])
v_view = v_in.reshape([-1, n_dim, head_dim])
k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
out = torch.empty_like(q_view)
sm_scale = 1.0 / math.sqrt(head_dim)
qk_scale = sm_scale * 1.44269504 # 1/log(2)
for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
l_i = torch.full_like(m_i, 1.0)
acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32)
q = q_view[tile_b, tile_m, :]
for tile_n in hl.tile(v_view.size(1)):
k = k_view[tile_b, :, tile_n]
qk = torch.bmm(q, k)
m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale)
qk = qk * qk_scale - m_ij[:, :, None]
p = torch.exp2(qk)
l_ij = torch.sum(p, -1)
alpha = torch.exp2(m_i - m_ij)
l_i = l_i * alpha + l_ij
acc = acc * alpha[:, :, None]
v = v_view[tile_b, tile_n, :]
p = p.to(v.dtype)
acc = torch.baddbmm(acc, p, v)
m_i = m_ij
m_i += torch.log2(l_i)
acc = acc / l_i[:, :, None]
out[tile_b, tile_m, :] = acc.to(out.dtype)
return out.view(q_in.size())Input TensorsGPU HardwareConfiguration Spacepower_of_2(min=16, max=1024, default=16)] Default Configuration{"block_sizes":[1,16,16],"indexing":["pointer","pointer","pointer","pointer"],"l2_groupings": Search Strategyreuse, accumulation, memory-vs-compute, and scheduling traits from the code itself and target hardware, TaskSuggest the first batch of configs. Include both near-default and exploratory candidates. Return minified |
|
The LLM response with: Round 0 Response { |
|
The next iteration of LLM prompt: Round 1 Refinement Prompt Search StateAnchor Configs"persistent_blocked"} Results (best first)"tensor_descriptor"], "num_warps": 8} Top Config PatternsFailed Config Patterns["", "last", "last"], "loop_orders": [[1, 0]], "num_stages": 8, "num_warps": 16, "range_flattens": [null, false], Next Stepinstead of rewriting every field. TaskSuggest up to 15 NEW UNIQUE configs around the anchors above. Avoid the failed patterns above and favor targeted edits with |
stack-info: PR: #2003, branch: choijon5/stack/3
0a73757 to
fd390a7
Compare
stack-info: PR: #2003, branch: choijon5/stack/3
fd390a7 to
eed2b6c
Compare
Stacked PRs:
[Autotuner] Adding LLM-guided search
Prompting an LLM to get the configs:
Ln indicates the number of rounds of LLM prompts.
Ln_spd indicates the speedup of nth round of LLM vs LFBO full autotuning.
Ln_t1/LF compares the wall clock time it takes to finish the nth round of LLM vs LFBO full autotuning.
For simple kernels (matmul, layer_norm), LLM is able to one shot the config with on par perf as LFBO full autotuning at a tiny fraction (10s, 5-8%) of LFBO full autotuning time.
For more complex kernels (attention), LLM does not get to on par perf after 3 rounds, even though perf improves after each round. 3 LLM rounds still only take 14% of LFBO full autotuning time.
There are cases (cross_entropy) where perf does not improve after more LLM rounds.
The LLM prompts (and their responses) are shown in the comments below, although the prompts are still changing.