[draft] use flash_attention from cuda-free by sijiac · Pull Request #12 · triton-lang/kernels

sijiac · 2024-09-24T18:54:41Z

By switching to use the attention from the cuda-free repo, the Triton attention now works well for the kernels repo

The missing part of the attention kernel of kernels repo is it doesn’t support the decoding case, where the length of Q and the length of K is different for the same batch

python3 -m main llama_chat_completion --profile=False --benchmark=False --ckpt_dir="/home/sijiac/models/Meta-Llama-3-8B-Instruct/" --tokenizer_path="/home/sijiac/models/Meta-Llama-3-8B-Instruct/tokenizer.model" --use_triton=True

adamomainz · 2024-09-24T19:00:01Z

can you run with benchmarking turned on and see the difference? would be curious to see the attention specific latency here :) in that case you dont need to specify use_triton since it will run with both cases

use flash_attention from cuda-free

4a1a609

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] use flash_attention from cuda-free#12

[draft] use flash_attention from cuda-free#12
sijiac wants to merge 1 commit intotriton-lang:mainfrom
sijiac:switch-to-use-amd-atten

sijiac commented Sep 24, 2024 •

edited

Loading

Uh oh!

adamomainz commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sijiac commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamomainz commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sijiac commented Sep 24, 2024 •

edited

Loading