Skip to content

[pull] master from ggml-org:master#854

Merged
pull[bot] merged 5 commits intoLongLeCE:masterfrom
ggml-org:master
Feb 6, 2026
Merged

[pull] master from ggml-org:master#854
pull[bot] merged 5 commits intoLongLeCE:masterfrom
ggml-org:master

Conversation

@pull
Copy link

@pull pull bot commented Feb 6, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

ggerganov and others added 5 commits February 6, 2026 07:55
This commit addresses the TODO in llama-sampling.h to rename that header
and the implementation to llama-sampler.
* metal : skip loading all-zero mask

* cont : minor
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
…19376)

The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
@pull pull bot locked and limited conversation to collaborators Feb 6, 2026
@pull pull bot added the ⤵️ pull label Feb 6, 2026
@pull pull bot merged commit 1946e46 into LongLeCE:master Feb 6, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants