Skip to content

vulkan: add allreduce function with cross-device CPU proxy and fix Tensor Parallel crash [EXPERIMENTAL]#25051

Draft
pwilkin wants to merge 7 commits into
ggml-org:masterfrom
pwilkin:vulkan-tp-p2p
Draft

vulkan: add allreduce function with cross-device CPU proxy and fix Tensor Parallel crash [EXPERIMENTAL]#25051
pwilkin wants to merge 7 commits into
ggml-org:masterfrom
pwilkin:vulkan-tp-p2p

Conversation

@pwilkin

@pwilkin pwilkin commented Jun 26, 2026

Copy link
Copy Markdown
Member

Overview

I've heard from @0cc4m that Vulkan maintainers really like large, LLM assisted PRs, so here's one that should make them happy 😁

This fixes crashes in the pipeline and introduces a Vulkan F16 AllReduce.

Additional information

Benchmark results on my box - would love someone with non-NVidia hardware to test this:

Vulkan tensor-parallel benchmark v2 (rebased+cleaned build) — RTX 3080 + RTX 5060 Ti

pp512 / tg128 (tok/s). llama-bench -fa 1, default ubatch. 19 LLMs, 2-16GB. Each split-mode in its own run.
Cases: Vk-layer / CUDA-layer / Vk-tensor crashfix (butterfly, GGML_VK_COMM_OFF) / Vk-tensor F16 (this work) / CUDA-tensor (NCCL).

Depth 0

model Vk-layer CUDA-layer Vk-tns crashfix Vk-tns F16 CUDA-tensor
Apriel-1.6-15b 1843/44 2152/52 400/33 1262/55 1258/68
Bielik-11B 2483/33 2695/39 461/30 1567/47 1524/55
Devstral-24B 610/23 700/27 -/- -/- -/-
Falcon-H1R-7B 2528/44 2946/52 -/- -/- -/-
GLM-4.6V-Flash 2143/39 3559/48 -/- -/- -/-
LFM2-8B-A1B 5062/259 9023/317 -/- -/- -/-
Llama-3.2-3B 4444/104 9173/123 1378/62 4047/115 3919/142
Ministral-3-14B 1776/32 2025/38 460/30 1378/42 1373/49
North-Mini-Code 2253/107 3380/142 -/- -/- -/-
Qwen3-4B-2507 3911/127 6312/144 1232/51 3231/77 3375/146
Qwen3.5-27B 397/20 522/26 283/21 413/29 -/-
Qwen3.5-35B-A3B 1916/82 2708/101 1061/37 2290/68 -/-
Qwen3.5-9B 2943/42 3323/51 706/44 2205/61 2180/73
Qwen3.6-27B-smol 527/21 595/26 -/- -/- -/-
gemma-3-12b 2131/26 2372/31 -/- -/- -/-
gemma-4-E2B 5313/62 6122/82 -/- -/- -/-
gemma-4-E4B 2700/67 4363/88 928/40 2392/70 2536/94
gpt-oss-20b 837/102 5374/149 1337/66 3335/120 3878/178
granite-4.0-h-tiny 5210/147 5908/171 -/- -/- -/-

Depth 4096

model Vk-layer CUDA-layer Vk-tns crashfix Vk-tns F16 CUDA-tensor
Apriel-1.6-15b 1625/39 1924/48 388/33 1190/50 1191/64
Bielik-11B 2103/30 2330/37 432/30 1448/44 1428/51
Devstral-24B 980/21 1210/27 -/- -/- -/-
Falcon-H1R-7B 2350/40 2758/51 -/- -/- -/-
GLM-4.6V-Flash 2302/38 3081/46 -/- -/- -/-
LFM2-8B-A1B 6962/224 8538/311 -/- -/- -/-
Llama-3.2-3B 6060/85 7033/109 1106/58 3654/97 3570/132
Ministral-3-14B 1597/30 1847/36 437/29 1315/40 1312/47
North-Mini-Code 2250/82 2431/120 -/- -/- -/-
Qwen3-4B-2507 3995/91 5006/122 935/49 2863/90 3036/135
Qwen3.5-27B 759/23 975/26 271/21 688/30 -/-
Qwen3.5-35B-A3B 2298/76 2608/99 915/37 2160/65 -/-
Qwen3.5-9B 2850/40 3255/51 693/42 2119/59 2111/72
Qwen3.6-27B-smol 729/20 981/26 -/- -/- -/-
gemma-3-12b 2040/24 2317/30 -/- -/- -/-
gemma-4-E2B 3824/54 5574/81 -/- -/- -/-
gemma-4-E4B 2790/55 4056/85 795/38 2229/66 2425/92
gpt-oss-20b 2754/101 5057/142 1259/61 3173/107 3632/172
granite-4.0-h-tiny 5173/138 5999/172 -/- -/- -/-

Depth 40000

model Vk-layer CUDA-layer Vk-tns crashfix Vk-tns F16 CUDA-tensor
Apriel-1.6-15b 703/23 703/26 285/25 -/- -/-
Bielik-11B 591/19 730/22 -/- -/- -/-
Devstral-24B 356/16 -/- -/- -/- -/-
Falcon-H1R-7B 790/31 1636/43 -/- -/- -/-
GLM-4.6V-Flash 948/33 1307/41 -/- -/- -/-
LFM2-8B-A1B 3044/124 5475/236 -/- -/- -/-
Llama-3.2-3B 1691/44 1500/57 738/47 1661/67 2087/80
Ministral-3-14B 724/20 770/24 -/- -/- -/-
North-Mini-Code 1227/70 1981/97 -/- -/- -/-
Qwen3-4B-2507 1085/42 1146/47 680/37 1215/56 1723/69
Qwen3.5-27B 567/19 733/24 -/- -/- -/-
Qwen3.5-35B-A3B 1365/62 1964/86 -/- -/- -/-
Qwen3.5-9B 1752/36 2460/45 631/39 1765/53 1794/65
Qwen3.6-27B-smol 531/18 746/23 -/- -/- -/-
gemma-3-12b 1382/20 1663/26 -/- -/- -/-
gemma-4-E2B 1742/53 3114/74 -/- -/- -/-
gemma-4-E4B 1472/50 2605/72 744/37 1386/60 1952/81
gpt-oss-20b 2081/72 2317/112 1092/60 2281/90 2625/143
granite-4.0-h-tiny 3896/113 5026/152 -/- -/- -/-

Requirements

@pwilkin pwilkin requested review from a team and JohannesGaessler as code owners June 26, 2026 12:37
@pwilkin pwilkin requested review from 0cc4m and jeffbolznv June 26, 2026 12:38
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 26, 2026
@pwilkin pwilkin marked this pull request as draft June 26, 2026 14:18
pwilkin added 3 commits June 26, 2026 22:59
…cache views

When a per-device allocation exceeds the backend's max buffer size (e.g. a large
KV cache), ggml-alloc returns a multi_buffer wrapping several real buffers.
Compute-graph views inherited that multi_buffer as their backend buffer, so a
backend that casts tensor->buffer->context to its own buffer-context type (the
Vulkan backend does, e.g. in ggml_vk_tensors_overlap) dereferenced garbage and
crashed with -sm tensor (issue ggml-org#22197).

A view aliases its source's storage, so it must reference the source's real
sub-buffer: set t_ij->buffer = t_ij->view_src->buffer. This is the correct ggml
invariant and a no-op in the single-buffer case.

Assisted-by: Claude Opus 4.8
Implements the backend-agnostic comm hook (ggml_backend_comm_init /
_allreduce_tensor / _free, discovered by the meta backend via
get_proc_address) for the Vulkan backend, so tensor-parallel inference no
longer falls back to the meta backend's CPU-barriered butterfly AllReduce.

Consumer GPUs have no P2P here, so the reduce stages through host memory, but
everything is ordered on the GPU via exported timeline semaphores (no CPU
barriers between layers). Each slice is split into chunks: the dedicated
transfer queue streams this device's slice out to shared host memory while the
compute queue pulls each peer chunk back as soon as it lands, so the two PCIe
directions overlap (full-duplex). Partials are cast to F16 before the host
transfer to halve the bytes on the bandwidth-bound link and added straight into
the fp32 result via the mixed-type add pipeline. Large prefill activations use
this pipeline; small (decode) tensors take a single-shot path where the fixed
per-call overhead dominates.

Roughly 2.5-3x the butterfly fallback; at long context it overtakes -sm layer
and is competitive with CUDA/NCCL on prefill. GGML_VK_COMM_OFF disables the
custom comm (falls back to butterfly); GGML_VK_COMM_FP32 forces fp32 staging.

Assisted-by: Claude Opus 4.8
…duce

The GPU-side cross-device ordering imports each peer's OPAQUE_FD timeline
semaphore, but OPAQUE_FD payloads are driver-private, so the import only works
when all devices share a driver (e.g. two NVIDIA GPUs). On mixed drivers or
vendors it is out of spec.

Add a portable fallback: a helper thread polls each peer's progress/upload
timeline and host-signals a local timeline that the consumer's download is
parked on (core timeline semaphores plus host signal/wait, no imported handle).
Both the chunked pipeline (prefill) and the single-shot (decode) paths are
bridged, so proxy mode no longer drops decode to the meta-backend butterfly.

A capability gate (vkGetPhysicalDeviceExternalSemaphoreProperties plus a
driverUUID match) selects the proxy deterministically on unsupported configs;
GGML_VK_COMM_PROXY forces it, and the import try/catch stays as a safety net.
Measured within ~4% of the native-import path on decode and on par for
prefill, with byte-identical output.

Assisted-by: Claude Opus 4.8
@pwilkin

pwilkin commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

Did a few more tests with an A16 and a 4090:

4090 + A16

-sm layer

3.15.215.713 I slot print_timing: id  3 | task 0 | prompt eval time =   35434.16 ms / 28872 tokens (    1.23 ms per token,   814.81 tokens per second)
3.15.215.720 I slot print_timing: id  3 | task 0 |        eval time =  140477.32 ms /  1481 tokens (   94.85 ms per token,    10.54 tokens per second)

-sm tensor

3.25.475.691 I slot print_timing: id  3 | task 0 | prompt eval time =   60539.25 ms / 28872 tokens (    2.10 ms per token,   476.91 tokens per second)
3.25.475.696 I slot print_timing: id  3 | task 0 |        eval time =  126102.68 ms /  1504 tokens (   83.84 ms per token,    11.93 tokens per second)

2xA16

-sm layer

4.14.866.003 I slot print_timing: id  3 | task 0 | prompt eval time =   52667.78 ms / 28872 tokens (    1.82 ms per token,   548.19 tokens per second)
4.14.866.009 I slot print_timing: id  3 | task 0 |        eval time =  179656.12 ms /  1233 tokens (  145.71 ms per token,     6.86 tokens per second)

-sm tensor

3.49.116.708 I slot print_timing: id  3 | task 0 | prompt eval time =   57678.27 ms / 28872 tokens (    2.00 ms per token,   500.57 tokens per second)
3.49.116.712 I slot print_timing: id  3 | task 0 |        eval time =  155333.79 ms /  1800 tokens (   86.30 ms per token,    11.59 tokens per second)

So as you can see, the 4090 is held down by the A16 and the boost from tensor parallel there is really small, but on 2xA16, the TG boost is almost double while the PP loss is negligible (almost 90% of the original).

@characharm

Copy link
Copy Markdown
Contributor

AMD Radeon RX 9070 XT & AMD Radeon AI PRO R9700

model size params backend ngl sm fa test t/s
qwen35 27B Q5_K - Medium 18.94 GiB 27.32 B Vulkan -1 tensor 1 pp512 @ d60000 219.29 ± 1.65
qwen35 27B Q5_K - Medium 18.94 GiB 27.32 B Vulkan -1 tensor 1 tg128 @ d60000 10.68 ± 0.16
model size params backend ngl fa test t/s
qwen35 27B Q5_K - Medium 18.94 GiB 27.32 B Vulkan -1 1 pp512 @ d60000 493.22 ± 2.01
qwen35 27B Q5_K - Medium 18.94 GiB 27.32 B Vulkan -1 1 tg128 @ d60000 21.79 ± 0.09

@wizardeur

Copy link
Copy Markdown

AMD RX7900XTX (2, 4, 8 GPUs)

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat

model size params backend ngl sm fa test t/s
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 layer 1 pp512 809.00 ± 2.70
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 layer 1 tg128 34.50 ± 0.07
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 tensor 1 pp512 1283.40 ± 7.10
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 tensor 1 tg128 41.57 ± 0.37

ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat

model size params backend ngl sm fa test t/s
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 layer 1 pp512 762.29 ± 5.09
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 layer 1 tg128 29.66 ± 0.03
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 tensor 1 pp512 305.86 ± 1.00
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 tensor 1 tg128 14.68 ± 0.90

ggml_vulkan: Found 8 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 4 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 5 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 6 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 7 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat

model size params backend ngl sm fa test t/s
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 layer 1 pp512 697.64 ± 3.23
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 layer 1 tg128 11.89 ± 0.31
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 tensor 1 pp512 95.61 ± 0.10
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B Vulkan -1 tensor 1 tg128 3.87 ± 1.57

And ROCm for comparison:

ggml_cuda_init: found 2 ROCm devices (Total VRAM: 49120 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 1: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

model size params backend ngl sm fa test t/s
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 layer 1 pp512 913.54 ± 26.89
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 layer 1 tg128 26.34 ± 0.03
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 tensor 1 pp512 1535.47 ± 2.33
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 tensor 1 tg128 45.35 ± 0.27

ggml_cuda_init: found 4 ROCm devices (Total VRAM: 98240 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 1: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 2: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 3: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

model size params backend ngl sm fa test t/s
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 layer 1 pp512 877.60 ± 31.85
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 layer 1 tg128 23.58 ± 0.02
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 tensor 1 pp512 2085.89 ± 13.42
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 tensor 1 tg128 55.79 ± 0.41

ggml_cuda_init: found 8 ROCm devices (Total VRAM: 196480 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 1: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 2: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 3: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 4: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 5: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 6: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 7: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

model size params backend ngl sm fa test t/s
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 layer 1 pp512 811.63 ± 55.16
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 layer 1 tg128 21.48 ± 0.03
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 tensor 1 pp512 2163.08 ± 23.29
qwen35 27B Q4_K - Medium 15.48 GiB 26.90 B ROCm -1 tensor 1 tg128 37.54 ± 2.63

@netrunnereve netrunnereve Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your way of dealing with the crash seems to be similar to the one proposed here (also AI generated, lol) that basically uses the view src to get around the multi buffer issue. I explain this a bit more in #22197 but I think this only works if the tensor is a view tensor. If it's not a view tensor then we don't have any way of knowing which buffer in the multi buffer the tensor is stored in.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, valid point. Didn't surface in the small models I've tested, but probably will in any bigger one. Paging @0cc4m here: do you have any qualms about just adding a tensor to sub-buffer map in multi_buffer?

@digitalscream

Copy link
Copy Markdown

OK, bit of oddness on my R9700s compared with the 7900XTX above - prefill takes a significant hit compared with -sm row:

ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1201) (radv) | uma: 0 | fp16: dot2 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon Graphics (RADV GFX1201) (radv) | uma: 0 | fp16: dot2 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch |     sm |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -----: | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | Vulkan     |  99 |     1024 |    row |   1 |          pp2048 |       1580.58 ± 5.18 |
| qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | Vulkan     |  99 |     1024 |    row |   1 |          pp4096 |       1556.35 ± 2.32 |
| qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | Vulkan     |  99 |     1024 |    row |   1 |           tg128 |         29.91 ± 0.03 |
| qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | Vulkan     |  99 |     1024 | tensor |   1 |          pp2048 |       1125.71 ± 2.82 |
| qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | Vulkan     |  99 |     1024 | tensor |   1 |          pp4096 |       1110.65 ± 1.31 |
| qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | Vulkan     |  99 |     1024 | tensor |   1 |           tg128 |         36.84 ± 0.38 |

Also, even more weirdness - when run as a server, the first request goes through at 36t/s, but the second falls off to 2t/s:

0.21.379.838 I srv  proxy_reques: proxying request to model nondraft_tensor_Qwen3.6-27B-Q4_0.gguf on port 47447
[47447] 0.08.608.257 I srv    operator(): Chat format: peg-native
[47447] 0.08.608.379 I slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
[47447] 0.08.608.381 I srv  get_availabl: updating prompt cache
[47447] 0.08.608.385 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
[47447] 0.08.608.388 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 16384.000 MiB, 256000 tokens, 17179869184 est)
[47447] 0.08.608.389 I srv  get_availabl: prompt cache update took 0.01 ms
[47447] 0.08.608.423 I slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
[47447] 0.08.608.425 I slot process_sing: id  0 | task -1 | saving idle slot to prompt cache
[47447] 0.08.608.426 I slot prompt_clear: id  0 | task -1 | clearing prompt with 0 tokens
[47447] 0.08.608.578 I slot process_sing: id  1 | task -1 | saving idle slot to prompt cache
[47447] 0.08.608.579 I slot prompt_clear: id  1 | task -1 | clearing prompt with 0 tokens
[47447] 0.08.608.723 I slot process_sing: id  2 | task -1 | saving idle slot to prompt cache
[47447] 0.08.608.725 I slot prompt_clear: id  2 | task -1 | clearing prompt with 0 tokens
[47447] 0.08.609.083 I srv  stream_sessi: stream_session_attach_pipe: conv_id= (empty=1)
[47447] 0.08.842.162 I slot create_check: id  3 | task 0 | created context checkpoint 1 of 32 (pos_min = 14, pos_max = 14, n_tokens = 15, size = 149.626 MiB)
[47447] 0.11.928.914 I slot print_timing: id  3 | task 0 | n_decoded =    108, tg =  35.85 t/s, tg_3s =  35.85 t/s
[47447] 0.14.931.024 I slot print_timing: id  3 | task 0 | n_decoded =    217, tg =  36.08 t/s, tg_3s =  36.31 t/s
[47447] 0.17.940.243 I slot print_timing: id  3 | task 0 | n_decoded =    325, tg =  36.02 t/s, tg_3s =  35.89 t/s
[47447] 0.20.946.915 I slot print_timing: id  3 | task 0 | n_decoded =    434, tg =  36.08 t/s, tg_3s =  36.25 t/s
[47447] 0.23.948.536 I slot print_timing: id  3 | task 0 | n_decoded =    542, tg =  36.06 t/s, tg_3s =  35.98 t/s
[47447] 0.26.957.662 I slot print_timing: id  3 | task 0 | n_decoded =    652, tg =  36.14 t/s, tg_3s =  36.56 t/s
[47447] 0.29.983.478 I slot print_timing: id  3 | task 0 | n_decoded =    761, tg =  36.12 t/s, tg_3s =  36.02 t/s
[47447] 0.33.010.837 I slot print_timing: id  3 | task 0 | n_decoded =    870, tg =  36.11 t/s, tg_3s =  36.00 t/s
[47447] 0.36.017.438 I slot print_timing: id  3 | task 0 | n_decoded =    979, tg =  36.12 t/s, tg_3s =  36.25 t/s
[47447] 0.39.044.366 I slot print_timing: id  3 | task 0 | n_decoded =   1087, tg =  36.08 t/s, tg_3s =  35.68 t/s
[47447] 0.40.202.434 I slot print_timing: id  3 | task 0 | prompt eval time =     307.57 ms /    19 tokens (   16.19 ms per token,    61.78 tokens per second)
[47447] 0.40.202.437 I slot print_timing: id  3 | task 0 |        eval time =   31285.98 ms /  1129 tokens (   27.71 ms per token,    36.09 tokens per second)
[47447] 0.40.202.438 I slot print_timing: id  3 | task 0 |       total time =   31593.55 ms /  1148 tokens
[47447] 0.40.202.441 I slot print_timing: id  3 | task 0 |    graphs reused =       1124
[47447] 0.40.202.474 I slot      release: id  3 | task 0 | stop processing: n_tokens = 1147, truncated = 0
[47447] 0.40.202.481 I srv  update_slots: all slots are idle
0.53.000.387 I srv  proxy_reques: proxying request to model nondraft_tensor_Qwen3.6-27B-Q4_0.gguf on port 47447
[47447] 0.40.231.788 I srv    operator(): Chat format: peg-native
[47447] 0.40.231.898 I slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = -1
[47447] 0.40.231.900 I srv  get_availabl: updating prompt cache
[47447] 0.40.231.901 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
[47447] 0.40.231.903 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 16384.000 MiB, 256000 tokens, 17179869184 est)
[47447] 0.40.231.905 I srv  get_availabl: prompt cache update took 0.01 ms
[47447] 0.40.231.944 I slot launch_slot_: id  2 | task 1131 | processing task, is_child = 0
[47447] 0.40.231.947 I slot process_sing: id  0 | task -1 | saving idle slot to prompt cache
[47447] 0.40.231.947 I slot prompt_clear: id  0 | task -1 | clearing prompt with 0 tokens
[47447] 0.40.232.096 I slot process_sing: id  1 | task -1 | saving idle slot to prompt cache
[47447] 0.40.232.098 I slot prompt_clear: id  1 | task -1 | clearing prompt with 0 tokens
[47447] 0.40.232.250 I slot process_sing: id  3 | task -1 | saving idle slot to prompt cache
[47447] 0.40.232.564 W srv   prompt_save:  - saving prompt with length 1147, total state size = 187.732 MiB (draft: 0.000 MiB)
[47447] 0.40.465.224 I srv        update:  - cache state: 1 prompts, 337.358 MiB (limits: 16384.000 MiB, 256000 tokens, 256000 est)
[47447] 0.40.465.227 I srv        update:    - prompt 0x614a78ac7710:    1147 tokens, checkpoints:  1,   337.358 MiB
[47447] 0.40.465.227 I slot prompt_clear: id  3 | task -1 | clearing prompt with 1147 tokens
[47447] 0.41.257.685 I slot create_check: id  2 | task 1131 | created context checkpoint 1 of 32 (pos_min = 423, pos_max = 423, n_tokens = 424, size = 149.626 MiB)
0.56.797.744 I srv  proxy_reques: proxying request to model nondraft_tensor_Qwen3.6-27B-Q4_0.gguf on port 47447
[47447] 0.44.039.953 I srv    operator(): Chat format: peg-native
[47447] 0.44.387.907 I slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
[47447] 0.44.387.909 I srv  get_availabl: updating prompt cache
[47447] 0.44.387.911 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
[47447] 0.44.387.914 I srv        update:  - cache state: 1 prompts, 337.358 MiB (limits: 16384.000 MiB, 256000 tokens, 256000 est)
[47447] 0.44.387.914 I srv        update:    - prompt 0x614a78ac7710:    1147 tokens, checkpoints:  1,   337.358 MiB
[47447] 0.44.387.915 I srv  get_availabl: prompt cache update took 0.01 ms
[47447] 0.44.387.955 I slot launch_slot_: id  1 | task 1139 | processing task, is_child = 0
[47447] 0.44.387.957 I slot process_sing: id  0 | task -1 | saving idle slot to prompt cache
[47447] 0.44.387.957 I slot prompt_clear: id  0 | task -1 | clearing prompt with 0 tokens
[47447] 0.44.388.080 I slot process_sing: id  3 | task -1 | saving idle slot to prompt cache
[47447] 0.44.388.081 I slot prompt_clear: id  3 | task -1 | clearing prompt with 0 tokens
[47447] 0.44.388.213 I srv  stream_sessi: stream_session_attach_pipe: conv_id= (empty=1)
[47447] 0.45.918.819 I slot create_check: id  1 | task 1139 | created context checkpoint 1 of 32 (pos_min = 132, pos_max = 132, n_tokens = 133, size = 149.626 MiB)
[47447] 0.48.003.905 I slot create_check: id  1 | task 1139 | created context checkpoint 2 of 32 (pos_min = 1144, pos_max = 1144, n_tokens = 1145, size = 149.626 MiB)
[47447] 0.48.888.392 I slot print_timing: id  1 | task 1139 | prompt processing, n_tokens =   1157, progress = 1.00, t =   4.50 s / 257.10 tokens per second
[47447] 0.51.100.466 I slot print_timing: id  2 | task 1131 | prompt eval time =    2231.06 ms /  1452 tokens (    1.54 ms per token,   650.81 tokens per second)
[47447] 0.51.100.469 I slot print_timing: id  2 | task 1131 |        eval time =    8403.82 ms /    12 tokens (  700.32 ms per token,     1.43 tokens per second)
[47447] 0.51.100.470 I slot print_timing: id  2 | task 1131 |       total time =   10634.87 ms /  1464 tokens
[47447] 0.51.100.470 I slot print_timing: id  2 | task 1131 |    graphs reused =       1129
[47447] 0.51.100.552 I slot      release: id  2 | task 1131 | stop processing: n_tokens = 1463, truncated = 0
[47447] 0.51.100.639 I srv  stream_sessi: stream_session_attach_pipe: conv_id= (empty=1)
[47447] 1.30.922.239 I slot print_timing: id  1 | task 1139 | n_decoded =    100, tg =   2.43 t/s, tg_3s =   2.43 t/s
[47447] 1.34.268.373 I slot print_timing: id  1 | task 1139 | n_decoded =    108, tg =   2.43 t/s, tg_3s =   2.39 t/s
[47447] 1.37.595.106 I slot print_timing: id  1 | task 1139 | n_decoded =    116, tg =   2.43 t/s, tg_3s =   2.40 t/s
[47447] 1.40.910.950 I slot print_timing: id  1 | task 1139 | n_decoded =    124, tg =   2.43 t/s, tg_3s =   2.41 t/s
[47447] 1.44.222.279 I slot print_timing: id  1 | task 1139 | n_decoded =    132, tg =   2.43 t/s, tg_3s =   2.42 t/s
1.59.519.602 E srv    operator(): http client error: Connection handling canceled
[47447] 1.47.152.943 W srv          stop: cancel task, id_task = 1139
[47447] 1.47.568.409 I slot print_timing: id  1 | task 1139 | n_decoded =    140, tg =   2.42 t/s, tg_3s =   2.39 t/s
[47447] 1.47.568.417 I slot      release: id  1 | task 1139 | stop processing: n_tokens = 1300, truncated = 0
[47447] 1.47.568.422 I srv  update_slots: all slots are idle

@cattivik66

Copy link
Copy Markdown
GPU 2× W7800 48 GB, separate PCIe 4.0 x16 root complexes (no P2P, no XGMI)
OS CachyOS, kernel 7.0.12, Mesa 26.1.2, RADV driver
Vulkan 1.4.348, KHR_coopmat detected and active
Model Qwen3.5-122B-A10B, UD-Q4_K_XL, 73 GiB
llama.cpp 9d5d882d8 + this PR

Synthetic benchmark (pp4096 / tg128, ~4K context)

Without this PR:

Mode pp4096 tg128
row 1229 48.4
layer 1195 48.4
tensor 1164 29.3

With this PR applied:

Mode pp4096 tg128
row 1230 48.5
layer 1193 48.4
tensor 1450 (+25%) 43.6 (+49%)

At short context the PR delivers -sm tensor pp +25 %, tg +49 %. It becomes the fastest pp mode (1450 vs row's 1230) and tg is only 10 % behind row.

Real-world big prompt benchmark (88K prompt, 164K context)

I've asked the LLM to provide me a resume of a book which content has been sent in the prompt

Without this PR:

Mode pp (t/s) tg (t/s) wall (s)
row 669 37 172
layer 689 37 162
tensor segfault

With this PR applied:

Mode pp (t/s) tg (t/s) wall (s)
row 691 37 164
layer 693 37 162
tensor 677 9 255

The tensor-mode segfault (fixed by this PR)

Without this PR, -sm tensor segfaults at any realistic context (-c 167936 and -c 204800 both crash). The log shows:

llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort

followed by SIGSEGV during model load. The single-buffer synthetic test at -p 4096 worked because the KV cache was small enough. The PR's multi-buffer KV-cache fix is required for -sm tensor to function at usable c>

The tensor decode regression at long context

With the PR applied, tensor-mode tg drops from 43.6 (short ctx) to 9 (long ctx) — 4.8× worse. Row/layer stay at 37. This appears to be the -sm tensor AllReduce decode path with high fixed per-call overhead at batch=1, mult>

For reference, -sm layer (1 sync point at the pipeline boundary) and -sm row (concatenation-only syncs) don't have this problem.

@marksverdhei

This comment was marked as off-topic.

@maxious

maxious commented Jun 27, 2026

Copy link
Copy Markdown

2x Intel Arc Pro B60 (Battlemage, BMG G21)

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) Pro B60 Graphics (BMG G21) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Intel(R) Arc(tm) Pro B60 Graphics (BMG G21) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

llama-bench -fa 1 -ngl -1 -p 512 -n 128 -r 3, Vulkan only. Build at PR head a448deb85. Both B60s share the same Mesa ANV driver (Mesa 26.1.3, driverUUID match), so the native cross-device timeline-semaphore import path is used (not the CPU-proxy fallback).

model size params backend ngl sm fa test t/s
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan -1 layer 1 pp512 6100.11 ± 18.65
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan -1 layer 1 tg128 187.42 ± 8.38
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan -1 tensor 1 pp512 8626.08 ± 108.51
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan -1 tensor 1 tg128 115.82 ± 7.53
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan -1 layer 1 pp512 1375.81 ± 124.48
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan -1 layer 1 tg128 50.19 ± 11.48
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan -1 tensor 1 pp512 2158.55 ± 2.04
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan -1 tensor 1 tg128 41.42 ± 10.05
qwen35 9B Q8_0 9.10 GiB 9.20 B Vulkan -1 layer 1 pp512 1090.84 ± 1.86
qwen35 9B Q8_0 9.10 GiB 9.20 B Vulkan -1 layer 1 tg128 34.85 ± 2.10
qwen35 9B Q8_0 9.10 GiB 9.20 B Vulkan -1 tensor 1 pp512 1609.03 ± 30.27
qwen35 9B Q8_0 9.10 GiB 9.20 B Vulkan -1 tensor 1 tg128 38.28 ± 1.08
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan -1 layer 1 pp512 1305.84 ± 37.30
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan -1 layer 1 tg128 43.37 ± 11.86
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan -1 tensor 1 pp512 2013.57 ± 48.94
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan -1 tensor 1 tg128 53.73 ± 5.34
qwen35moe 35B.A3B Q3_K - Medium 12.79 GiB 35.51 B Vulkan -1 layer 1 pp512 771.10 ± 12.43
qwen35moe 35B.A3B Q3_K - Medium 12.79 GiB 35.51 B Vulkan -1 layer 1 tg128 44.54 ± 7.42
qwen35moe 35B.A3B Q3_K - Medium 12.79 GiB 35.51 B Vulkan -1 tensor 1 pp512 1111.37 ± 10.68
qwen35moe 35B.A3B Q3_K - Medium 12.79 GiB 35.51 B Vulkan -1 tensor 1 tg128 22.77 ± 7.64
qwen35 27B Q8_0 33.31 GiB 27.32 B Vulkan -1 layer 1 pp512 321.05 ± 11.59
qwen35 27B Q8_0 33.31 GiB 27.32 B Vulkan -1 layer 1 tg128 9.32 ± 0.28
qwen35 27B Q8_0 33.31 GiB 27.32 B Vulkan -1 tensor 1 pp512 517.89 ± 0.90
qwen35 27B Q8_0 33.31 GiB 27.32 B Vulkan -1 tensor 1 tg128 13.89 ± 0.03

Observations

  • Prefill (pp512): -sm tensor is ~1.4-1.6x faster than -sm layer across every model, including the dense 27B Q8_0 that exceeds a single B60's 24 GiB. The pipelined chunked AllReduce does its job on prefill activations, consistent with the NVIDIA results in the PR description.
  • Decode (tg128): Matches the PR author's characterisation - for models that comfortably fit one B60 (1B, 8B, 9B, dense MoE at Q3), tensor mode is break-even or slightly slower (per-call overhead on small decode tensors dominates the single-shot path). For gpt-oss-20B and the dense 27B Q8_0, the only two that genuinely need TP to fit on 2x24 GiB, tensor mode wins on TG too: gpt-oss 43 -> 54 t/s, and 27B Q8_0 9.3 -> 13.9 t/s - exactly the "balanced GPU pair, TG boost" regime the PR description calls out.
  • The 27B Q8_0 result is in the same ballpark as the mixed AMD pair reported above (10.68 tg with tensor on 27B Q5_K), so Intel/Vulkan lands competitively in that workload class.

@pwilkin

pwilkin commented Jun 27, 2026

Copy link
Copy Markdown
Member Author

@marksverdhei it's a prototype, don't worry :) when it's been decently tested and cleaned up I'll deslopify it :)

@0cc4m 0cc4m changed the title vulkan: make TP viable vulkan: add allreduce function and fix Tensor Parallel crash Jun 27, 2026
@pwilkin pwilkin changed the title vulkan: add allreduce function and fix Tensor Parallel crash vulkan: add allreduce function with cross-device CPU proxy and fix Tensor Parallel crash Jun 27, 2026
@pwilkin pwilkin changed the title vulkan: add allreduce function with cross-device CPU proxy and fix Tensor Parallel crash vulkan: add allreduce function with cross-device CPU proxy and fix Tensor Parallel crash [EXPERIMENTAL] Jun 27, 2026
@pwilkin

pwilkin commented Jun 27, 2026

Copy link
Copy Markdown
Member Author

So the damn clanker decided to only implement the proper allreduce for 2 GPUs, because why bother ;) sorry for all the people with 4+ who posted their results, could you please retest with the new commit? (I've tested up to 8 GPUs now for correctness)

@AbdullahMPrograms

AbdullahMPrograms commented Jun 27, 2026

Copy link
Copy Markdown

Sharing some benchmark results with 2/4/5x Radeon PRO W7900's:
##Vulkan Stock:
./LLM/llama.cpp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/bartowski/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_L.gguf -ngl -1 -fa 1 -sm layer

model size params backend ngl fa test t/s
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B Vulkan -1 1 pp512 786.58 ± 2.38
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B Vulkan -1 1 tg128 33.94 ± 0.03

##Vulkan PR TP (2 GPU):
./LLM/llama.cpp-vulkantp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/bartowski/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_L.gguf -ngl -1 -fa 1 -sm tensor -dev Vulkan0/Vulkan1

model size params backend ngl sm fa dev test t/s
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B Vulkan -1 tensor 1 Vulkan0/Vulkan1 pp512 836.11 ± 92.66
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B Vulkan -1 tensor 1 Vulkan0/Vulkan1 tg128 29.77 ± 0.41

##Vulkan PR TP (4 GPU):
./LLM/llama.cpp-vulkantp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/bartowski/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_L.gguf -ngl -1 -fa 1 -sm tensor -dev Vulkan0/Vulkan1/Vulkan2/Vulkan3

model size params backend ngl sm fa dev test t/s
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B Vulkan -1 tensor 1 Vulkan0/Vulkan1/Vulkan2/Vulkan3 pp512 413.89 ± 56.70
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B Vulkan -1 tensor 1 Vulkan0/Vulkan1/Vulkan2/Vulkan3 tg128 27.29 ± 0.62

##Vulkan PR TP (5 GPU):
./LLM/llama.cpp-vulkantp/vulkan/bin/llama-bench -m /home/ultimis/LLM/Models/bartowski/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_L.gguf -ngl -1 -fa 1 -sm tensor -dev Vulkan0/Vulkan1/Vulkan2/Vulkan3/Vulkan4

model size params backend ngl sm fa dev test t/s
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B Vulkan -1 tensor 1 Vulkan0/Vulkan1/Vulkan2/Vulkan3/Vulkan4 pp512 341.89 ± 55.39
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B Vulkan -1 tensor 1 Vulkan0/Vulkan1/Vulkan2/Vulkan3/Vulkan4 tg128 23.23 ± 0.62

#For comparison with ROCm:
##ROCm Stock:
./LLM/llama.cpp/rocm/bin/llama-bench -m /home/ultimis/LLM/Models/bartowski/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_L.gguf -ngl -1 -fa 1 -sm layer

model size params backend ngl fa test t/s
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B ROCm -1 1 pp512 843.56 ± 3.07
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B ROCm -1 1 tg128 28.52 ± 0.04

build: 050ee92 (9821)
##ROCm TP RCCL (2 GPU):
./LLM/llama.cpp/rocm-rccl/bin/llama-bench -m /home/ultimis/LLM/Models/bartowski/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_L.gguf -ngl -1 -fa 1 -sm tensor -dev ROCm0/ROCm1

model size params backend ngl sm fa dev test t/s
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B ROCm -1 tensor 1 ROCm0/ROCm1 pp512 1429.22 ± 10.58
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B ROCm -1 tensor 1 ROCm0/ROCm1 tg128 37.50 ± 0.47

##ROCm TP RCCL (4 GPU):
./LLM/llama.cpp/rocm-rccl/bin/llama-bench -m /home/ultimis/LLM/Models/bartowski/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_L.gguf -ngl -1 -fa 1 -sm tensor -dev ROCm0/ROCm1/ROCm2/ROCm3

model size params backend ngl sm fa dev test t/s
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B ROCm -1 tensor 1 ROCm0/ROCm1/ROCm2/ROCm3 pp512 1992.92 ± 6.26
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B ROCm -1 tensor 1 ROCm0/ROCm1/ROCm2/ROCm3 tg128 45.54 ± 1.95

##ROCm TP RCCL (5 GPU):
./LLM/llama.cpp/rocm-rccl/bin/llama-bench -m /home/ultimis/LLM/Models/bartowski/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_L.gguf -ngl -1 -fa 1 -sm tensor -dev ROCm0/ROCm1/ROCm2/ROCm3/ROCm4

model size params backend ngl sm fa dev test t/s
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B ROCm -1 tensor 1 ROCm0/ROCm1/ROCm2/ROCm3/ROCm4 pp512 2033.37 ± 13.56
qwen35 27B Q4_K - Medium 17.62 GiB 27.32 B ROCm -1 tensor 1 ROCm0/ROCm1/ROCm2/ROCm3/ROCm4 tg128 48.57 ± 2.89

@digitalscream

Copy link
Copy Markdown

Could someone else check multi-turn conversations? As noted in my comment above, I'm finding that the first prompt works as expected, but any follow up tanks down to 2t/s.

@pwilkin

pwilkin commented Jun 27, 2026

Copy link
Copy Markdown
Member Author

@AbdullahMPrograms is this after the multiGPU fix already?

@netrunnereve I risked a proper fix to the real buffer mapping issue.

@AbdullahMPrograms

Copy link
Copy Markdown

@AbdullahMPrograms is this after the multiGPU fix already?

@netrunnereve I risked a proper fix to the real buffer mapping issue.

@pwilkin This is with commit: 362fdd2

git log --oneline -n 5
362fdd2b7 (HEAD -> pr-25051) vulkan: generalize -sm tensor AllReduce to >2 devices
a448deb85 fix Windows build
68c289bd7 vulkan: portable CPU-proxy fallback for cross-driver -sm tensor AllReduce
5dc559cdd vulkan: GPU-pipelined multi-GPU AllReduce for -sm tensor
325cceab0 ggml-backend-meta: fix SPLIT_MODE_TENSOR segfault on multi_buffer KV-cache views

@AbdullahMPrograms

Copy link
Copy Markdown

Could someone else check multi-turn conversations? As noted in my comment above, I'm finding that the first prompt works as expected, but any follow up tanks down to 2t/s.

I can reproduce this behavior as well, first prompt:

0.58.303.871 I slot print_timing: id  3 | task 0 | prompt eval time =     606.59 ms /    17 tokens (   35.68 ms per token,    28.03 tokens per second)
0.58.303.874 I slot print_timing: id  3 | task 0 |        eval time =   34812.30 ms /  1061 tokens (   32.81 ms per token,    30.48 tokens per second)
0.58.303.896 I slot print_timing: id  3 | task 0 |       total time =   35418.89 ms /  1078 tokens

follow up prompt:

2.09.854.324 I slot print_timing: id  2 | task 1063 | prompt eval time =    4341.21 ms /  1090 tokens (    3.98 ms per token,   251.08 tokens per second)
2.09.854.328 I slot print_timing: id  2 | task 1063 |        eval time =   63484.68 ms /   214 tokens (  296.66 ms per token,     3.37 tokens per second)
2.09.854.329 I slot print_timing: id  2 | task 1063 |       total time =   67825.90 ms /  1304 tokens

@digitalscream

digitalscream commented Jun 27, 2026

Copy link
Copy Markdown

Could someone else check multi-turn conversations? As noted in my comment above, I'm finding that the first prompt works as expected, but any follow up tanks down to 2t/s.

I can reproduce this behavior as well, first prompt:

0.58.303.871 I slot print_timing: id  3 | task 0 | prompt eval time =     606.59 ms /    17 tokens (   35.68 ms per token,    28.03 tokens per second)
0.58.303.874 I slot print_timing: id  3 | task 0 |        eval time =   34812.30 ms /  1061 tokens (   32.81 ms per token,    30.48 tokens per second)
0.58.303.896 I slot print_timing: id  3 | task 0 |       total time =   35418.89 ms /  1078 tokens

follow up prompt:

2.09.854.324 I slot print_timing: id  2 | task 1063 | prompt eval time =    4341.21 ms /  1090 tokens (    3.98 ms per token,   251.08 tokens per second)
2.09.854.328 I slot print_timing: id  2 | task 1063 |        eval time =   63484.68 ms /   214 tokens (  296.66 ms per token,     3.37 tokens per second)
2.09.854.329 I slot print_timing: id  2 | task 1063 |       total time =   67825.90 ms /  1304 tokens

Thank you! At least I know I'm not holding it wrong ;)

(EDIT: ...or we both are)

@AbdullahMPrograms

Copy link
Copy Markdown

I've also encountered some gibberish in llama-server with TP on 5 GPU's:
image
2-4 GPU's does not have gibberish, launch command:

./LLM/llama.cpp-vulkantp/vulkan/bin/llama-server -m /home/ultimis/LLM/Models/bartowski/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_L.gguf -c 32768 -ngl 999 -fa on -fit off -sm tensor --host 0.0.0.0 --port 7001 -dev Vulkan0,Vulkan1,Vulkan2,Vulkan3,Vulkan4

above on commit e578ca2

@wizardeur

Copy link
Copy Markdown

It's faster now on TP=4 and TP=8 (RX7900XTX). Shorter table this time:

Backend GPUs sm pp512 Run 1 pp512 Run 2 tg128 Run 1 tg128 Run 2
Vulkan 2 layer 809.00 794.62 34.50 33.40
Vulkan 2 tensor 1283.40 1286.12 41.57 41.68
Vulkan 4 layer 762.29 757.42 29.66 29.58
Vulkan 4 tensor 305.86 981.02 14.68 37.50
Vulkan 8 layer 697.64 701.80 11.89 12.34
Vulkan 8 tensor 95.61 492.38 3.87 21.10
ROCm 2 layer 913.54 915.04 26.34 26.13
ROCm 2 tensor 1535.47 1532.96 45.35 45.42
ROCm 4 layer 877.60 879.33 23.58 23.63
ROCm 4 tensor 2085.89 2070.25 55.79 55.40
ROCm 8 layer 811.63 807.94 21.48 21.55
ROCm 8 tensor 2163.08 2165.96 37.54 36.99

And for a multiturn conversation, I don't see any significant degradation.
1st message:

2.14.960.380 I slot print_timing: id  3 | task 0 | prompt eval time =    4516.41 ms /  4618 tokens (    0.98 ms per token,  1022.49 tokens per s
econd)
2.14.960.383 I slot print_timing: id  3 | task 0 |        eval time =   76061.21 ms /  1390 tokens (   54.72 ms per token,    18.27 tokens per s
econd)

2nd message:

8.48.047.849 I slot print_timing: id  3 | task 1395 | prompt eval time =    2080.61 ms /  1926 tokens (    1.08 ms per token,   925.69 tokens per second)
8.48.047.852 I slot print_timing: id  3 | task 1395 |        eval time =  125606.23 ms /  2336 tokens (   53.77 ms per token,    18.60 tokens per second)

3rd message:

23.52.991.932 I slot print_timing: id  3 | task 3735 | prompt eval time =    3239.29 ms /  3366 tokens (    0.96 ms per token,  1039.12 tokens per second)
23.52.991.936 I slot print_timing: id  3 | task 3735 |        eval time =   57636.44 ms /  1098 tokens (   52.49 ms per token,    19.05 tokens per second)

@pwilkin

pwilkin commented Jun 27, 2026

Copy link
Copy Markdown
Member Author

Looking into the multiGPU degradation, as for the slowdown, can one of you possibly capture a perf profile?

@digitalscream

Copy link
Copy Markdown

Looking into the multiGPU degradation, as for the slowdown, can one of you possibly capture a perf profile?

Sure - not done that before, though. Totally revealing my lack of knowledge here, but can you point me to documentation for doing that?

@pwilkin

pwilkin commented Jun 28, 2026

Copy link
Copy Markdown
Member Author

Looking into the multiGPU degradation, as for the slowdown, can one of you possibly capture a perf profile?

Sure - not done that before, though. Totally revealing my lack of knowledge here, but can you point me to documentation for doing that?

Sure: https://perfwiki.github.io/main/tutorial/

It's actually pretty easy: you run perf record <command>, then you upload the perf results eg. to Huggingface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants