ggml : fix tensor-parallel + -ncmoe crash on MoE models by liminfei-amd · Pull Request #25028 · ggml-org/llama.cpp

liminfei-amd · 2026-06-26T03:11:08Z

Overview

Tensor parallelism (-sm tensor) combined with -ncmoe (CPU-offloaded MoE experts) aborts during warm-up on MoE models with:

ggml/src/ggml-backend-meta.cpp: GGML_ASSERT(ggml_is_contiguous(tensor)) failed

(#24886; also reported on CUDA with Qwen3.5-122B and on ROCm with GLM, so it is not hardware- or model-size-specific.)

The failing tensor is the MoE router output (ffn_moe_topk). It is mirrored (GGML_BACKEND_SPLIT_AXIS_MIRRORED — replicated across backends, because routing must be identical on every device) and it happens to be a non-contiguous view. ggml_backend_meta_buffer_get_tensor / ..._set_tensor assert contiguity before consulting the split state, so a mirrored non-contiguous tensor trips the assert — even though the GGML_BACKEND_SPLIT_AXIS_MIRRORED branch just below already handles it (read from / write to the simple buffers).

The fix moves the split-state lookup above the assert and allows the mirrored case in both get_tensor and set_tensor:

GGML_ASSERT(ggml_is_contiguous(tensor) || split_state.axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED);

Diagnosis credit to the reporter @nathanmp, who identified the mirrored-axis condition.

Additional information

Reproduced and verified on 2x gfx1100 (RX 7900 GRE), HIP + RCCL, with Qwen3.5-35B-A3B (a small MoE that exercises the same path):

Before: llama-cli -sm tensor --device ROCm0,ROCm1 -ncmoe 24 ... -> abort at warm-up (ggml_is_contiguous assert), exactly as reported.
After: loads and generates correctly -- "The capital of France is Paris. Located in the north-central part of...".
Regression: -sm tensor without -ncmoe already worked and is unchanged (still generates coherently). The change only relaxes the assert for the mirrored case, which the code path immediately below already handles.

The asynchronous variants (*_tensor_async) are intentionally left untouched: they have no mirrored switch case and were not on the crash path (generation completes without them).

Requirements

I have read and followed the contributing guidelines
AI usage disclosure: YES -- an AI coding assistant helped reproduce the crash, instrument and identify the failing tensor/axis, and draft the change; I reviewed and verified every line and ran the reproduction and regression on real hardware.

Fixes #24886

@nathanmp

Tensor parallelism (-sm tensor) combined with -ncmoe (CPU-offloaded MoE experts) aborts during warm-up on MoE models with GGML_ASSERT(ggml_is_contiguous(tensor)) in ggml-backend-meta.cpp. The failing tensor is the MoE router output (ffn_moe_topk): it is mirrored (GGML_BACKEND_SPLIT_AXIS_MIRRORED, replicated across backends since routing must be identical) and happens to be a non-contiguous view. ggml_backend_meta_buffer_{get,set}_tensor asserted contiguity before consulting the split state, so a mirrored non-contiguous tensor tripped the assert even though the GGML_BACKEND_SPLIT_AXIS_MIRRORED case right below already handles it. Move the split-state lookup above the assert and allow the mirrored case in both get_tensor and set_tensor. Diagnosis credit to the reporter (@nathanmp). Fixes ggml-org#24886 Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>

cattivik66 · 2026-06-26T15:27:12Z

Tested on: 2x AMD Radeon Pro W7800 48GB (gfx1100), ROCm 7.2.4, HIP backend
Model: Qwen3.5-122B-A10B (UD-Q4_K_XL, MoE, 73 GiB)

Results after applying this PR (d56ab38):

-sn tensor : works, no crash. pp4096 ≈ 947 t/s, tg128 ≈ 27.6 t/s
-sk tensor + -ncmoe 0 : now works, no crash. Fix confirmed effective.

Comparison on same hardware:
-shell layer : pp4096 ≈ 1366 t/s, tg128 ≈ 43.4 t/s (fastest)
-shell tensor: pp4096 ≈ 947 t/s, tg128 ≈ 27.6 t/s (works after PR)

PR successfully fixes the issue. Layer-split is still best for throughput on this GPU, but tensor parallelism now works as a fallback.

liminfei-amd requested a review from JohannesGaessler as a code owner June 26, 2026 03:11

liminfei-amd mentioned this pull request Jun 26, 2026

Eval bug: Tensor parallelism crashes when combined with -ncmoe with Qwen 3.5 397B. #24886

Open

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : fix tensor-parallel + -ncmoe crash on MoE models#25028

ggml : fix tensor-parallel + -ncmoe crash on MoE models#25028
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:amd-rocm/24886-meta-mirrored-tensor

liminfei-amd commented Jun 26, 2026

Uh oh!

cattivik66 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

liminfei-amd commented Jun 26, 2026

Overview

Additional information

Requirements

Uh oh!

cattivik66 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants