Skip to content

ggml : fix tensor-parallel + -ncmoe crash on MoE models#25028

Open
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:amd-rocm/24886-meta-mirrored-tensor
Open

ggml : fix tensor-parallel + -ncmoe crash on MoE models#25028
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:amd-rocm/24886-meta-mirrored-tensor

Conversation

@liminfei-amd

Copy link
Copy Markdown
Contributor

Overview

Tensor parallelism (-sm tensor) combined with -ncmoe (CPU-offloaded MoE experts) aborts during warm-up on MoE models with:

ggml/src/ggml-backend-meta.cpp: GGML_ASSERT(ggml_is_contiguous(tensor)) failed

(#24886; also reported on CUDA with Qwen3.5-122B and on ROCm with GLM, so it is not hardware- or model-size-specific.)

The failing tensor is the MoE router output (ffn_moe_topk). It is mirrored (GGML_BACKEND_SPLIT_AXIS_MIRRORED — replicated across backends, because routing must be identical on every device) and it happens to be a non-contiguous view. ggml_backend_meta_buffer_get_tensor / ..._set_tensor assert contiguity before consulting the split state, so a mirrored non-contiguous tensor trips the assert — even though the GGML_BACKEND_SPLIT_AXIS_MIRRORED branch just below already handles it (read from / write to the simple buffers).

The fix moves the split-state lookup above the assert and allows the mirrored case in both get_tensor and set_tensor:

GGML_ASSERT(ggml_is_contiguous(tensor) || split_state.axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED);

Diagnosis credit to the reporter @nathanmp, who identified the mirrored-axis condition.

Additional information

Reproduced and verified on 2x gfx1100 (RX 7900 GRE), HIP + RCCL, with Qwen3.5-35B-A3B (a small MoE that exercises the same path):

  • Before: llama-cli -sm tensor --device ROCm0,ROCm1 -ncmoe 24 ... -> abort at warm-up (ggml_is_contiguous assert), exactly as reported.
  • After: loads and generates correctly -- "The capital of France is Paris. Located in the north-central part of...".
  • Regression: -sm tensor without -ncmoe already worked and is unchanged (still generates coherently). The change only relaxes the assert for the mirrored case, which the code path immediately below already handles.

The asynchronous variants (*_tensor_async) are intentionally left untouched: they have no mirrored switch case and were not on the crash path (generation completes without them).

Requirements

  • I have read and followed the contributing guidelines
  • AI usage disclosure: YES -- an AI coding assistant helped reproduce the crash, instrument and identify the failing tensor/axis, and draft the change; I reviewed and verified every line and ran the reproduction and regression on real hardware.

Fixes #24886

Tensor parallelism (-sm tensor) combined with -ncmoe (CPU-offloaded MoE
experts) aborts during warm-up on MoE models with
GGML_ASSERT(ggml_is_contiguous(tensor)) in ggml-backend-meta.cpp.

The failing tensor is the MoE router output (ffn_moe_topk): it is mirrored
(GGML_BACKEND_SPLIT_AXIS_MIRRORED, replicated across backends since routing
must be identical) and happens to be a non-contiguous view.
ggml_backend_meta_buffer_{get,set}_tensor asserted contiguity before
consulting the split state, so a mirrored non-contiguous tensor tripped the
assert even though the GGML_BACKEND_SPLIT_AXIS_MIRRORED case right below
already handles it.

Move the split-state lookup above the assert and allow the mirrored case in
both get_tensor and set_tensor.

Diagnosis credit to the reporter (@nathanmp).

Fixes ggml-org#24886

Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 26, 2026
@cattivik66

Copy link
Copy Markdown

Tested on: 2x AMD Radeon Pro W7800 48GB (gfx1100), ROCm 7.2.4, HIP backend
Model: Qwen3.5-122B-A10B (UD-Q4_K_XL, MoE, 73 GiB)

Results after applying this PR (d56ab38):

-sn tensor : works, no crash. pp4096 ≈ 947 t/s, tg128 ≈ 27.6 t/s
-sk tensor + -ncmoe 0 : now works, no crash. Fix confirmed effective.

Comparison on same hardware:
-shell layer : pp4096 ≈ 1366 t/s, tg128 ≈ 43.4 t/s (fastest)
-shell tensor: pp4096 ≈ 947 t/s, tg128 ≈ 27.6 t/s (works after PR)

PR successfully fixes the issue. Layer-split is still best for throughput on this GPU, but tensor parallelism now works as a fallback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Tensor parallelism crashes when combined with -ncmoe with Qwen 3.5 397B.

2 participants