ggml : fix tensor-parallel + -ncmoe crash on MoE models#25028
Open
liminfei-amd wants to merge 1 commit into
Open
ggml : fix tensor-parallel + -ncmoe crash on MoE models#25028liminfei-amd wants to merge 1 commit into
liminfei-amd wants to merge 1 commit into
Conversation
Tensor parallelism (-sm tensor) combined with -ncmoe (CPU-offloaded MoE
experts) aborts during warm-up on MoE models with
GGML_ASSERT(ggml_is_contiguous(tensor)) in ggml-backend-meta.cpp.
The failing tensor is the MoE router output (ffn_moe_topk): it is mirrored
(GGML_BACKEND_SPLIT_AXIS_MIRRORED, replicated across backends since routing
must be identical) and happens to be a non-contiguous view.
ggml_backend_meta_buffer_{get,set}_tensor asserted contiguity before
consulting the split state, so a mirrored non-contiguous tensor tripped the
assert even though the GGML_BACKEND_SPLIT_AXIS_MIRRORED case right below
already handles it.
Move the split-state lookup above the assert and allow the mirrored case in
both get_tensor and set_tensor.
Diagnosis credit to the reporter (@nathanmp).
Fixes ggml-org#24886
Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
|
Tested on: 2x AMD Radeon Pro W7800 48GB (gfx1100), ROCm 7.2.4, HIP backend Results after applying this PR (d56ab38): -sn tensor : works, no crash. pp4096 ≈ 947 t/s, tg128 ≈ 27.6 t/s Comparison on same hardware: PR successfully fixes the issue. Layer-split is still best for throughput on this GPU, but tensor parallelism now works as a fallback. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Tensor parallelism (
-sm tensor) combined with-ncmoe(CPU-offloaded MoE experts) aborts during warm-up on MoE models with:(#24886; also reported on CUDA with Qwen3.5-122B and on ROCm with GLM, so it is not hardware- or model-size-specific.)
The failing tensor is the MoE router output (
ffn_moe_topk). It is mirrored (GGML_BACKEND_SPLIT_AXIS_MIRRORED— replicated across backends, because routing must be identical on every device) and it happens to be a non-contiguous view.ggml_backend_meta_buffer_get_tensor/..._set_tensorassert contiguity before consulting the split state, so a mirrored non-contiguous tensor trips the assert — even though theGGML_BACKEND_SPLIT_AXIS_MIRROREDbranch just below already handles it (read from / write to the simple buffers).The fix moves the split-state lookup above the assert and allows the mirrored case in both
get_tensorandset_tensor:Diagnosis credit to the reporter @nathanmp, who identified the mirrored-axis condition.
Additional information
Reproduced and verified on 2x gfx1100 (RX 7900 GRE), HIP + RCCL, with Qwen3.5-35B-A3B (a small MoE that exercises the same path):
llama-cli -sm tensor --device ROCm0,ROCm1 -ncmoe 24 ...-> abort at warm-up (ggml_is_contiguousassert), exactly as reported.-sm tensorwithout-ncmoealready worked and is unchanged (still generates coherently). The change only relaxes the assert for the mirrored case, which the code path immediately below already handles.The asynchronous variants (
*_tensor_async) are intentionally left untouched: they have no mirrored switch case and were not on the crash path (generation completes without them).Requirements
Fixes #24886