ggml-cpu: replace cyclic chunk distribution with atomic work-stealing#25048
Open
dnislno wants to merge 1 commit into
Open
ggml-cpu: replace cyclic chunk distribution with atomic work-stealing#25048dnislno wants to merge 1 commit into
dnislno wants to merge 1 commit into
Conversation
Replaces the per-thread cyclic assignment (current_chunk = ith) with a shared atomic counter (atomic_fetch_add) in both ggml_compute_forward_mul_mat and ggml_compute_forward_mul_mat_id. This prevents thread starvation on hybrid CPU architectures (e.g. Intel Alder Lake with P-cores + E-cores) where slower E-cores can lag behind while faster P-cores sit idle waiting at ggml_barrier. The old pattern also had a redundant if (nth >= nchunks) break guard that was necessary for the cyclic scheme but is subsumed by the while-condition in the work-stealing approach. Benchmarked on i3-1215U (2P+HT + 4E = 8 logical) with Qwen3.5-2B-Q4_K_M: - pp512: t=8 improved 8% (69.47 -> 75.24 t/s) - tg128: t=8 improved 14% (9.07 -> 10.32 t/s)
|
Hi @dnislno, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Replace the per-thread cyclic chunk assignment (
current_chunk = ith) with a shared atomic work-stealing counter (atomic_fetch_add) in bothggml_compute_forward_mul_matandggml_compute_forward_mul_mat_idinggml/src/ggml-cpu/ggml-cpu.c.In the cyclic scheme, thread
istarts at chunkiand advances bynth: thread 0 gets chunks 0,8,16...; thread 1 gets 1,9,17... On hybrid CPUs (e.g. Intel Alder Lake P-cores + E-cores), a slow E-core assigned to dense chunks makes other threads wait atggml_barrierwhile idle — the fast P-cores have finished their cyclic share but cannot steal remaining work because each chunk is pre-assigned.With work-stealing, all threads atomically claim the next available chunk from 0. Fast cores automatically take more chunks; slow cores take fewer. The barrier wait is minimized.
The old code also had a redundant
if (nth >= nchunks) breakguard inside the while body. This was necessary for the cyclic scheme (a thread that started beyond the last chunk had no work), but with work-stealing the while-condition itself handles this correctly — removed.Related issues/PRs
--threads -1double counts viastd::thread::hardware_concurrency()due to hyper-threading #19110 — recent threadpool refactoringNo standalone issue was filed for this specific bug.
Source locations
File:
ggml/src/ggml-cpu/ggml-cpu.cSite 1 -
ggml_compute_forward_mul_matreset (line 1351):Before:
After:
Site 2 -
ggml_compute_forward_mul_matloop (line 1410):Before:
After:
Site 3 -
ggml_compute_forward_mul_mat_idreset (line 1625):Before:
After:
*current_chunk_ctr = 0;Site 4 -
ggml_compute_forward_mul_mat_idloop (line 1663):Before:
After:
Local testing
Tested locally on Windows 11 with MSYS2 UCRT64 (GCC 16.1.0, cmake 4.3.4, ninja 1.13.2). CPU: Intel i3-1215U (2P+HT + 4E = 8 logical threads). Model: Qwen3.5-2B-Q4_K_M (1.23 GiB).
Before fix — llama-bench with
-p 512 -n 128 -r 3:Combined throughput at t=8: 29.79 t/s — throughput regresses as threads increase.
After fix — same benchmark:
Combined throughput at t=8: 33.33 t/s (+11.9%).
The fix restores scaling: prompt processing (pp512) now improves with more threads (61 -> 71 -> 75 t/s). Text generation (tg128) remains bandwidth-bound as expected, but the penalty at t=8 is reduced (9.07 -> 10.32 t/s, +13.8%).
Changes summary
mul_matandmul_mat_idto useatomic_fetch_addin the while-condition instead of per-thread cyclic startnth) so all threads compete from the first chunkif (nth >= nchunks) breakguardTests added
No new tests. The existing benchmark suite (
llama-bench,llama-perplexity) validates correctness: neural network outputs are deterministic under the same seed regardless of chunk ordering (chunks are independent and accumulated via atomics).Requirements
current_chunkvariable), running A/B benchmarks to measure the fix impact, and drafting the PR description. The fix itself is human-authored and understood: replace cyclic indexing with atomic work-stealing, a standard parallel computing pattern.