ggml-cpu: replace cyclic chunk distribution with atomic work-stealing by dnislno · Pull Request #25048 · ggml-org/llama.cpp

dnislno · 2026-06-26T12:03:46Z

Overview

Replace the per-thread cyclic chunk assignment (current_chunk = ith) with a shared atomic work-stealing counter (atomic_fetch_add) in both ggml_compute_forward_mul_mat and ggml_compute_forward_mul_mat_id in ggml/src/ggml-cpu/ggml-cpu.c.

In the cyclic scheme, thread i starts at chunk i and advances by nth: thread 0 gets chunks 0,8,16...; thread 1 gets 1,9,17... On hybrid CPUs (e.g. Intel Alder Lake P-cores + E-cores), a slow E-core assigned to dense chunks makes other threads wait at ggml_barrier while idle — the fast P-cores have finished their cyclic share but cannot steal remaining work because each chunk is pre-assigned.

With work-stealing, all threads atomically claim the next available chunk from 0. Fast cores automatically take more chunks; slow cores take fewer. The barrier wait is minimized.

The old code also had a redundant if (nth >= nchunks) break guard inside the while body. This was necessary for the cyclic scheme (a thread that started beyond the last chunk had no work), but with work-stealing the while-condition itself handles this correctly — removed.

Related issues/PRs

Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) #842 — original threadpool performance issue
ggml : spread compute across threads in chunks #1507 — thread synchronization overhead
ggml : use dynamic thread scheduling for matrix multiplication #6915 — CPU affinity and thread scheduling
ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079 — hybrid CPU architecture performance
Disable NUMA-specific chunking for high-core-count HPC systems #16882 — ggml_barrier spin-wait contention
--threads -1 double counts via std::thread::hardware_concurrency() due to hyper-threading #19110 — recent threadpool refactoring

No standalone issue was filed for this specific bug.

Source locations

File: ggml/src/ggml-cpu/ggml-cpu.c

Site 1 - ggml_compute_forward_mul_mat reset (line 1351):

Before:

// Every thread starts at ith, so the first unprocessed chunk is nth.  This save a bit of coordination right at the start.
atomic_store_explicit(&params->threadpool->current_chunk, nth, memory_order_relaxed);

After:

atomic_store_explicit(&params->threadpool->current_chunk, 0, memory_order_relaxed);

Site 2 - ggml_compute_forward_mul_mat loop (line 1410):

Before:

// The first chunk comes from our thread_id, the rest will get auto-assigned.
int current_chunk = ith;

while (current_chunk < nchunk0 * nchunk1) {
    ...
    if (nth >= nchunk0 * nchunk1) {
        break;
    }
    current_chunk = atomic_fetch_add_explicit(&params->threadpool->current_chunk, 1, memory_order_relaxed);
}

After:

int current_chunk;

while ((current_chunk = atomic_fetch_add_explicit(&params->threadpool->current_chunk, 1, memory_order_relaxed)) < nchunk0 * nchunk1) {
    ...
}

Site 3 - ggml_compute_forward_mul_mat_id reset (line 1625):

Before:

*current_chunk_ctr = nth;

After:

*current_chunk_ctr = 0;

Site 4 - ggml_compute_forward_mul_mat_id loop (line 1663):

Before:

int current_chunk = ith;

while (current_chunk < nchunk0 * nchunk1) {
    ...
    if (nth >= nchunk0 * nchunk1) {
        break;
    }
    current_chunk = atomic_fetch_add_explicit(current_chunk_ctr, 1, memory_order_relaxed);
}

After:

int current_chunk;

while ((current_chunk = atomic_fetch_add_explicit(current_chunk_ctr, 1, memory_order_relaxed)) < nchunk0 * nchunk1) {
    ...
}

Local testing

Tested locally on Windows 11 with MSYS2 UCRT64 (GCC 16.1.0, cmake 4.3.4, ninja 1.13.2). CPU: Intel i3-1215U (2P+HT + 4E = 8 logical threads). Model: Qwen3.5-2B-Q4_K_M (1.23 GiB).

Before fix — llama-bench with -p 512 -n 128 -r 3:

| threads | pp512 (t/s) | tg128 (t/s) |
|---------|------------|------------|
| 4       | 68.15      | 13.01      |
| 6       | 70.69      | 11.56      |
| 8       | 69.47      |  9.07      |

Combined throughput at t=8: 29.79 t/s — throughput regresses as threads increase.

After fix — same benchmark:

| threads | pp512 (t/s) | tg128 (t/s) |
|---------|------------|------------|
| 4       | 61.44      | 12.27      |
| 6       | 70.67      | 12.55      |
| 8       | 75.24      | 10.32      |

Combined throughput at t=8: 33.33 t/s (+11.9%).

The fix restores scaling: prompt processing (pp512) now improves with more threads (61 -> 71 -> 75 t/s). Text generation (tg128) remains bandwidth-bound as expected, but the penalty at t=8 is reduced (9.07 -> 10.32 t/s, +13.8%).

Changes summary

ggml/src/ggml-cpu/ggml-cpu.c | 27 ++++++---------------------
 1 file changed, 6 insertions(+), 21 deletions(-)

Unify both mul_mat and mul_mat_id to use atomic_fetch_add in the while-condition instead of per-thread cyclic start
Reset chunk counter to 0 (not nth) so all threads compete from the first chunk
Remove redundant if (nth >= nchunks) break guard
Remove stale comments about the cyclic scheme

Tests added

No new tests. The existing benchmark suite (llama-bench, llama-perplexity) validates correctness: neural network outputs are deterministic under the same seed regardless of chunk ordering (chunks are independent and accumulated via atomics).

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:
- Used opencode as assistive tool for codebase navigation (locating the cyclic distribution sites and current_chunk variable), running A/B benchmarks to measure the fix impact, and drafting the PR description. The fix itself is human-authored and understood: replace cyclic indexing with atomic work-stealing, a standard parallel computing pattern.

Replaces the per-thread cyclic assignment (current_chunk = ith) with a shared atomic counter (atomic_fetch_add) in both ggml_compute_forward_mul_mat and ggml_compute_forward_mul_mat_id. This prevents thread starvation on hybrid CPU architectures (e.g. Intel Alder Lake with P-cores + E-cores) where slower E-cores can lag behind while faster P-cores sit idle waiting at ggml_barrier. The old pattern also had a redundant if (nth >= nchunks) break guard that was necessary for the cyclic scheme but is subsumed by the while-condition in the work-stealing approach. Benchmarked on i3-1215U (2P+HT + 4E = 8 logical) with Qwen3.5-2B-Q4_K_M: - pp512: t=8 improved 8% (69.47 -> 75.24 t/s) - tg128: t=8 improved 14% (9.07 -> 10.32 t/s)

ggml-gh-bot · 2026-06-26T12:08:07Z

Hi @dnislno, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

dnislno requested a review from ggerganov as a code owner June 26, 2026 12:03

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: replace cyclic chunk distribution with atomic work-stealing#25048

ggml-cpu: replace cyclic chunk distribution with atomic work-stealing#25048
dnislno wants to merge 1 commit into
ggml-org:masterfrom
dnislno:fix/work-stealing

dnislno commented Jun 26, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dnislno commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Related issues/PRs

Source locations

Local testing

Changes summary

Tests added

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dnislno commented Jun 26, 2026 •

edited

Loading