Skip to content

ggml-cpu: replace cyclic chunk distribution with atomic work-stealing#25048

Open
dnislno wants to merge 1 commit into
ggml-org:masterfrom
dnislno:fix/work-stealing
Open

ggml-cpu: replace cyclic chunk distribution with atomic work-stealing#25048
dnislno wants to merge 1 commit into
ggml-org:masterfrom
dnislno:fix/work-stealing

Conversation

@dnislno

@dnislno dnislno commented Jun 26, 2026

Copy link
Copy Markdown

Overview

Replace the per-thread cyclic chunk assignment (current_chunk = ith) with a shared atomic work-stealing counter (atomic_fetch_add) in both ggml_compute_forward_mul_mat and ggml_compute_forward_mul_mat_id in ggml/src/ggml-cpu/ggml-cpu.c.

In the cyclic scheme, thread i starts at chunk i and advances by nth: thread 0 gets chunks 0,8,16...; thread 1 gets 1,9,17... On hybrid CPUs (e.g. Intel Alder Lake P-cores + E-cores), a slow E-core assigned to dense chunks makes other threads wait at ggml_barrier while idle — the fast P-cores have finished their cyclic share but cannot steal remaining work because each chunk is pre-assigned.

With work-stealing, all threads atomically claim the next available chunk from 0. Fast cores automatically take more chunks; slow cores take fewer. The barrier wait is minimized.

The old code also had a redundant if (nth >= nchunks) break guard inside the while body. This was necessary for the cyclic scheme (a thread that started beyond the last chunk had no work), but with work-stealing the while-condition itself handles this correctly — removed.

Related issues/PRs

No standalone issue was filed for this specific bug.

Source locations

File: ggml/src/ggml-cpu/ggml-cpu.c

Site 1 - ggml_compute_forward_mul_mat reset (line 1351):

Before:

// Every thread starts at ith, so the first unprocessed chunk is nth.  This save a bit of coordination right at the start.
atomic_store_explicit(&params->threadpool->current_chunk, nth, memory_order_relaxed);

After:

atomic_store_explicit(&params->threadpool->current_chunk, 0, memory_order_relaxed);

Site 2 - ggml_compute_forward_mul_mat loop (line 1410):

Before:

// The first chunk comes from our thread_id, the rest will get auto-assigned.
int current_chunk = ith;

while (current_chunk < nchunk0 * nchunk1) {
    ...
    if (nth >= nchunk0 * nchunk1) {
        break;
    }
    current_chunk = atomic_fetch_add_explicit(&params->threadpool->current_chunk, 1, memory_order_relaxed);
}

After:

int current_chunk;

while ((current_chunk = atomic_fetch_add_explicit(&params->threadpool->current_chunk, 1, memory_order_relaxed)) < nchunk0 * nchunk1) {
    ...
}

Site 3 - ggml_compute_forward_mul_mat_id reset (line 1625):

Before:

*current_chunk_ctr = nth;

After:

*current_chunk_ctr = 0;

Site 4 - ggml_compute_forward_mul_mat_id loop (line 1663):

Before:

int current_chunk = ith;

while (current_chunk < nchunk0 * nchunk1) {
    ...
    if (nth >= nchunk0 * nchunk1) {
        break;
    }
    current_chunk = atomic_fetch_add_explicit(current_chunk_ctr, 1, memory_order_relaxed);
}

After:

int current_chunk;

while ((current_chunk = atomic_fetch_add_explicit(current_chunk_ctr, 1, memory_order_relaxed)) < nchunk0 * nchunk1) {
    ...
}

Local testing

Tested locally on Windows 11 with MSYS2 UCRT64 (GCC 16.1.0, cmake 4.3.4, ninja 1.13.2). CPU: Intel i3-1215U (2P+HT + 4E = 8 logical threads). Model: Qwen3.5-2B-Q4_K_M (1.23 GiB).

Before fix — llama-bench with -p 512 -n 128 -r 3:

| threads | pp512 (t/s) | tg128 (t/s) |
|---------|------------|------------|
| 4       | 68.15      | 13.01      |
| 6       | 70.69      | 11.56      |
| 8       | 69.47      |  9.07      |

Combined throughput at t=8: 29.79 t/s — throughput regresses as threads increase.

After fix — same benchmark:

| threads | pp512 (t/s) | tg128 (t/s) |
|---------|------------|------------|
| 4       | 61.44      | 12.27      |
| 6       | 70.67      | 12.55      |
| 8       | 75.24      | 10.32      |

Combined throughput at t=8: 33.33 t/s (+11.9%).

The fix restores scaling: prompt processing (pp512) now improves with more threads (61 -> 71 -> 75 t/s). Text generation (tg128) remains bandwidth-bound as expected, but the penalty at t=8 is reduced (9.07 -> 10.32 t/s, +13.8%).

Changes summary

ggml/src/ggml-cpu/ggml-cpu.c | 27 ++++++---------------------
 1 file changed, 6 insertions(+), 21 deletions(-)
  • Unify both mul_mat and mul_mat_id to use atomic_fetch_add in the while-condition instead of per-thread cyclic start
  • Reset chunk counter to 0 (not nth) so all threads compete from the first chunk
  • Remove redundant if (nth >= nchunks) break guard
  • Remove stale comments about the cyclic scheme

Tests added

No new tests. The existing benchmark suite (llama-bench, llama-perplexity) validates correctness: neural network outputs are deterministic under the same seed regardless of chunk ordering (chunks are independent and accumulated via atomics).

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
    • Used opencode as assistive tool for codebase navigation (locating the cyclic distribution sites and current_chunk variable), running A/B benchmarks to measure the fix impact, and drafting the PR description. The fix itself is human-authored and understood: replace cyclic indexing with atomic work-stealing, a standard parallel computing pattern.

Replaces the per-thread cyclic assignment (current_chunk = ith) with
a shared atomic counter (atomic_fetch_add) in both ggml_compute_forward_mul_mat
and ggml_compute_forward_mul_mat_id.

This prevents thread starvation on hybrid CPU architectures (e.g. Intel
Alder Lake with P-cores + E-cores) where slower E-cores can lag behind
while faster P-cores sit idle waiting at ggml_barrier.

The old pattern also had a redundant if (nth >= nchunks) break guard
that was necessary for the cyclic scheme but is subsumed by the
while-condition in the work-stealing approach.

Benchmarked on i3-1215U (2P+HT + 4E = 8 logical) with Qwen3.5-2B-Q4_K_M:
- pp512: t=8 improved 8% (69.47 -> 75.24 t/s)
- tg128: t=8 improved 14% (9.07 -> 10.32 t/s)
@dnislno dnislno requested a review from ggerganov as a code owner June 26, 2026 12:03
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 26, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Hi @dnislno, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant