Remove compile bottlenecks from ZImage pipeline by hitchhiker3010 · Pull Request #13461 · huggingface/diffusers

hitchhiker3010 · 2026-04-13T20:40:44Z

What does this PR do?

Fixes performance issues identified by profiling ZImagePipeline with torch.profiler as part of #13401 .

What does this PR do?

Profiled ZImagePipeline (using Tongyi-MAI/Z-Image-Turbo) in both eager and torch.compile modes following the profiling guide. The Chrome traces revealed two device-to-host (DtoH) synchronization points that break asynchronous GPU execution and prevent torch.compile from yielding its full speedup.

Pipeline denoising loop: t_norm = timestep[0].item() DtoH sync

Inside the denoising loop, timestep[0].item() triggers a GPU→CPU sync every step to read t_norm for CFG truncation logic. Since the full timestep schedule is known before the loop begins, we precompute all t_norm values into a plain Python list before entering the loop and index into it with i.
This also lets us set scheduler.set_begin_index(0) upfront to avoid the DtoH sync in _init_step_index (same pattern as Avoid DtoH sync from access of nonzero() item in scheduler #11696 )

Profiling ZImagePipeline
GPU - L4
num_inference_steps - 4,
guidance_scale - 0.0 ( Guidance should be 0 for the Turbo models)

Before

The first scheduler_step took 657.8µs
Number of cudaStreamSynchronize blocks - 19

After

The first scheduler_step took 15.49 µs after this fix
Number of cudaStreamSynchronize blocks - 13
Part of #13401 .

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Here is the link to the discussion
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul @dg845

sayakpaul · 2026-04-14T03:31:10Z

Thanks for your PR! Can we eliminate all the cudaStreamSynchronize calls?

…former Boolean mask indexing (tensor[mask] = val) implicitly calls nonzero(), which triggers a DtoH sync that stalls the CPU while the GPU queue drains. Replacing it with torch.where eliminates these syncs from the transformer's pad-token assignment. Profiling (4-step turbo, fix_2 vs fix_1): - Eager: nonzero CPU time drops from ~2091 ms to <1 ms; index_put eliminated - Compile: nonzero CPU time drops from ~3057 ms to <1 ms; index_put eliminated

hitchhiker3010 · 2026-04-14T07:38:33Z

Here are some comparison stats between commit_1 and commit_2

Metric	commit_1 eager	commit_2 eager	commit_1 compile	commit_2 compile
nonzero calls	28	4	28	4
nonzero CPU time	2091 ms	0.72 ms	3057 ms	0.49 ms
index_put calls	20	0	36	0
index_put total	4183 ms	0 ms	9172 ms	0 ms
cudaStreamSynchronize calls	13	5	13	5
cudaStreamSynchronize total	2089 ms	0.47 ms	3055 ms	0.32 ms

hitchhiker3010 · 2026-04-14T08:08:25Z

all the trace files can be accessed here.

The cudaStreamSynchronize traces from the Denoising phase are eliminated now, the remaining 5 cudaStreamSynchronize seem to be from the text encoding phase, should we fix them too?

cc: @sayakpaul

hitchhiker3010 and others added 2 commits April 14, 2026 01:11

[core] Remove DtoH syncs from ZImage pipeline denoising loop

b04b66f

Merge branch 'huggingface:main' into main

2bf717b

github-actions bot added pipelines size/S PR with diff < 50 LOC labels Apr 13, 2026

sayakpaul added the performance Anything related to performance improvements, profiling and benchmarking label Apr 14, 2026

sayakpaul requested a review from dg845 April 14, 2026 03:31

hitchhiker3010 mentioned this pull request Apr 14, 2026

Help us profile important pipelines and improve if needed #13401

Open

github-actions bot added models size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove compile bottlenecks from ZImage pipeline#13461

Remove compile bottlenecks from ZImage pipeline#13461
hitchhiker3010 wants to merge 3 commits intohuggingface:mainfrom
hitchhiker3010:main

hitchhiker3010 commented Apr 13, 2026

Uh oh!

sayakpaul commented Apr 14, 2026

Uh oh!

hitchhiker3010 commented Apr 14, 2026 •

edited

Loading

Uh oh!

hitchhiker3010 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hitchhiker3010 commented Apr 13, 2026

What does this PR do?

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul commented Apr 14, 2026

Uh oh!

hitchhiker3010 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hitchhiker3010 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hitchhiker3010 commented Apr 14, 2026 •

edited

Loading