Skip to content

Update qd8 gemm config to enable c4 microkernels for wasmsdot#9846

Merged
copybara-service[bot] merged 1 commit intogoogle:masterfrom
yolanda15:qd8_update_sdot
Apr 1, 2026
Merged

Update qd8 gemm config to enable c4 microkernels for wasmsdot#9846
copybara-service[bot] merged 1 commit intogoogle:masterfrom
yolanda15:qd8_update_sdot

Conversation

@yolanda15
Copy link
Copy Markdown
Contributor

This improves qd8 performance for both relaxed simd and revectorization.

Tested on Intel(R) Core(TM) Ultra 7 258V using d8:

-------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------
Default (2x8c8_wasmsdot):
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time      48247 us        48251 us          146
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time      92239 us        92244 us           77
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time       62248 us        62252 us          113
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       21342 us        21345 us          322

With revec:
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time      46664 us        46670 us          150
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time      91125 us        91129 us           78
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time       61155 us        61159 us          117
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       20374 us        20376 us          341

4x16c4_wasmsdot_u2
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time      39707 us        39711 us          179
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time      75175 us        75179 us           96
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time       50763 us        50770 us          139
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       18312 us        18314 us          378

With revec
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time      26667 us        26671 us          260
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time      50717 us        50721 us          100
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time       33998 us        34003 us          204
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       11892 us        11895 us          575

copybara-service bot pushed a commit that referenced this pull request Apr 1, 2026
--
f3976a3 by Yolanda Chen <yolanda.chen@intel.com>:

Update qd8 gemm config to enable c4 microkernels for wasmsdot

FUTURE_COPYBARA_INTEGRATE_REVIEW=#9846 from yolanda15:qd8_update_sdot f3976a3
PiperOrigin-RevId: 892956888
@copybara-service copybara-service bot merged commit cbd072c into google:master Apr 1, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants