Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks (#506) by mmandina · Pull Request #506 · facebook/openzl

mmandina · 2026-03-12T20:30:36Z

Summary:

Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point
formats in bitSplit, mirroring the existing encode specializations. Each format
gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams
without the overhead of the generic switch-based loop.

decodeFp16: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2
decodeFp32: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4
decodeFp64: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8

Pattern matchers and dispatch added to ZL_bitSplitDecode.

Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed).

Updated unitbench skill to recommend buck build/run @//mode/opt for benchmarking.

Benchmark results (10MB, buck @//mode/opt):

Format	Encode (MB/s)	Decode Generic (MB/s)	Decode Specialized (MB/s)	Decode Speedup
bf16	23,289	—	49,423	—
fp16	20,042	1,555	54,102	~35x
fp32	25,872	2,093	48,320	~23x
fp64	24,014	3,756	47,330	~13x

Reviewed By: terrelln

Differential Revision: D96359402

meta-codesync · 2026-03-18T15:49:44Z

@mmandina has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96359402.

…itBench benchmarks (facebook#506) Summary: Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point formats in bitSplit, mirroring the existing encode specializations. Each format gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams without the overhead of the generic switch-based loop. - `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2 - `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4 - `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8 Pattern matchers and dispatch added to `ZL_bitSplitDecode`. Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed). Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking. Benchmark results (10MB, buck @//mode/opt): | Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup | |--------|---------------|-----------------------|---------------------------|----------------| | bf16 | 23,289 | — | 49,423 | — | | fp16 | 20,042 | 1,555 | 54,102 | ~35x | | fp32 | 25,872 | 2,093 | 48,320 | ~23x | | fp64 | 24,014 | 3,756 | 47,330 | ~13x | Reviewed By: terrelln Differential Revision: D96359402

…itBench benchmarks (facebook#506) Summary: Pull Request resolved: facebook#506 Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point formats in bitSplit, mirroring the existing encode specializations. Each format gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams without the overhead of the generic switch-based loop. - `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2 - `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4 - `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8 Pattern matchers and dispatch added to `ZL_bitSplitDecode`. Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed). Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking. Benchmark results (10MB, buck @//mode/opt): | Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup | |--------|---------------|-----------------------|---------------------------|----------------| | bf16 | 23,289 | — | 49,423 | — | | fp16 | 20,042 | 1,555 | 54,102 | ~35x | | fp32 | 25,872 | 2,093 | 48,320 | ~23x | | fp64 | 24,014 | 3,756 | 47,330 | ~13x | Reviewed By: terrelln Differential Revision: D96359402

…itBench benchmarks (facebook#506) Summary: Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point formats in bitSplit, mirroring the existing encode specializations. Each format gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams without the overhead of the generic switch-based loop. - `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2 - `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4 - `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8 Pattern matchers and dispatch added to `ZL_bitSplitDecode`. Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed). Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking. Benchmark results (10MB, buck @//mode/opt): | Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup | |--------|---------------|-----------------------|---------------------------|----------------| | bf16 | 23,289 | — | 49,423 | — | | fp16 | 20,042 | 1,555 | 54,102 | ~35x | | fp32 | 25,872 | 2,093 | 48,320 | ~23x | | fp64 | 24,014 | 3,756 | 47,330 | ~13x | Reviewed By: terrelln Differential Revision: D96359402

…itBench benchmarks (facebook#506) Summary: Pull Request resolved: facebook#506 Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point formats in bitSplit, mirroring the existing encode specializations. Each format gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams without the overhead of the generic switch-based loop. - `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2 - `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4 - `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8 Pattern matchers and dispatch added to `ZL_bitSplitDecode`. Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed). Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking. Benchmark results (10MB, buck @//mode/opt): | Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup | |--------|---------------|-----------------------|---------------------------|----------------| | bf16 | 23,289 | — | 49,423 | — | | fp16 | 20,042 | 1,555 | 54,102 | ~35x | | fp32 | 25,872 | 2,093 | 48,320 | ~23x | | fp64 | 24,014 | 3,756 | 47,330 | ~13x | Reviewed By: terrelln Differential Revision: D96359402

meta-codesync · 2026-03-30T14:52:38Z

This pull request has been merged in d40da0a.

meta-cla Bot added the cla signed label Mar 12, 2026

mmandina force-pushed the export-D96359402 branch 2 times, most recently from 2995f97 to 7140539 Compare March 18, 2026 15:46

meta-codesync Bot added fb-exported meta-exported labels Mar 18, 2026

mmandina force-pushed the export-D96359402 branch from 7140539 to 12d00e1 Compare March 18, 2026 15:49

meta-codesync Bot changed the title ~~Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks~~ Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks (#506) Mar 27, 2026

mmandina force-pushed the export-D96359402 branch from 12d00e1 to ba6e630 Compare March 27, 2026 07:06

mmandina force-pushed the export-D96359402 branch from ba6e630 to 75f2991 Compare March 27, 2026 21:55

mmandina force-pushed the export-D96359402 branch from 75f2991 to 4850913 Compare March 27, 2026 21:58

mmandina force-pushed the export-D96359402 branch from 4850913 to 6ecb65f Compare March 27, 2026 22:15

mmandina force-pushed the export-D96359402 branch from 6ecb65f to 6193adf Compare March 27, 2026 22:19

meta-codesync Bot closed this in d40da0a Mar 30, 2026

facebook-github-tools Bot added the Merged label Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks (#506)#506

Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks (#506)#506
mmandina wants to merge 1 commit intofacebook:devfrom
mmandina:export-D96359402

mmandina commented Mar 12, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Mar 18, 2026

Uh oh!

meta-codesync Bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmandina commented Mar 12, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Mar 18, 2026

Uh oh!

meta-codesync Bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mmandina commented Mar 12, 2026 •

edited by meta-codesync Bot

Loading