Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks (#506)#506
Closed
mmandina wants to merge 1 commit intofacebook:devfrom
Closed
Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks (#506)#506mmandina wants to merge 1 commit intofacebook:devfrom
mmandina wants to merge 1 commit intofacebook:devfrom
Conversation
2995f97 to
7140539
Compare
7140539 to
12d00e1
Compare
12d00e1 to
ba6e630
Compare
mmandina
added a commit
to mmandina/openzl
that referenced
this pull request
Mar 27, 2026
…itBench benchmarks (facebook#506) Summary: Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point formats in bitSplit, mirroring the existing encode specializations. Each format gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams without the overhead of the generic switch-based loop. - `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2 - `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4 - `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8 Pattern matchers and dispatch added to `ZL_bitSplitDecode`. Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed). Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking. Benchmark results (10MB, buck @//mode/opt): | Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup | |--------|---------------|-----------------------|---------------------------|----------------| | bf16 | 23,289 | — | 49,423 | — | | fp16 | 20,042 | 1,555 | 54,102 | ~35x | | fp32 | 25,872 | 2,093 | 48,320 | ~23x | | fp64 | 24,014 | 3,756 | 47,330 | ~13x | Reviewed By: terrelln Differential Revision: D96359402
ba6e630 to
75f2991
Compare
mmandina
added a commit
to mmandina/openzl
that referenced
this pull request
Mar 27, 2026
…itBench benchmarks (facebook#506) Summary: Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point formats in bitSplit, mirroring the existing encode specializations. Each format gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams without the overhead of the generic switch-based loop. - `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2 - `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4 - `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8 Pattern matchers and dispatch added to `ZL_bitSplitDecode`. Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed). Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking. Benchmark results (10MB, buck @//mode/opt): | Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup | |--------|---------------|-----------------------|---------------------------|----------------| | bf16 | 23,289 | — | 49,423 | — | | fp16 | 20,042 | 1,555 | 54,102 | ~35x | | fp32 | 25,872 | 2,093 | 48,320 | ~23x | | fp64 | 24,014 | 3,756 | 47,330 | ~13x | Reviewed By: terrelln Differential Revision: D96359402
mmandina
added a commit
to mmandina/openzl
that referenced
this pull request
Mar 27, 2026
…itBench benchmarks (facebook#506) Summary: Pull Request resolved: facebook#506 Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point formats in bitSplit, mirroring the existing encode specializations. Each format gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams without the overhead of the generic switch-based loop. - `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2 - `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4 - `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8 Pattern matchers and dispatch added to `ZL_bitSplitDecode`. Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed). Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking. Benchmark results (10MB, buck @//mode/opt): | Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup | |--------|---------------|-----------------------|---------------------------|----------------| | bf16 | 23,289 | — | 49,423 | — | | fp16 | 20,042 | 1,555 | 54,102 | ~35x | | fp32 | 25,872 | 2,093 | 48,320 | ~23x | | fp64 | 24,014 | 3,756 | 47,330 | ~13x | Reviewed By: terrelln Differential Revision: D96359402
75f2991 to
4850913
Compare
mmandina
added a commit
to mmandina/openzl
that referenced
this pull request
Mar 27, 2026
…itBench benchmarks (facebook#506) Summary: Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point formats in bitSplit, mirroring the existing encode specializations. Each format gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams without the overhead of the generic switch-based loop. - `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2 - `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4 - `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8 Pattern matchers and dispatch added to `ZL_bitSplitDecode`. Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed). Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking. Benchmark results (10MB, buck @//mode/opt): | Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup | |--------|---------------|-----------------------|---------------------------|----------------| | bf16 | 23,289 | — | 49,423 | — | | fp16 | 20,042 | 1,555 | 54,102 | ~35x | | fp32 | 25,872 | 2,093 | 48,320 | ~23x | | fp64 | 24,014 | 3,756 | 47,330 | ~13x | Reviewed By: terrelln Differential Revision: D96359402
4850913 to
6ecb65f
Compare
…itBench benchmarks (facebook#506) Summary: Pull Request resolved: facebook#506 Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point formats in bitSplit, mirroring the existing encode specializations. Each format gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams without the overhead of the generic switch-based loop. - `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2 - `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4 - `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8 Pattern matchers and dispatch added to `ZL_bitSplitDecode`. Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed). Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking. Benchmark results (10MB, buck @//mode/opt): | Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup | |--------|---------------|-----------------------|---------------------------|----------------| | bf16 | 23,289 | — | 49,423 | — | | fp16 | 20,042 | 1,555 | 54,102 | ~35x | | fp32 | 25,872 | 2,093 | 48,320 | ~23x | | fp64 | 24,014 | 3,756 | 47,330 | ~13x | Reviewed By: terrelln Differential Revision: D96359402
6ecb65f to
6193adf
Compare
|
This pull request has been merged in d40da0a. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point
formats in bitSplit, mirroring the existing encode specializations. Each format
gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams
without the overhead of the generic switch-based loop.
decodeFp16: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2decodeFp32: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4decodeFp64: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8Pattern matchers and dispatch added to
ZL_bitSplitDecode.Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed).
Updated unitbench skill to recommend
buck build/run @//mode/optfor benchmarking.Benchmark results (10MB, buck @//mode/opt):
Reviewed By: terrelln
Differential Revision: D96359402