Skip to content

Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks (#506)#506

Closed
mmandina wants to merge 1 commit intofacebook:devfrom
mmandina:export-D96359402
Closed

Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks (#506)#506
mmandina wants to merge 1 commit intofacebook:devfrom
mmandina:export-D96359402

Conversation

@mmandina
Copy link
Copy Markdown
Contributor

@mmandina mmandina commented Mar 12, 2026

Summary:

Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point
formats in bitSplit, mirroring the existing encode specializations. Each format
gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams
without the overhead of the generic switch-based loop.

  • decodeFp16: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2
  • decodeFp32: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4
  • decodeFp64: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8

Pattern matchers and dispatch added to ZL_bitSplitDecode.

Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed).

Updated unitbench skill to recommend buck build/run @//mode/opt for benchmarking.

Benchmark results (10MB, buck @//mode/opt):

Format Encode (MB/s) Decode Generic (MB/s) Decode Specialized (MB/s) Decode Speedup
bf16 23,289 49,423
fp16 20,042 1,555 54,102 ~35x
fp32 25,872 2,093 48,320 ~23x
fp64 24,014 3,756 47,330 ~13x

Reviewed By: terrelln

Differential Revision: D96359402

@meta-cla meta-cla Bot added the cla signed label Mar 12, 2026
@mmandina mmandina force-pushed the export-D96359402 branch 2 times, most recently from 2995f97 to 7140539 Compare March 18, 2026 15:46
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 18, 2026

@mmandina has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96359402.

@meta-codesync meta-codesync Bot changed the title Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks Add specialized bitSplit decode fast paths for fp16/fp32/fp64 with unitBench benchmarks (#506) Mar 27, 2026
mmandina added a commit to mmandina/openzl that referenced this pull request Mar 27, 2026
…itBench benchmarks (facebook#506)

Summary:

Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point
formats in bitSplit, mirroring the existing encode specializations. Each format
gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams
without the overhead of the generic switch-based loop.

- `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2
- `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4
- `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8

Pattern matchers and dispatch added to `ZL_bitSplitDecode`.

Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed).

Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking.

Benchmark results (10MB, buck @//mode/opt):

| Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup |
|--------|---------------|-----------------------|---------------------------|----------------|
| bf16   | 23,289        | —                     | 49,423                    | —              |
| fp16   | 20,042        | 1,555                 | 54,102                    | ~35x           |
| fp32   | 25,872        | 2,093                 | 48,320                    | ~23x           |
| fp64   | 24,014        | 3,756                 | 47,330                    | ~13x           |

Reviewed By: terrelln

Differential Revision: D96359402
mmandina added a commit to mmandina/openzl that referenced this pull request Mar 27, 2026
…itBench benchmarks (facebook#506)

Summary:

Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point
formats in bitSplit, mirroring the existing encode specializations. Each format
gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams
without the overhead of the generic switch-based loop.

- `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2
- `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4
- `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8

Pattern matchers and dispatch added to `ZL_bitSplitDecode`.

Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed).

Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking.

Benchmark results (10MB, buck @//mode/opt):

| Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup |
|--------|---------------|-----------------------|---------------------------|----------------|
| bf16   | 23,289        | —                     | 49,423                    | —              |
| fp16   | 20,042        | 1,555                 | 54,102                    | ~35x           |
| fp32   | 25,872        | 2,093                 | 48,320                    | ~23x           |
| fp64   | 24,014        | 3,756                 | 47,330                    | ~13x           |

Reviewed By: terrelln

Differential Revision: D96359402
mmandina added a commit to mmandina/openzl that referenced this pull request Mar 27, 2026
…itBench benchmarks (facebook#506)

Summary:
Pull Request resolved: facebook#506

Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point
formats in bitSplit, mirroring the existing encode specializations. Each format
gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams
without the overhead of the generic switch-based loop.

- `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2
- `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4
- `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8

Pattern matchers and dispatch added to `ZL_bitSplitDecode`.

Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed).

Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking.

Benchmark results (10MB, buck @//mode/opt):

| Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup |
|--------|---------------|-----------------------|---------------------------|----------------|
| bf16   | 23,289        | —                     | 49,423                    | —              |
| fp16   | 20,042        | 1,555                 | 54,102                    | ~35x           |
| fp32   | 25,872        | 2,093                 | 48,320                    | ~23x           |
| fp64   | 24,014        | 3,756                 | 47,330                    | ~13x           |

Reviewed By: terrelln

Differential Revision: D96359402
mmandina added a commit to mmandina/openzl that referenced this pull request Mar 27, 2026
…itBench benchmarks (facebook#506)

Summary:

Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point
formats in bitSplit, mirroring the existing encode specializations. Each format
gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams
without the overhead of the generic switch-based loop.

- `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2
- `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4
- `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8

Pattern matchers and dispatch added to `ZL_bitSplitDecode`.

Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed).

Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking.

Benchmark results (10MB, buck @//mode/opt):

| Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup |
|--------|---------------|-----------------------|---------------------------|----------------|
| bf16   | 23,289        | —                     | 49,423                    | —              |
| fp16   | 20,042        | 1,555                 | 54,102                    | ~35x           |
| fp32   | 25,872        | 2,093                 | 48,320                    | ~23x           |
| fp64   | 24,014        | 3,756                 | 47,330                    | ~13x           |

Reviewed By: terrelln

Differential Revision: D96359402
…itBench benchmarks (facebook#506)

Summary:
Pull Request resolved: facebook#506

Add specialized decode functions for fp16, fp32, and fp64 IEEE floating-point
formats in bitSplit, mirroring the existing encode specializations. Each format
gets a dedicated decoder that reassembles {mantissa, exponent, sign} streams
without the overhead of the generic switch-based loop.

- `decodeFp16`: bitWidths {10, 5, 1}, srcEltWidths {2, 1, 1}, dstEltWidth=2
- `decodeFp32`: bitWidths {23, 8, 1}, srcEltWidths {4, 1, 1}, dstEltWidth=4
- `decodeFp64`: bitWidths {52, 11, 1}, srcEltWidths {8, 2, 1}, dstEltWidth=8

Pattern matchers and dispatch added to `ZL_bitSplitDecode`.

Kernel benchmarks added for fp16 and fp64 decode (fp32 decode already existed).

Updated unitbench skill to recommend `buck build/run @//mode/opt` for benchmarking.

Benchmark results (10MB, buck @//mode/opt):

| Format | Encode (MB/s) | Decode Generic (MB/s) | Decode Specialized (MB/s) | Decode Speedup |
|--------|---------------|-----------------------|---------------------------|----------------|
| bf16   | 23,289        | —                     | 49,423                    | —              |
| fp16   | 20,042        | 1,555                 | 54,102                    | ~35x           |
| fp32   | 25,872        | 2,093                 | 48,320                    | ~23x           |
| fp64   | 24,014        | 3,756                 | 47,330                    | ~13x           |

Reviewed By: terrelln

Differential Revision: D96359402
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 30, 2026

This pull request has been merged in d40da0a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant