Use zstd::bulk API in IPC and Parquet with context reuse for compression and decompression#9400
Conversation
Switch parquet and IPC zstd codec from the streaming API (zstd::Encoder/Decoder) to the bulk API (zstd::bulk::Compressor/Decompressor) with reusable contexts. This avoids the overhead of reinitializing zstd contexts on every compress/decompress call, yielding ~8-11% speedup on benchmarks. Parquet: Store Compressor and Decompressor in ZSTDCodec, reused across calls. IPC: Add DecompressionContext (mirroring existing CompressionContext) with a reusable bulk Decompressor, threaded through RecordBatchDecoder. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run benchmarks ipc_reader |
|
🤖 Hi @Dandandan, thanks for the request (#9400 (comment)).
Please choose one or more of these with You can also set environment variables on subsequent lines: Unsupported benchmarks: ipc_reader. |
| compressor: zstd::bulk::Compressor<'static>, | ||
| decompressor: zstd::bulk::Decompressor<'static>, |
There was a problem hiding this comment.
Would it be possible to do even better than this and initialize a thread-local compressor?
There was a problem hiding this comment.
Theoretically perhaps but for a lot of pages it should be reused already a lot of times.
There was a problem hiding this comment.
There was a problem hiding this comment.
Yeah just slapping a thread_local! would not be handy as we don't know when it's good to clear it.
Perhaps an API could be changed/designed to reuse context/allocations like this for multiple parquet reader instances on the same thread, though I don't think the gain would be large.
I will add this to the available benchmarks |
|
run benchmark ipc_reader |
This comment was marked as outdated.
This comment was marked as outdated.
|
🤖 |
|
🤖: Benchmark completed Details
|
Those show the difference (others should be unchanged). |
alamb
left a comment
There was a problem hiding this comment.
Thanks @Dandandan and @thinkharderdev -- this is a nice find
| compressor: zstd::bulk::Compressor<'static>, | ||
| decompressor: zstd::bulk::Decompressor<'static>, |
There was a problem hiding this comment.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Removed the new function implementation for CompressionContext.

Which issue does this PR close?
Rationale for this change
Switch parquet and IPC zstd codec from the streaming API (zstd::Encoder/Decoder) to the bulk API (zstd::bulk::Compressor/Decompressor) with reusable contexts. This avoids the overhead of reinitializing zstd contexts on every compress/decompress call, yielding ~8-11% speedup on benchmarks.
Parquet: Store Compressor and Decompressor in ZSTDCodec, reused across calls. IPC: Add DecompressionContext (mirroring existing CompressionContext) with a reusable bulk Decompressor, threaded through RecordBatchDecoder.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?