Skip to content

[TASK-431] Byte-sized cap instead of hard one for ArrowWriter#443

Merged
leekeiabstraction merged 5 commits intoapache:mainfrom
fresh-borzoni:byte-sized-cap
Apr 13, 2026
Merged

[TASK-431] Byte-sized cap instead of hard one for ArrowWriter#443
leekeiabstraction merged 5 commits intoapache:mainfrom
fresh-borzoni:byte-sized-cap

Conversation

@fresh-borzoni
Copy link
Copy Markdown
Contributor

Summary

closes #431

Arrow log batches were previously capped at a hard 256-record limit regardless of actual data size, meaning a batch of 256 tiny ints (~1KB) was treated the same as 256 large rows (~10MB). This replaces that fixed cap with byte-size-based fullness matching Java's ArrowWriter, so batches now fill to the configured writer_batch_size (default 2MB).

The fullness check uses a threshold-based optimization to avoid computing sizes on every append, it estimates how many records should fit, skips checks until that count is reached, then recalculates. Size estimation reads buffer lengths directly from Arrow builders (O(num_columns), zero allocation), with a pre-computed IPC overhead constant that captures metadata and body framing for the schema.

Compression is accounted for via an adaptive ArrowCompressionRatioEstimator shared across batches for the same table. It starts at 1.0 (assume no compression) and adjusts asymmetrically after each batch is serialized, quick to increase when compression worsens, slow to decrease when it improves. This matches Java's conservative approach to avoid underestimating batch sizes.

Also aligns VARIABLE_WIDTH_AVG_BYTES (64 -> 8) and INITIAL_ROW_CAPACITY (256 -> 1024) with Java Arrow defaults.

Writer pooling (ArrowWriterPool) and DynamicWriteBatchSizeEstimator are out of scope -> TODOs added.

@fresh-borzoni
Copy link
Copy Markdown
Contributor Author

@fresh-borzoni
Copy link
Copy Markdown
Contributor Author

Created issues for TODOs: #444 #445

Copy link
Copy Markdown
Contributor

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR! leave some minor comments/questions

Comment thread crates/fluss/src/record/arrow.rs Outdated
Comment thread crates/fluss/src/record/arrow.rs Outdated
Comment thread crates/fluss/src/record/arrow.rs
Comment thread crates/fluss/src/client/write/batch.rs Outdated
Comment thread crates/fluss/src/record/arrow.rs Outdated
Comment thread crates/fluss/src/record/arrow.rs
@fresh-borzoni
Copy link
Copy Markdown
Contributor Author

@charlesdong1991 Ty for the review, addressed comments
PTAL 🙏

Comment thread crates/fluss/src/record/arrow.rs
@fresh-borzoni
Copy link
Copy Markdown
Contributor Author

@leekeiabstraction TY for the review, answered your comment
PTAL 🙏

@fresh-borzoni
Copy link
Copy Markdown
Contributor Author

@luoyuxia @leekeiabstraction @charlesdong1991 PTAL 🙏
I'd like to continue with write path optimisations if possible

Copy link
Copy Markdown
Contributor

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approved. Note there's rebase conflict that needs to be addressed before merge.

@fresh-borzoni
Copy link
Copy Markdown
Contributor Author

fresh-borzoni commented Apr 13, 2026

@leekeiabstraction Ty, rebased

Copy link
Copy Markdown
Contributor

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, made some small changes to remove unused functions to allow build to pass

@leekeiabstraction leekeiabstraction merged commit bf08166 into apache:main Apr 13, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch DEFAULT_MAX_RECORD hard cap to byte-size

3 participants