Is your feature request related to a problem or challenge?
Some callers already have the selected rows as a BooleanBuffer. Today they still need to call RowSelection::from_filters, which converts that bitmap into a Vec<RowSelector>.
That can use much more memory. For example, with 35% isolated single-row hits, the selector form is about 11.2 bytes per input row. The bitmap is 1 bit per input row. On 500M rows, that is about 5.6 GB for selectors versus 62.5 MB for the bitmap.
The rough calculation is:
RowSelector = 16 bytes
selectors = 500M rows * 35% hits * 2 selectors per isolated hit
= 350M selectors
memory = 350M * 16 bytes = 5.6 GB
bitmap = 500M bits / 8 = 62.5 MB
The reader may then choose the Mask strategy and convert the selectors back into a bitmap. In that case, the caller's bitmap was converted to selectors and then back to a bitmap again.
Describe the solution you'd like
A first-class mask backing on RowSelection:
let selection = RowSelection::from_boolean_buffer(buf);
let selection: RowSelection = buf.into();
The reader's Mask strategy can pass that buffer straight to the cursor. Existing from_filters / from_consecutive_ranges users are unchanged.
When execution uses the mask strategy, the reader can pass that buffer directly to the mask cursor instead of rebuilding a bitmap from selectors. Existing selector-backed paths remain available and are still better for long clustered runs and page-index-style selections.
This is scoped to callers that already have a row-level bitmap(come form the external index). It should not be presented as a broad DataFusion / TPC-DS / ClickBench speedup by itself, because current common DataFusion SQL paths generally do not naturally produce bitmap-backed RowSelections.
Describe alternatives you've considered
Doing the conversion downstream doesn't help: the producer still has to call from_filters and pay the RLE encoding cost.
Additional context
No response
Is your feature request related to a problem or challenge?
Some callers already have the selected rows as a
BooleanBuffer. Today they still need to callRowSelection::from_filters, which converts that bitmap into aVec<RowSelector>.That can use much more memory. For example, with 35% isolated single-row hits, the selector form is about 11.2 bytes per input row. The bitmap is 1 bit per input row. On 500M rows, that is about 5.6 GB for selectors versus 62.5 MB for the bitmap.
The rough calculation is:
The reader may then choose the
Maskstrategy and convert the selectors back into a bitmap. In that case, the caller's bitmap was converted to selectors and then back to a bitmap again.Describe the solution you'd like
A first-class mask backing on
RowSelection:The reader's
Maskstrategy can pass that buffer straight to the cursor. Existingfrom_filters/from_consecutive_rangesusers are unchanged.When execution uses the mask strategy, the reader can pass that buffer directly to the mask cursor instead of rebuilding a bitmap from selectors. Existing selector-backed paths remain available and are still better for long clustered runs and page-index-style selections.
This is scoped to callers that already have a row-level bitmap(come form the external index). It should not be presented as a broad DataFusion / TPC-DS / ClickBench speedup by itself, because current common DataFusion SQL paths generally do not naturally produce bitmap-backed
RowSelections.Describe alternatives you've considered
Doing the conversion downstream doesn't help: the producer still has to call
from_filtersand pay the RLE encoding cost.Additional context
No response