Skip to content

Allow RowSelection to be backed by a BooleanBuffer to reduce memory usage #10140

Description

@haohuaijin

Is your feature request related to a problem or challenge?

Some callers already have the selected rows as a BooleanBuffer. Today they still need to call RowSelection::from_filters, which converts that bitmap into a Vec<RowSelector>.

That can use much more memory. For example, with 35% isolated single-row hits, the selector form is about 11.2 bytes per input row. The bitmap is 1 bit per input row. On 500M rows, that is about 5.6 GB for selectors versus 62.5 MB for the bitmap.

The rough calculation is:

RowSelector = 16 bytes
selectors   = 500M rows * 35% hits * 2 selectors per isolated hit
            = 350M selectors
memory      = 350M * 16 bytes = 5.6 GB

bitmap      = 500M bits / 8 = 62.5 MB

The reader may then choose the Mask strategy and convert the selectors back into a bitmap. In that case, the caller's bitmap was converted to selectors and then back to a bitmap again.

Describe the solution you'd like

A first-class mask backing on RowSelection:

let selection = RowSelection::from_boolean_buffer(buf);
let selection: RowSelection = buf.into();

The reader's Mask strategy can pass that buffer straight to the cursor. Existing from_filters / from_consecutive_ranges users are unchanged.

When execution uses the mask strategy, the reader can pass that buffer directly to the mask cursor instead of rebuilding a bitmap from selectors. Existing selector-backed paths remain available and are still better for long clustered runs and page-index-style selections.

This is scoped to callers that already have a row-level bitmap(come form the external index). It should not be presented as a broad DataFusion / TPC-DS / ClickBench speedup by itself, because current common DataFusion SQL paths generally do not naturally produce bitmap-backed RowSelections.

Describe alternatives you've considered

Doing the conversion downstream doesn't help: the producer still has to call from_filters and pay the RLE encoding cost.

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions