Skip to content

support storing, filtering and retrieving non-key columns#444

Open
nyh wants to merge 6 commits into
scylladb:masterfrom
nyh:alternator-filter
Open

support storing, filtering and retrieving non-key columns#444
nyh wants to merge 6 commits into
scylladb:masterfrom
nyh:alternator-filter

Conversation

@nyh

@nyh nyh commented May 9, 2026

Copy link
Copy Markdown
Contributor

This series completes the filtering-column feature in the vector store:

  1. Enabling the vector-store to store - in memory - additional non-key columns, not just the key columns and vectors that we store today. This should support both CQL columns and Alternator attributes.
  2. Allowing ANN queries to pre-filter on these stored non-key column values, not just on the key columns.
  3. Also optionally retrieve some or all of these stored columns with search results.

Alternator vector search will use all three features. CQL's vector search can't yet use the third feature, but in the future it could: If the user's vector search only SELECTs columns which are stored in the vector index, then the entire request can be satisfied from the vector index (using the third feature) and CQL doesn't need to perform additional slow base-table queries.

The vector store already had the data-model bones for this feature: "Table" stored typed per-column values, indexes.rs tracked which columns an index declared as filterable, and the routing logic could prefer an
index whose filtering columns covered all restrictions in a query. What was missing was the actual plumbing: fetching the column values from ScyllaDB during backfill, extracting them from CDC change rows, forwarding
them through DbEmbedding, and storing them into "Table". Without that plumbing the filtering-column slots always contained the sentinel "not-present" value, so every restriction on a non-primary-key column
produced no results.

Patch 1 — support storing and filtering on Alternator scalar attributes

Wires up the full pipeline, motivated by Alternator's Projection=INCLUDE for vector indexes (NonKeyAttributes used as KeyConditionExpression pre-filters).

For Alternator, all non-key attributes live in a single :attrs map<blob,blob> column with no per-attribute type schema. The two common scalar encodings (S = 0x00 tag + UTF-8 bytes, N = 0x03 tag + CQL decimal) are stored verbatim as CqlValue::Blob so the type tag is preserved for comparison. cql_cmp gains two cross-type arms
(Blob, Text) and (Blob, Decimal) that decode the tag at comparison time, and httproutes.rs maps JSON string/number filter values to Text/Decimal for the HTTP API side.

The CQL path was already partially in place (the scan query and CDC consumer already selected the typed columns directly), but the values were never forwarded into Table::add(). That gap is also closed here as a side-effect.

Patch 2 — validator: test ANN filtering on a CQL non-primary-key column

Adds an end-to-end validator test that exercises the newly complete CQL path. A table with pk INT PRIMARY KEY, category INT, v VECTOR<FLOAT,3> gets a vector index with category declared as a filtering column. Six rows with alternating category 0/1 are inserted; a WHERE category = 0 ANN query must return exactly three rows. Verified to pass against a real ScyllaDB node.

Patch 3 - ann: optionally return stored column values alongside primary keys
ann() is given the ability to optionally return some of the stored columns together with the item keys which are always returned. Alternator uses this to allow a search that only needs the stored columns to complete without the slow extra step of base table lookup per result. CQL doesn't use it yet, but in the future could when a SELECT only asks for the keys and filtering columns.

Patch 4 - cdc: fix incorrect vector deletion on Alternator partial updates
Fix a pre-existing bug where we incorrectly assumed that if an Alternator attribute - such as the vector column itself - was missing in an update, it meant it was deleted (and when the vector column is deleted, the entire item is deleted). That assumption was not true - CDC gives only deltas, so a certain change might be to some unrelated
attribute and not change the vector column at all. We need to correctly distinguish between elements of ":attr" that were really deleted in the operation - and elements which are missing from the delta because they were really deleted. Luckily (it's not luck, obviously), CDC gives us exactly this information, which we forgot to use.

Patch 5 - fix a bug in try_to_json() which could lead to crash if an unsupported CQL type was used as a filtering column.

@nyh

nyh commented May 9, 2026

Copy link
Copy Markdown
Contributor Author

Notes to reviewers:

  1. The first patch comes from an earlier pull request that Alternator needs - it will be removed in a rebase when it goes in from that other pull request. (DONE)
  2. This patch to the vector store (or something resembling it) is needed for allowing Alternator to "project" additional non-key columns to the index (this is analogous to the CQL "filtering columns" index-schema feature). The Alternator feature is in Alternator vector search: Implement projected columns (for filtering and selecting) scylladb#29959.

@nyh

nyh commented May 9, 2026

Copy link
Copy Markdown
Contributor Author

While writing this patch, I discovered that the vector store is missing two very important features regarding "projected columns" (a.k.a. filtering columns):

  1. Although the code can store a list of these columns, as far as I can tell there was no code to actually use these columns in a filter. I think the code I added now for Alternator filtering is the first implementation in the vector store?
  2. Additionally, there doesn't seem any way to ask the vector store to return the stored projected column values for matching columns. This is important for Alternator which wants to be able to fetch these projected columns efficiently - without the slow extra step of reading all the matching items from the base table.

This patch is currently implementing 1, but not yet 2. I - or somebody else - will need to implement 2 as well, to completely support what Alternator needs. Right now the Alternator feature (scylladb/scylladb#29776) has an xfailing test because of this missing feature. (DONE BELOW)

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for associating scalar attribute values with each indexed embedding (especially for Alternator :attrs) and using them as pre-filter restrictions in ANN queries, by wiring extraction (scan + CDC), storage, routing metadata, and HTTP filter parsing/comparison semantics.

Changes:

  • Extend the embedding ingestion pipeline (range scan + CDC) to extract/store filtering column values (including Alternator :attrs['col'] scalars) alongside vectors.
  • Add Alternator-aware comparison semantics by preserving type-tagged blobs and supporting cross-type comparisons (Blob vs Text/Decimal) during filter evaluation.
  • Extend HTTP ANN filtering to accept restrictions on filtering columns (not just primary keys) and compute similarity scores from distances (plus a regression test).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
crates/vector-store/tests/integration/usearch.rs Adds regression test ensuring ANN similarity scores are decreasing and correctly derived from Euclidean distance.
crates/vector-store/tests/integration/db_basic.rs Updates test scan embedding fixtures to include the new column_values field.
crates/vector-store/src/vector.rs Defines Alternator type tags and adds helpers to extract scalar Alternator attribute blobs and scalar maps from :attrs.
crates/vector-store/src/table.rs Stores per-row filtering column values and extends cql_cmp with Alternator blob-vs-text/decimal comparisons.
crates/vector-store/src/similarity.rs Adjusts SimilarityScore conversions to be derived from Distance.
crates/vector-store/src/monitor_items.rs Updates test fixtures to include the new column_values field.
crates/vector-store/src/lib.rs Extends DbEmbedding with column_values: BTreeMap<ColumnName, CqlValue>.
crates/vector-store/src/indexes.rs Propagates the selected index’s filtering columns through routing results.
crates/vector-store/src/httproutes.rs Accepts filtering-column restrictions in ANN requests and maps Alternator attribute filter JSON to typed CqlValues; computes similarity scores from Distance.
crates/vector-store/src/db.rs Extends target-option parsing to support fc and skips vector-type validation for Alternator keyspaces.
crates/vector-store/src/db_index.rs Extends range scan SELECT to include filtering columns and returns them in DbEmbedding::column_values; injects Alternator filtering columns as Blob types.
crates/vector-store/src/db_index_backend.rs Extends range scan query generation to include filtering columns for both CQL and Alternator backends.
crates/vector-store/src/db_cdc.rs Extracts filtering column values from CDC rows (Alternator from :attrs, CQL from typed columns) into DbEmbedding::column_values.

Comment thread crates/vector-store/src/httproutes.rs
Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/httproutes.rs
@nyh

nyh commented May 10, 2026

Copy link
Copy Markdown
Contributor Author

I'll push a new version fixing fmt and clippy errors. This is so sad for me to see these things. "fmt" didn't like the way the AI nicely formatted some lines (I have no idea what rule was even violated, it looked perfectly nice), and "clippy" didn't like a nested if and insisted that it must be rewritten with "&&". What nonsense :-(

@nyh nyh force-pushed the alternator-filter branch from 31f3b52 to 06329e5 Compare May 10, 2026 08:59
@nyh

nyh commented May 10, 2026

Copy link
Copy Markdown
Contributor Author

Clippy still doesn't accept this patch, and reports issues in code NOT included in this patch.
Despite my better judement I'll include changes for those unrelated style issues :-(

@nyh nyh force-pushed the alternator-filter branch 2 times, most recently from 5ad183d to 28b97dc Compare May 10, 2026 09:12
@nyh

nyh commented May 10, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a new version to fix a pre-existing test that started failing. The new paragraph from the commit message:

Two existing integration tests in `routing.rs` that previously expected
`BAD_REQUEST` for filter-column restrictions (with a TODO to update them
once end-to-end support landed) are updated to expect `OK` now that this
commit provides that support.

@nyh nyh force-pushed the alternator-filter branch from 28b97dc to 584bddb Compare May 10, 2026 10:20
@nyh nyh requested a review from Copilot May 10, 2026 10:21

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Comment thread crates/vector-store/src/table.rs
Comment thread crates/vector-store/src/table.rs
Comment thread crates/vector-store/src/vector.rs
Comment thread crates/vector-store/src/db_cdc.rs Outdated
@nyh nyh changed the title support storing and filtering on Alternator scalar attributes support storing and filtering on non-key columns May 10, 2026
@nyh nyh force-pushed the alternator-filter branch from 7857ba4 to ef2349e Compare May 10, 2026 10:45
@nyh nyh requested review from QuerthDP and ewienik May 10, 2026 10:46
@nyh

nyh commented May 10, 2026

Copy link
Copy Markdown
Contributor Author

@ewienik @QuerthDP I called you to review this PR. It started as adding the features I needed for saving non-key filtering columns in Alternator (known as "projected columns" there), but then I realized - this is exactly what is also needed for CQL to allow filtering on non-key columns. And indeed, my PR also fixes CQL :-) I wonder what you think about that - you actually had a feature in CQL that was almost working, but change needed the small changes here to fully work.

Still missing is the feature to return the projected columns in ANN requests, along with the primary key returned. I need this feature in Alternator, so I'll add this next.

@nyh

nyh commented May 10, 2026

Copy link
Copy Markdown
Contributor Author

Still missing is the feature to return the projected columns in ANN requests, along with the primary key returned. I need this feature in Alternator, so I'll add this next.

Another AI session, and this missing feature is also done, as the last third patch. ann() now has the ability to return some of the stored columns - if requested - together with the item keys always returned. Alternator uses this to allow a search that only needs the stored columns to complete without the slow extra step of base table lookup per result.

@nyh nyh force-pushed the alternator-filter branch from 0a03676 to d5477f6 Compare May 10, 2026 13:34
@nyh nyh requested a review from Copilot May 10, 2026 13:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Comment thread crates/vector-store/src/index/opensearch.rs
Comment thread crates/vector-store/src/httproutes.rs Outdated
Comment thread crates/httpapi/src/lib.rs Outdated
@nyh nyh marked this pull request as draft May 10, 2026 14:23
@nyh nyh changed the title support storing and filtering on non-key columns support storing, filtering and retrieving non-key columns May 10, 2026
@nyh nyh marked this pull request as ready for review May 10, 2026 15:49
@nyh nyh force-pushed the alternator-filter branch from 8afd558 to b101e4a Compare May 11, 2026 09:06
@nyh nyh requested a review from Copilot May 11, 2026 09:06

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

crates/vector-store/src/httproutes.rs:652

  • The handler only checks primary_keys.len() == distances.len(), but it doesn’t validate that column_values_per_row.len() matches the number of returned rows. If a backend returns a mismatched column_values_per_row (e.g., empty while keys/distances are non-empty), the response will contain misaligned/short column_values arrays. Consider adding a length check (or enforcing that backends return one entry per row whenever return_columns is non-empty) and returning 500 on mismatch.
        Ok((primary_keys, distances, column_values_per_row)) => {
            if primary_keys.len() != distances.len() {
                let msg = format!(
                    "wrong size of an ann response: number of primary_keys = {}, number of distances = {}",
                    primary_keys.len(),
                    distances.len()
                );
                debug!("post_index_ann: {msg}");
                (StatusCode::INTERNAL_SERVER_ERROR, msg).into_response()
            } else {

Comment thread crates/vector-store/src/index/opensearch.rs
Comment thread crates/vector-store/src/httproutes.rs Outdated
Comment thread crates/httpapi/src/lib.rs Outdated
Comment thread crates/vector-store/src/db_cdc.rs Outdated
@nyh nyh force-pushed the alternator-filter branch 2 times, most recently from 74da2ac to 5c6600c Compare May 11, 2026 12:17
@nyh nyh requested review from knowack1 and smoczy123 May 11, 2026 12:25
Comment thread crates/vector-store/src/db.rs Outdated
Comment thread crates/vector-store/src/db.rs Outdated
Comment thread crates/vector-store/src/db_cdc.rs Outdated
Comment thread crates/vector-store/src/db_index.rs Outdated

#[test]
fn extract_scalars_non_map_returns_empty() {
let result = extract_alternator_scalars(&CqlValue::Int(42), &["col".into()]);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this code does not compile in this commit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither I nor my AI understood what you meant by this comment. The code does seem to compile. Maybe you mean if I try to compile this commit alone without the other commits this line will fail?

/// category values (0 and 1). Querying with `WHERE category = 0` must return
/// only the rows whose category is 0.
#[framed]
async fn ann_filter_by_non_pk_filtering_column(actors: TestActors) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have the same test for alternator API?


info!("finished");
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this and the previous commit close VECTOR-419. This should be mentioned in the cover letter.
Preferably, I would split this PR into two:

  • PR 1: Commits 1–2 // Support for filtering on non-PK columns
  • PR 2: Commits 3–5 // Extended HTTP API

Comment thread api/openapi.json
"properties": {
"column_values": {
"type": "object",
"description": "Per-column stored values for the filtering columns requested in\n`return_columns`. Each entry maps a column name to a Vec of\n`Option<Value>` — one entry per returned nearest neighbour, in the\nsame order as `similarity_scores`. `None` means the value was not\npresent for that row (e.g. the attribute did not exist when the row\nwas indexed). Absent when `return_columns` was empty.",

@knowack1 knowack1 May 13, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Vec of Option<Value>" - this looks like some rust specifics, while this is description of HTTP API.

Comment thread api/openapi.json
"properties": {
"column_values": {
"type": "object",
"description": "Per-column stored values for the filtering columns requested in\n`return_columns`. Each entry maps a column name to a Vec of\n`Option<Value>` — one entry per returned nearest neighbour, in the\nsame order as `similarity_scores`. `None` means the value was not\npresent for that row (e.g. the attribute did not exist when the row\nwas indexed). Absent when `return_columns` was empty.",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "None" in json?

.await
.expect("failed to delete vector column");

// Wait for the index count to drop to 1.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait is not needed while we have later:
// ANN query should only return pk=2.


let index = create_index(CreateIndexQuery::new(&session, &clients, &table, "v")).await;

let status = wait_for_index(client, &index).await;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, we can wait until the SELECT returns two vectors. Prefer the CQL API over the Vector Store HTTP API in validator tests to keep them more E2E-oriented.

Comment thread crates/vector-store/src/lib.rs Outdated
DbEmbedding {
primary_key,
embedding,
embedding: Some(embedding),

@knowack1 knowack1 May 13, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its worth considering to make this struct more generic and keep only column_values - put the embedding also in column_values. This potentialy can be useful for upcoming full text search changes also.

@ewienik

ewienik commented May 14, 2026

Copy link
Copy Markdown
Collaborator

@ewienik @QuerthDP I called you to review this PR. It started as adding the features I needed for saving non-key filtering columns in Alternator (known as "projected columns" there), but then I realized - this is exactly what is also needed for CQL to allow filtering on non-key columns. And indeed, my PR also fixes CQL :-) I wonder what you think about that - you actually had a feature in CQL that was almost working, but change needed the small changes here to fully work.

Still missing is the feature to return the projected columns in ANN requests, along with the primary key returned. I need this feature in Alternator, so I'll add this next.

We have a ticket for creating local indexes using non-primary key columns and filtering columns from non-primary columns: VECTOR-561. Currently you can filter by all columns from the primary key. We need to design a solution for fullscan and cdc update of non-primary key columns, need to take care of memory usage (all data is stored in RAM), need to use timestamps for modify or discard incoming data. I think the solution for using non-primary key columns should be in separate PRs.

@nyh

nyh commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a new rebased version on the recent master (there have been a lot changes to the existing code and the rebase was complicated, but the AI just did it automatically, no sweat :-)).
I have not yet addressed the review comments. I'll do it now.

@nyh nyh force-pushed the alternator-filter branch from 11ba254 to 663e968 Compare May 19, 2026 15:43
@nyh

nyh commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

CI continues to fail on random network problems:

Error: Unable to download artifact(s): Failed to ListArtifacts: Received non-retryable error: Failed request: (403) Forbidden: Error from intermediary with HTTP status code 403 "Forbidden"

Whoever is responsible for these CI jobs, please fix it. I have no idea who to even ask.

nyh added 6 commits May 19, 2026 23:42
Add the ability to associate scalar attribute values with each indexed
embedding and use them as pre-filter restrictions in ANN queries.  This
is the vector-store side of Alternator's Projection=INCLUDE support for
vector indexes, where NonKeyAttributes are projected into the index and
can be used in KeyConditionExpression pre-filters.

The vector-store already had the data-model infrastructure for filtering
columns: `Table` tracked typed `Column` storage, `indexes.rs` tracked
which columns were available for filtering, and routing could prefer an
index whose filtering columns covered all the query restrictions.
However, the actual pipeline for fetching, storing, and comparing those
column values was not yet implemented.  This commit wires that pipeline
up, specifically for Alternator tables.

**What this commit adds**

`DbEmbedding` gains a `column_values: BTreeMap<ColumnName, CqlValue>`
field.  The backfill range-scan query is extended to select the relevant
`:attrs[col]` subscripts alongside the vector, and the CDC consumer is
extended to extract them from change rows.  `Table::add()` stores the
values via the existing `insert_cqlvalue` path, applying the same
timestamp guard as for vectors (only update when the incoming timestamp
is strictly newer than the stored one) and logging type-conversion errors
via `warn!` rather than silently dropping them.  `httproutes.rs` is
extended to accept restrictions on filtering columns in addition to
primary-key columns; the Alternator-specific Blob encoding path in the
filter-value converter is gated on `is_alternator` so that genuine CQL
blob columns keep the normal hex-string conversion.

**Why Alternator is different from CQL**

For CQL tables, non-key columns that are declared as filtering columns
are ordinary typed CQL columns: they are fetched directly by the scan
query and arrive as `CqlValue` values whose type already matches the
column schema.  No special encoding or decoding is needed.

For Alternator tables there is no per-attribute schema.  All non-key
attributes are stored together in a single `:attrs map<blob,blob>` column,
with each value encoded as a 1-byte type tag followed by a type-specific
binary payload.  This commit handles the two common scalar encodings:

- S (tag 0x00): raw UTF-8 string bytes.
- N (tag 0x03): CQL decimal (4-byte big-endian scale + big-endian signed
  varint).
- NOT_SUPPORTED_YET (tag 0x04): JSON-wrapped {"S":...} or {"N":...}.

The values are stored verbatim as `CqlValue::Blob` (raw bytes, including
the type-tag byte) so that the comparator can later distinguish attribute
types.  Without the tag, the S-type string "1" and the N-type number 1
would be indistinguishable, causing the string "1" to incorrectly match
a numeric filter `< 5`, and numbers like 10 to sort before 5
lexicographically even though 10 > 5 numerically.

`cql_cmp` is extended with two cross-type cases that decode the blob at
comparison time:

- `(Blob, Text)`: blob must be S-type; the UTF-8 payload is compared
  lexicographically.  N-type blobs return None (type mismatch).
- `(Blob, Decimal)`: blob must be N-type; the CQL decimal payload is
  decoded into a `BigDecimal` and compared numerically.  S-type blobs
  return None (type mismatch).

`httproutes.rs` maps JSON string filter values to `CqlValue::Text` (for
S-type comparisons) and JSON number filter values to `CqlValue::Decimal`
(for N-type comparisons) when the column type is `Blob`.

`from_target_option()` is extended to read a `fc` (filtering columns)
list from the index target-option JSON, to default to a Global index type
when no `pk` field is present, and to skip column-type validation for
Alternator keyspaces (where the vector lives in `:attrs`, not a native
CQL column).

Two existing integration tests in `routing.rs` that previously expected
`BAD_REQUEST` for filter-column restrictions (with a TODO to update them
once end-to-end support landed) are updated to expect `OK` now that this
commit provides that support.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous commit ("support storing and filtering on Alternator scalar
attributes") wired up the full filtering-column pipeline in `Table::add()`:
it introduced the `column_values` field on `DbEmbedding` and taught
`Table::add()` to store those values into the per-column `TValue` storage.

Although the commit was motivated by Alternator support, it also completed
the CQL path.  The backfill range-scan query and the CDC consumer already
selected and extracted non-primary-key filtering columns for CQL tables
(directly by column name, with no special encoding), but the values were
never forwarded to `Table::add()` and therefore never stored.  After that
commit the pipeline is complete end-to-end for CQL as well.

Add a validator test, `ann_filter_by_non_pk_filtering_column`, that
exercises this path:

- A table with `pk INT PRIMARY KEY, category INT, v VECTOR<FLOAT, 3>` is
  created with a vector index that declares `category` as a filtering
  column.
- Six rows are inserted with alternating `category` values (0 and 1).
- An ANN query with `WHERE category = 0` must return exactly the three
  rows whose category is 0.

The test was verified to pass end-to-end against a real ScyllaDB node.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add support for the caller to request that the ANN response includes the
stored values of named filtering columns, in addition to the usual primary
keys and distances.

New API fields
--------------
* `PostIndexAnnRequest.return_columns` — an optional list of column names
  whose stored values should be returned.  Defaults to empty (no change in
  behaviour for existing callers).
* `PostIndexAnnResponse.column_values` — a map from column name to a
  per-result `Vec<Option<Value>>`, aligned with `similarity_scores`.  A
  `None` entry means the attribute was absent when the row was indexed.
  Omitted from the response when `return_columns` was empty.

Why this matters
----------------
The primary motivation is efficient vector-search queries in Alternator
(ScyllaDB's DynamoDB-compatible API) - but in the future can be used in
CQL as well: Alternator makes it possible to "project" some non-key
attributes into the index, to allow them to be used in pre-filtering.
If the user can ask to return *only* these projected attributes or a
subset of them (Select = ALL_PROJECTED_ATTRIBUTES or SPECIFIC_ATTRIBUTES),
the query can be fulfilled entirely by the vector store - without needing
to read every item from the base table. We just need the vector store to
return the projected columns' values - which it already stores to allow
filtering.

Implementation
--------------
* `TableSearch::column_values_for()` — new trait method (and `Table`
  implementation) that retrieves stored `CqlValue`s for a set of column
  names given a primary ID.
* `usearch.rs` — threads `return_columns: Arc<[ColumnName]>` through both
  the `Ann` and `FilteredAnn` dispatch paths and their underlying free
  functions; calls `column_values_for()` for each result row and collects
  the values.
* `actor.rs` — extends `AnnR` from a 2-tuple to a 3-tuple
  `(Vec<PrimaryKey>, Vec<Distance>, Vec<BTreeMap<ColumnName, CqlValue>>)`
  and adds `return_columns` to the `Ann` and `FilteredAnn` actor messages.
* `httproutes.rs` — extracts `return_columns` from the request, passes it
  to `ann()`/`filtered_ann()`, converts the returned `CqlValue`s to JSON
  via `try_to_json()`, and populates `column_values` in the response.
* `opensearch.rs` — threads `return_columns` through to `ann()` and calls
  `column_values_for()` for each result row, same as usearch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Problem
-------
When an Alternator UpdateItem that only removes an attribute (e.g.
`REMOVE x`) was processed via CDC, the indexed item was silently removed
from the vector index entirely, even when x is not the vector attribute.

Root cause: the Alternator `:attrs` attribute map is stored as a single
CQL map column.  CDC delivers *delta* rows, not post-images: a delta row
for `REMOVE x` contains the map keys that were deleted in
`cdc$deleted_elements_:attrs`, while the new-values side of `:attrs` is
absent (NULL).  The old code treated a NULL `:attrs` in the CDC row as
"vector deleted" and emitted `embedding: None`, which caused the table
layer to remove the item from the index.

Why this was unnoticed before
-----------------------------
The bug has been present since Alternator support was added.  However,
none of the existing tests exercised the combination of (1) an item
that has a vector, (2) an UpdateItem REMOVE on a *non-vector* attribute,
and (3) a subsequent ANN query expecting the item to still be findable.
Only once projected/filtering columns were introduced did we write tests
that naturally trigger exactly this pattern (removing a filtering column
while expecting the vector to remain indexed), exposing the pre-existing
bug.

Fix
---
1. **Tri-state `embedding` field in `DbEmbedding`.**
   Changed `embedding: Option<Vector>` to `embedding: Option<Option<Vector>>`,
   where:
   - `None`          = vector unchanged; skip the index update
   - `Some(None)`    = vector was deleted; remove from index
   - `Some(Some(v))` = vector was set to `v`; add/update in index

2. **Correct CDC delta interpretation in `db_cdc.rs`.**
   The Alternator branch now reads:
   - `cdc$deleted_elements_:attrs` (keys explicitly deleted this event)
   - `cdc$deleted_:attrs` (whole collection tombstoned, e.g. PutItem)
   - `row.operation` (to detect RowDelete / PartitionDelete)

   Embedding action is only `Some(None)` (delete) when one of the above
   deletion signals applies to the vector attribute.  A NULL `:attrs` in
   a partial-update event now maps to `None` (skip), meaning "the vector
   was not touched by this operation".

3. **Correct filtering-column tombstoning.**
   `extract_alternator_scalars` no longer automatically tombstones absent
   filtering columns (absence in a delta = unchanged, not deleted).  The
   CDC handler now explicitly tombstones only columns that appear in
   `cdc$deleted_elements_:attrs`, or all columns when `whole_col_deleted`
   is true and the column was not set in the new delta.

4. **Table layer `add()` handles the tri-state.**
   The `add()` method branches on `Some(None)` -> remove, `Some(Some(v))`
   -> upsert, and `None` -> no-op, and all call sites and tests were
   updated accordingly.
Previously try_to_json ended with `_ => unimplemented!()`, which causes
the server process to abort with a panic if any stored column value
happens to be a CqlValue variant not explicitly handled (e.g. Inet).

This was harmless when try_to_json was only called for primary-key
columns, because primary keys are restricted to a small set of types in
practice.  With the new return_columns plumbing, try_to_json is now also
applied to arbitrary filtering-column values returned by the index.
Filtering columns can be of any NativeType supported by Table::new,
including Inet, so the panic path is now reachable from a normal query.

Fix by:
- Adding explicit handling for CqlValue::Inet (formatted as its standard
  string representation, e.g. "192.168.1.1" or "::1").
- Replacing the wildcard arm with a bail!() so any currently-unhandled
  variant returns a regular error to the caller instead of aborting the
  server.

Test coverage: added assertions for IPv4, IPv6, and the error path
(Counter, a variant that is not stored in Table but proves the graceful
fallback works) to the existing try_to_json_conversion test.
The engine's main loop syncs each index's status from node_state into
the engine's internal cache every second (hardcoded CHECK_INTERVAL).
This cache is what the get_index_status HTTP endpoint reads, which is
what Scylla calls to determine whether a vector index is ready.

Because this interval was hardcoded to 1s, a newly-built index would
continue to be reported as CREATING for up to a second after the
prefill scan finished and node_state marked it as Serving. This caused
tests that wait for IndexStatus==ACTIVE to take at least ~1 second even
when the actual indexing completed in milliseconds.

Expose the interval as VECTOR_STORE_INDEX_STATUS_UPDATE_INTERVAL
(default: 1s, matching the previous hardcoded value) so test
environments can set it to a lower value (e.g. 100ms) to reduce test
latency.

For example, before this patch Alternator's vector test suite (in
the Scylla core repository) took 57 seconds; With this patch, it is
down to 28 seconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
@nyh

nyh commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

I've addressed the comments in the first commit, but still need to address the rest and probably to split this PR into two as the reviewer asked. I'll continue doing this tomorrow. Even with the AI it takes a lot of time wall-clock time to do such patch reorganizations (I don't even want to think what the Copilot charges for this kind of work will be starting June 1st...).

@nyh nyh force-pushed the alternator-filter branch from 663e968 to 9c57758 Compare May 19, 2026 20:46
@knowack1

Copy link
Copy Markdown
Collaborator

I've addressed the comments in the first commit, but still need to address the rest and probably to split this PR into two as the reviewer asked. I'll continue doing this tomorrow. Even with the AI it takes a lot of time wall-clock time to do such patch reorganizations (I don't even want to think what the Copilot charges for this kind of work will be starting June 1st...).

History reorganization is indeed very time-consuming, both for an AI and for a human. I often doubt whether it is worth spending this time just to keep a perfect history of changes. I do not think the investment in a perfect history is ever returned by later, more seamless investigations, code reviews, or easier backporting - but this is only my feeling, not backed by metrics.

However, in this specific case, this patch actually covers two features that could be delivered one by one:

  • Support for filtering columns (the missing feature).
  • Alternator support for provisioning (filtering columns).

Still, I am fine with keeping this in a single PR just need to be referenced accordingly.

@ewienik

ewienik commented May 21, 2026

Copy link
Copy Markdown
Collaborator

However, in this specific case, this patch actually covers two features that could be delivered one by one:

* Support for filtering columns (the missing feature).

* Alternator support for provisioning (filtering columns).

Still, I am fine with keeping this in a single PR just need to be referenced accordingly.

Support for filtering columns depends on #449 and VECTOR-561. I thought that #449 can be done quit fast, but it seems there are more problems similar to supporting vector search local indexes based on non-primary key columns (problem similar to MV). Both PR tries to extend DBEmbedding with map or vector of values from columns - it seems a challenge. Alsw VECTOR-561 is going to extend in similar way this struct. I think the best way to fight with this is to refactor vector-store with VECTOR-561 issue to allow filtering on custom columns and in next steps allow multi-column targets and local indexes on non-primary columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants