support storing, filtering and retrieving non-key columns by nyh · Pull Request #444 · scylladb/vector-store

nyh · 2026-05-09T19:53:48Z

This series completes the filtering-column feature in the vector store:

Enabling the vector-store to store - in memory - additional non-key columns, not just the key columns and vectors that we store today. This should support both CQL columns and Alternator attributes.
Allowing ANN queries to pre-filter on these stored non-key column values, not just on the key columns.
Also optionally retrieve some or all of these stored columns with search results.

Alternator vector search will use all three features. CQL's vector search can't yet use the third feature, but in the future it could: If the user's vector search only SELECTs columns which are stored in the vector index, then the entire request can be satisfied from the vector index (using the third feature) and CQL doesn't need to perform additional slow base-table queries.

The vector store already had the data-model bones for this feature: "Table" stored typed per-column values, indexes.rs tracked which columns an index declared as filterable, and the routing logic could prefer an
index whose filtering columns covered all restrictions in a query. What was missing was the actual plumbing: fetching the column values from ScyllaDB during backfill, extracting them from CDC change rows, forwarding
them through DbEmbedding, and storing them into "Table". Without that plumbing the filtering-column slots always contained the sentinel "not-present" value, so every restriction on a non-primary-key column
produced no results.

Patch 1 — support storing and filtering on Alternator scalar attributes

Wires up the full pipeline, motivated by Alternator's Projection=INCLUDE for vector indexes (NonKeyAttributes used as KeyConditionExpression pre-filters).

For Alternator, all non-key attributes live in a single :attrs map<blob,blob> column with no per-attribute type schema. The two common scalar encodings (S = 0x00 tag + UTF-8 bytes, N = 0x03 tag + CQL decimal) are stored verbatim as CqlValue::Blob so the type tag is preserved for comparison. cql_cmp gains two cross-type arms
(Blob, Text) and (Blob, Decimal) that decode the tag at comparison time, and httproutes.rs maps JSON string/number filter values to Text/Decimal for the HTTP API side.

The CQL path was already partially in place (the scan query and CDC consumer already selected the typed columns directly), but the values were never forwarded into Table::add(). That gap is also closed here as a side-effect.

Patch 2 — validator: test ANN filtering on a CQL non-primary-key column

Adds an end-to-end validator test that exercises the newly complete CQL path. A table with pk INT PRIMARY KEY, category INT, v VECTOR<FLOAT,3> gets a vector index with category declared as a filtering column. Six rows with alternating category 0/1 are inserted; a WHERE category = 0 ANN query must return exactly three rows. Verified to pass against a real ScyllaDB node.

Patch 3 - ann: optionally return stored column values alongside primary keys
ann() is given the ability to optionally return some of the stored columns together with the item keys which are always returned. Alternator uses this to allow a search that only needs the stored columns to complete without the slow extra step of base table lookup per result. CQL doesn't use it yet, but in the future could when a SELECT only asks for the keys and filtering columns.

Patch 4 - cdc: fix incorrect vector deletion on Alternator partial updates
Fix a pre-existing bug where we incorrectly assumed that if an Alternator attribute - such as the vector column itself - was missing in an update, it meant it was deleted (and when the vector column is deleted, the entire item is deleted). That assumption was not true - CDC gives only deltas, so a certain change might be to some unrelated
attribute and not change the vector column at all. We need to correctly distinguish between elements of ":attr" that were really deleted in the operation - and elements which are missing from the delta because they were really deleted. Luckily (it's not luck, obviously), CDC gives us exactly this information, which we forgot to use.

Patch 5 - fix a bug in try_to_json() which could lead to crash if an unsupported CQL type was used as a filtering column.

nyh · 2026-05-09T19:55:59Z

Notes to reviewers:

The first patch comes from an earlier pull request that Alternator needs - it will be removed in a rebase when it goes in from that other pull request. (DONE)
This patch to the vector store (or something resembling it) is needed for allowing Alternator to "project" additional non-key columns to the index (this is analogous to the CQL "filtering columns" index-schema feature). The Alternator feature is in Alternator vector search: Implement projected columns (for filtering and selecting) scylladb#29959.

nyh · 2026-05-09T20:20:10Z

While writing this patch, I discovered that the vector store is missing two very important features regarding "projected columns" (a.k.a. filtering columns):

Although the code can store a list of these columns, as far as I can tell there was no code to actually use these columns in a filter. I think the code I added now for Alternator filtering is the first implementation in the vector store?
Additionally, there doesn't seem any way to ask the vector store to return the stored projected column values for matching columns. This is important for Alternator which wants to be able to fetch these projected columns efficiently - without the slow extra step of reading all the matching items from the base table.

This patch is currently implementing 1, but not yet 2. I - or somebody else - will need to implement 2 as well, to completely support what Alternator needs. Right now the Alternator feature (scylladb/scylladb#29776) has an xfailing test because of this missing feature. (DONE BELOW)

Copilot

Pull request overview

Adds end-to-end support for associating scalar attribute values with each indexed embedding (especially for Alternator :attrs) and using them as pre-filter restrictions in ANN queries, by wiring extraction (scan + CDC), storage, routing metadata, and HTTP filter parsing/comparison semantics.

Changes:

Extend the embedding ingestion pipeline (range scan + CDC) to extract/store filtering column values (including Alternator :attrs['col'] scalars) alongside vectors.
Add Alternator-aware comparison semantics by preserving type-tagged blobs and supporting cross-type comparisons (Blob vs Text/Decimal) during filter evaluation.
Extend HTTP ANN filtering to accept restrictions on filtering columns (not just primary keys) and compute similarity scores from distances (plus a regression test).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
crates/vector-store/tests/integration/usearch.rs	Adds regression test ensuring ANN similarity scores are decreasing and correctly derived from Euclidean distance.
crates/vector-store/tests/integration/db_basic.rs	Updates test scan embedding fixtures to include the new `column_values` field.
crates/vector-store/src/vector.rs	Defines Alternator type tags and adds helpers to extract scalar Alternator attribute blobs and scalar maps from `:attrs`.
crates/vector-store/src/table.rs	Stores per-row filtering column values and extends `cql_cmp` with Alternator blob-vs-text/decimal comparisons.
crates/vector-store/src/similarity.rs	Adjusts `SimilarityScore` conversions to be derived from `Distance`.
crates/vector-store/src/monitor_items.rs	Updates test fixtures to include the new `column_values` field.
crates/vector-store/src/lib.rs	Extends `DbEmbedding` with `column_values: BTreeMap<ColumnName, CqlValue>`.
crates/vector-store/src/indexes.rs	Propagates the selected index’s filtering columns through routing results.
crates/vector-store/src/httproutes.rs	Accepts filtering-column restrictions in ANN requests and maps Alternator attribute filter JSON to typed `CqlValue`s; computes similarity scores from `Distance`.
crates/vector-store/src/db.rs	Extends target-option parsing to support `fc` and skips vector-type validation for Alternator keyspaces.
crates/vector-store/src/db_index.rs	Extends range scan SELECT to include filtering columns and returns them in `DbEmbedding::column_values`; injects Alternator filtering columns as `Blob` types.
crates/vector-store/src/db_index_backend.rs	Extends range scan query generation to include filtering columns for both CQL and Alternator backends.
crates/vector-store/src/db_cdc.rs	Extracts filtering column values from CDC rows (Alternator from `:attrs`, CQL from typed columns) into `DbEmbedding::column_values`.

nyh · 2026-05-10T08:58:32Z

I'll push a new version fixing fmt and clippy errors. This is so sad for me to see these things. "fmt" didn't like the way the AI nicely formatted some lines (I have no idea what rule was even violated, it looked perfectly nice), and "clippy" didn't like a nested if and insisted that it must be rewritten with "&&". What nonsense :-(

nyh · 2026-05-10T09:02:14Z

Clippy still doesn't accept this patch, and reports issues in code NOT included in this patch.
Despite my better judement I'll include changes for those unrelated style issues :-(

nyh · 2026-05-10T09:13:29Z

Pushed a new version to fix a pre-existing test that started failing. The new paragraph from the commit message:

Two existing integration tests in `routing.rs` that previously expected
`BAD_REQUEST` for filter-column restrictions (with a TODO to update them
once end-to-end support landed) are updated to expect `OK` now that this
commit provides that support.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

nyh · 2026-05-10T10:49:28Z

@ewienik @QuerthDP I called you to review this PR. It started as adding the features I needed for saving non-key filtering columns in Alternator (known as "projected columns" there), but then I realized - this is exactly what is also needed for CQL to allow filtering on non-key columns. And indeed, my PR also fixes CQL :-) I wonder what you think about that - you actually had a feature in CQL that was almost working, but change needed the small changes here to fully work.

Still missing is the feature to return the projected columns in ANN requests, along with the primary key returned. I need this feature in Alternator, so I'll add this next.

nyh · 2026-05-10T13:11:15Z

Still missing is the feature to return the projected columns in ANN requests, along with the primary key returned. I need this feature in Alternator, so I'll add this next.

Another AI session, and this missing feature is also done, as the last third patch. ann() now has the ability to return some of the stored columns - if requested - together with the item keys always returned. Alternator uses this to allow a search that only needs the stored columns to complete without the slow extra step of base table lookup per result.

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

crates/vector-store/src/httproutes.rs:652

The handler only checks primary_keys.len() == distances.len(), but it doesn’t validate that column_values_per_row.len() matches the number of returned rows. If a backend returns a mismatched column_values_per_row (e.g., empty while keys/distances are non-empty), the response will contain misaligned/short column_values arrays. Consider adding a length check (or enforcing that backends return one entry per row whenever return_columns is non-empty) and returning 500 on mismatch.

        Ok((primary_keys, distances, column_values_per_row)) => {
            if primary_keys.len() != distances.len() {
                let msg = format!(
                    "wrong size of an ann response: number of primary_keys = {}, number of distances = {}",
                    primary_keys.len(),
                    distances.len()
                );
                debug!("post_index_ann: {msg}");
                (StatusCode::INTERNAL_SERVER_ERROR, msg).into_response()
            } else {

knowack1 · 2026-05-13T08:42:30Z

+
+    #[test]
+    fn extract_scalars_non_map_returns_empty() {
+        let result = extract_alternator_scalars(&CqlValue::Int(42), &["col".into()]);


Looks like this code does not compile in this commit.

Neither I nor my AI understood what you meant by this comment. The code does seem to compile. Maybe you mean if I try to compile this commit alone without the other commits this line will fail?

knowack1 · 2026-05-13T09:13:54Z

+/// category values (0 and 1).  Querying with `WHERE category = 0` must return
+/// only the rows whose category is 0.
+#[framed]
+async fn ann_filter_by_non_pk_filtering_column(actors: TestActors) {


Should we have the same test for alternator API?

knowack1 · 2026-05-13T09:20:45Z


    info!("finished");
 }
+


It seems that this and the previous commit close VECTOR-419. This should be mentioned in the cover letter.
Preferably, I would split this PR into two:

PR 1: Commits 1–2 // Support for filtering on non-PK columns

PR 2: Commits 3–5 // Extended HTTP API

knowack1 · 2026-05-13T09:26:26Z

        "properties": {
+          "column_values": {
+            "type": "object",
+            "description": "Per-column stored values for the filtering columns requested in\n`return_columns`.  Each entry maps a column name to a Vec of\n`Option<Value>` — one entry per returned nearest neighbour, in the\nsame order as `similarity_scores`.  `None` means the value was not\npresent for that row (e.g. the attribute did not exist when the row\nwas indexed).  Absent when `return_columns` was empty.",


"Vec of Option<Value>" - this looks like some rust specifics, while this is description of HTTP API.

knowack1 · 2026-05-13T09:27:15Z

        "properties": {
+          "column_values": {
+            "type": "object",
+            "description": "Per-column stored values for the filtering columns requested in\n`return_columns`.  Each entry maps a column name to a Vec of\n`Option<Value>` — one entry per returned nearest neighbour, in the\nsame order as `similarity_scores`.  `None` means the value was not\npresent for that row (e.g. the attribute did not exist when the row\nwas indexed).  Absent when `return_columns` was empty.",


What is "None" in json?

knowack1 · 2026-05-13T09:39:54Z

+        .await
+        .expect("failed to delete vector column");
+
+    // Wait for the index count to drop to 1.


This wait is not needed while we have later:
// ANN query should only return pk=2.

knowack1 · 2026-05-13T09:41:49Z

+
+    let index = create_index(CreateIndexQuery::new(&session, &clients, &table, "v")).await;
+
+    let status = wait_for_index(client, &index).await;


Instead, we can wait until the SELECT returns two vectors. Prefer the CQL API over the Vector Store HTTP API in validator tests to keep them more E2E-oriented.

knowack1 · 2026-05-13T11:15:17Z

                        DbEmbedding {
                            primary_key,
-                            embedding,
+                            embedding: Some(embedding),


Maybe its worth considering to make this struct more generic and keep only column_values - put the embedding also in column_values. This potentialy can be useful for upcoming full text search changes also.

ewienik · 2026-05-14T10:23:10Z

@ewienik @QuerthDP I called you to review this PR. It started as adding the features I needed for saving non-key filtering columns in Alternator (known as "projected columns" there), but then I realized - this is exactly what is also needed for CQL to allow filtering on non-key columns. And indeed, my PR also fixes CQL :-) I wonder what you think about that - you actually had a feature in CQL that was almost working, but change needed the small changes here to fully work.

Still missing is the feature to return the projected columns in ANN requests, along with the primary key returned. I need this feature in Alternator, so I'll add this next.

We have a ticket for creating local indexes using non-primary key columns and filtering columns from non-primary columns: VECTOR-561. Currently you can filter by all columns from the primary key. We need to design a solution for fullscan and cdc update of non-primary key columns, need to take care of memory usage (all data is stored in RAM), need to use timestamps for modify or discard incoming data. I think the solution for using non-primary key columns should be in separate PRs.

nyh · 2026-05-19T15:37:10Z

Pushed a new rebased version on the recent master (there have been a lot changes to the existing code and the rebase was complicated, but the AI just did it automatically, no sweat :-)).
I have not yet addressed the review comments. I'll do it now.

nyh · 2026-05-19T19:29:50Z

CI continues to fail on random network problems:

Error: Unable to download artifact(s): Failed to ListArtifacts: Received non-retryable error: Failed request: (403) Forbidden: Error from intermediary with HTTP status code 403 "Forbidden"

Whoever is responsible for these CI jobs, please fix it. I have no idea who to even ask.

Add the ability to associate scalar attribute values with each indexed embedding and use them as pre-filter restrictions in ANN queries. This is the vector-store side of Alternator's Projection=INCLUDE support for vector indexes, where NonKeyAttributes are projected into the index and can be used in KeyConditionExpression pre-filters. The vector-store already had the data-model infrastructure for filtering columns: `Table` tracked typed `Column` storage, `indexes.rs` tracked which columns were available for filtering, and routing could prefer an index whose filtering columns covered all the query restrictions. However, the actual pipeline for fetching, storing, and comparing those column values was not yet implemented. This commit wires that pipeline up, specifically for Alternator tables. **What this commit adds** `DbEmbedding` gains a `column_values: BTreeMap<ColumnName, CqlValue>` field. The backfill range-scan query is extended to select the relevant `:attrs[col]` subscripts alongside the vector, and the CDC consumer is extended to extract them from change rows. `Table::add()` stores the values via the existing `insert_cqlvalue` path, applying the same timestamp guard as for vectors (only update when the incoming timestamp is strictly newer than the stored one) and logging type-conversion errors via `warn!` rather than silently dropping them. `httproutes.rs` is extended to accept restrictions on filtering columns in addition to primary-key columns; the Alternator-specific Blob encoding path in the filter-value converter is gated on `is_alternator` so that genuine CQL blob columns keep the normal hex-string conversion. **Why Alternator is different from CQL** For CQL tables, non-key columns that are declared as filtering columns are ordinary typed CQL columns: they are fetched directly by the scan query and arrive as `CqlValue` values whose type already matches the column schema. No special encoding or decoding is needed. For Alternator tables there is no per-attribute schema. All non-key attributes are stored together in a single `:attrs map<blob,blob>` column, with each value encoded as a 1-byte type tag followed by a type-specific binary payload. This commit handles the two common scalar encodings: - S (tag 0x00): raw UTF-8 string bytes. - N (tag 0x03): CQL decimal (4-byte big-endian scale + big-endian signed varint). - NOT_SUPPORTED_YET (tag 0x04): JSON-wrapped {"S":...} or {"N":...}. The values are stored verbatim as `CqlValue::Blob` (raw bytes, including the type-tag byte) so that the comparator can later distinguish attribute types. Without the tag, the S-type string "1" and the N-type number 1 would be indistinguishable, causing the string "1" to incorrectly match a numeric filter `< 5`, and numbers like 10 to sort before 5 lexicographically even though 10 > 5 numerically. `cql_cmp` is extended with two cross-type cases that decode the blob at comparison time: - `(Blob, Text)`: blob must be S-type; the UTF-8 payload is compared lexicographically. N-type blobs return None (type mismatch). - `(Blob, Decimal)`: blob must be N-type; the CQL decimal payload is decoded into a `BigDecimal` and compared numerically. S-type blobs return None (type mismatch). `httproutes.rs` maps JSON string filter values to `CqlValue::Text` (for S-type comparisons) and JSON number filter values to `CqlValue::Decimal` (for N-type comparisons) when the column type is `Blob`. `from_target_option()` is extended to read a `fc` (filtering columns) list from the index target-option JSON, to default to a Global index type when no `pk` field is present, and to skip column-type validation for Alternator keyspaces (where the vector lives in `:attrs`, not a native CQL column). Two existing integration tests in `routing.rs` that previously expected `BAD_REQUEST` for filter-column restrictions (with a TODO to update them once end-to-end support landed) are updated to expect `OK` now that this commit provides that support. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

The previous commit ("support storing and filtering on Alternator scalar attributes") wired up the full filtering-column pipeline in `Table::add()`: it introduced the `column_values` field on `DbEmbedding` and taught `Table::add()` to store those values into the per-column `TValue` storage. Although the commit was motivated by Alternator support, it also completed the CQL path. The backfill range-scan query and the CDC consumer already selected and extracted non-primary-key filtering columns for CQL tables (directly by column name, with no special encoding), but the values were never forwarded to `Table::add()` and therefore never stored. After that commit the pipeline is complete end-to-end for CQL as well. Add a validator test, `ann_filter_by_non_pk_filtering_column`, that exercises this path: - A table with `pk INT PRIMARY KEY, category INT, v VECTOR<FLOAT, 3>` is created with a vector index that declares `category` as a filtering column. - Six rows are inserted with alternating `category` values (0 and 1). - An ANN query with `WHERE category = 0` must return exactly the three rows whose category is 0. The test was verified to pass end-to-end against a real ScyllaDB node. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Add support for the caller to request that the ANN response includes the stored values of named filtering columns, in addition to the usual primary keys and distances. New API fields -------------- * `PostIndexAnnRequest.return_columns` — an optional list of column names whose stored values should be returned. Defaults to empty (no change in behaviour for existing callers). * `PostIndexAnnResponse.column_values` — a map from column name to a per-result `Vec<Option<Value>>`, aligned with `similarity_scores`. A `None` entry means the attribute was absent when the row was indexed. Omitted from the response when `return_columns` was empty. Why this matters ---------------- The primary motivation is efficient vector-search queries in Alternator (ScyllaDB's DynamoDB-compatible API) - but in the future can be used in CQL as well: Alternator makes it possible to "project" some non-key attributes into the index, to allow them to be used in pre-filtering. If the user can ask to return *only* these projected attributes or a subset of them (Select = ALL_PROJECTED_ATTRIBUTES or SPECIFIC_ATTRIBUTES), the query can be fulfilled entirely by the vector store - without needing to read every item from the base table. We just need the vector store to return the projected columns' values - which it already stores to allow filtering. Implementation -------------- * `TableSearch::column_values_for()` — new trait method (and `Table` implementation) that retrieves stored `CqlValue`s for a set of column names given a primary ID. * `usearch.rs` — threads `return_columns: Arc<[ColumnName]>` through both the `Ann` and `FilteredAnn` dispatch paths and their underlying free functions; calls `column_values_for()` for each result row and collects the values. * `actor.rs` — extends `AnnR` from a 2-tuple to a 3-tuple `(Vec<PrimaryKey>, Vec<Distance>, Vec<BTreeMap<ColumnName, CqlValue>>)` and adds `return_columns` to the `Ann` and `FilteredAnn` actor messages. * `httproutes.rs` — extracts `return_columns` from the request, passes it to `ann()`/`filtered_ann()`, converts the returned `CqlValue`s to JSON via `try_to_json()`, and populates `column_values` in the response. * `opensearch.rs` — threads `return_columns` through to `ann()` and calls `column_values_for()` for each result row, same as usearch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Problem ------- When an Alternator UpdateItem that only removes an attribute (e.g. `REMOVE x`) was processed via CDC, the indexed item was silently removed from the vector index entirely, even when x is not the vector attribute. Root cause: the Alternator `:attrs` attribute map is stored as a single CQL map column. CDC delivers *delta* rows, not post-images: a delta row for `REMOVE x` contains the map keys that were deleted in `cdc$deleted_elements_:attrs`, while the new-values side of `:attrs` is absent (NULL). The old code treated a NULL `:attrs` in the CDC row as "vector deleted" and emitted `embedding: None`, which caused the table layer to remove the item from the index. Why this was unnoticed before ----------------------------- The bug has been present since Alternator support was added. However, none of the existing tests exercised the combination of (1) an item that has a vector, (2) an UpdateItem REMOVE on a *non-vector* attribute, and (3) a subsequent ANN query expecting the item to still be findable. Only once projected/filtering columns were introduced did we write tests that naturally trigger exactly this pattern (removing a filtering column while expecting the vector to remain indexed), exposing the pre-existing bug. Fix --- 1. **Tri-state `embedding` field in `DbEmbedding`.** Changed `embedding: Option<Vector>` to `embedding: Option<Option<Vector>>`, where: - `None` = vector unchanged; skip the index update - `Some(None)` = vector was deleted; remove from index - `Some(Some(v))` = vector was set to `v`; add/update in index 2. **Correct CDC delta interpretation in `db_cdc.rs`.** The Alternator branch now reads: - `cdc$deleted_elements_:attrs` (keys explicitly deleted this event) - `cdc$deleted_:attrs` (whole collection tombstoned, e.g. PutItem) - `row.operation` (to detect RowDelete / PartitionDelete) Embedding action is only `Some(None)` (delete) when one of the above deletion signals applies to the vector attribute. A NULL `:attrs` in a partial-update event now maps to `None` (skip), meaning "the vector was not touched by this operation". 3. **Correct filtering-column tombstoning.** `extract_alternator_scalars` no longer automatically tombstones absent filtering columns (absence in a delta = unchanged, not deleted). The CDC handler now explicitly tombstones only columns that appear in `cdc$deleted_elements_:attrs`, or all columns when `whole_col_deleted` is true and the column was not set in the new delta. 4. **Table layer `add()` handles the tri-state.** The `add()` method branches on `Some(None)` -> remove, `Some(Some(v))` -> upsert, and `None` -> no-op, and all call sites and tests were updated accordingly.

Previously try_to_json ended with `_ => unimplemented!()`, which causes the server process to abort with a panic if any stored column value happens to be a CqlValue variant not explicitly handled (e.g. Inet). This was harmless when try_to_json was only called for primary-key columns, because primary keys are restricted to a small set of types in practice. With the new return_columns plumbing, try_to_json is now also applied to arbitrary filtering-column values returned by the index. Filtering columns can be of any NativeType supported by Table::new, including Inet, so the panic path is now reachable from a normal query. Fix by: - Adding explicit handling for CqlValue::Inet (formatted as its standard string representation, e.g. "192.168.1.1" or "::1"). - Replacing the wildcard arm with a bail!() so any currently-unhandled variant returns a regular error to the caller instead of aborting the server. Test coverage: added assertions for IPv4, IPv6, and the error path (Counter, a variant that is not stored in Table but proves the graceful fallback works) to the existing try_to_json_conversion test.

The engine's main loop syncs each index's status from node_state into the engine's internal cache every second (hardcoded CHECK_INTERVAL). This cache is what the get_index_status HTTP endpoint reads, which is what Scylla calls to determine whether a vector index is ready. Because this interval was hardcoded to 1s, a newly-built index would continue to be reported as CREATING for up to a second after the prefill scan finished and node_state marked it as Serving. This caused tests that wait for IndexStatus==ACTIVE to take at least ~1 second even when the actual indexing completed in milliseconds. Expose the interval as VECTOR_STORE_INDEX_STATUS_UPDATE_INTERVAL (default: 1s, matching the previous hardcoded value) so test environments can set it to a lower value (e.g. 100ms) to reduce test latency. For example, before this patch Alternator's vector test suite (in the Scylla core repository) took 57 seconds; With this patch, it is down to 28 seconds. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

nyh · 2026-05-19T20:45:24Z

I've addressed the comments in the first commit, but still need to address the rest and probably to split this PR into two as the reviewer asked. I'll continue doing this tomorrow. Even with the AI it takes a lot of time wall-clock time to do such patch reorganizations (I don't even want to think what the Copilot charges for this kind of work will be starting June 1st...).

knowack1 · 2026-05-21T09:05:47Z

I've addressed the comments in the first commit, but still need to address the rest and probably to split this PR into two as the reviewer asked. I'll continue doing this tomorrow. Even with the AI it takes a lot of time wall-clock time to do such patch reorganizations (I don't even want to think what the Copilot charges for this kind of work will be starting June 1st...).

History reorganization is indeed very time-consuming, both for an AI and for a human. I often doubt whether it is worth spending this time just to keep a perfect history of changes. I do not think the investment in a perfect history is ever returned by later, more seamless investigations, code reviews, or easier backporting - but this is only my feeling, not backed by metrics.

However, in this specific case, this patch actually covers two features that could be delivered one by one:

Support for filtering columns (the missing feature).
Alternator support for provisioning (filtering columns).

Still, I am fine with keeping this in a single PR just need to be referenced accordingly.

ewienik · 2026-05-21T09:45:58Z

However, in this specific case, this patch actually covers two features that could be delivered one by one:
* Support for filtering columns (the missing feature).

* Alternator support for provisioning (filtering columns).
Still, I am fine with keeping this in a single PR just need to be referenced accordingly.

Support for filtering columns depends on #449 and VECTOR-561. I thought that #449 can be done quit fast, but it seems there are more problems similar to supporting vector search local indexes based on non-primary key columns (problem similar to MV). Both PR tries to extend DBEmbedding with map or vector of values from columns - it seems a challenge. Alsw VECTOR-561 is going to extend in similar way this struct. I think the best way to fight with this is to refactor vector-store with VECTOR-561 issue to allow filtering on custom columns and in next steps allow multi-column targets and local indexes on non-primary columns.

github-actions Bot added P2 area/vector labels May 9, 2026

nyh mentioned this pull request May 9, 2026

alternator: add even more vector search features scylladb/scylladb#29776

Open

nyh requested a review from Copilot May 10, 2026 08:52

Copilot started reviewing on behalf of nyh May 10, 2026 08:53 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread crates/vector-store/src/httproutes.rs

Comment thread crates/vector-store/src/table.rs Outdated

Comment thread crates/vector-store/src/table.rs Outdated

Comment thread crates/vector-store/src/httproutes.rs

nyh force-pushed the alternator-filter branch from 31f3b52 to 06329e5 Compare May 10, 2026 08:59

nyh force-pushed the alternator-filter branch 2 times, most recently from 5ad183d to 28b97dc Compare May 10, 2026 09:12

nyh force-pushed the alternator-filter branch from 28b97dc to 584bddb Compare May 10, 2026 10:20

nyh requested a review from Copilot May 10, 2026 10:21

Copilot started reviewing on behalf of nyh May 10, 2026 10:21 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread crates/vector-store/src/table.rs

Comment thread crates/vector-store/src/table.rs

Comment thread crates/vector-store/src/vector.rs

Comment thread crates/vector-store/src/db_cdc.rs Outdated

nyh changed the title ~~support storing and filtering on Alternator scalar attributes~~ support storing and filtering on non-key columns May 10, 2026

nyh force-pushed the alternator-filter branch from 7857ba4 to ef2349e Compare May 10, 2026 10:45

nyh requested review from QuerthDP and ewienik May 10, 2026 10:46

nyh force-pushed the alternator-filter branch from 0a03676 to d5477f6 Compare May 10, 2026 13:34

nyh requested a review from Copilot May 10, 2026 13:34

Copilot started reviewing on behalf of nyh May 10, 2026 13:34 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread crates/vector-store/src/index/opensearch.rs

Comment thread crates/vector-store/src/httproutes.rs Outdated

Comment thread crates/httpapi/src/lib.rs Outdated

nyh marked this pull request as draft May 10, 2026 14:23

nyh changed the title ~~support storing and filtering on non-key columns~~ support storing, filtering and retrieving non-key columns May 10, 2026

nyh marked this pull request as ready for review May 10, 2026 15:49

nyh force-pushed the alternator-filter branch from 8afd558 to b101e4a Compare May 11, 2026 09:06

nyh requested a review from Copilot May 11, 2026 09:06

Copilot started reviewing on behalf of nyh May 11, 2026 09:07 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread crates/vector-store/src/index/opensearch.rs

Comment thread crates/vector-store/src/httproutes.rs Outdated

Comment thread crates/httpapi/src/lib.rs Outdated

Comment thread crates/vector-store/src/db_cdc.rs Outdated

nyh force-pushed the alternator-filter branch 2 times, most recently from 74da2ac to 5c6600c Compare May 11, 2026 12:17

nyh requested review from knowack1 and smoczy123 May 11, 2026 12:25

nyh mentioned this pull request May 11, 2026

validator: add filtering tests that were removed from test.py #435

Merged

knowack1 reviewed May 13, 2026

View reviewed changes

nyh mentioned this pull request May 19, 2026

Alternator vector search: Implement projected columns (for filtering and selecting) scylladb/scylladb#29959

Draft

nyh force-pushed the alternator-filter branch from 5c6600c to 11ba254 Compare May 19, 2026 15:34

nyh force-pushed the alternator-filter branch from 11ba254 to 663e968 Compare May 19, 2026 15:43

nyh added 6 commits May 19, 2026 23:42

nyh force-pushed the alternator-filter branch from 663e968 to 9c57758 Compare May 19, 2026 20:46


		let index = create_index(CreateIndexQuery::new(&session, &clients, &table, "v")).await;

		let status = wait_for_index(client, &index).await;

Conversation

nyh commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nyh commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nyh commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nyh commented May 10, 2026

Uh oh!

nyh commented May 10, 2026

Uh oh!

nyh commented May 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nyh commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nyh commented May 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

knowack1 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

nyh May 19, 2026

Choose a reason for hiding this comment

Uh oh!

knowack1 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

knowack1 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

knowack1 May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knowack1 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

knowack1 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

knowack1 May 13, 2026

nyh commented May 9, 2026 •

edited

Loading

nyh commented May 9, 2026 •

edited

Loading

nyh commented May 9, 2026 •

edited

Loading

nyh commented May 10, 2026 •

edited

Loading

knowack1 May 13, 2026 •

edited

Loading

knowack1 May 13, 2026 •

edited

Loading