Skip to content

Add text embedding search with hybrid mode support#12

Merged
listlessbird merged 13 commits into
mainfrom
feat/hybrid-search
Feb 24, 2026
Merged

Add text embedding search with hybrid mode support#12
listlessbird merged 13 commits into
mainfrom
feat/hybrid-search

Conversation

@listlessbird
Copy link
Copy Markdown
Owner

Summary

  • Build a separate text FAISS index alongside the image index during rebuild
  • Add mode query parameter to search (image/text/hybrid), defaulting to hybrid
  • Merge results in hybrid mode using Reciprocal Rank Fusion (score = 1/(rank+k), k=60)
  • Fix text index artifacts being skipped when image index was already cached

Closes #9

Build a separate text FAISS index alongside the image index during
rebuild, using the existing caption/OCR text embeddings stored in S3.
The text index is cached on disk but only loaded into memory lazily
on first use, keeping the default startup fast.

Search now supports a `mode` query parameter (image/text/hybrid).
In hybrid mode (the default), both indexes are queried independently
with the same text embedding and results are merged using Reciprocal
Rank Fusion (score = 1/(rank+k), k=60).

Closes #9
Text index files were only downloaded inside the `if not index_file.exists()`
guard, so nodes that already had the image index cached from before text
indexes existed would never fetch text_index.faiss. This caused
has_text_index() to report false and hybrid mode to silently degrade
to image-only.

Move text artifact fetching to its own `if not text_index_local.exists()`
block that runs unconditionally.
A single _text.npy with a wrong dimension caused a NumPy broadcast
error that killed the entire rebuild. Since text indexing is optional,
log a warning and skip the bad vector so the image index (and the rest
of the text index) still gets built.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 052ebf46bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend-api/src/api/services/search.py Outdated
Comment on lines +27 to +28
for rank, (image_id, _) in enumerate(image_results):
scores[image_id] = scores.get(image_id, 0) + 1 / (rank + k)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Start reciprocal-rank fusion at rank 1

reciprocal_rank_fusion uses enumerate(...) with the default zero-based rank, so the top hit is scored as 1/(k+0) instead of the standard RRF 1/(k+1). This over-weights every position and can change merged ordering compared with canonical RRF, which means hybrid ranking quality will drift from expected behavior whenever mode="hybrid" is used.

Useful? React with 👍 / 👎.

Comment on lines +121 to +124
text_jobs: list[tuple[str, Path]] = [
(text_index_key, text_index_local),
(text_mapping_key, cache_path / "text_mapping.json"),
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip hard-failing index load on optional text artifact fetch

This block makes text_mapping.json a required download whenever text_index.faiss exists, and any download failure bubbles out before the image index is loaded. In environments with partial/corrupt text artifacts (or transient object-store failures), _load_index_version can fail entirely, taking down even image-only search despite healthy index.faiss and mapping.json.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9384010ed9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# A node may already have index.faiss cached from before text indexes
# existed, so this must run even when the image index is already on disk.
text_index_local = cache_path / "text_index.faiss"
if not text_index_local.exists():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require mapping file before treating text cache as valid

This guard only checks for text_index.faiss, so a partial prior download (for example, index file written but text_mapping.json failed) will skip re-download on the next load. In that state ensure_text_index_loaded leaves _text_id_mapping empty, which makes search_text return no mapped results and silently degrades text/hybrid search until the cache is manually repaired.

Useful? React with 👍 / 👎.

Comment on lines +389 to +391
text_built = (tmp_path / "text_index.faiss").exists()
text_s3_key = self._storage.build_index_key(version, "text_index.faiss") if text_built else None
text_n_built = len(text_image_ids) if text_built and text_image_ids is not None else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive text build metadata before tempdir cleanup

text_built is computed after leaving with tempfile.TemporaryDirectory(), so tmp_path has already been removed and (tmp_path / "text_index.faiss").exists() is always false. That means BuildResult.text_num_vectors and text_s3_key are always reported as None even when text artifacts were successfully built and uploaded, causing incorrect workflow/job metadata.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 228bb31b79

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +126 to +129
try:
txt_emb = storage.download_numpy(text_key)
except ClientError:
pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Tolerate optional text-embedding fetch failures

The optional text embedding fetch only suppresses ClientError, but storage.download_numpy() can also raise non-ClientError exceptions (for example ValueError from np.load on a malformed .npy, or transport-level botocore errors). In those cases the exception escapes _download_one, future.result() re-raises, and build_index_activity aborts even though image embeddings were successfully collected, which turns an optional text artifact problem into a full rebuild failure.

Useful? React with 👍 / 👎.

@listlessbird listlessbird linked an issue Feb 24, 2026 that may be closed by this pull request
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Feb 24, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
memes a94c7d5 Commit Preview URL

Branch Preview URL
Feb 24 2026, 03:22 PM

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@listlessbird listlessbird merged commit b1a78b4 into main Feb 24, 2026
2 of 3 checks passed
@listlessbird listlessbird deleted the feat/hybrid-search branch February 24, 2026 15:22
listlessbird added a commit that referenced this pull request Feb 24, 2026
* feat: late fusion search with RRF and disk-cached text FAISS index

Build a separate text FAISS index alongside the image index during
rebuild, using the existing caption/OCR text embeddings stored in S3.
The text index is cached on disk but only loaded into memory lazily
on first use, keeping the default startup fast.

Search now supports a `mode` query parameter (image/text/hybrid).
In hybrid mode (the default), both indexes are queried independently
with the same text embedding and results are merged using Reciprocal
Rank Fusion (score = 1/(rank+k), k=60).

Closes #9

* fix: fetch text index artifacts independently of image index cache

Text index files were only downloaded inside the `if not index_file.exists()`
guard, so nodes that already had the image index cached from before text
indexes existed would never fetch text_index.faiss. This caused
has_text_index() to report false and hybrid mode to silently degrade
to image-only.

Move text artifact fetching to its own `if not text_index_local.exists()`
block that runs unconditionally.

* fix: skip mismatched text embeddings instead of aborting index build

A single _text.npy with a wrong dimension caused a NumPy broadcast
error that killed the entire rebuild. Since text indexing is optional,
log a warning and skip the bad vector so the image index (and the rest
of the text index) still gets built.

* feat: add text meta

* chore: tuning notebook

* chore: setup nbstripout

* fix: one off rrf, text index check

* fix: exception handle

* chore: add basic tests

* chore: test actions

* chore: branding

* fix: missing mode param in test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

late fusion the text embeddings text embeddings are untouched

1 participant