Add text embedding search with hybrid mode support#12
Conversation
Build a separate text FAISS index alongside the image index during rebuild, using the existing caption/OCR text embeddings stored in S3. The text index is cached on disk but only loaded into memory lazily on first use, keeping the default startup fast. Search now supports a `mode` query parameter (image/text/hybrid). In hybrid mode (the default), both indexes are queried independently with the same text embedding and results are merged using Reciprocal Rank Fusion (score = 1/(rank+k), k=60). Closes #9
Text index files were only downloaded inside the `if not index_file.exists()` guard, so nodes that already had the image index cached from before text indexes existed would never fetch text_index.faiss. This caused has_text_index() to report false and hybrid mode to silently degrade to image-only. Move text artifact fetching to its own `if not text_index_local.exists()` block that runs unconditionally.
A single _text.npy with a wrong dimension caused a NumPy broadcast error that killed the entire rebuild. Since text indexing is optional, log a warning and skip the bad vector so the image index (and the rest of the text index) still gets built.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 052ebf46bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for rank, (image_id, _) in enumerate(image_results): | ||
| scores[image_id] = scores.get(image_id, 0) + 1 / (rank + k) |
There was a problem hiding this comment.
Start reciprocal-rank fusion at rank 1
reciprocal_rank_fusion uses enumerate(...) with the default zero-based rank, so the top hit is scored as 1/(k+0) instead of the standard RRF 1/(k+1). This over-weights every position and can change merged ordering compared with canonical RRF, which means hybrid ranking quality will drift from expected behavior whenever mode="hybrid" is used.
Useful? React with 👍 / 👎.
| text_jobs: list[tuple[str, Path]] = [ | ||
| (text_index_key, text_index_local), | ||
| (text_mapping_key, cache_path / "text_mapping.json"), | ||
| ] |
There was a problem hiding this comment.
Skip hard-failing index load on optional text artifact fetch
This block makes text_mapping.json a required download whenever text_index.faiss exists, and any download failure bubbles out before the image index is loaded. In environments with partial/corrupt text artifacts (or transient object-store failures), _load_index_version can fail entirely, taking down even image-only search despite healthy index.faiss and mapping.json.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9384010ed9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # A node may already have index.faiss cached from before text indexes | ||
| # existed, so this must run even when the image index is already on disk. | ||
| text_index_local = cache_path / "text_index.faiss" | ||
| if not text_index_local.exists(): |
There was a problem hiding this comment.
Require mapping file before treating text cache as valid
This guard only checks for text_index.faiss, so a partial prior download (for example, index file written but text_mapping.json failed) will skip re-download on the next load. In that state ensure_text_index_loaded leaves _text_id_mapping empty, which makes search_text return no mapped results and silently degrades text/hybrid search until the cache is manually repaired.
Useful? React with 👍 / 👎.
| text_built = (tmp_path / "text_index.faiss").exists() | ||
| text_s3_key = self._storage.build_index_key(version, "text_index.faiss") if text_built else None | ||
| text_n_built = len(text_image_ids) if text_built and text_image_ids is not None else None |
There was a problem hiding this comment.
Derive text build metadata before tempdir cleanup
text_built is computed after leaving with tempfile.TemporaryDirectory(), so tmp_path has already been removed and (tmp_path / "text_index.faiss").exists() is always false. That means BuildResult.text_num_vectors and text_s3_key are always reported as None even when text artifacts were successfully built and uploaded, causing incorrect workflow/job metadata.
Useful? React with 👍 / 👎.
9384010 to
f1c3aef
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 228bb31b79
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try: | ||
| txt_emb = storage.download_numpy(text_key) | ||
| except ClientError: | ||
| pass |
There was a problem hiding this comment.
Tolerate optional text-embedding fetch failures
The optional text embedding fetch only suppresses ClientError, but storage.download_numpy() can also raise non-ClientError exceptions (for example ValueError from np.load on a malformed .npy, or transport-level botocore errors). In those cases the exception escapes _download_one, future.result() re-raises, and build_index_activity aborts even though image embeddings were successfully collected, which turns an optional text artifact problem into a full rebuild failure.
Useful? React with 👍 / 👎.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
memes | a94c7d5 | Commit Preview URL Branch Preview URL |
Feb 24 2026, 03:22 PM |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
* feat: late fusion search with RRF and disk-cached text FAISS index Build a separate text FAISS index alongside the image index during rebuild, using the existing caption/OCR text embeddings stored in S3. The text index is cached on disk but only loaded into memory lazily on first use, keeping the default startup fast. Search now supports a `mode` query parameter (image/text/hybrid). In hybrid mode (the default), both indexes are queried independently with the same text embedding and results are merged using Reciprocal Rank Fusion (score = 1/(rank+k), k=60). Closes #9 * fix: fetch text index artifacts independently of image index cache Text index files were only downloaded inside the `if not index_file.exists()` guard, so nodes that already had the image index cached from before text indexes existed would never fetch text_index.faiss. This caused has_text_index() to report false and hybrid mode to silently degrade to image-only. Move text artifact fetching to its own `if not text_index_local.exists()` block that runs unconditionally. * fix: skip mismatched text embeddings instead of aborting index build A single _text.npy with a wrong dimension caused a NumPy broadcast error that killed the entire rebuild. Since text indexing is optional, log a warning and skip the bad vector so the image index (and the rest of the text index) still gets built. * feat: add text meta * chore: tuning notebook * chore: setup nbstripout * fix: one off rrf, text index check * fix: exception handle * chore: add basic tests * chore: test actions * chore: branding * fix: missing mode param in test
Summary
modequery parameter to search (image/text/hybrid), defaulting to hybridCloses #9