Add text embedding search with hybrid mode support by listlessbird · Pull Request #12 · listlessbird/mimeme

listlessbird · 2026-02-20T15:07:41Z

Summary

Build a separate text FAISS index alongside the image index during rebuild
Add mode query parameter to search (image/text/hybrid), defaulting to hybrid
Merge results in hybrid mode using Reciprocal Rank Fusion (score = 1/(rank+k), k=60)
Fix text index artifacts being skipped when image index was already cached

Closes #9

Build a separate text FAISS index alongside the image index during rebuild, using the existing caption/OCR text embeddings stored in S3. The text index is cached on disk but only loaded into memory lazily on first use, keeping the default startup fast. Search now supports a `mode` query parameter (image/text/hybrid). In hybrid mode (the default), both indexes are queried independently with the same text embedding and results are merged using Reciprocal Rank Fusion (score = 1/(rank+k), k=60). Closes #9

Text index files were only downloaded inside the `if not index_file.exists()` guard, so nodes that already had the image index cached from before text indexes existed would never fetch text_index.faiss. This caused has_text_index() to report false and hybrid mode to silently degrade to image-only. Move text artifact fetching to its own `if not text_index_local.exists()` block that runs unconditionally.

A single _text.npy with a wrong dimension caused a NumPy broadcast error that killed the entire rebuild. Since text indexing is optional, log a warning and skip the bad vector so the image index (and the rest of the text index) still gets built.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 052ebf46bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-20T15:12:22Z

+    for rank, (image_id, _) in enumerate(image_results):
+        scores[image_id] = scores.get(image_id, 0) + 1 / (rank + k)


Start reciprocal-rank fusion at rank 1

reciprocal_rank_fusion uses enumerate(...) with the default zero-based rank, so the top hit is scored as 1/(k+0) instead of the standard RRF 1/(k+1). This over-weights every position and can change merged ordering compared with canonical RRF, which means hybrid ranking quality will drift from expected behavior whenever mode="hybrid" is used.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-20T15:12:22Z

+                text_jobs: list[tuple[str, Path]] = [
+                    (text_index_key, text_index_local),
+                    (text_mapping_key, cache_path / "text_mapping.json"),
+                ]


Skip hard-failing index load on optional text artifact fetch

This block makes text_mapping.json a required download whenever text_index.faiss exists, and any download failure bubbles out before the image index is loaded. In environments with partial/corrupt text artifacts (or transient object-store failures), _load_index_version can fail entirely, taking down even image-only search despite healthy index.faiss and mapping.json.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9384010ed9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-23T16:20:19Z

+        # A node may already have index.faiss cached from before text indexes
+        # existed, so this must run even when the image index is already on disk.
+        text_index_local = cache_path / "text_index.faiss"
+        if not text_index_local.exists():


Require mapping file before treating text cache as valid

This guard only checks for text_index.faiss, so a partial prior download (for example, index file written but text_mapping.json failed) will skip re-download on the next load. In that state ensure_text_index_loaded leaves _text_id_mapping empty, which makes search_text return no mapped results and silently degrades text/hybrid search until the cache is manually repaired.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-23T16:20:19Z

+        text_built = (tmp_path / "text_index.faiss").exists()
+        text_s3_key = self._storage.build_index_key(version, "text_index.faiss") if text_built else None
+        text_n_built = len(text_image_ids) if text_built and text_image_ids is not None else None


Derive text build metadata before tempdir cleanup

text_built is computed after leaving with tempfile.TemporaryDirectory(), so tmp_path has already been removed and (tmp_path / "text_index.faiss").exists() is always false. That means BuildResult.text_num_vectors and text_s3_key are always reported as None even when text artifacts were successfully built and uploaded, causing incorrect workflow/job metadata.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 228bb31b79

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-24T14:56:55Z

+                try:
+                    txt_emb = storage.download_numpy(text_key)
+                except ClientError:
+                    pass


Tolerate optional text-embedding fetch failures

The optional text embedding fetch only suppresses ClientError, but storage.download_numpy() can also raise non-ClientError exceptions (for example ValueError from np.load on a malformed .npy, or transport-level botocore errors). In those cases the exception escapes _download_one, future.result() re-raises, and build_index_activity aborts even though image embeddings were successfully collected, which turns an optional text artifact problem into a full rebuild failure.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-24T15:04:42Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

chatgpt-codex-connector · 2026-02-24T15:08:25Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

chatgpt-codex-connector · 2026-02-24T15:17:09Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

cloudflare-workers-and-pages · 2026-02-24T15:17:09Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	memes	`a94c7d5`	Commit Preview URL Branch Preview URL	Feb 24 2026, 03:22 PM

chatgpt-codex-connector · 2026-02-24T15:19:44Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

chatgpt-codex-connector · 2026-02-24T15:21:29Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

* feat: late fusion search with RRF and disk-cached text FAISS index Build a separate text FAISS index alongside the image index during rebuild, using the existing caption/OCR text embeddings stored in S3. The text index is cached on disk but only loaded into memory lazily on first use, keeping the default startup fast. Search now supports a `mode` query parameter (image/text/hybrid). In hybrid mode (the default), both indexes are queried independently with the same text embedding and results are merged using Reciprocal Rank Fusion (score = 1/(rank+k), k=60). Closes #9 * fix: fetch text index artifacts independently of image index cache Text index files were only downloaded inside the `if not index_file.exists()` guard, so nodes that already had the image index cached from before text indexes existed would never fetch text_index.faiss. This caused has_text_index() to report false and hybrid mode to silently degrade to image-only. Move text artifact fetching to its own `if not text_index_local.exists()` block that runs unconditionally. * fix: skip mismatched text embeddings instead of aborting index build A single _text.npy with a wrong dimension caused a NumPy broadcast error that killed the entire rebuild. Since text indexing is optional, log a warning and skip the bad vector so the image index (and the rest of the text index) still gets built. * feat: add text meta * chore: tuning notebook * chore: setup nbstripout * fix: one off rrf, text index check * fix: exception handle * chore: add basic tests * chore: test actions * chore: branding * fix: missing mode param in test

listlessbird added 3 commits February 20, 2026 01:03

chatgpt-codex-connector Bot reviewed Feb 20, 2026

View reviewed changes

listlessbird added 2 commits February 23, 2026 21:46

feat: add text meta

1962385

chore: tuning notebook

f1c3aef

chatgpt-codex-connector Bot reviewed Feb 23, 2026

View reviewed changes

listlessbird force-pushed the main branch from 4943e71 to ed21d55 Compare February 24, 2026 14:48

listlessbird force-pushed the feat/hybrid-search branch from 9384010 to f1c3aef Compare February 24, 2026 14:48

chore: setup nbstripout

228bb31

chatgpt-codex-connector Bot reviewed Feb 24, 2026

View reviewed changes

listlessbird linked an issue Feb 24, 2026 that may be closed by this pull request

text embeddings are untouched #7

Closed

fix: one off rrf, text index check

1e3b9f2

fix: exception handle

dc140f2

listlessbird added 3 commits February 24, 2026 20:46

chore: add basic tests

025f3fd

chore: test actions

a662093

chore: branding

80603a6

fix: missing mode param in test

9dfd35c

Merge branch 'main' into feat/hybrid-search

a94c7d5

listlessbird merged commit b1a78b4 into main Feb 24, 2026
2 of 3 checks passed

listlessbird deleted the feat/hybrid-search branch February 24, 2026 15:22

		for rank, (image_id, _) in enumerate(image_results):
		scores[image_id] = scores.get(image_id, 0) + 1 / (rank + k)

Conversation

listlessbird commented Feb 20, 2026

Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Feb 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented Feb 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented Feb 24, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

chatgpt-codex-connector Bot commented Feb 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Feb 24, 2026 •

edited

Loading