text embeddings are untouched

the text embeddings stored at ingestion time are completely unused.

At ingestion, two kinds of embeddings are produced:

img_feats → saved to {sha256}.npy → stored as proc.embed_s3_key (the image key)
txt_feats → saved to {sha256}_text.npy → stored as text_embedding_key

Then `build_index_activity` queries the DB for proc.embed_s3_key and downloads those — only the image embeddings go into FAISS. The _text.npy files sit in R2 and are never read again.

The text that's being encoded at ingestion is the caption + OCR output from moondream2, combined into one string. The idea is to create a fused embedding — encode the image's own description and store it alongside the visual embedding. But since only image embeddings go into faiss and that text vector goes nowhere, it has zero effect on search at the moment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text embeddings are untouched #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

text embeddings are untouched #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions