Skip to content

text embeddings are untouched #7

@listlessbird

Description

@listlessbird

the text embeddings stored at ingestion time are completely unused.

At ingestion, two kinds of embeddings are produced:

img_feats → saved to {sha256}.npy → stored as proc.embed_s3_key (the image key)
txt_feats → saved to {sha256}_text.npy → stored as text_embedding_key

Then build_index_activity queries the DB for proc.embed_s3_key and downloads those — only the image embeddings go into FAISS. The _text.npy files sit in R2 and are never read again.

The text that's being encoded at ingestion is the caption + OCR output from moondream2, combined into one string. The idea is to create a fused embedding — encode the image's own description and store it alongside the visual embedding. But since only image embeddings go into faiss and that text vector goes nowhere, it has zero effect on search at the moment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions