- A single foundation model rarely knows a sofa, a smartphone and a floral dress equally well — yet a real
- marketplace catalog mixes all of them on the same shelf. This project studies where general-purpose visual
- encoders like CLIP and SigLIP 2 stop being enough for e-commerce retrieval, and what has to be rebuilt
- around them when the catalog is not one domain but twenty. Each query is first stripped of its context —
- mannequins, human models, living-room scenes, studio gradients — then routed to a category-specific
- expert whose embeddings are reranked with the fine-grained color and texture cues that generic models
- quietly discard. The whole pipeline is then compressed into an on-device Android demo, raising a second
- question the paper versions of Google Lens rarely address in the open: how much of a foundation-model
- retrieval system actually survives when it has to run on a phone?
+ This project explores visual product search for marketplace applications, where users can search for visually similar
+ items using a single photo. I build a retrieval pipeline based on vision-language embeddings, object segmentation,
+ category-aware routing, and color-aware reranking to improve search quality for fashion and marketplace-style product
+ images. The final system is demonstrated in an Android app that performs on-device image-based retrieval over a local
+ product catalog.
+