Turn raw CSV data into beautiful, interactive 3D embedding clusters with semantic search and a web UI.
git clone https://github.com/aGallea/embedding-clusters.git
cd embedding-clusters
uv sync --all-extras
RUNNING_MODE=SERVER uv run python -m embedding_clusterOpen http://localhost:8000.
Requires Python 3.13+ and uv.
- Embeddings — CLIP (images) + SentenceTransformer (text) with ChromaDB persistence
- Clustering — k-means with automatic cluster count suggestion
- 3D Visualization — interactive scatter plot with t-SNE, UMAP, or PCA reduction and multiple render modes (particles, sprites, instanced spheres)
- Semantic Search — find similar items by text or image URL, highlighted directly in the 3D view
- Cluster Drill-Down — inspect cluster items, sub-cluster within a cluster, and explore hierarchical structure
- AI-Powered Naming — auto-label clusters using OpenAI, Google, Anthropic, or Ollama via LiteLLM
- Annotations — rename, tag, and annotate clusters with persistent notes
- Web UI — CSV upload, live indexing progress via WebSocket, plot controls, and collection management
![]() |
![]() |
![]() |
![]() |
CSV → Select Fields → Download Model → Embed & Store
→ Configure Plot → 3D Visualization → Search & Explore
- Upload CSV — drag-and-drop in the web UI or pass a file path via CLI
- Select fields — choose text columns (SentenceTransformer) and/or image URL columns (CLIP) to embed
- Embed & store — rows are converted to vector embeddings and stored in a ChromaDB collection
- Plot — pick a collection, set cluster count (or auto-suggest), choose a reduction algorithm, and render an interactive 3D scatter plot
- Search — enter a text query or image URL to find and highlight the most similar items
- Drill down — click a cluster to inspect items, sub-cluster for hierarchical exploration, and annotate with AI-generated or custom names
For detailed usage of all three modes (SERVER, INDEX, PLOT), environment variables, and API endpoints, see docs/USAGE.md.
The application has three running modes controlled by the RUNNING_MODE
environment variable:
- SERVER — FastAPI backend + React SPA (the web UI)
- INDEX — CLI-only CSV embedding pipeline
- PLOT — CLI-only cluster visualization
See docs/ARCHITECTURE.md for the full system design, data flow diagrams, and component breakdown.
uv sync --all-extras # Install dependencies
uv run ruff check embedding_cluster/ tests/ # Lint
uv run ruff format embedding_cluster/ tests/ # Format
uv run mypy embedding_cluster/ # Type check (strict)
uv run pytest # Run testsFrontend (React 19 + TypeScript + Vite + Tailwind CSS 4):
cd frontend && npm install && npm run dev # Dev server on :5173
cd frontend && npm run build # Production build
cd frontend && npm run test:e2e # Playwright E2E testsSee CONTRIBUTING.md for the full development guide.
| Layer | Technology |
|---|---|
| Backend | Python 3.13, FastAPI, Pydantic |
| Embeddings | SentenceTransformers, CLIP (HuggingFace) |
| Vector DB | ChromaDB |
| ML | scikit-learn (KMeans, t-SNE, PCA), UMAP |
| AI Naming | LiteLLM (OpenAI, Google, Anthropic, Ollama) |
| Frontend | React 19, TypeScript, Vite, Tailwind CSS 4 |
| 3D | React Three Fiber, Three.js, drei |
| State | Zustand, TanStack React Query |
| Testing | pytest, Playwright |
| CI | GitHub Actions (lint, typecheck, test with 90% coverage) |
See CONTRIBUTING.md for setup, code style, and PR guidelines.





