embedding-clusters

Turn raw CSV data into beautiful, interactive 3D embedding clusters with semantic search and a web UI.

Quick Start

git clone https://github.com/aGallea/embedding-clusters.git
cd embedding-clusters
uv sync --all-extras
RUNNING_MODE=SERVER uv run python -m embedding_cluster

Open http://localhost:8000.

Requires Python 3.13+ and uv.

Features

Embeddings — CLIP (images) + SentenceTransformer (text) with ChromaDB persistence
Clustering — k-means with automatic cluster count suggestion
3D Visualization — interactive scatter plot with t-SNE, UMAP, or PCA reduction and multiple render modes (particles, sprites, instanced spheres)
Semantic Search — find similar items by text or image URL, highlighted directly in the 3D view
Cluster Drill-Down — inspect cluster items, sub-cluster within a cluster, and explore hierarchical structure
AI-Powered Naming — auto-label clusters using OpenAI, Google, Anthropic, or Ollama via LiteLLM
Annotations — rename, tag, and annotate clusters with persistent notes
Web UI — CSV upload, live indexing progress via WebSocket, plot controls, and collection management

Screenshots

How It Works

CSV → Select Fields → Download Model → Embed & Store
  → Configure Plot → 3D Visualization → Search & Explore

Upload CSV — drag-and-drop in the web UI or pass a file path via CLI
Select fields — choose text columns (SentenceTransformer) and/or image URL columns (CLIP) to embed
Embed & store — rows are converted to vector embeddings and stored in a ChromaDB collection
Plot — pick a collection, set cluster count (or auto-suggest), choose a reduction algorithm, and render an interactive 3D scatter plot
Search — enter a text query or image URL to find and highlight the most similar items
Drill down — click a cluster to inspect items, sub-cluster for hierarchical exploration, and annotate with AI-generated or custom names

For detailed usage of all three modes (SERVER, INDEX, PLOT), environment variables, and API endpoints, see docs/USAGE.md.

Architecture

The application has three running modes controlled by the RUNNING_MODE environment variable:

SERVER — FastAPI backend + React SPA (the web UI)
INDEX — CLI-only CSV embedding pipeline
PLOT — CLI-only cluster visualization

See docs/ARCHITECTURE.md for the full system design, data flow diagrams, and component breakdown.

Development

uv sync --all-extras                           # Install dependencies
uv run ruff check embedding_cluster/ tests/    # Lint
uv run ruff format embedding_cluster/ tests/   # Format
uv run mypy embedding_cluster/                 # Type check (strict)
uv run pytest                                  # Run tests

Frontend (React 19 + TypeScript + Vite + Tailwind CSS 4):

cd frontend && npm install && npm run dev      # Dev server on :5173
cd frontend && npm run build                   # Production build
cd frontend && npm run test:e2e                # Playwright E2E tests

See CONTRIBUTING.md for the full development guide.

Tech Stack

Layer	Technology
Backend	Python 3.13, FastAPI, Pydantic
Embeddings	SentenceTransformers, CLIP (HuggingFace)
Vector DB	ChromaDB
ML	scikit-learn (KMeans, t-SNE, PCA), UMAP
AI Naming	LiteLLM (OpenAI, Google, Anthropic, Ollama)
Frontend	React 19, TypeScript, Vite, Tailwind CSS 4
3D	React Three Fiber, Three.js, drei
State	Zustand, TanStack React Query
Testing	pytest, Playwright
CI	GitHub Actions (lint, typecheck, test with 90% coverage)

Contributing

See CONTRIBUTING.md for setup, code style, and PR guidelines.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
embedding_cluster		embedding_cluster
frontend		frontend
tests		tests
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.yamllint.yaml		.yamllint.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

embedding-clusters

Quick Start

Features

Screenshots

How It Works

Architecture

Development

Tech Stack

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

embedding-clusters

Quick Start

Features

Screenshots

How It Works

Architecture

Development

Tech Stack

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages