diff --git a/AGENTS.md b/AGENTS.md index a5f0e1c..c243b03 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -64,8 +64,8 @@ uv run pytest # Run with coverage report uv run pytest --cov=embedding_cluster --cov-report=term-missing -# Run with coverage enforcement (70% minimum) -uv run pytest --cov=embedding_cluster --cov-report=term-missing --cov-fail-under=70 +# Run with coverage enforcement (90% minimum, matches CI) +uv run pytest --cov=embedding_cluster --cov-report=term-missing --cov-fail-under=90 # Run a single test file uv run pytest tests/test_settings.py -v @@ -110,7 +110,7 @@ E2E tests require pre-indexed ChromaDB data. The `webServer` config in GitHub Actions workflow in `.github/workflows/ci.yml` runs on push/PR: - **lint** job: `ruff check` + `ruff format --check` - **typecheck** job: `mypy embedding_cluster/` -- **test** job: `pytest --cov --cov-fail-under=70` +- **test** job: `pytest --cov` (90% minimum enforced by coverage report) All jobs use `uv sync --all-extras` for dependency installation. @@ -172,16 +172,35 @@ embedding_cluster/ settings.py # Pydantic Settings (env var config) utils.py # Shared utilities (logging, ChromaDB helpers, image downloader) indexer.py # INDEX mode: CSV parsing, embedding generation, ChromaDB storage - scatter_plot.py # PLOT mode: Clustering, t-SNE, Dash visualization + scatter_plot.py # PLOT mode: Clustering, dimensionality reduction, visualization data + ai_naming.py # LLM-powered cluster naming via LiteLLM + annotations.py # Cluster annotation persistence (JSON sidecar files) csv/ # Sample data files + server/ + app.py # FastAPI app factory, SPA serving + models.py # Pydantic request/response models + tasks.py # Background task registry + ws.py # WebSocket manager for live progress + routes/ + ai.py # AI cluster naming endpoints + annotations.py # Cluster annotation CRUD + collections.py # ChromaDB collection management + csv.py # CSV upload and preview + index.py # Indexing jobs with WebSocket progress + plot.py # Plot computation, cluster detail, sub-clustering + search.py # Semantic search (text and image) +frontend/ + src/ + App.tsx # Router, QueryClient, Zustand provider + api/ # Typed API client layer + components/ # UI components organized by page + hooks/ # useIndexWebSocket, usePlotData + pages/ # HomePage, IndexPage, PlotPage, SettingsPage + stores/ # Zustand plotStore (plot state management) + types/ # TypeScript interfaces mirroring backend models tests/ - __init__.py conftest.py # Shared fixtures - test_settings.py # Settings env var parsing tests - test_utils.py # Utilities, Singleton, ImageDownloader tests - test_indexer.py # Indexer pipeline tests (mocked ML models) - test_scatter_plot.py # Scatter plot tests (mocked data) - test_main.py # Entry point dispatch tests + test_*.py # Unit tests for each backend module and route ``` ### Key Dependencies @@ -191,10 +210,10 @@ Runtime: - `chromadb` - Vector database for embedding storage - `transformers` / `sentence-transformers` - Text and image embedding models - `torch` - ML framework backend -- `dash` / `plotly` - Interactive 3D visualization -- `scikit-learn` - KMeans clustering and t-SNE +- `fastapi` / `uvicorn` - Web server and REST API +- `scikit-learn` - KMeans clustering and dimensionality reduction - `aiohttp` - Async HTTP for image downloads -- `openai` - Optional GPT-based cluster naming +- `litellm` - Multi-provider LLM integration for cluster naming - `numpy` / `Pillow` - Numerical and image processing Dev: @@ -202,6 +221,7 @@ Dev: - `mypy` - Static type checking - `ruff` - Linting and formatting - `pre-commit` - Git hook management +- `httpx` - Test client for FastAPI routes ## Git & Commit Conventions @@ -231,9 +251,15 @@ Extensive pre-commit setup. Key hooks: ## Data Flow -1. **INDEX mode**: CSV -> parse rows -> generate embeddings (CLIP for images, - SentenceTransformer for text) -> store in ChromaDB collections -2. **PLOT mode**: ChromaDB collection -> StandardScaler -> KMeans clustering -> - t-SNE 3D projection -> Dash/Plotly interactive scatter plot - -ChromaDB data is persisted to `./chromadb/` directory (gitignored). +1. **INDEX mode**: CSV → parse rows → generate embeddings (CLIP for images, + SentenceTransformer for text) → store in ChromaDB collections +2. **PLOT mode**: ChromaDB collection → StandardScaler → KMeans clustering → + dimensionality reduction (t-SNE/UMAP/PCA) → 3D point data via REST API +3. **SERVER mode**: FastAPI serves REST API + built React SPA. Long-running + jobs (indexing, plot computation) use a task registry with WebSocket + progress streaming. + +Persistent data: +- `./chromadb/` — Vector database (gitignored) +- `./uploads/` — Uploaded CSV files (gitignored) +- `./annotations/` — Cluster annotations as JSON sidecar files (gitignored) diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..2e629ff --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,99 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +Python + React application for generating, indexing, and visualizing embedding clusters from CSV data. Uses CLIP/SentenceTransformer for embeddings, ChromaDB for vector storage, k-means for clustering, and a React/Three.js frontend for 3D visualization. + +- **Python 3.13**, managed with [uv](https://docs.astral.sh/uv/) +- **Package name**: `embedding_cluster` (underscore, not hyphen) +- **Entry point**: `python -m embedding_cluster` dispatches to INDEX, PLOT, or SERVER mode via `RUNNING_MODE` env var + +## Commands + +### Backend + +```bash +uv sync --all-extras # Install all dependencies +RUNNING_MODE=SERVER uv run python -m embedding_cluster # Start server on :8000 +uv run ruff check embedding_cluster/ tests/ # Lint +uv run ruff check --fix embedding_cluster/ tests/ # Lint with auto-fix +uv run ruff format embedding_cluster/ tests/ # Format +uv run mypy embedding_cluster/ # Type check (strict mode) +uv run pytest # Run all tests +uv run pytest tests/test_settings.py -v # Run single test file +uv run pytest tests/test_settings.py::test_fn -v # Run single test function +uv run pytest --cov=embedding_cluster --cov-report=term-missing --cov-fail-under=90 # Coverage (90% CI min) +uv run pre-commit run --all-files # All pre-commit hooks +``` + +### Frontend + +```bash +cd frontend && npm install # Install deps +cd frontend && npm run dev # Dev server on :5173 +cd frontend && npm run build # Production build (output: frontend/dist) +cd frontend && npm run lint # ESLint +cd frontend && npm run test:e2e # Playwright E2E tests +cd frontend && npx playwright test e2e/search.spec.ts # Single E2E test +``` + +E2E tests require pre-indexed ChromaDB data and a built frontend. The Playwright config auto-starts the FastAPI backend. + +## Architecture + +### Three Running Modes + +All controlled by `RUNNING_MODE` env var, dispatched in `__main__.py`: +- **INDEX**: `indexer.py` — CSV parsing → embedding generation → ChromaDB storage +- **PLOT**: `scatter_plot.py` — ChromaDB → StandardScaler → k-means → dimensionality reduction (t-SNE/UMAP/PCA) +- **SERVER**: `server/app.py` — FastAPI backend serving REST API + built React SPA from `frontend/dist` + +### Backend Structure + +- `settings.py` — All config via env vars using `pydantic-settings` `BaseSettings` +- `server/app.py` — FastAPI app factory, mounts route modules and serves SPA +- `server/routes/` — API routes split by domain: `ai.py`, `annotations.py`, `collections.py`, `csv.py`, `index.py`, `plot.py`, `search.py` +- `server/tasks.py` — Background task management for long-running operations +- `server/ws.py` — WebSocket support for live progress +- `ai_naming.py` — LLM-powered cluster naming via LiteLLM (supports OpenAI, Ollama) +- `annotations.py` — Cluster annotation persistence (JSON sidecar files in `annotations/`) +- `utils.py` — ChromaDB helpers, image downloader with retry, singleton pattern + +### Frontend Structure + +React 19 + TypeScript + Vite + Tailwind CSS 4: +- `pages/` — `HomePage`, `IndexPage`, `PlotPage`, `SettingsPage` +- `components/` — Organized by page: `home/`, `index/`, `plot/`, `csv/` +- `stores/plotStore.ts` — Zustand store for plot state +- `api/` — API client layer +- `hooks/` — React Query hooks +- 3D visualization uses React Three Fiber (`@react-three/fiber` + `@react-three/drei`) + +## Code Style + +### Python +- **ruff**: line length 90, target py313 +- **mypy strict mode** — all functions need type annotations +- Use `from __future__ import annotations` in every module +- Modern syntax: `str | None` (not `Optional`), `list[str]` (not `List`) +- Absolute imports only: `from embedding_cluster.settings import Settings` +- Heavy imports behind `TYPE_CHECKING` blocks where possible +- Logger per module: `logger = logging.getLogger(__name__)` + +### Git Conventions +- **Conventional commits** enforced by commitizen: `type(scope): description` +- Types: `feat`, `fix`, `docs`, `test`, `refactor` +- **No direct commits to master** (enforced by pre-commit hook) +- Branch naming: `feature-name` style (e.g., `feat/ollama-provider-integration`) + +### Pre-commit Hooks +Extensive setup including: ruff, commitizen, yamllint, markdownlint, shellcheck, gitleaks (secret detection), hadolint, check-jsonschema, no-commit-to-branch. Install with: +```bash +uv run pre-commit install --install-hooks -t pre-commit -t commit-msg +``` + +## CI + +GitHub Actions (`.github/workflows/ci.yml`): lint → typecheck → test (90% coverage minimum). All jobs use `uv sync --all-extras`. diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..cde6c60 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,18 @@ +# Code of Conduct + +This project follows the +[Contributor Covenant Code of Conduct v2.1](https://www.contributor-covenant.org/version/2/1/code_of_conduct/). + +Please read the full text at the link above. In summary, we are committed +to providing a welcoming and inclusive experience for everyone. + +## Reporting + +If you experience or witness unacceptable behavior, please contact the +project maintainer at **asafgallea@gmail.com**. All reports will be +handled with discretion. + +## Attribution + +This Code of Conduct is adapted from the +[Contributor Covenant](https://www.contributor-covenant.org), version 2.1. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..753b542 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,171 @@ +# Contributing + +Thanks for your interest in contributing to embedding-clusters! This guide +covers everything you need to get started. + +## Prerequisites + +- [Python 3.13+](https://www.python.org/downloads/) +- [uv](https://docs.astral.sh/uv/getting-started/installation/) package + manager +- [Node.js 18+](https://nodejs.org/) (for frontend development) + +## Setup + +```bash +git clone https://github.com/aGallea/embedding-clusters.git +cd embedding-clusters +uv sync --all-extras +uv run pre-commit install --install-hooks -t pre-commit -t commit-msg +``` + +For frontend work: + +```bash +cd frontend +npm install +``` + +## Running Locally + +Start the full application (backend + frontend): + +```bash +RUNNING_MODE=SERVER uv run python -m embedding_cluster +``` + +For frontend development with hot reload: + +```bash +# Terminal 1 — backend +RUNNING_MODE=SERVER uv run python -m embedding_cluster + +# Terminal 2 — frontend dev server (proxies API to backend) +cd frontend && npm run dev +``` + +The Vite dev server runs on `http://localhost:5173` and proxies `/api` and +`/ws` requests to the backend on port 8000. + +## Testing + +### Backend (Python) + +```bash +uv run pytest # Run all tests +uv run pytest tests/test_settings.py -v # Single file +uv run pytest tests/test_settings.py::test_fn # Single test +uv run pytest --cov=embedding_cluster \ + --cov-report=term-missing --cov-fail-under=90 # With coverage +``` + +Tests use `pytest-asyncio` in auto mode. CI enforces a **90% minimum +coverage** threshold. + +### Frontend (E2E) + +```bash +cd frontend +npx playwright install chromium # First-time setup +npm run build # Build required before E2E +npm run test:e2e # Run tests +npm run test:e2e:ui # Run with interactive UI +``` + +E2E tests require pre-indexed data in ChromaDB. See the +[AGENTS.md](AGENTS.md) E2E section for setup instructions. + +## Code Style + +### Python + +- **ruff** for linting and formatting (line length 90, target py313) +- **mypy** in strict mode — all functions require type annotations +- `from __future__ import annotations` in every module +- Modern type syntax: `str | None`, `list[str]`, `dict[str, Any]` +- Absolute imports only: `from embedding_cluster.settings import Settings` +- Heavy imports behind `TYPE_CHECKING` blocks where possible +- Logger per module: `logger = logging.getLogger(__name__)` + +```bash +uv run ruff check embedding_cluster/ tests/ # Lint +uv run ruff check --fix embedding_cluster/ tests/ # Auto-fix +uv run ruff format embedding_cluster/ tests/ # Format +uv run mypy embedding_cluster/ # Type check +``` + +### Frontend (TypeScript) + +- ESLint with TypeScript and React hooks plugins +- Tailwind CSS 4 for styling + +```bash +cd frontend && npm run lint +``` + +## Pre-commit Hooks + +The project uses extensive pre-commit hooks that run automatically on +commit. Key hooks include: + +- **ruff** — linting (with auto-fix) and formatting +- **commitizen** — commit message validation +- **gitleaks** — secret detection +- **yamllint** / **markdownlint** — config file linting +- **no-commit-to-branch** — prevents direct commits to master + +Run all hooks manually: + +```bash +uv run pre-commit run --all-files +``` + +## Commit Messages + +This project uses [Conventional Commits](https://www.conventionalcommits.org/) +enforced by [commitizen](https://commitizen-tools.github.io/commitizen/). + +Format: `type(scope): description` + +| Type | Use for | +|------|---------| +| `feat` | New features | +| `fix` | Bug fixes | +| `docs` | Documentation changes | +| `test` | Adding or updating tests | +| `refactor` | Code changes that neither fix bugs nor add features | + +Examples: + +```text +feat(search): add image URL search support +fix(indexer): handle empty CSV rows gracefully +docs(readme): update quick start instructions +test(server): add collection deletion tests +``` + +## Pull Request Process + +1. Create a branch from `master` (e.g. `feat/my-feature`) +2. Make your changes and ensure all checks pass: + ```bash + uv run ruff check embedding_cluster/ tests/ + uv run ruff format --check embedding_cluster/ tests/ + uv run mypy embedding_cluster/ + uv run pytest --cov=embedding_cluster --cov-fail-under=90 + ``` +3. Push and open a pull request against `master` +4. CI will run lint, typecheck, and test jobs automatically +5. All conversations must be resolved before merging +6. At least one approving review is required + +## Project Structure + +See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full system +design and component breakdown. + +## Good First Issues + +Look for issues labeled +[`good first issue`](https://github.com/aGallea/embedding-clusters/labels/good%20first%20issue) +for beginner-friendly tasks. diff --git a/README.md b/README.md index 23661f9..879a10e 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,15 @@ # embedding-clusters -![python-version][python-version] +[![python-version][python-badge]][python-url] +[![CI][ci-badge]][ci-url] +[![License: MIT][license-badge]][license-url] -Turn raw CSV data into beautiful, interactive embedding clusters with fast +Turn raw CSV data into beautiful, interactive 3D embedding clusters with semantic search and a web UI. ![3D cluster plot](docs/screenshots/3d-cluster-plot.png) -## Quick Start (Web UI) +## Quick Start ```bash git clone https://github.com/aGallea/embedding-clusters.git @@ -18,143 +20,117 @@ RUNNING_MODE=SERVER uv run python -m embedding_cluster Open . -## Features +> Requires Python 3.13+ and [uv](https://docs.astral.sh/uv/). -- **Embeddings & Storage**: CLIP (images) + SentenceTransformer (text) with - ChromaDB persistence. -- **Clustering & Plot**: k-means clusters with 3D t-SNE, UMAP, or PCA. -- **Search & Collections**: semantic search by text or image URL, collection - browsing and deletion. -- **Web UI**: CSV upload, live progress, plot controls, and multiple render - modes. +## Features -## Visual Highlights +- **Embeddings** — CLIP (images) + SentenceTransformer (text) with + ChromaDB persistence +- **Clustering** — k-means with automatic cluster count suggestion +- **3D Visualization** — interactive scatter plot with t-SNE, UMAP, or PCA + reduction and multiple render modes (particles, sprites, instanced spheres) +- **Semantic Search** — find similar items by text or image URL, highlighted + directly in the 3D view +- **Cluster Drill-Down** — inspect cluster items, sub-cluster within a + cluster, and explore hierarchical structure +- **AI-Powered Naming** — auto-label clusters using OpenAI, Google, + Anthropic, or Ollama via LiteLLM +- **Annotations** — rename, tag, and annotate clusters with persistent + notes +- **Web UI** — CSV upload, live indexing progress via WebSocket, plot + controls, and collection management + +## Screenshots ![Semantic search demo](docs/gifs/semantic-search-mini.gif) -![Home dashboard](docs/screenshots/home-page.png) -![Index page config](docs/screenshots/index-page-config.png) -![Index page progress](docs/screenshots/index-page-progress.png) -![Semantic search results](docs/screenshots/semantic-search.png) - -## How Things Work - -The tool turns a CSV file into an interactive 3D cluster -visualization in a few steps: - -1. **Upload CSV** -- Provide a CSV file containing your data. - The web UI lets you drag-and-drop; the CLI accepts a file path. -2. **Select fields** -- Choose which columns to embed. Text fields - (e.g. product names) use a SentenceTransformer model; image URL - fields use a CLIP model. You can embed both in the same dataset. -3. **Model download** -- The selected model is pulled from - [HuggingFace](https://huggingface.co) on first use and cached - locally for subsequent runs. -4. **Embedding & storage** -- Each row is converted into a vector - embedding by the chosen model. Embeddings are stored in a - [ChromaDB](https://www.trychroma.com/) collection for - persistent, queryable vector storage. -5. **Plot configuration** -- Pick a collection, set the number of - k-means clusters (or let the tool suggest one), and choose a - dimensionality reduction algorithm (t-SNE, UMAP, or PCA). -6. **3D visualization** -- The reduced vectors are rendered as an - interactive 3D scatter plot. Hover for metadata, toggle cluster - visibility, switch render modes, or go fullscreen. -7. **Semantic search** -- Enter a text query or paste an image URL - to find the most similar items. Matching points are highlighted - directly in the 3D view. -8. **Cluster groupings** -- Toggle individual clusters on/off to - focus on specific groups. Use the optional GPT-powered naming - to label each cluster automatically. - -## Cluster Drill-Down and Annotation - -After generating a plot you can inspect, subdivide, and annotate -individual clusters directly from the web UI. - -### Cluster Detail Panel - -Click a cluster name in the legend to open a side panel listing every -item in that cluster. Items are sorted by distance to the centroid so -the most representative points appear first. The panel supports -pagination, displays item metadata, and shows image thumbnails when -an image field is available. - -### Sub-Clustering - -Inside the detail panel, toggle **Sub-cluster** to re-run k-means -within a single cluster. The result is rendered as a mini 3D scatter -plot (PCA-reduced) so you can explore hierarchical structure without -leaving the page. - -### Annotations - -Each cluster can be renamed, tagged, and annotated with free-form -notes. Changes are saved automatically (debounced) and persisted as -JSON sidecar files in the `annotations/` directory. Annotations -survive page reloads and are scoped per plot job. - -### API Endpoints - -The feature exposes the following REST endpoints under `/api`: - -- `GET /plot/{job_id}/cluster/{index}` -- paginated cluster detail -- `POST /plot/{job_id}/cluster/{index}/sub-cluster` -- sub-cluster - a single cluster with configurable k -- `GET /annotations/{job_id}` -- fetch all annotations for a job -- `PUT /annotations/{job_id}` -- update annotations -- `DELETE /annotations/{job_id}` -- delete annotations +| | | +|---|---| +| ![Home](docs/screenshots/home-page.png) | ![Index config](docs/screenshots/index-page-config.png) | +| ![Index progress](docs/screenshots/index-page-progress.png) | ![Search](docs/screenshots/semantic-search.png) | + +## How It Works ```text -CSV --> Select Fields --> Download Model --> Embed & Store - --> Configure Plot --> 3D Visualization --> Search & Explore +CSV → Select Fields → Download Model → Embed & Store + → Configure Plot → 3D Visualization → Search & Explore ``` -## Using CLI +1. **Upload CSV** — drag-and-drop in the web UI or pass a file path via CLI +2. **Select fields** — choose text columns (SentenceTransformer) and/or + image URL columns (CLIP) to embed +3. **Embed & store** — rows are converted to vector embeddings and stored in + a ChromaDB collection +4. **Plot** — pick a collection, set cluster count (or auto-suggest), choose + a reduction algorithm, and render an interactive 3D scatter plot +5. **Search** — enter a text query or image URL to find and highlight the + most similar items +6. **Drill down** — click a cluster to inspect items, sub-cluster for + hierarchical exploration, and annotate with AI-generated or custom names + +For detailed usage of all three modes (SERVER, INDEX, PLOT), environment +variables, and API endpoints, see [docs/USAGE.md](docs/USAGE.md). + +## Architecture + +The application has three running modes controlled by the `RUNNING_MODE` +environment variable: + +- **SERVER** — FastAPI backend + React SPA (the web UI) +- **INDEX** — CLI-only CSV embedding pipeline +- **PLOT** — CLI-only cluster visualization + +See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full system design, +data flow diagrams, and component breakdown. -### Index (CLI) +## Development ```bash -RUNNING_MODE=INDEX \ - LOCAL_CSV_FILENAME=./embedding_cluster/csv/fashion_small.csv \ - ID_FIELD=id \ - IMAGE_EMBEDDING_FIELDS='["imageUrl"]' \ - CHROMADB_COLLECTION_PREFIX=fashion_ \ - NUMBER_OF_ASYNC_TASKS=10 \ - uv run python -m embedding_cluster +uv sync --all-extras # Install dependencies +uv run ruff check embedding_cluster/ tests/ # Lint +uv run ruff format embedding_cluster/ tests/ # Format +uv run mypy embedding_cluster/ # Type check (strict) +uv run pytest # Run tests ``` -### Plot (CLI) +Frontend (React 19 + TypeScript + Vite + Tailwind CSS 4): ```bash -RUNNING_MODE=PLOT \ - CHROMADB_COLLECTION_NAME=fashion_imageUrl \ - TEXT_DISPLAY_FIELDS='["productDisplayName"]' \ - IMAGE_FIELD=imageUrl \ - uv run python -m embedding_cluster +cd frontend && npm install && npm run dev # Dev server on :5173 +cd frontend && npm run build # Production build +cd frontend && npm run test:e2e # Playwright E2E tests ``` -Key environment variables: - -- `RUNNING_MODE`: `INDEX`, `PLOT`, or `SERVER` -- `TEXT_MODEL_NAME`: SentenceTransformer model name -- `IMAGE_MODEL_NAME`: CLIP model name -- `NUM_CLUSTERS`: k-means cluster count -- `REDUCTION_ALGORITHM`: `tsne`, `umap`, or `pca` +See [CONTRIBUTING.md](CONTRIBUTING.md) for the full development guide. -## Development +## Tech Stack -```bash -uv sync --all-extras -uv run ruff check embedding_cluster/ tests/ -uv run ruff format embedding_cluster/ tests/ -uv run mypy embedding_cluster/ -uv run pytest -``` +| Layer | Technology | +|-------|-----------| +| Backend | Python 3.13, FastAPI, Pydantic | +| Embeddings | SentenceTransformers, CLIP (HuggingFace) | +| Vector DB | ChromaDB | +| ML | scikit-learn (KMeans, t-SNE, PCA), UMAP | +| AI Naming | LiteLLM (OpenAI, Google, Anthropic, Ollama) | +| Frontend | React 19, TypeScript, Vite, Tailwind CSS 4 | +| 3D | React Three Fiber, Three.js, drei | +| State | Zustand, TanStack React Query | +| Testing | pytest, Playwright | +| CI | GitHub Actions (lint, typecheck, test with 90% coverage) | ## Contributing -Pull requests are welcome. For major changes, please open an issue first. +See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, code style, and PR +guidelines. + +## License + +[MIT](LICENSE) -[python-version]: https://img.shields.io/badge/python-3.13-blue.svg +[python-badge]: https://img.shields.io/badge/python-3.13-blue.svg +[python-url]: https://www.python.org/downloads/ +[ci-badge]: https://github.com/aGallea/embedding-clusters/actions/workflows/ci.yml/badge.svg +[ci-url]: https://github.com/aGallea/embedding-clusters/actions/workflows/ci.yml +[license-badge]: https://img.shields.io/badge/License-MIT-green.svg +[license-url]: LICENSE diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 0000000..d1dea2c --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,39 @@ +# Security Policy + +## Reporting a Vulnerability + +If you discover a security vulnerability, please report it responsibly by +emailing **asafgallea@gmail.com**. Do not open a public issue. + +You can expect: + +- An acknowledgment within **48 hours** +- A status update within **7 days** +- Coordinated disclosure once a fix is available + +## Supported Versions + +| Version | Supported | +|---------|-----------| +| Latest on `master` | Yes | +| Older releases | No | + +## Scope + +This policy covers the `embedding-clusters` application code, including: + +- The Python backend (FastAPI server, indexer, plot computation) +- The React frontend +- Configuration and build tooling + +## Security Considerations + +- **File uploads** — CSV uploads are saved to a sandboxed `./uploads/` + directory. The server validates file paths to prevent directory traversal. +- **AI credentials** — LLM API keys are configured per-session in the + browser and sent per-request. They are not stored server-side. +- **ChromaDB** — runs embedded (no network exposure). Data is stored + locally in `./chromadb/`. +- **No authentication** — the application is designed for local or trusted + network use. Do not expose it to the public internet without adding an + authentication layer. diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..6815476 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,268 @@ +# Architecture + +This document describes the system design, component responsibilities, and +data flow of **embedding-clusters**. + +## Overview + +The application converts CSV data into interactive 3D embedding +visualizations. It has three running modes, all dispatched from a single +entry point (`python -m embedding_cluster`): + +| Mode | Entry | Purpose | +|------|-------|---------| +| `SERVER` | `server/app.py` | FastAPI backend + React SPA | +| `INDEX` | `indexer.py` | CLI embedding pipeline | +| `PLOT` | `scatter_plot.py` | CLI cluster visualization | + +```text + __main__.py + / | \ + / | \ + INDEX SERVER PLOT + | | | + indexer.py FastAPI scatter_plot.py + | / | \ | + | routes | SPA | + | | | + +--- ChromaDB --------+ +``` + +## Backend Components + +### Configuration (`settings.py`) + +All configuration is driven by environment variables, parsed by +`pydantic-settings` `BaseSettings`. Each setting has a `Field()` with a +default value and description. List fields accept JSON-encoded strings +(e.g. `'["field1","field2"]'`). + +### Indexing Pipeline (`indexer.py`) + +Responsible for the INDEX mode and also used by the server's indexing route. + +1. Read CSV rows (with optional start/stop line range) +2. Load embedding models: + - **SentenceTransformer** for text fields + - **CLIP** (via HuggingFace Transformers) for image URL fields +3. Generate embeddings in batches with semaphore-controlled concurrency +4. Store embeddings + metadata in ChromaDB collections +5. Report progress via callback (used by WebSocket in server mode) +6. Support cancellation via `asyncio.Event` + +Images are downloaded asynchronously with exponential backoff retry +(up to 6 attempts) using a singleton `ImageDownloader` backed by +`aiohttp.ClientSession`. + +### Plot Computation (`scatter_plot.py`) + +Responsible for the PLOT mode and used by the server's plot route. + +1. Load embeddings from a ChromaDB collection +2. Standardize with `StandardScaler` +3. Reduce dimensions using t-SNE, UMAP, or PCA +4. Cluster with KMeans +5. Compute silhouette scores, centroids, and per-point distances +6. Return structured point and cluster data + +Additional capabilities: +- **Optimal cluster suggestion** — evaluates k=2..30 with inertia and + silhouette scores +- **Sub-clustering** — re-run KMeans within a single cluster or on a + selected subset of points + +### AI Naming (`ai_naming.py`) + +Uses [LiteLLM](https://github.com/BerriAI/litellm) as a universal gateway +to call any LLM provider (OpenAI, Google, Anthropic, Ollama) with a single +interface. Generates short (max 5 words) descriptive names for clusters +based on sampled items. + +### Annotations (`annotations.py`) + +Persists cluster metadata (name, notes, tags) as JSON sidecar files in +`./annotations/`, one file per plot job. The `AnnotationManager` handles +read/write with automatic timestamping. + +### Utilities (`utils.py`) + +- **Logging** — colored console formatter +- **ChromaDB helpers** — collection creation, batch document initialization +- **ImageDownloader** — singleton async image fetcher with retry logic +- **ID generator** — random alphanumeric IDs for jobs and documents + +## Server Architecture + +The `SERVER` mode runs a FastAPI application that serves both the REST API +and the built React SPA. + +### App Factory (`server/app.py`) + +`create_app()` assembles the FastAPI app: +- Registers all API route modules under `/api` +- Adds CORS middleware for frontend dev server (`localhost:5173`) +- Serves the React SPA from `frontend/dist` (if built), with catch-all + fallback to `index.html` for client-side routing + +### Task Management (`server/tasks.py`) + +Long-running operations (indexing, plot computation) run as background +async tasks tracked by an in-memory `TaskRegistry`: + +- Each job gets a unique ID and a `TaskState` with status, progress dict, + result, error, and a cancel event +- Status lifecycle: `PENDING` → `RUNNING` → `COMPLETED` | `FAILED` | `CANCELLED` +- Clients poll status via REST or subscribe via WebSocket + +### WebSocket Manager (`server/ws.py`) + +Manages per-job WebSocket connections for real-time progress streaming. +Broadcasts JSON messages (progress, log, heartbeat, completed, error) to +all connected clients for a given job ID. + +### API Routes (`server/routes/`) + +| Route module | Prefix | Responsibility | +|-------------|--------|---------------| +| `csv.py` | `/api/csv` | Upload and preview CSV files | +| `index.py` | `/api/index` | Start/cancel indexing jobs, WebSocket progress | +| `collections.py` | `/api/collections` | List, detail, delete ChromaDB collections | +| `plot.py` | `/api/plot` | Compute plots, cluster detail, sub-clustering, suggest k | +| `search.py` | `/api/search` | Semantic search (text or image query) | +| `ai.py` | `/api/ai` | LLM cluster naming, connection testing, Ollama proxy | +| `annotations.py` | `/api/annotations` | CRUD for cluster annotations | + +### Request/Response Models (`server/models.py`) + +All API contracts are defined as Pydantic models. The frontend TypeScript +types in `frontend/src/types/index.ts` mirror these models. + +## Frontend Architecture + +The frontend is a React 19 SPA built with Vite and Tailwind CSS 4. + +### Routing (`App.tsx`) + +Four pages mapped via React Router: + +| Path | Page | Purpose | +|------|------|---------| +| `/` | `HomePage` | Collection browser, quick actions | +| `/index` | `IndexPage` | CSV upload, embedding config, progress | +| `/plot` | `PlotPage` | 3D visualization, search, annotations | +| `/settings` | `SettingsPage` | AI provider configuration | + +### State Management + +- **Zustand** (`stores/plotStore.ts`) — single store for all plot-related + state: points, clusters, visibility, search results, drill-down path, + annotations, render mode, algorithm parameters +- **TanStack React Query** — server state (collections, plot data polling) + +### 3D Visualization + +Uses [React Three Fiber](https://github.com/pmndrs/react-three-fiber) +(`@react-three/fiber`) with `drei` helpers. Three render modes: + +1. **Particles** — GPU-accelerated point cloud (default, best performance) +2. **Sprites** — image thumbnails at each point (when image field available) +3. **Instanced Spheres** — 3D sphere meshes with lighting + +### API Client Layer (`api/`) + +Typed fetch wrappers organized by domain (`client.ts`, `indexing.ts`, +`plot.ts`, `ai.ts`, `collections.ts`, `csv.ts`). All requests go through +a shared `apiFetch()` utility with error handling. + +### Hooks + +- `useIndexWebSocket` — real-time indexing progress with stuck detection + (warning after 15s, error after 30s of silence) +- `usePlotData` — starts plot computation, polls for results every 2s + +## Data Flow + +### Indexing (Web UI) + +```text +Browser Server Storage + | | | + |-- POST /csv/upload ---------->| | + |<---- filename, columns -------| | + | | | + |-- POST /index/start -------->| | + |<---- job_id ------------------| | + | |-- load models | + |== WS /index/ws/{job_id} ====>| | + | |-- read CSV | + |<--- progress messages --------|-- embed rows ---------->| + |<--- log messages -------------|-- store in ChromaDB --->| + |<--- completed message --------| | +``` + +### Plot Generation + +```text +Browser Server Storage + | | | + |-- POST /plot/compute -------->| | + |<---- job_id ------------------| | + | |-- load embeddings <----| + |-- GET /plot/data/{id} ------->|-- reduce dimensions | + |<---- ready: false ------------|-- KMeans clustering | + |-- GET /plot/data/{id} ------->|-- compute centroids | + |<---- ready: true, data -------| | + | | | + |-- render 3D scene | | +``` + +### Semantic Search + +```text +Browser Server Storage + | | | + |-- POST /search -------------->| | + | |-- infer model type | + | |-- embed query | + | |-- ChromaDB.query() <----| + |<---- results + distances -----| | + | | | + |-- highlight in 3D scene | | +``` + +## Storage + +| Directory | Contents | Persistence | +|-----------|----------|-------------| +| `./chromadb/` | Vector database (embeddings + metadata) | Persistent, gitignored | +| `./uploads/` | User-uploaded CSV files | Persistent, gitignored | +| `./annotations/` | Cluster annotation JSON files | Persistent, gitignored | + +## Design Decisions + +### Why ChromaDB? + +ChromaDB provides embedded vector storage with no external dependencies. +Collections persist to disk automatically, support metadata filtering, and +offer nearest-neighbor search out of the box — exactly what this tool needs +without requiring a separate database server. + +### Why LiteLLM? + +Rather than coupling to a single LLM provider, LiteLLM provides a unified +interface to OpenAI, Google, Anthropic, and Ollama. Users can switch +providers from the settings page without code changes. + +### Why React Three Fiber? + +The 3D visualization needs to render thousands of points interactively. +React Three Fiber provides a React-native API over Three.js, enabling +declarative scene composition while retaining GPU-level performance through +instanced rendering and point clouds. + +### Job-Based Architecture + +Embedding generation and plot computation can take seconds to minutes. The +task registry pattern decouples request handling from execution, allowing +the frontend to poll or subscribe via WebSocket without blocking HTTP +connections. diff --git a/docs/USAGE.md b/docs/USAGE.md new file mode 100644 index 0000000..d3fa07e --- /dev/null +++ b/docs/USAGE.md @@ -0,0 +1,199 @@ +# Usage + +This guide covers all three running modes and their configuration. + +## Web UI (SERVER mode) + +The recommended way to use embedding-clusters: + +```bash +RUNNING_MODE=SERVER uv run python -m embedding_cluster +``` + +Open . The web UI provides: + +1. **Home** — browse existing collections, see item counts and model info +2. **Index** — upload a CSV, select fields to embed, configure models, + and watch real-time progress via WebSocket +3. **Plot** — pick a collection, set clustering parameters, and interact + with a 3D scatter plot +4. **Settings** — configure AI provider for cluster naming + +### Workflow + +1. Navigate to the **Index** page +2. Upload a CSV file (drag-and-drop or file picker) +3. Select which columns to embed: + - **Text fields** use a SentenceTransformer model + - **Image URL fields** use a CLIP model +4. Click **Start** and watch the progress bar +5. Navigate to the **Plot** page +6. Select the new collection and configure: + - Number of clusters (or click **Suggest** for auto-detection) + - Reduction algorithm: t-SNE, UMAP, or PCA + - Algorithm-specific parameters (perplexity, learning rate, etc.) +7. Click **Compute** to generate the 3D visualization +8. Explore: + - **Hover** points to see metadata + - **Search** by text or image URL to highlight similar items + - **Click** a cluster in the legend to drill down + - **Sub-cluster** within a cluster for hierarchical exploration + - **Annotate** clusters with names, tags, and notes + - **AI Name** clusters using your configured LLM provider + +### Render Modes + +The 3D plot supports three render modes, switchable from the plot controls: + +- **Particles** — GPU-accelerated point cloud (default, best for large + datasets) +- **Sprites** — image thumbnails at each point (requires an image field) +- **Instanced Spheres** — 3D sphere meshes with lighting effects + +## CLI: INDEX mode + +Embed CSV data into ChromaDB collections from the command line: + +```bash +RUNNING_MODE=INDEX \ + LOCAL_CSV_FILENAME=./embedding_cluster/csv/fashion_small.csv \ + ID_FIELD=id \ + IMAGE_EMBEDDING_FIELDS='["imageUrl"]' \ + CHROMADB_COLLECTION_PREFIX=fashion_ \ + NUMBER_OF_ASYNC_TASKS=10 \ + uv run python -m embedding_cluster +``` + +You can also embed text fields: + +```bash +RUNNING_MODE=INDEX \ + LOCAL_CSV_FILENAME=./data/products.csv \ + ID_FIELD=product_id \ + TEXT_EMBEDDING_FIELDS='["name", "description"]' \ + CHROMADB_COLLECTION_PREFIX=products_ \ + uv run python -m embedding_cluster +``` + +Or both text and image fields in the same run: + +```bash +RUNNING_MODE=INDEX \ + LOCAL_CSV_FILENAME=./data/catalog.csv \ + ID_FIELD=id \ + TEXT_EMBEDDING_FIELDS='["title"]' \ + IMAGE_EMBEDDING_FIELDS='["thumbnail_url"]' \ + CHROMADB_COLLECTION_PREFIX=catalog_ \ + uv run python -m embedding_cluster +``` + +## CLI: PLOT mode + +Generate a cluster visualization from an existing collection: + +```bash +RUNNING_MODE=PLOT \ + CHROMADB_COLLECTION_NAME=fashion_imageUrl \ + TEXT_DISPLAY_FIELDS='["productDisplayName"]' \ + IMAGE_FIELD=imageUrl \ + NUM_CLUSTERS=8 \ + REDUCTION_ALGORITHM=umap \ + uv run python -m embedding_cluster +``` + +## Environment Variables + +All configuration is via environment variables, parsed by +[pydantic-settings](https://docs.pydantic.dev/latest/concepts/pydantic_settings/). + +### General + +| Variable | Default | Description | +|----------|---------|-------------| +| `RUNNING_MODE` | — | `INDEX`, `PLOT`, or `SERVER` | +| `DEVICE` | `cpu` | PyTorch device (`cpu`, `mps`, `cuda`) | + +### Indexing + +| Variable | Default | Description | +|----------|---------|-------------| +| `LOCAL_CSV_FILENAME` | — | Path to CSV file | +| `ID_FIELD` | — | Column name for unique row IDs | +| `TEXT_EMBEDDING_FIELDS` | `[]` | JSON array of text columns to embed | +| `IMAGE_EMBEDDING_FIELDS` | `[]` | JSON array of image URL columns to embed | +| `TEXT_MODEL_NAME` | `BAAI/bge-small-en-v1.5` | SentenceTransformer model | +| `IMAGE_MODEL_NAME` | `openai/clip-vit-base-patch32` | CLIP model | +| `CHROMADB_COLLECTION_PREFIX` | — | Prefix for collection names | +| `NUMBER_OF_ASYNC_TASKS` | `5` | Concurrency limit for async operations | +| `BULK_SIZE` | `100` | Batch size for ChromaDB upserts | +| `START_LINE` | — | First CSV line to process (optional) | +| `STOP_LINE` | — | Last CSV line to process (optional) | + +### Plotting + +| Variable | Default | Description | +|----------|---------|-------------| +| `CHROMADB_COLLECTION_NAME` | — | Collection to visualize | +| `TEXT_DISPLAY_FIELDS` | `[]` | JSON array of metadata fields to show on hover | +| `IMAGE_FIELD` | — | Metadata field containing image URLs | +| `NUM_CLUSTERS` | `10` | Number of k-means clusters | +| `REDUCTION_ALGORITHM` | `tsne` | `tsne`, `umap`, or `pca` | +| `TSNE_PERPLEXITY` | `30` | t-SNE perplexity parameter | +| `TSNE_LEARNING_RATE` | `200` | t-SNE learning rate | +| `UMAP_N_NEIGHBORS` | `15` | UMAP neighbors parameter | +| `UMAP_MIN_DIST` | `0.1` | UMAP minimum distance | +| `UMAP_METRIC` | `cosine` | UMAP distance metric | + +## API Endpoints + +When running in SERVER mode, the following REST endpoints are available +under `/api`: + +### CSV + +- `POST /api/csv/upload` — upload a CSV file +- `POST /api/csv/preview` — preview columns and sample rows + +### Indexing + +- `POST /api/index/start` — start an indexing job +- `GET /api/index/status/{job_id}` — poll job progress +- `POST /api/index/cancel/{job_id}` — cancel a running job +- `WS /api/index/ws/{job_id}` — real-time progress via WebSocket + +### Collections + +- `GET /api/collections` — list all collections +- `GET /api/collections/{name}` — collection detail with metadata fields +- `DELETE /api/collections/{name}` — delete a collection + +### Plot + +- `POST /api/plot/compute` — start plot computation +- `GET /api/plot/data/{job_id}` — fetch computed plot data +- `GET /api/plot/{job_id}/cluster/{index}` — paginated cluster items +- `POST /api/plot/{job_id}/cluster/{index}/sub-cluster` — sub-cluster +- `POST /api/plot/{job_id}/sub-cluster` — sub-cluster by point IDs +- `POST /api/plot/suggest-clusters` — auto-suggest optimal cluster count +- `POST /api/plot/{job_id}/suggest-k` — suggest k for sub-clustering + +### Search + +- `POST /api/search` — semantic search by text or image URL + +### AI Naming + +- `POST /api/ai/name-clusters` — generate cluster names via LLM +- `POST /api/ai/name-sub-clusters` — generate sub-cluster names +- `POST /api/ai/test-connection` — validate LLM credentials +- `POST /api/ai/ollama/models` — list available Ollama models + +### Annotations + +- `GET /api/annotations/{job_id}` — fetch annotations for a job +- `PUT /api/annotations/{job_id}/cluster/{index}` — update annotation +- `DELETE /api/annotations/{job_id}` — delete all annotations for a job + +### Health + +- `GET /api/health` — returns `{"status": "ok"}`