aGallea · aGallea · Apr 6, 2026 · Apr 6, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -64,8 +64,8 @@ uv run pytest
 # Run with coverage report
 uv run pytest --cov=embedding_cluster --cov-report=term-missing
 
-# Run with coverage enforcement (70% minimum)
-uv run pytest --cov=embedding_cluster --cov-report=term-missing --cov-fail-under=70
+# Run with coverage enforcement (90% minimum, matches CI)
+uv run pytest --cov=embedding_cluster --cov-report=term-missing --cov-fail-under=90
 
 # Run a single test file
 uv run pytest tests/test_settings.py -v
@@ -110,7 +110,7 @@ E2E tests require pre-indexed ChromaDB data. The `webServer` config in
 GitHub Actions workflow in `.github/workflows/ci.yml` runs on push/PR:
 - **lint** job: `ruff check` + `ruff format --check`
 - **typecheck** job: `mypy embedding_cluster/`
-- **test** job: `pytest --cov --cov-fail-under=70`
+- **test** job: `pytest --cov` (90% minimum enforced by coverage report)
 
 All jobs use `uv sync --all-extras` for dependency installation.
 
@@ -172,16 +172,35 @@ embedding_cluster/
   settings.py          # Pydantic Settings (env var config)
   utils.py             # Shared utilities (logging, ChromaDB helpers, image downloader)
   indexer.py           # INDEX mode: CSV parsing, embedding generation, ChromaDB storage
-  scatter_plot.py      # PLOT mode: Clustering, t-SNE, Dash visualization
+  scatter_plot.py      # PLOT mode: Clustering, dimensionality reduction, visualization data
+  ai_naming.py         # LLM-powered cluster naming via LiteLLM
+  annotations.py       # Cluster annotation persistence (JSON sidecar files)
   csv/                 # Sample data files
+  server/
+    app.py             # FastAPI app factory, SPA serving
+    models.py          # Pydantic request/response models
+    tasks.py           # Background task registry
+    ws.py              # WebSocket manager for live progress
+    routes/
+      ai.py            # AI cluster naming endpoints
+      annotations.py   # Cluster annotation CRUD
+      collections.py   # ChromaDB collection management
+      csv.py           # CSV upload and preview
+      index.py         # Indexing jobs with WebSocket progress
+      plot.py          # Plot computation, cluster detail, sub-clustering
+      search.py        # Semantic search (text and image)
+frontend/
+  src/
+    App.tsx            # Router, QueryClient, Zustand provider
+    api/               # Typed API client layer
+    components/        # UI components organized by page
+    hooks/             # useIndexWebSocket, usePlotData
+    pages/             # HomePage, IndexPage, PlotPage, SettingsPage
+    stores/            # Zustand plotStore (plot state management)
+    types/             # TypeScript interfaces mirroring backend models
 tests/
-  __init__.py
   conftest.py          # Shared fixtures
-  test_settings.py     # Settings env var parsing tests
-  test_utils.py        # Utilities, Singleton, ImageDownloader tests
-  test_indexer.py      # Indexer pipeline tests (mocked ML models)
-  test_scatter_plot.py # Scatter plot tests (mocked data)
-  test_main.py         # Entry point dispatch tests
+  test_*.py            # Unit tests for each backend module and route
 ```
 
 ### Key Dependencies
@@ -191,17 +210,18 @@ Runtime:
 - `chromadb` - Vector database for embedding storage
 - `transformers` / `sentence-transformers` - Text and image embedding models
 - `torch` - ML framework backend
-- `dash` / `plotly` - Interactive 3D visualization
-- `scikit-learn` - KMeans clustering and t-SNE
+- `fastapi` / `uvicorn` - Web server and REST API
+- `scikit-learn` - KMeans clustering and dimensionality reduction
 - `aiohttp` - Async HTTP for image downloads
-- `openai` - Optional GPT-based cluster naming
+- `litellm` - Multi-provider LLM integration for cluster naming
 - `numpy` / `Pillow` - Numerical and image processing
 
 Dev:
 - `pytest` / `pytest-asyncio` / `pytest-cov` - Testing framework
 - `mypy` - Static type checking
 - `ruff` - Linting and formatting
 - `pre-commit` - Git hook management
+- `httpx` - Test client for FastAPI routes
 
 ## Git & Commit Conventions
 
@@ -231,9 +251,15 @@ Extensive pre-commit setup. Key hooks:
 
 ## Data Flow
 
-1. **INDEX mode**: CSV -> parse rows -> generate embeddings (CLIP for images,
-   SentenceTransformer for text) -> store in ChromaDB collections
-2. **PLOT mode**: ChromaDB collection -> StandardScaler -> KMeans clustering ->
-   t-SNE 3D projection -> Dash/Plotly interactive scatter plot
-
-ChromaDB data is persisted to `./chromadb/` directory (gitignored).
+1. **INDEX mode**: CSV → parse rows → generate embeddings (CLIP for images,
+   SentenceTransformer for text) → store in ChromaDB collections
+2. **PLOT mode**: ChromaDB collection → StandardScaler → KMeans clustering →
+   dimensionality reduction (t-SNE/UMAP/PCA) → 3D point data via REST API
+3. **SERVER mode**: FastAPI serves REST API + built React SPA. Long-running
+   jobs (indexing, plot computation) use a task registry with WebSocket
+   progress streaming.
+
+Persistent data:
+- `./chromadb/` — Vector database (gitignored)
+- `./uploads/` — Uploaded CSV files (gitignored)
+- `./annotations/` — Cluster annotations as JSON sidecar files (gitignored)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,99 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Python + React application for generating, indexing, and visualizing embedding clusters from CSV data. Uses CLIP/SentenceTransformer for embeddings, ChromaDB for vector storage, k-means for clustering, and a React/Three.js frontend for 3D visualization.
+
+- **Python 3.13**, managed with [uv](https://docs.astral.sh/uv/)
+- **Package name**: `embedding_cluster` (underscore, not hyphen)
+- **Entry point**: `python -m embedding_cluster` dispatches to INDEX, PLOT, or SERVER mode via `RUNNING_MODE` env var
+
+## Commands
+
+### Backend
+
+```bash
+uv sync --all-extras                                    # Install all dependencies
+RUNNING_MODE=SERVER uv run python -m embedding_cluster  # Start server on :8000
+uv run ruff check embedding_cluster/ tests/             # Lint
+uv run ruff check --fix embedding_cluster/ tests/       # Lint with auto-fix
+uv run ruff format embedding_cluster/ tests/            # Format
+uv run mypy embedding_cluster/                          # Type check (strict mode)
+uv run pytest                                           # Run all tests
+uv run pytest tests/test_settings.py -v                 # Run single test file
+uv run pytest tests/test_settings.py::test_fn -v        # Run single test function
+uv run pytest --cov=embedding_cluster --cov-report=term-missing --cov-fail-under=90  # Coverage (90% CI min)
+uv run pre-commit run --all-files                       # All pre-commit hooks
+```
+
+### Frontend
+
+```bash
+cd frontend && npm install                  # Install deps
+cd frontend && npm run dev                  # Dev server on :5173
+cd frontend && npm run build                # Production build (output: frontend/dist)
+cd frontend && npm run lint                 # ESLint
+cd frontend && npm run test:e2e             # Playwright E2E tests
+cd frontend && npx playwright test e2e/search.spec.ts  # Single E2E test
+```
+
+E2E tests require pre-indexed ChromaDB data and a built frontend. The Playwright config auto-starts the FastAPI backend.
+
+## Architecture
+
+### Three Running Modes
+
+All controlled by `RUNNING_MODE` env var, dispatched in `__main__.py`:
+- **INDEX**: `indexer.py` — CSV parsing → embedding generation → ChromaDB storage
+- **PLOT**: `scatter_plot.py` — ChromaDB → StandardScaler → k-means → dimensionality reduction (t-SNE/UMAP/PCA)
+- **SERVER**: `server/app.py` — FastAPI backend serving REST API + built React SPA from `frontend/dist`
+
+### Backend Structure
+
+- `settings.py` — All config via env vars using `pydantic-settings` `BaseSettings`
+- `server/app.py` — FastAPI app factory, mounts route modules and serves SPA
+- `server/routes/` — API routes split by domain: `ai.py`, `annotations.py`, `collections.py`, `csv.py`, `index.py`, `plot.py`, `search.py`
+- `server/tasks.py` — Background task management for long-running operations
+- `server/ws.py` — WebSocket support for live progress
+- `ai_naming.py` — LLM-powered cluster naming via LiteLLM (supports OpenAI, Ollama)
+- `annotations.py` — Cluster annotation persistence (JSON sidecar files in `annotations/`)
+- `utils.py` — ChromaDB helpers, image downloader with retry, singleton pattern
+
+### Frontend Structure
+
+React 19 + TypeScript + Vite + Tailwind CSS 4:
+- `pages/` — `HomePage`, `IndexPage`, `PlotPage`, `SettingsPage`
+- `components/` — Organized by page: `home/`, `index/`, `plot/`, `csv/`
+- `stores/plotStore.ts` — Zustand store for plot state
+- `api/` — API client layer
+- `hooks/` — React Query hooks
+- 3D visualization uses React Three Fiber (`@react-three/fiber` + `@react-three/drei`)
+
+## Code Style
+
+### Python
+- **ruff**: line length 90, target py313
+- **mypy strict mode** — all functions need type annotations
+- Use `from __future__ import annotations` in every module
+- Modern syntax: `str | None` (not `Optional`), `list[str]` (not `List`)
+- Absolute imports only: `from embedding_cluster.settings import Settings`
+- Heavy imports behind `TYPE_CHECKING` blocks where possible
+- Logger per module: `logger = logging.getLogger(__name__)`
+
+### Git Conventions
+- **Conventional commits** enforced by commitizen: `type(scope): description`
+- Types: `feat`, `fix`, `docs`, `test`, `refactor`
+- **No direct commits to master** (enforced by pre-commit hook)
+- Branch naming: `feature-name` style (e.g., `feat/ollama-provider-integration`)
+
+### Pre-commit Hooks
+Extensive setup including: ruff, commitizen, yamllint, markdownlint, shellcheck, gitleaks (secret detection), hadolint, check-jsonschema, no-commit-to-branch. Install with:
+```bash
+uv run pre-commit install --install-hooks -t pre-commit -t commit-msg
+```
+
+## CI
+
+GitHub Actions (`.github/workflows/ci.yml`): lint → typecheck → test (90% coverage minimum). All jobs use `uv sync --all-extras`.
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,18 @@
+# Code of Conduct
+
+This project follows the
+[Contributor Covenant Code of Conduct v2.1](https://www.contributor-covenant.org/version/2/1/code_of_conduct/).
+
+Please read the full text at the link above. In summary, we are committed
+to providing a welcoming and inclusive experience for everyone.
+
+## Reporting
+
+If you experience or witness unacceptable behavior, please contact the
+project maintainer at **asafgallea@gmail.com**. All reports will be
+handled with discretion.
+
+## Attribution
+
+This Code of Conduct is adapted from the
+[Contributor Covenant](https://www.contributor-covenant.org), version 2.1.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,171 @@
+# Contributing
+
+Thanks for your interest in contributing to embedding-clusters! This guide
+covers everything you need to get started.
+
+## Prerequisites
+
+- [Python 3.13+](https://www.python.org/downloads/)
+- [uv](https://docs.astral.sh/uv/getting-started/installation/) package
+  manager
+- [Node.js 18+](https://nodejs.org/) (for frontend development)
+
+## Setup
+
+```bash
+git clone https://github.com/aGallea/embedding-clusters.git
+cd embedding-clusters
+uv sync --all-extras
+uv run pre-commit install --install-hooks -t pre-commit -t commit-msg
+```
+
+For frontend work:
+
+```bash
+cd frontend
+npm install
+```
+
+## Running Locally
+
+Start the full application (backend + frontend):
+
+```bash
+RUNNING_MODE=SERVER uv run python -m embedding_cluster
+```
+
+For frontend development with hot reload:
+
+```bash
+# Terminal 1 — backend
+RUNNING_MODE=SERVER uv run python -m embedding_cluster
+
+# Terminal 2 — frontend dev server (proxies API to backend)
+cd frontend && npm run dev
+```
+
+The Vite dev server runs on `http://localhost:5173` and proxies `/api` and
+`/ws` requests to the backend on port 8000.
+
+## Testing
+
+### Backend (Python)
+
+```bash
+uv run pytest                                  # Run all tests
+uv run pytest tests/test_settings.py -v        # Single file
+uv run pytest tests/test_settings.py::test_fn  # Single test
+uv run pytest --cov=embedding_cluster \
+  --cov-report=term-missing --cov-fail-under=90  # With coverage
+```
+
+Tests use `pytest-asyncio` in auto mode. CI enforces a **90% minimum
+coverage** threshold.
+
+### Frontend (E2E)
+
+```bash
+cd frontend
+npx playwright install chromium     # First-time setup
+npm run build                       # Build required before E2E
+npm run test:e2e                    # Run tests
+npm run test:e2e:ui                 # Run with interactive UI
+```
+
+E2E tests require pre-indexed data in ChromaDB. See the
+[AGENTS.md](AGENTS.md) E2E section for setup instructions.
+
+## Code Style
+
+### Python
+
+- **ruff** for linting and formatting (line length 90, target py313)
+- **mypy** in strict mode — all functions require type annotations
+- `from __future__ import annotations` in every module
+- Modern type syntax: `str | None`, `list[str]`, `dict[str, Any]`
+- Absolute imports only: `from embedding_cluster.settings import Settings`
+- Heavy imports behind `TYPE_CHECKING` blocks where possible
+- Logger per module: `logger = logging.getLogger(__name__)`
+
+```bash
+uv run ruff check embedding_cluster/ tests/       # Lint
+uv run ruff check --fix embedding_cluster/ tests/  # Auto-fix
+uv run ruff format embedding_cluster/ tests/       # Format
+uv run mypy embedding_cluster/                     # Type check
+```
+
+### Frontend (TypeScript)
+
+- ESLint with TypeScript and React hooks plugins
+- Tailwind CSS 4 for styling
+
+```bash
+cd frontend && npm run lint
+```
+
+## Pre-commit Hooks
+
+The project uses extensive pre-commit hooks that run automatically on
+commit. Key hooks include:
+
+- **ruff** — linting (with auto-fix) and formatting
+- **commitizen** — commit message validation
+- **gitleaks** — secret detection
+- **yamllint** / **markdownlint** — config file linting
+- **no-commit-to-branch** — prevents direct commits to master
+
+Run all hooks manually:
+
+```bash
+uv run pre-commit run --all-files
+```
+
+## Commit Messages
+
+This project uses [Conventional Commits](https://www.conventionalcommits.org/)
+enforced by [commitizen](https://commitizen-tools.github.io/commitizen/).
+
+Format: `type(scope): description`
+
+| Type | Use for |
+|------|---------|
+| `feat` | New features |
+| `fix` | Bug fixes |
+| `docs` | Documentation changes |
+| `test` | Adding or updating tests |
+| `refactor` | Code changes that neither fix bugs nor add features |
+
+Examples:
+
+```text
+feat(search): add image URL search support
+fix(indexer): handle empty CSV rows gracefully
+docs(readme): update quick start instructions
+test(server): add collection deletion tests
+```
+
+## Pull Request Process
+
+1. Create a branch from `master` (e.g. `feat/my-feature`)
+2. Make your changes and ensure all checks pass:
+   ```bash
+   uv run ruff check embedding_cluster/ tests/
+   uv run ruff format --check embedding_cluster/ tests/
+   uv run mypy embedding_cluster/
+   uv run pytest --cov=embedding_cluster --cov-fail-under=90
+   ```
+3. Push and open a pull request against `master`
+4. CI will run lint, typecheck, and test jobs automatically
+5. All conversations must be resolved before merging
+6. At least one approving review is required
+
+## Project Structure
+
+See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full system
+design and component breakdown.
+
+## Good First Issues
+
+Look for issues labeled
+[`good first issue`](https://github.com/aGallea/embedding-clusters/labels/good%20first%20issue)
+for beginner-friendly tasks.