Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 45 additions & 19 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,8 @@ uv run pytest
# Run with coverage report
uv run pytest --cov=embedding_cluster --cov-report=term-missing

# Run with coverage enforcement (70% minimum)
uv run pytest --cov=embedding_cluster --cov-report=term-missing --cov-fail-under=70
# Run with coverage enforcement (90% minimum, matches CI)
uv run pytest --cov=embedding_cluster --cov-report=term-missing --cov-fail-under=90

# Run a single test file
uv run pytest tests/test_settings.py -v
Expand Down Expand Up @@ -110,7 +110,7 @@ E2E tests require pre-indexed ChromaDB data. The `webServer` config in
GitHub Actions workflow in `.github/workflows/ci.yml` runs on push/PR:
- **lint** job: `ruff check` + `ruff format --check`
- **typecheck** job: `mypy embedding_cluster/`
- **test** job: `pytest --cov --cov-fail-under=70`
- **test** job: `pytest --cov` (90% minimum enforced by coverage report)

All jobs use `uv sync --all-extras` for dependency installation.

Expand Down Expand Up @@ -172,16 +172,35 @@ embedding_cluster/
settings.py # Pydantic Settings (env var config)
utils.py # Shared utilities (logging, ChromaDB helpers, image downloader)
indexer.py # INDEX mode: CSV parsing, embedding generation, ChromaDB storage
scatter_plot.py # PLOT mode: Clustering, t-SNE, Dash visualization
scatter_plot.py # PLOT mode: Clustering, dimensionality reduction, visualization data
ai_naming.py # LLM-powered cluster naming via LiteLLM
annotations.py # Cluster annotation persistence (JSON sidecar files)
csv/ # Sample data files
server/
app.py # FastAPI app factory, SPA serving
models.py # Pydantic request/response models
tasks.py # Background task registry
ws.py # WebSocket manager for live progress
routes/
ai.py # AI cluster naming endpoints
annotations.py # Cluster annotation CRUD
collections.py # ChromaDB collection management
csv.py # CSV upload and preview
index.py # Indexing jobs with WebSocket progress
plot.py # Plot computation, cluster detail, sub-clustering
search.py # Semantic search (text and image)
frontend/
src/
App.tsx # Router, QueryClient, Zustand provider
api/ # Typed API client layer
components/ # UI components organized by page
hooks/ # useIndexWebSocket, usePlotData
pages/ # HomePage, IndexPage, PlotPage, SettingsPage
stores/ # Zustand plotStore (plot state management)
types/ # TypeScript interfaces mirroring backend models
tests/
__init__.py
conftest.py # Shared fixtures
test_settings.py # Settings env var parsing tests
test_utils.py # Utilities, Singleton, ImageDownloader tests
test_indexer.py # Indexer pipeline tests (mocked ML models)
test_scatter_plot.py # Scatter plot tests (mocked data)
test_main.py # Entry point dispatch tests
test_*.py # Unit tests for each backend module and route
```

### Key Dependencies
Expand All @@ -191,17 +210,18 @@ Runtime:
- `chromadb` - Vector database for embedding storage
- `transformers` / `sentence-transformers` - Text and image embedding models
- `torch` - ML framework backend
- `dash` / `plotly` - Interactive 3D visualization
- `scikit-learn` - KMeans clustering and t-SNE
- `fastapi` / `uvicorn` - Web server and REST API
- `scikit-learn` - KMeans clustering and dimensionality reduction
- `aiohttp` - Async HTTP for image downloads
- `openai` - Optional GPT-based cluster naming
- `litellm` - Multi-provider LLM integration for cluster naming
- `numpy` / `Pillow` - Numerical and image processing

Dev:
- `pytest` / `pytest-asyncio` / `pytest-cov` - Testing framework
- `mypy` - Static type checking
- `ruff` - Linting and formatting
- `pre-commit` - Git hook management
- `httpx` - Test client for FastAPI routes

## Git & Commit Conventions

Expand Down Expand Up @@ -231,9 +251,15 @@ Extensive pre-commit setup. Key hooks:

## Data Flow

1. **INDEX mode**: CSV -> parse rows -> generate embeddings (CLIP for images,
SentenceTransformer for text) -> store in ChromaDB collections
2. **PLOT mode**: ChromaDB collection -> StandardScaler -> KMeans clustering ->
t-SNE 3D projection -> Dash/Plotly interactive scatter plot

ChromaDB data is persisted to `./chromadb/` directory (gitignored).
1. **INDEX mode**: CSV → parse rows → generate embeddings (CLIP for images,
SentenceTransformer for text) → store in ChromaDB collections
2. **PLOT mode**: ChromaDB collection → StandardScaler → KMeans clustering →
dimensionality reduction (t-SNE/UMAP/PCA) → 3D point data via REST API
3. **SERVER mode**: FastAPI serves REST API + built React SPA. Long-running
jobs (indexing, plot computation) use a task registry with WebSocket
progress streaming.

Persistent data:
- `./chromadb/` — Vector database (gitignored)
- `./uploads/` — Uploaded CSV files (gitignored)
- `./annotations/` — Cluster annotations as JSON sidecar files (gitignored)
99 changes: 99 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Python + React application for generating, indexing, and visualizing embedding clusters from CSV data. Uses CLIP/SentenceTransformer for embeddings, ChromaDB for vector storage, k-means for clustering, and a React/Three.js frontend for 3D visualization.

- **Python 3.13**, managed with [uv](https://docs.astral.sh/uv/)
- **Package name**: `embedding_cluster` (underscore, not hyphen)
- **Entry point**: `python -m embedding_cluster` dispatches to INDEX, PLOT, or SERVER mode via `RUNNING_MODE` env var

## Commands

### Backend

```bash
uv sync --all-extras # Install all dependencies
RUNNING_MODE=SERVER uv run python -m embedding_cluster # Start server on :8000
uv run ruff check embedding_cluster/ tests/ # Lint
uv run ruff check --fix embedding_cluster/ tests/ # Lint with auto-fix
uv run ruff format embedding_cluster/ tests/ # Format
uv run mypy embedding_cluster/ # Type check (strict mode)
uv run pytest # Run all tests
uv run pytest tests/test_settings.py -v # Run single test file
uv run pytest tests/test_settings.py::test_fn -v # Run single test function
uv run pytest --cov=embedding_cluster --cov-report=term-missing --cov-fail-under=90 # Coverage (90% CI min)
uv run pre-commit run --all-files # All pre-commit hooks
```

### Frontend

```bash
cd frontend && npm install # Install deps
cd frontend && npm run dev # Dev server on :5173
cd frontend && npm run build # Production build (output: frontend/dist)
cd frontend && npm run lint # ESLint
cd frontend && npm run test:e2e # Playwright E2E tests
cd frontend && npx playwright test e2e/search.spec.ts # Single E2E test
```

E2E tests require pre-indexed ChromaDB data and a built frontend. The Playwright config auto-starts the FastAPI backend.

## Architecture

### Three Running Modes

All controlled by `RUNNING_MODE` env var, dispatched in `__main__.py`:
- **INDEX**: `indexer.py` — CSV parsing → embedding generation → ChromaDB storage
- **PLOT**: `scatter_plot.py` — ChromaDB → StandardScaler → k-means → dimensionality reduction (t-SNE/UMAP/PCA)
- **SERVER**: `server/app.py` — FastAPI backend serving REST API + built React SPA from `frontend/dist`

### Backend Structure

- `settings.py` — All config via env vars using `pydantic-settings` `BaseSettings`
- `server/app.py` — FastAPI app factory, mounts route modules and serves SPA
- `server/routes/` — API routes split by domain: `ai.py`, `annotations.py`, `collections.py`, `csv.py`, `index.py`, `plot.py`, `search.py`
- `server/tasks.py` — Background task management for long-running operations
- `server/ws.py` — WebSocket support for live progress
- `ai_naming.py` — LLM-powered cluster naming via LiteLLM (supports OpenAI, Ollama)
- `annotations.py` — Cluster annotation persistence (JSON sidecar files in `annotations/`)
- `utils.py` — ChromaDB helpers, image downloader with retry, singleton pattern

### Frontend Structure

React 19 + TypeScript + Vite + Tailwind CSS 4:
- `pages/` — `HomePage`, `IndexPage`, `PlotPage`, `SettingsPage`
- `components/` — Organized by page: `home/`, `index/`, `plot/`, `csv/`
- `stores/plotStore.ts` — Zustand store for plot state
- `api/` — API client layer
- `hooks/` — React Query hooks
- 3D visualization uses React Three Fiber (`@react-three/fiber` + `@react-three/drei`)

## Code Style

### Python
- **ruff**: line length 90, target py313
- **mypy strict mode** — all functions need type annotations
- Use `from __future__ import annotations` in every module
- Modern syntax: `str | None` (not `Optional`), `list[str]` (not `List`)
- Absolute imports only: `from embedding_cluster.settings import Settings`
- Heavy imports behind `TYPE_CHECKING` blocks where possible
- Logger per module: `logger = logging.getLogger(__name__)`

### Git Conventions
- **Conventional commits** enforced by commitizen: `type(scope): description`
- Types: `feat`, `fix`, `docs`, `test`, `refactor`
- **No direct commits to master** (enforced by pre-commit hook)
- Branch naming: `feature-name` style (e.g., `feat/ollama-provider-integration`)

### Pre-commit Hooks
Extensive setup including: ruff, commitizen, yamllint, markdownlint, shellcheck, gitleaks (secret detection), hadolint, check-jsonschema, no-commit-to-branch. Install with:
```bash
uv run pre-commit install --install-hooks -t pre-commit -t commit-msg
```

## CI

GitHub Actions (`.github/workflows/ci.yml`): lint → typecheck → test (90% coverage minimum). All jobs use `uv sync --all-extras`.
18 changes: 18 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Code of Conduct

This project follows the
[Contributor Covenant Code of Conduct v2.1](https://www.contributor-covenant.org/version/2/1/code_of_conduct/).

Please read the full text at the link above. In summary, we are committed
to providing a welcoming and inclusive experience for everyone.

## Reporting

If you experience or witness unacceptable behavior, please contact the
project maintainer at **asafgallea@gmail.com**. All reports will be
handled with discretion.

## Attribution

This Code of Conduct is adapted from the
[Contributor Covenant](https://www.contributor-covenant.org), version 2.1.
171 changes: 171 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Contributing

Thanks for your interest in contributing to embedding-clusters! This guide
covers everything you need to get started.

## Prerequisites

- [Python 3.13+](https://www.python.org/downloads/)
- [uv](https://docs.astral.sh/uv/getting-started/installation/) package
manager
- [Node.js 18+](https://nodejs.org/) (for frontend development)

## Setup

```bash
git clone https://github.com/aGallea/embedding-clusters.git
cd embedding-clusters
uv sync --all-extras
uv run pre-commit install --install-hooks -t pre-commit -t commit-msg
```

For frontend work:

```bash
cd frontend
npm install
```

## Running Locally

Start the full application (backend + frontend):

```bash
RUNNING_MODE=SERVER uv run python -m embedding_cluster
```

For frontend development with hot reload:

```bash
# Terminal 1 — backend
RUNNING_MODE=SERVER uv run python -m embedding_cluster

# Terminal 2 — frontend dev server (proxies API to backend)
cd frontend && npm run dev
```

The Vite dev server runs on `http://localhost:5173` and proxies `/api` and
`/ws` requests to the backend on port 8000.

## Testing

### Backend (Python)

```bash
uv run pytest # Run all tests
uv run pytest tests/test_settings.py -v # Single file
uv run pytest tests/test_settings.py::test_fn # Single test
uv run pytest --cov=embedding_cluster \
--cov-report=term-missing --cov-fail-under=90 # With coverage
```

Tests use `pytest-asyncio` in auto mode. CI enforces a **90% minimum
coverage** threshold.

### Frontend (E2E)

```bash
cd frontend
npx playwright install chromium # First-time setup
npm run build # Build required before E2E
npm run test:e2e # Run tests
npm run test:e2e:ui # Run with interactive UI
```

E2E tests require pre-indexed data in ChromaDB. See the
[AGENTS.md](AGENTS.md) E2E section for setup instructions.

## Code Style

### Python

- **ruff** for linting and formatting (line length 90, target py313)
- **mypy** in strict mode — all functions require type annotations
- `from __future__ import annotations` in every module
- Modern type syntax: `str | None`, `list[str]`, `dict[str, Any]`
- Absolute imports only: `from embedding_cluster.settings import Settings`
- Heavy imports behind `TYPE_CHECKING` blocks where possible
- Logger per module: `logger = logging.getLogger(__name__)`

```bash
uv run ruff check embedding_cluster/ tests/ # Lint
uv run ruff check --fix embedding_cluster/ tests/ # Auto-fix
uv run ruff format embedding_cluster/ tests/ # Format
uv run mypy embedding_cluster/ # Type check
```

### Frontend (TypeScript)

- ESLint with TypeScript and React hooks plugins
- Tailwind CSS 4 for styling

```bash
cd frontend && npm run lint
```

## Pre-commit Hooks

The project uses extensive pre-commit hooks that run automatically on
commit. Key hooks include:

- **ruff** — linting (with auto-fix) and formatting
- **commitizen** — commit message validation
- **gitleaks** — secret detection
- **yamllint** / **markdownlint** — config file linting
- **no-commit-to-branch** — prevents direct commits to master

Run all hooks manually:

```bash
uv run pre-commit run --all-files
```

## Commit Messages

This project uses [Conventional Commits](https://www.conventionalcommits.org/)
enforced by [commitizen](https://commitizen-tools.github.io/commitizen/).

Format: `type(scope): description`

| Type | Use for |
|------|---------|
| `feat` | New features |
| `fix` | Bug fixes |
| `docs` | Documentation changes |
| `test` | Adding or updating tests |
| `refactor` | Code changes that neither fix bugs nor add features |

Examples:

```text
feat(search): add image URL search support
fix(indexer): handle empty CSV rows gracefully
docs(readme): update quick start instructions
test(server): add collection deletion tests
```

## Pull Request Process

1. Create a branch from `master` (e.g. `feat/my-feature`)
2. Make your changes and ensure all checks pass:
```bash
uv run ruff check embedding_cluster/ tests/
uv run ruff format --check embedding_cluster/ tests/
uv run mypy embedding_cluster/
uv run pytest --cov=embedding_cluster --cov-fail-under=90
```
3. Push and open a pull request against `master`
4. CI will run lint, typecheck, and test jobs automatically
5. All conversations must be resolved before merging
6. At least one approving review is required

## Project Structure

See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full system
design and component breakdown.

## Good First Issues

Look for issues labeled
[`good first issue`](https://github.com/aGallea/embedding-clusters/labels/good%20first%20issue)
for beginner-friendly tasks.
Loading
Loading