Tools for downloading Gmail threads into a local ChromaDB collection, embedding their contents, and optionally clustering / labeling them back in Gmail. The repo provides two entry points:
index.py: pulls Gmail threads, chunks message bodies, creates embeddings, and writes them to ChromaDB.cluster.py: loads the stored embeddings, runs HDBSCAN clustering, and can optionally apply generated labels to Gmail conversations.
- Python 3.10+ (recommended 3.11). macOS ships Python 3.9; install a newer interpreter via
brew install python@3.11or a tool likepyenv. - A Google Cloud project with Gmail API enabled and OAuth client credentials downloaded as
desktop_credentials.jsonin the repo root (desktop app OAuth client). You can override the path viaGMAIL_CLIENT_SECRET, and the tools will also fall back tocredentials.jsonorweb_credentials.jsonif present.
# 1) Create and activate the virtual environment
# Replace python3.11 with whichever 3.10+ interpreter you installed
python3.11 -m venv .venv
source .venv/bin/activate
# 2) Install dependencies
pip install --upgrade pip
pip install -r requirements.txtThe first run of either script opens a browser window for Google OAuth and writes the resulting token to token.json.
The scripts infer behavior from these environment variables:
| Variable | Purpose |
|---|---|
OPENAI_API_KEY |
Enables OpenAI-hosted embeddings (default model text-embedding-3-small). |
OPENAI_EMBEDDING_MODEL_NAME |
Override the OpenAI embedding model. |
LOCAL_EMBEDDING_MODEL_NAME |
SentenceTransformer model to load when no OpenAI key is present (defaults to intfloat/multilingual-e5-large-instruct). |
GEMINI_API_KEY |
Enables Gemini-powered summaries during clustering. |
GEMINI_GENERATION_MODEL |
Gemini model name override (gemini-2.5-flash by default, with automatic fallbacks). |
GMAIL_TOKEN_PATH |
Custom location for the OAuth token file (token.json by default). |
GMAIL_CLIENT_SECRET |
Custom path to your OAuth client secret (defaults to the first found among desktop_credentials.json, credentials.json, or web_credentials.json). |
Environment variables are automatically loaded from .env in the project root (or the file pointed to by CS289A_ENV_FILE) when python-dotenv is installed; otherwise export them manually before running the scripts.
When labeling clusters, the tool automatically tries gemini-2.5-flash, gemini-1.5-flash-latest, gemini-1.5-flash, gemini-1.5-flash-001, then gemini-1.0-pro, using the first model your account supports. Set GEMINI_GENERATION_MODEL to override the ordering.
If no OPENAI_API_KEY is supplied the scripts fall back to a local SentenceTransformer model, which requires PyTorch and will download model weights on first use.
python index.py --user-email you@example.com \
--max-threads 100 \
--page-size 100 \
--embed-batch 64 \
[--reset] \
[--log-level DEBUG]--user-email(required) selects the Gmail account to index.--max-threadscontrols the number of threads to pull (default 100).--resetclears existing ChromaDB entries for the user before re-indexing.--log-leveloverrides the default INFO logging verbosity (pass DEBUG for fine-grained tracing).- Other flags tune pagination and embedding batch sizes.
The script writes chunk metadata and embeddings into a user-specific namespace inside ChromaDB.
python cluster.py --user-email you@example.com \
--min-cluster-size 5 \
--max-threads 1000 \
[--output-file cluster_summary.txt] \
[--apply-labels] \
[--log-level DEBUG] \
[--enable-sub-labels]--apply-labelspushes generated labels back to Gmail; omit it for a dry run.--output-filewrites a detailed cluster summary (including thread subjects) to the given path.- Provide
--collectioninstead of--user-emailto cluster a specific Chroma collection manually. --log-levelswitches the logging verbosity (DEBUG, INFO, WARNING, etc.).- Live progress logs show thread assignment, cluster labeling, and Gmail label application status while the command runs.
--enable-sub-labelsturns on hierarchical sub-labeling within each cluster (optionally uses Gemini for labeling those subclusters).
The script prints a summary of discovered clusters, their representative subjects, and any outlier threads.
The repo includes a lightweight FastAPI backend (api.py) and a React/Vite frontend under frontend/ to drive indexing, clustering, and visualization from the browser.
Backend (from repo root):
# activate venv first (see Quick Start above)
uvicorn api:app --port 8000- Make sure your environment matches how you indexed (e.g., same
OPENAI_API_KEYor none) so the backend opens the correct Chroma collection. - By default, data is stored in
chroma_email_index/under the repo root unlessCHROMA_PERSIST_DIRECTORYis set.
Frontend (from repo root):
cd frontend
npm install
npm run dev- Vite dev server proxies API calls to
http://localhost:8000/apiby default. If you host the backend elsewhere, setVITE_API_BASEto your backend origin.
In the UI:
- Enter the same email you indexed. If it is new, click "Run indexing (async)" first.
- After indexing finishes, run clustering (sync or async). The result panel shows status/logs; clusters render below with subjects/senders (and subclusters if enabled).
- Click "Load embeddings & show PCA" to see a scatter plot for the current user's embeddings.
- "Load last cluster result" fetches the last completed clustering run on the server.
- The repository now includes a managed virtual environment at
.venv. Activate it withsource .venv/bin/activate(or.\.venv\Scripts\activateon Windows). - To refresh dependencies, run
pip install -r requirements.txtafter activation.
- OAuth consent: If authentication fails, delete
token.jsonand re-run to trigger a fresh OAuth flow. - Transient Gmail API errors: The scripts automatically retry common 5xx/429 responses. If a thread still fails after retries, rerun the command later; Gmail occasionally returns temporary backend errors.
- Chroma count errors: Some chromadb builds omit the
where=argument oncollection.count(). The tooling detects this and falls back to fetching IDs, but if you see repeated warnings consider upgrading chromadb (pip install --upgrade chromadb). - Chroma import errors: If importing
chromadbfails, check the detailed exception in the log; the scripts abort with guidance when the package is missing. - Embedding backends: Ensure either
OPENAI_API_KEYis set or that your environment has GPU/CPU resources to load the default SentenceTransformer model. - Python version: The scripts abort on interpreters older than 3.10. If you see
importlib.metadataerrors, recreate.venvwith a newer Python (python3.11 -m venv .venv). - macOS LibreSSL warning: System Python on macOS links against LibreSSL, triggering an urllib3 warning. Installing Python via Homebrew / pyenv (OpenSSL-based) removes the warning.

