Skip to content

logankm02/cs289a-grad-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gmail Thread Indexing & Clustering

Tools for downloading Gmail threads into a local ChromaDB collection, embedding their contents, and optionally clustering / labeling them back in Gmail. The repo provides two entry points:

  • index.py: pulls Gmail threads, chunks message bodies, creates embeddings, and writes them to ChromaDB.
  • cluster.py: loads the stored embeddings, runs HDBSCAN clustering, and can optionally apply generated labels to Gmail conversations.

Prerequisites

  • Python 3.10+ (recommended 3.11). macOS ships Python 3.9; install a newer interpreter via brew install python@3.11 or a tool like pyenv.
  • A Google Cloud project with Gmail API enabled and OAuth client credentials downloaded as desktop_credentials.json in the repo root (desktop app OAuth client). You can override the path via GMAIL_CLIENT_SECRET, and the tools will also fall back to credentials.json or web_credentials.json if present.

Quick Start

# 1) Create and activate the virtual environment
# Replace python3.11 with whichever 3.10+ interpreter you installed
python3.11 -m venv .venv
source .venv/bin/activate

# 2) Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

The first run of either script opens a browser window for Google OAuth and writes the resulting token to token.json.

Runtime Configuration

The scripts infer behavior from these environment variables:

Variable Purpose
OPENAI_API_KEY Enables OpenAI-hosted embeddings (default model text-embedding-3-small).
OPENAI_EMBEDDING_MODEL_NAME Override the OpenAI embedding model.
LOCAL_EMBEDDING_MODEL_NAME SentenceTransformer model to load when no OpenAI key is present (defaults to intfloat/multilingual-e5-large-instruct).
GEMINI_API_KEY Enables Gemini-powered summaries during clustering.
GEMINI_GENERATION_MODEL Gemini model name override (gemini-2.5-flash by default, with automatic fallbacks).
GMAIL_TOKEN_PATH Custom location for the OAuth token file (token.json by default).
GMAIL_CLIENT_SECRET Custom path to your OAuth client secret (defaults to the first found among desktop_credentials.json, credentials.json, or web_credentials.json).

Environment variables are automatically loaded from .env in the project root (or the file pointed to by CS289A_ENV_FILE) when python-dotenv is installed; otherwise export them manually before running the scripts.

When labeling clusters, the tool automatically tries gemini-2.5-flash, gemini-1.5-flash-latest, gemini-1.5-flash, gemini-1.5-flash-001, then gemini-1.0-pro, using the first model your account supports. Set GEMINI_GENERATION_MODEL to override the ordering.

If no OPENAI_API_KEY is supplied the scripts fall back to a local SentenceTransformer model, which requires PyTorch and will download model weights on first use.

Index Gmail Threads

python index.py --user-email you@example.com \
                --max-threads 100 \
                --page-size 100 \
                --embed-batch 64 \
                [--reset] \
                [--log-level DEBUG]
  • --user-email (required) selects the Gmail account to index.
  • --max-threads controls the number of threads to pull (default 100).
  • --reset clears existing ChromaDB entries for the user before re-indexing.
  • --log-level overrides the default INFO logging verbosity (pass DEBUG for fine-grained tracing).
  • Other flags tune pagination and embedding batch sizes.

The script writes chunk metadata and embeddings into a user-specific namespace inside ChromaDB.

Cluster Stored Threads

python cluster.py --user-email you@example.com \
                  --min-cluster-size 5 \
                  --max-threads 1000 \
                  [--output-file cluster_summary.txt] \
                  [--apply-labels] \
                  [--log-level DEBUG] \
                  [--enable-sub-labels]
  • --apply-labels pushes generated labels back to Gmail; omit it for a dry run.
  • --output-file writes a detailed cluster summary (including thread subjects) to the given path.
  • Provide --collection instead of --user-email to cluster a specific Chroma collection manually.
  • --log-level switches the logging verbosity (DEBUG, INFO, WARNING, etc.).
  • Live progress logs show thread assignment, cluster labeling, and Gmail label application status while the command runs.
  • --enable-sub-labels turns on hierarchical sub-labeling within each cluster (optionally uses Gemini for labeling those subclusters).

The script prints a summary of discovered clusters, their representative subjects, and any outlier threads.

Run the API + UI

The repo includes a lightweight FastAPI backend (api.py) and a React/Vite frontend under frontend/ to drive indexing, clustering, and visualization from the browser.

Backend (from repo root):

# activate venv first (see Quick Start above)
uvicorn api:app --port 8000
  • Make sure your environment matches how you indexed (e.g., same OPENAI_API_KEY or none) so the backend opens the correct Chroma collection.
  • By default, data is stored in chroma_email_index/ under the repo root unless CHROMA_PERSIST_DIRECTORY is set.

Frontend (from repo root):

cd frontend
npm install
npm run dev
  • Vite dev server proxies API calls to http://localhost:8000/api by default. If you host the backend elsewhere, set VITE_API_BASE to your backend origin.

In the UI:

  1. Enter the same email you indexed. If it is new, click "Run indexing (async)" first.
  2. After indexing finishes, run clustering (sync or async). The result panel shows status/logs; clusters render below with subjects/senders (and subclusters if enabled).
  3. Click "Load embeddings & show PCA" to see a scatter plot for the current user's embeddings.
  4. "Load last cluster result" fetches the last completed clustering run on the server.

UI Sample

Iteration 2

Clustering Demo 2nd Iteration UI

Iteration 1

Clustering Demo 1st Iteration UI

Virtual Environment Notes

  • The repository now includes a managed virtual environment at .venv. Activate it with source .venv/bin/activate (or .\.venv\Scripts\activate on Windows).
  • To refresh dependencies, run pip install -r requirements.txt after activation.

Troubleshooting

  • OAuth consent: If authentication fails, delete token.json and re-run to trigger a fresh OAuth flow.
  • Transient Gmail API errors: The scripts automatically retry common 5xx/429 responses. If a thread still fails after retries, rerun the command later; Gmail occasionally returns temporary backend errors.
  • Chroma count errors: Some chromadb builds omit the where= argument on collection.count(). The tooling detects this and falls back to fetching IDs, but if you see repeated warnings consider upgrading chromadb (pip install --upgrade chromadb).
  • Chroma import errors: If importing chromadb fails, check the detailed exception in the log; the scripts abort with guidance when the package is missing.
  • Embedding backends: Ensure either OPENAI_API_KEY is set or that your environment has GPU/CPU resources to load the default SentenceTransformer model.
  • Python version: The scripts abort on interpreters older than 3.10. If you see importlib.metadata errors, recreate .venv with a newer Python (python3.11 -m venv .venv).
  • macOS LibreSSL warning: System Python on macOS links against LibreSSL, triggering an urllib3 warning. Installing Python via Homebrew / pyenv (OpenSSL-based) removes the warning.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors