Cluster Optimization

for Text Embeddings

End-to-end clustering pipeline for 45K text embeddings: data cleaning with ensemble outlier detection, optimal cluster count selection using 4 methods, and final KMeans clustering with t-SNE visualization.

Pipeline

Raw Data (45,895 texts with embeddings)
    │
    ▼
┌─────────────────────────────────┐
│  1. Data Cleaning               │
│  ├─ KNN Distance Outliers       │
│  ├─ Local Outlier Factor (LOF)  │
│  └─ Isolation Forest            │
│  → Ensemble vote (≥2 of 3)     │
└─────────────────────────────────┘
    │  43,304 clean samples (-5.6%)
    ▼
┌─────────────────────────────────┐
│  2. Optimal k Selection         │
│  ├─ Elbow Method                │
│  ├─ Silhouette Analysis         │
│  ├─ Calinski-Harabasz Index     │
│  └─ Davies-Bouldin Index        │
│  → Median of 4 methods          │
└─────────────────────────────────┘
    │  k = 57
    ▼
┌─────────────────────────────────┐
│  3. Final Clustering            │
│  ├─ KMeans (k=57, k-means++)   │
│  ├─ t-SNE 2D visualization     │
│  └─ Cluster analysis & export  │
└─────────────────────────────────┘
    │  57 clusters with centroids
    ▼
  Output CSV

Data Cleaning

Three outlier detection methods are applied independently, and a conservative ensemble voting approach removes points flagged by at least 2 out of 3 methods:

Method	Outliers Detected
KNN Distance (z-score > 1.0)	4,729
Local Outlier Factor	4,590
Isolation Forest	4,590
Combined (≥2 votes)	2,591

Result: 45,895 → 43,304 clean samples (5.6% removed)

Optimal k Selection

Four standard methods evaluate cluster counts from 2 to 80:

Method	Optimal k
Elbow Method	2
Silhouette Analysis	2
Calinski-Harabasz Index	57
Davies-Bouldin Index	57
Final (median)	57

Results

57 clusters from 43,304 text embeddings (3,072-dimensional)
Cluster sizes range from ~100 to ~1,600 items
t-SNE visualization confirms well-separated cluster structure

Tech Stack

Python — core language
scikit-learn — KMeans, t-SNE, LOF, Isolation Forest, silhouette/CH/DB metrics
pandas / NumPy — data manipulation
matplotlib / seaborn — visualization
SciPy — statistical aggregation (mode)

Project Structure

├── cluster_optimization.ipynb   # Full pipeline notebook
└── README.md

Usage

pip install numpy pandas scikit-learn matplotlib seaborn scipy

Open cluster_optimization.ipynb in Jupyter or Google Colab and run all cells. The notebook expects a CSV file with columns: _id, text, n_tokens, embedding (string representation of float arrays).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cluster_optimization.ipynb		cluster_optimization.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cluster Optimization

Pipeline

Data Cleaning

Optimal k Selection

Results

Tech Stack

Project Structure

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cluster Optimization

Pipeline

Data Cleaning

Optimal k Selection

Results

Tech Stack

Project Structure

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages