End-to-end clustering pipeline for 45K text embeddings: data cleaning with ensemble outlier detection, optimal cluster count selection using 4 methods, and final KMeans clustering with t-SNE visualization.
Raw Data (45,895 texts with embeddings)
│
▼
┌─────────────────────────────────┐
│ 1. Data Cleaning │
│ ├─ KNN Distance Outliers │
│ ├─ Local Outlier Factor (LOF) │
│ └─ Isolation Forest │
│ → Ensemble vote (≥2 of 3) │
└─────────────────────────────────┘
│ 43,304 clean samples (-5.6%)
▼
┌─────────────────────────────────┐
│ 2. Optimal k Selection │
│ ├─ Elbow Method │
│ ├─ Silhouette Analysis │
│ ├─ Calinski-Harabasz Index │
│ └─ Davies-Bouldin Index │
│ → Median of 4 methods │
└─────────────────────────────────┘
│ k = 57
▼
┌─────────────────────────────────┐
│ 3. Final Clustering │
│ ├─ KMeans (k=57, k-means++) │
│ ├─ t-SNE 2D visualization │
│ └─ Cluster analysis & export │
└─────────────────────────────────┘
│ 57 clusters with centroids
▼
Output CSV
Three outlier detection methods are applied independently, and a conservative ensemble voting approach removes points flagged by at least 2 out of 3 methods:
| Method | Outliers Detected |
|---|---|
| KNN Distance (z-score > 1.0) | 4,729 |
| Local Outlier Factor | 4,590 |
| Isolation Forest | 4,590 |
| Combined (≥2 votes) | 2,591 |
Result: 45,895 → 43,304 clean samples (5.6% removed)
Four standard methods evaluate cluster counts from 2 to 80:
| Method | Optimal k |
|---|---|
| Elbow Method | 2 |
| Silhouette Analysis | 2 |
| Calinski-Harabasz Index | 57 |
| Davies-Bouldin Index | 57 |
| Final (median) | 57 |
- 57 clusters from 43,304 text embeddings (3,072-dimensional)
- Cluster sizes range from ~100 to ~1,600 items
- t-SNE visualization confirms well-separated cluster structure
- Python — core language
- scikit-learn — KMeans, t-SNE, LOF, Isolation Forest, silhouette/CH/DB metrics
- pandas / NumPy — data manipulation
- matplotlib / seaborn — visualization
- SciPy — statistical aggregation (mode)
├── cluster_optimization.ipynb # Full pipeline notebook
└── README.md
pip install numpy pandas scikit-learn matplotlib seaborn scipyOpen cluster_optimization.ipynb in Jupyter or Google Colab and run all cells. The notebook expects a CSV file with columns: _id, text, n_tokens, embedding (string representation of float arrays).