A comprehensive, academically rigorous data mining application built with Streamlit that implements clustering and classification algorithms from scratch, strictly following academic course material.
Before installing, make sure you have one of the following:
- Option 1: Docker installed on your system
- Option 2: Python 3.9+ installed
First, clone the repository to your local machine:
git clone <repository-url>
cd DM-ProjectDocker ensures the application runs consistently across all systems without dependency issues.
docker build -t data-mining-app .docker run -p 8501:8501 data-mining-appOpen your browser and navigate to:
http://localhost:8501
If you prefer running without Docker:
pip install -r requirements.txtstreamlit run app.pyYour browser should automatically open to:
http://localhost:8501
- Click "Browse files" in the sidebar
- Upload a CSV file containing your dataset
- The application will display your data with dimensions and column information
Navigate through the data cleaning workflow:
- Duplicates: Detect and remove duplicate rows
- Missing Values:
- View missing data summary
- Use "Auto Clean Data" to convert non-standard values to NaN
- Add custom missing indicators (e.g., "?", "N/A", "missing")
- Choose handling method: Drop rows, Fill with 0/Mean/Median, Forward/Backward fill
- Column Management: Delete unnecessary columns
- View boxplots for each numeric column
- Identify extreme values visually
- Choose to remove or keep outliers
- View descriptive statistics: Min, Max, Mean, Median, Mode, Q1, Q3
- Generate scatter plots to visualize relationships between attributes
- Automatically detect float columns that represent integers
- Convert them to proper integer types
Choose your normalization strategy:
- Min-Max Normalization: Scales values to [0, 1]
- Z-Score Standardization: Centers data with mean=0, std=1
- Skip: Proceed without normalization
Discover hidden patterns in your data:
- Choose X-axis and Y-axis features for 2D visualization
- Preview data distribution before clustering
- K-Means: Centroid-based clustering with elbow method
- K-Medoids (PAM): Medoid-based clustering, more robust to outliers
- AGNES: Hierarchical agglomerative clustering with dendrogram
- DIANA: Hierarchical divisive clustering
- DBSCAN: Density-based clustering with noise detection
- K-Means/K-Medoids: Number of clusters (k)
- AGNES: Linkage method (single, complete, average, ward)
- DBSCAN: eps (neighborhood radius), MinPts (minimum points)
- Single Algorithm: Run one algorithm with specific parameters
- Select All: Compare all algorithms automatically with optimal parameters
- 2D and 3D cluster visualizations
- Cluster distribution histograms
- Evaluation metrics: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index
- Elbow curves (K-Means, K-Medoids)
- Dendrograms (AGNES, DIANA)
- Best algorithm recommendation based on composite scores
Train classifiers to predict target classes:
- Choose the column you want to predict (class/label)
- View class distribution and balance
- Adjust train/test split ratio (default: 80/20)
- View sample counts per class in both sets
- k-NN (k-Nearest Neighbors): Distance-based classification
- Configure k (number of neighbors, 1-20)
- View accuracy and precision across different k values
- Naive Bayes: Probabilistic classifier using Bayes' theorem
- Gaussian distribution for continuous attributes
- C4.5 Decision Tree: Tree-based classifier using Gain Ratio
- Automatically builds tree with optimal splits
- SVM (Support Vector Machine): Maximum margin classifier
- Configure kernel (linear, RBF, polynomial)
- Adjust C parameter (regularization)
- Single Algorithm: Train one classifier with specific parameters
- Select All: Compare all four classifiers automatically
- Evaluation metrics: Accuracy, Precision, Recall, F-Measure
- Confusion matrices (table and heatmap)
- Performance plots (k-NN: Accuracy vs k, Precision vs k)
- Best algorithm recommendation
- Individual visualizations for each algorithm in comparison mode
DM-Project/
βββ app.py # Main Streamlit application
βββ requirements.txt # Python dependencies
βββ Dockerfile # Docker configuration
βββ docker-compose.yml # Docker Compose setup
βββ data_mining_project_tasks.md # Detailed project roadmap
βββ README.md # This file
β
βββ algorithms/
βββ unsupervised/ # Clustering algorithms
β βββ interface.py # Unsupervised learning UI
β βββ kmeans.py # K-Means implementation
β βββ kmedoids.py # K-Medoids (PAM) implementation
β βββ agnes.py # AGNES hierarchical clustering
β βββ diana.py # DIANA divisive clustering
β βββ dbscan.py # DBSCAN density-based clustering
β
βββ supervised/ # Classification algorithms
βββ interface.py # Supervised learning UI
βββ knn.py # k-Nearest Neighbors
βββ naive_bayes.py # Gaussian Naive Bayes
βββ c45.py # C4.5 Decision Tree
βββ svm.py # Support Vector Machine
- Manual centroid initialization and updates
- Iterative convergence checking
- Elbow method for optimal k detection
- Metrics: Inertia (WCSS), Silhouette Score, Davies-Bouldin, Calinski-Harabasz
- Medoid-based clustering (more robust to outliers)
- PAM algorithm implementation
- Elbow method visualization
- Metrics: Same as K-Means
- Bottom-up hierarchical clustering
- Multiple linkage methods: Single, Complete, Average, Ward
- Dendrogram visualization
- Metrics: Silhouette Score, Davies-Bouldin, Calinski-Harabasz
- Top-down hierarchical clustering
- Recursive splitting strategy
- Dendrogram visualization
- Metrics: Same as AGNES
- Density-based clustering
- Automatic noise detection
- Core, Border, and Noise point classification
- Parameters: eps (neighborhood radius), MinPts (minimum points)
- Metrics: Separate metrics for clustered points and all points
- 4-Step Process (from class material):
- Compute Euclidean distances to all training points
- Select k nearest neighbors
- Identify classes of k neighbors
- Majority vote for final prediction
- Multi-k evaluation (k=1 to k=20)
- Visualizations: Accuracy vs k, Precision vs k, Confusion matrix
- Metrics: Accuracy, Precision, Recall, F-Measure
- Training Phase:
- Prior probabilities: P(Ck) = Count(Ck) / Total
- Conditional probabilities: P(xi | Ck) using Gaussian distribution
- Prediction Phase:
- Posterior = P(Ck) Γ β P(xi | Ck)
- Predicted class = argmax(Posterior)
- Key Assumption: Conditional independence between attributes
- Metrics: Accuracy, Precision, Recall, F-Measure, Confusion matrix
- Training Phase:
- Calculate entropy for node purity
- Compute Information Gain for potential splits
- Calculate Split Information
- Gain Ratio = Information Gain / Split Information (key C4.5 improvement)
- Select attribute with maximum Gain Ratio
- Recursive tree building with stopping criteria
- Prediction Phase:
- Traverse tree from root to leaf
- Follow branches based on attribute values
- Metrics: Accuracy, Precision, Recall, F-Measure, Confusion matrix
- Key Concepts:
- Find optimal hyperplane that maximizes margin between classes
- Support Vectors: Data points closest to decision boundary
- Kernel functions: Transform data to higher dimensions for non-linear separation
- Training Phase:
- Apply kernel transformation (linear, RBF, or polynomial)
- Optimize hyperplane to maximize margin
- Identify support vectors (critical training points)
- Prediction Phase:
- Determine which side of hyperplane test point falls on
- Assign class based on hyperplane decision
- Parameters:
- Kernel: Type of kernel function (linear, RBF, poly)
- C: Regularization parameter (controls margin vs misclassification trade-off)
- Metrics: Accuracy, Precision, Recall, F-Measure, Confusion matrix, Support vector counts
-
Silhouette Score (0 to 1, higher is better)
- Measures how similar points are to their own cluster vs. other clusters
- Values near 1 indicate well-separated clusters
-
Davies-Bouldin Index (0 to β, lower is better)
- Measures average similarity between clusters
- Lower values indicate better separation
-
Calinski-Harabasz Index (0 to β, higher is better)
- Ratio of between-cluster to within-cluster dispersion
- Higher values indicate denser, well-separated clusters
-
Composite Score
- Normalized weighted combination of all metrics
- Used to recommend the best algorithm
-
Accuracy (0 to 1, higher is better)
- Proportion of correct predictions
- Formula: (TP + TN) / Total
-
Precision (0 to 1, higher is better)
- Of all positive predictions, how many were correct
- Formula: TP / (TP + FP)
-
Recall (0 to 1, higher is better)
- Of all actual positives, how many were identified
- Formula: TP / (TP + FN)
-
F-Measure (0 to 1, higher is better)
- Harmonic mean of Precision and Recall
- Formula: 2 Γ (Precision Γ Recall) / (Precision + Recall)
-
Composite Score
- Average of Accuracy, Precision, Recall, and F-Measure
- Used to recommend the best algorithm
- Python 3.9+
- Streamlit - Interactive web application framework
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- Plotly - Interactive visualizations
- scikit-learn - Only for confusion matrix and validation metrics
- Docker - Containerization for reproducibility
This application strictly follows academic course material:
- No use of sklearn's clustering or classification algorithms
- Manual implementation of distance calculations, centroid/medoid updates, tree building, etc.
- Only sklearn used for: confusion matrix, train_test_split validation, and metric calculations
- k-NN: 4-step process with Euclidean distance
- Naive Bayes: Prior probabilities, Gaussian likelihood, posterior calculation
- C4.5: Entropy, Information Gain, Split Information, Gain Ratio (not just Information Gain)
- Clustering: All distance metrics, linkage methods, and evaluation metrics match course definitions
- Code comments explain academic concepts
- UI includes expandable metric explanations
- All formulas are documented
- Docker: Ensure Docker Desktop is running
- Python: Verify Python version with
python --version(must be 3.9+) - Dependencies: Run
pip install -r requirements.txtagain
- Change port:
streamlit run app.py --server.port 8502 - Docker:
docker run -p 8502:8501 data-mining-app
- Large datasets may require more RAM
- Try reducing dataset size or using Docker with increased memory allocation
- Ensure Plotly is installed:
pip install plotly - Clear browser cache and refresh
This project is developed for academic purposes as part of a Data Mining course.
Ramy
Data Mining Course Project
Version 3.0
December 2025
- Course instructors for providing detailed algorithm specifications
- Academic material that guided the implementation
- Streamlit community for excellent documentation
For issues or questions about the application:
- Check the troubleshooting section above
- Review the detailed task plan in
data_mining_project_tasks.md - Verify your setup matches the prerequisites
β¨ Enjoy exploring your data with rigorous, academically-compliant algorithms!