Predictive Modeling on 8M+ Real Crime Records using PySpark on Databricks
_By Yogesh Sivakumar
Built a binary classifier to predict whether a reported crime in Chicago was domestic or non-domestic using 8M+ records from 2000β2025. Leveraged big data tools on Databricks with PySpark, SparkSQL, and SparkML. Performed feature engineering, statistical tests, and optimization to handle the large dataset effectively.
- Predict domestic crimes using historical data
- Handle big data using scalable tools (PySpark & SparkSQL)
- Evaluate model performance and provide actionable insights
- Platform: Databricks Community Edition
- Tech Stack: PySpark, SparkSQL, SparkML
- Model: Random Forest Classifier
- Metrics: Accuracy, Precision, Recall, F1-Score
- Optimization: Caching, Repartitioning
| Metric | Value |
|---|---|
| Accuracy | 88.7% |
| Precision | 87.9% |
| Recall | 88.0% |
| F1-Score | 87.9% |
Chicago Crime Data Analysis and Domestic Crime Prediction.pdfβ Full project reportChicago_Crime_Analysis.htmlβ Databricks notebook (HTML export)- π View on Databricks
- Hyperparameter tuning and cross-validation
- Use advanced models (e.g., GBT, XGBoost)
- Apply data balancing techniques (SMOTE)
- Integrate MLflow for tracking and reproducibility
Yogesh Sivakumar