Skip to content

YogeshSivakumar18/High-Volume-Crime-Data-Analytics-on-Databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Chicago Crime Analytics: Domestic Crime Prediction Using Big Data & SparkML

Predictive Modeling on 8M+ Real Crime Records using PySpark on Databricks
_By Yogesh Sivakumar

Overview

Built a binary classifier to predict whether a reported crime in Chicago was domestic or non-domestic using 8M+ records from 2000–2025. Leveraged big data tools on Databricks with PySpark, SparkSQL, and SparkML. Performed feature engineering, statistical tests, and optimization to handle the large dataset effectively.

Objectives

  • Predict domestic crimes using historical data
  • Handle big data using scalable tools (PySpark & SparkSQL)
  • Evaluate model performance and provide actionable insights

Tools & Techniques

  • Platform: Databricks Community Edition
  • Tech Stack: PySpark, SparkSQL, SparkML
  • Model: Random Forest Classifier
  • Metrics: Accuracy, Precision, Recall, F1-Score
  • Optimization: Caching, Repartitioning

Key Results

Metric Value
Accuracy 88.7%
Precision 87.9%
Recall 88.0%
F1-Score 87.9%

Files in this Repository

  • Chicago Crime Data Analysis and Domestic Crime Prediction.pdf – Full project report
  • Chicago_Crime_Analysis.html – Databricks notebook (HTML export)
  • 🌐 View on Databricks

Future Work

  • Hyperparameter tuning and cross-validation
  • Use advanced models (e.g., GBT, XGBoost)
  • Apply data balancing techniques (SMOTE)
  • Integrate MLflow for tracking and reproducibility

πŸ‘₯ Authors

Yogesh Sivakumar

About

Analyzed 8M+ Chicago crime records using PySpark on Databricks to predict domestic vs non-domestic crimes. Achieved 88.7% accuracy with Random Forest. Applied statistical tests, feature engineering, and big data optimization.

Resources

Stars

Watchers

Forks

Contributors