A comprehensive machine learning solution for detecting fraudulent financial transactions using XGBoost classifier on a highly imbalanced dataset of 6.3+ million transactions.
Google Colab Notebook: View & Run in Colab
This project implements a fraud detection system that analyzes financial transaction patterns to identify fraudulent activities with 99.99% ROC-AUC accuracy. The model processes transaction data including balance changes, transaction types, and engineered features to detect fraud in real-time scenarios.
- Dataset Size: 6,362,620 transactions
- Fraud Rate: 0.129% (highly imbalanced)
- Time Period: 30 days (744 hours)
- Model Accuracy: 99.99% ROC-AUC Score
- Comprehensive exploratory data analysis (EDA)
- Missing values and outlier analysis
- Transaction type distribution analysis
- Fraud pattern visualization
- errorBalanceOrig: Balance discrepancy detection on sender side
- errorBalanceDest: Balance discrepancy detection on receiver side
- isHighRiskType: Flags TRANSFER and CASH_OUT transactions
- isOriginDrained: Identifies completely emptied accounts
- isDestEmpty: Detects transactions to new/empty accounts
- Algorithm: XGBoost Classifier
- Handling Imbalance: Strategic undersampling (10:1 ratio)
- Performance Metrics:
- ROC-AUC Score: 0.9999
- Average Precision: 0.9985
- Fraud Recall: 99.75%
- Fraud Precision: 87%
- F1-Score: 0.93
- errorBalanceOrig (50.44%) - Balance inconsistencies on sender side
- isOriginDrained (46.09%) - Complete account drainage
- Transaction Amount (1.32%) - Avg fraud: ₹1.46M vs normal: ₹178K
- Transaction Type - Only TRANSFER and CASH_OUT show fraud activity
- 98.1% of fraud cases completely drain the origin account
- 65.2% of fraud destinations are empty accounts
- Fraudulent transactions average 8x higher amounts
- Only TRANSFER and CASH_OUT transaction types exhibit fraud
- Python 3.x
- Data Processing: NumPy, Pandas
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-learn, XGBoost
- Environment: Google Colab
Fraud-Detection-ML/
│
├── accredian_python_file.ipynb # Main Jupyter notebook
├── Data_Dictionary.txt # Dataset documentation
├── README.md # Project documentation
└── requirements.txt # Python dependencies (if applicable)
- True Negatives: 1,270,642
- False Positives: 239
- False Negatives: 4
- True Positives: 1,639
- Non-Fraud Precision: 1.00
- Fraud Recall: 1.00
- Overall Accuracy: 99.98%
- Real-time Transaction Scoring - Deploy model as API endpoint
- Velocity Checks - Monitor rapid account drainage patterns
- Destination Verification - Enhanced KYC for new accounts
- Transaction Limits - Smart thresholds based on model predictions
- Behavioral Analytics - Customer baseline profiling
- Time-based Monitoring - Increased sensitivity during off-peak hours
- Reduction in fraud detection time
- Decreased false positive rates
- Lower financial losses from fraud
- Improved customer satisfaction through reduced friction
numpy
pandas
matplotlib
seaborn
scikit-learn
xgboost- Open the Google Colab link
- Run all cells sequentially
- Review visualizations and model performance metrics
The dataset contains synthetic financial transaction data with the following features:
- step: Time unit (1 step = 1 hour)
- type: Transaction type (CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER)
- amount: Transaction amount in local currency
- nameOrig: Customer initiating the transaction
- oldbalanceOrg: Initial balance before transaction
- newbalanceOrig: New balance after transaction
- nameDest: Recipient of the transaction
- oldbalanceDest: Recipient's initial balance
- newbalanceDest: Recipient's new balance
- isFraud: Fraud indicator (target variable)
- isFlaggedFraud: Business rule-based flag (>200K transactions)
The project comprehensively addresses:
- Data cleaning methodology (missing values, outliers, multicollinearity)
- Model selection and justification (XGBoost)
- Variable selection process and feature engineering
- Model performance evaluation
- Key fraud prediction factors
- Logical validation of predictive features
- Prevention strategies and recommendations
- Effectiveness measurement frameworks
Sujal Kumar Nayak
- LinkedIn: linkedin.com/in/sujal-kumar-nayak
- Email: nayaksujalkumar@gmail.com
This project is created for educational and portfolio purposes.
- Dataset: Synthetic financial transaction data
- Inspiration: Real-world fraud detection systems used by PayPal, Stripe, and major banks
- Tools: Google Colab for cloud computing resources
Note: This project demonstrates advanced machine learning techniques for fraud detection. The model achieves near-perfect performance on the test set and provides actionable insights for real-world fraud prevention systems.