A complete Machine Learning fraud detection system featuring a modular pipeline, PostgreSQL persistence, an interactive CLI, and a real-time web dashboard. Developed as part of the FIAP Tech Challenge – Phase 3.
Want to see it in action? Run the dashboard in just 3 steps:
# 1. Start the Flask server
python run.py
# 2. Open in your browser
# http://127.0.0.1:5000
# 3. Test the detector!
# → Click “LEGÍTIMA” or “FRAUDULENTA”
# → Click “EXECUTAR SIMULAÇÃO”
# → Watch real-time results:
# ✅ Model Prediction (classified by XGBoost)
# 🎯 Ground Truth (actual transaction type)
# 📊 Fraud probability (0–100%)
# 🔍 Confidence level (HIGH / MODERATE / LOW)
# ⚡ Inference latency (measured in real-time)🎯 What you’ll see:
- Simulation Panel – trigger legitimate or fraudulent transactions in one click
- Real-Time Stats – total transactions, detected frauds, recall rate, latency
- Full History – all classifications with prediction vs actual label
💡 Tip: Try simulating multiple fraudulent transactions. The model will realistically miss some (~15% error), demonstrating that it’s not overfitted.
Develop a fraud detection system using an optimized XGBoost model with:
- ✅ Handling of highly imbalanced data (1:578 fraud ratio)
- ✅ Modular MVC-ML Pipeline (processing → training → inference)
- ✅ Robust validation with StratifiedKFold
- ✅ Automated Grid Search with hyperparameter versioning
- ✅ PostgreSQL (data) + Pickle (models) persistence
- ✅ Complete CLI with 4 operational modes
- ✅ Flask REST API backend for real-time simulation
- ✅ Interactive Web Dashboard for live monitoring and insights
The system follows a layered modular architecture for clear separation of concerns:
- Model (M):
src/models/– SQLAlchemy ORM + ML configs - View (V):
src/services/frontend/– HTML templates + static assets - Controller (C):
src/services/backend/– Flask routes + logic - ML Pipeline:
src/ml/– data processing and training - Services:
src/services/– infrastructure components (DB, ML, Frontend, Backend)
ml-fraud-detector/
├── data/ # 📊 Datasets and configs
│ ├── archive/ # Older hyperparameter versions
│ ├── examples/ # Sample transactions
│ ├── creditcard.csv # Original dataset (284,807 rows)
│ └── xgboost_hyperparameters.json
│
├── database/ # 🗄️ PostgreSQL config
│ ├── migrations/ # SQL migrations
│ ├── docker-compose.yml # PostgreSQL 15 setup
│ └── schema.sql # Schema (7 pipeline + 2 webapp tables)
│
├── docs/ # 📚 Documentation
│ ├── images/ # Charts and screenshots
│ ├── API_ENDPOINTS.md # REST API docs
│ ├── DATA_ARCHITECTURE.md # PostgreSQL + Pickle architecture
│ ├── DECISOES_TECNICAS.md # ML optimizations
│ ├── EDA_REPORT.md # Exploratory data analysis
│ ├── MODEL_SELECTION.md # Model comparison
│ └── TRANSACTION_EXAMPLES.md # Examples
│
├── models/ # 🤖 Trained ML models
│ ├── archive/ # Older versions
│ ├── scalers.pkl # RobustScaler + StandardScaler
│ └── xgboost_v2.1.0.pkl # ⭐ Production model
│
├── reports/ # 📈 Generated analysis
│ └── feature_selection_analysis.json
│
├── src/ # 💻 Source code
│ ├── ml/ # 🧠 ML pipeline
│ ├── models/ # 📦 SQLAlchemy ORM models
│ ├── services/ # 🔌 Service layer
│ └── __init__.py
│
├── main.py # 🎯 CLI entry point
├── run.py # 🚀 Flask app entry
├── requirements.txt # Dependencies
└── README.md
CSV (raw) → [01] Load → [02] Outlier Analysis → [03] Missing Values
→ [04] Normalize → [05] Feature Engineering → [06] Feature Selection
→ [07] Train/Test Split → ✅ Ready for training
PostgreSQL Tables
raw_transactions,cleaned_transactions, …test_featuresclassification_results,simulated_transactions
Pickle Files
models/scalers.pkl– RobustScaler + StandardScalermodels/xgboost_v2.1.0.pkl– production model
JSON Configs
xgboost_hyperparameters.json– active hyperparametersarchive/– previous versions (auto-timestamped)
Hybrid Design Rationale
- PostgreSQL → traceability + analytics
- Pickle → fast model loading
- JSON → version control for reproducibility
- Python 3.13, Pandas, XGBoost 3.0.5
- Flask 3.0.3, SQLAlchemy, PostgreSQL 15
- Docker Compose, ThreadPoolExecutor, COPY optimization
- Source: Kaggle – Credit Card Fraud Detection
- Transactions: 284,807 (2 days, September 2013)
- Features: 30 (PCA-transformed + Time, Amount)
- Target: Class (0=legit, 1=fraud)
- Imbalance: 492 frauds (0.172%) vs 284,315 legitimate
- Python ≥3.13
- PostgreSQL ≥15
- 2GB+ RAM
git clone <repo-url>
cd ml-fraud-detector
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtor faster:
uv pip install -e .Option A: Docker (recommended)
cd database
docker-compose up -dOption B: Local setup
createdb fraud_detection
createuser fraud_user --password fraud_pass_dev
psql -U fraud_user -d fraud_detection -f database/schema.sqlpython main.py pipelineSupports 4 main modes:
Runs full data processing pipeline (Steps 01–07).
Trains XGBoost with current JSON hyperparameters.
Performs automated Grid Search and updates JSON.
Runs inference on CSV or JSON input.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/simulate |
Generate and classify a transaction |
GET |
/api/stats |
Get aggregated stats |
GET |
/api/history |
Retrieve prediction history |
GET |
/health |
Health check |
Full API reference: docs/API_ENDPOINTS.md
Interactive web dashboard for simulation and monitoring:
- Run real-time simulations
- Track recall, latency, and fraud rate
- Review full prediction history
- Technical Reports: EDA, model selection, and data architecture
- Plans: MVP roadmap, Kafka scalability, changelog
- Architecture Docs:
src/ml/README.md
src/ml/models/configs.py:
xgboost_params = {
'colsample_bytree': 0.7,
'learning_rate': 0.3,
'max_depth': 6,
'min_child_weight': 1,
'n_estimators': 100,
'scale_pos_weight': 577,
'subsample': 0.7,
'eval_metric': 'aucpr'
}| Step | Before | After | Gain | Optimization |
|---|---|---|---|---|
| Normalize | 90s | 16.5s | +81.6% | PostgreSQL COPY |
| Pipeline Total | 130s | 62s | +52% | Parallelization |
| Metric | v1.1.0 | v2.0.0 | v2.1.0 ⭐ |
|---|---|---|---|
| PR-AUC | 0.8719 | 0.8847 | 0.8772 |
| Precision | 72.27% | 85.42% | 86.60% |
| Recall | 87.76% | 83.67% | 81.63% |
| F1-Score | 79.26% | 84.54% | 84.04% |
Includes:
- Pipeline tables (raw → processed)
- Webapp tables (
classification_results,simulated_transactions)
✅ Completed 📋 Next Step: Optional Kafka streaming for scalability
The modular ML pipeline architecture is designed for easy Apache Airflow integration:
- Step-based structure (01-07) maps directly to Airflow DAGs
- Service layer separation enables distributed task execution
- JSON configs + PostgreSQL provide stateful orchestration support
- CLI modes (
train,tune,pipeline) can become Airflow operators
This allows seamless migration from local execution to scheduled, distributed ML workflows without code restructuring.
In large-scale scenarios, Apache Kafka can enable distributed ingestion, parallel consumers, and real-time reprocessing.
- Dataset: Kaggle – Credit Card Fraud Detection
- Docs: XGBoost, Flask, PostgreSQL
Victor Lucas Santos de Oliveira – LinkedIn Adrianny Lelis da Silva – LinkedIn
