This project builds a reusable machine learning pipeline that detects fraudulent transactions from any structured CSV file. Whether the data comes from PayPal, Stripe, or internal logs, the system validates the input, engineers meaningful features, runs multiple models, and outputs fraud risk scores.
- Accepts any transaction CSV file
- Validates column structure, data types, and missing values
- Cleans and transforms the data (encoding, scaling, feature creation)
- Trains and compares four ML models:
- Logistic Regression
- Decision Tree
- Random Forest
- K-Nearest Neighbors (KNN)
- Outputs fraud probability per transaction
- Summarizes model performance using Accuracy, Precision, Recall, and F1
- Input Validation
- Checks schema, nulls, types, duplicates
- Feature Engineering
- Encodes categoricals, scales numerics, creates derived features
- Model Training & Evaluation
- Benchmarks four classifiers
- Compares metrics across models
- Scoring & Output
- Generates fraud scores
- Produces model comparison summary
- Optional Deployment
- Streamlit or Flask interface for CSV upload and scoring
- Modular Python pipeline
- Fraud scores per transaction
- Model comparison dashboard
- Ready-to-integrate output for analysts or systems and learn for the data.
Fraud patterns evolve. Static rules fail. ML adapts. This project turns raw data into actionable insight—fast, scalable, and production-ready.
Just run notebook 04_model_training.ipynb
fraud_scoring_service/
├── data/
├── notebooks/
│ ├── 04_model_training.ipynb # train classifiers and compare metrics
fraud_scoring_service/
├── data/
│ └── raw/
│ └── synthetic_fraud_dataset.csv # the untouched CSV of transactions from keggle
├── notebooks/
│ ├── _main_.ipynb
│ ├── 01_knn_baseline.ipynb
│ ├── 02_knn_with_scaling_and_onehot.ipynb
│ ├── 03_data_preparation.ipynb # logic of functions validate schema/types, clean data, engineer features
│ ├── 04_model_training.ipynb # train classifiers and compare metrics
│ └── with_functions.ipynb
├── lib/
│ ├── functions.py # load functions
│ ├── validate_input.py # check for required columns, dtypes, nulls/duplicates
│ ├── clean_data.py # fill missing values, convert/fix Timestamp, drop duplicates
│ ├── feature_engineering.py # encode categoricals, scale numerics, derive new features
├── README.md
├── .gitattributes # paths
├── .gitignore # files and folders excluded from Git commits
├── .config.yaml # configuration file (data paths, model parameters)
├── pyproject.toml # project metadata and dependency definitions
└── uv.lock # locked versions of all dependencies
- [Presentation] (https://docs.google.com/presentation/d/1DCRTmxcjTngXTsZA_19D8t6RzMZyaBRoeajUfCeMsb8/edit?usp=sharing)
(https://w7pjprojectfraudscoringservice-rrp8ex7co5rxczfmvvjnkx.streamlit.app/)
You can do the 12,50 ammount 150 balance day operation 1, no previous fraud, is weekend and risk 0.92, do it with LogisticOversample then change to LogisticRegression and check how it gives a 36 % and i the model adaptation examples check how it it learns for all this.