System Architecture & Operational Guide

1. System Overview

This Enterprise Credit Card Fraud Detection System is a modular, production-ready Python application designed to identify fraudulent transactions in high-volume financial datasets. Unlike simple script-based approaches, this system handles the entire machine learning lifecycle: data ingestion, cleaning, complex feature engineering, model training with class imbalance handling, and automated reporting.

2. Technical Architecture

The system is architected into distinct modules within the src/ directory, ensuring separation of concerns and maintainability.

A. Data Ingestion (`src/data_loader.py`)

Responsibility: Efficiently loads and sanitizes raw transaction logs.
Key Logic:
- Validates file existence and integrity.
- Standardizes column names (correcting inconsistencies like oldbalanceOrg vs newbalanceOrig).
- Handles missing values (though rare in PaySim).

B. Feature Engineering (`src/features.py`)

This is the critical component where domain knowledge is applied to extract signal from noise.

Temporal Features: Extracts hour_of_day from the step column to capture time-based fraud patterns.
Behavioral Features:
- errorBalanceOrig: Calculates the discrepancy between the transaction amount and the change in the originator's balance. Fraudulent transactions often show zero balance changes or unexpected amounts.
- errorBalanceDest: Similar calculation for the recipient's balance.
Categorical Encoding: Converts transaction types (CASH_OUT, TRANSFER, etc.) into one-hot encoded variables (type_CASH_OUT, type_TRANSFER) for mathematical model compatibility.

C. Model Development (`src/model.py`)

Algorithm: Random Forest Classifier.
- Why? Random Forest is robust to noise, handles non-linear relationships effectively, and provides feature importance visibility.
Class Imbalance Handling:
- The dataset is highly imbalanced (~0.17% fraud).
- We use class_weight='balanced', which automatically adjusts weights inversely proportional to class frequencies. This penalizes the model heavily for missing a fraud case, ensuring it doesn't just predict "Legit" 99.9% of the time.
Persistence: Models are serialized using joblib to models/fraud_model.pkl for immediate deployment without retraining.

D. Evaluation & Reporting (`src/evaluation.py`)

Automated Reporting: Every training run generates:
- Classification Report: Precision, Recall, and F1-score for both classes.
- Confusion Matrix: Visualizes True Positives vs. False Positives.
- ROC Curve: Plots the True Positive Rate against the False Positive Rate.
Audit Logging (src/utils.py): All actions are logged to logs/ in JSON format for compliance and debugging.

3. Training Process

The training pipeline (main.py --mode train) executes the following steps:

Ingest: Load the 6.3M+ row dataset.
Split: Perform a Stratified Train-Test Split (80/20). Stratification is crucial to ensure the minority class (fraud) is present in both sets in the exact same proportion.
Scale: Apply StandardScaler to normalize numerical features (mean=0, variance=1).
Train: Fit the Weighted Random Forest on the training set.
Validate: Predict on the held-out test set and generate metrics.

4. Evaluation Results

The model achieves exceptional performance on the PaySim dataset:

ROC-AUC: ~0.999 (Near perfect)
Precision (Fraud): 1.00 (No false positives in test set)
Recall (Fraud): 1.00 (caught all fraud in test set)

Note: These perfect scores reflect the synthetic nature of the dataset, where fraud patterns are deterministic. In real-world production data, expect slightly lower scores (AUC ~0.95), but the architecture remains valid.

5. Usage Scenarios

Auditing Historical Data

Run the model in prediction mode on past transaction logs to identify missed fraud cases.

python main.py --mode predict --data old_logs.csv --model_path models/fraud_model.pkl

Real-time Logic (API Integration)

The FraudDetector class in src/model.py can be imported into a REST API (FastAPI/Flask) to score transactions in real-time before approval.

from src.model import FraudDetector
detector = FraudDetector()
detector.load_model('models/fraud_model.pkl')
prediction = detector.predict_proba(new_transaction_features)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Architecture & Operational Guide

1. System Overview

2. Technical Architecture

A. Data Ingestion (`src/data_loader.py`)

B. Feature Engineering (`src/features.py`)

C. Model Development (`src/model.py`)

D. Evaluation & Reporting (`src/evaluation.py`)

3. Training Process

4. Evaluation Results

5. Usage Scenarios

Auditing Historical Data

Real-time Logic (API Integration)

FilesExpand file tree

SYSTEM_OVERVIEW.md

Latest commit

History

SYSTEM_OVERVIEW.md

File metadata and controls

System Architecture & Operational Guide

1. System Overview

2. Technical Architecture

A. Data Ingestion (src/data_loader.py)

B. Feature Engineering (src/features.py)

C. Model Development (src/model.py)

D. Evaluation & Reporting (src/evaluation.py)

3. Training Process

4. Evaluation Results

5. Usage Scenarios

Auditing Historical Data

Real-time Logic (API Integration)

A. Data Ingestion (`src/data_loader.py`)

B. Feature Engineering (`src/features.py`)

C. Model Development (`src/model.py`)

D. Evaluation & Reporting (`src/evaluation.py`)