This repository contains the group project for CS2207: Anomaly Detection in Credit Card Usage. The goal is to analyze credit card transaction behavior, cluster cardholders using DBSCAN, and identify outlier usage patterns that may indicate anomalous or unusual customer behavior.
Use Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to find clusters in high-dimensional credit card usage data and detect noise points as potential anomalies.
Key tasks:
- Perform exploratory data analysis on credit card usage metrics such as balances, purchase amounts, and transaction frequency.
- Clean and normalize the dataset, handling missing values and scaling features.
- Apply DBSCAN clustering and tune parameters using a k-distance plot.
- Test multiple
MinPtsvalues to identify robust clustering behavior. - Profile the data points classified as noise and interpret them as outliers.
README.md- Project overview and usage instructions.pyproject.toml- Project metadata and Python requirements.code/experiments/- Experimental analysis files, notebooks, and exploratory scripts.code/finalized/- Final implementation, cleaned scripts, and report-ready code.data/- Dataset files used for analysis.requirements.txt- Used Libraries with Version.
The assignment uses a credit card dataset for clustering analysis. The dataset should contain customer transaction and usage metrics appropriate for density-based clustering. Place the dataset files in the data/ folder before running the analysis.
If you have the dataset downloaded separately, rename it consistently (for example
data/credit_card_usage.csv).
- Python 3.13 or later
- Recommended libraries:
numpypandasscikit-learnmatplotlibseabornscipy
- Create and activate a virtual environment:
python -m venv .venv
.\.venv\Scripts\Activate.ps1- Install the necessary libraries:
pip install numpy pandas scikit-learn matplotlib seaborn scipy- Place the dataset in the
data/folder.
-
Exploratory Data Analysis
- Inspect distributions of transaction features.
- Visualize balances, purchase frequency, and usage statistics.
- Identify missing values, correlations, and feature scales.
-
Preprocessing
- Handle missing values using imputation or row removal.
- Normalize or scale numeric features so clustering works effectively.
- Select features relevant for anomaly detection.
-
DBSCAN Clustering
- Generate a k-distance plot to choose an appropriate
epsilon. - Experiment with different
MinPtsvalues. - Fit DBSCAN and examine cluster labels.
- Generate a k-distance plot to choose an appropriate
-
Outlier Profiling
- Identify points labeled as
-1(noise). - Compare noise points to clustered points across key features.
- Summarize what makes those points anomalous.
- Identify points labeled as
DBSCANis especially useful when clusters have arbitrary shapes and when the dataset contains noise.- The
epsilonparameter controls neighborhood distance; choose it from the k-distance curve. MinPtsshould typically be set based on the dimensionality of the data plus one.
Recommended final artifacts for this project:
- A cleaned Python script or notebook implementing the full workflow.
- A k-distance plot with the selected
epsilon. - A table or chart summarizing cluster sizes and noise count.
- Analysis of anomalous credit card usage behaviors.
Author : Ankit Kumar
Email : ankit07chy@gmail.com