-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME
More file actions
86 lines (75 loc) · 3.83 KB
/
README
File metadata and controls
86 lines (75 loc) · 3.83 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# Personal Finance Manager – Machine-Learning Toolkit
## Overview
This repository contains a collection of data-science utilities and Jupyter notebooks that help you analyse personal financial transactions.
The main objectives are:
1. Detect fraudulent transactions automatically.
2. Predict/categorise spending patterns and merchant categories.
3. Provide a reusable preprocessing pipeline for cleaning and engineering transaction data.
The project is **exploratory** in nature – it accompanies a series of blog posts and lectures on applying classic machine-learning techniques to structured and text-heavy financial data.
## Key Features
* **Fraud Detection** – Logistic Regression and Random Forest models trained on a balanced dataset (synthetic + real) to flag suspicious activity.
* **Spending Category Classification** – Multinomial Naive Bayes (NLP) model that learns from the transaction *Description* free-text field.
* **Synthetic Fraud Generator** – Quickly bootstrap models with `preprocessing/generate_fraud_data.py`.
* **Data Cleaning & Feature Engineering** – Ready-to-use helpers under `preprocessing/` and `utils/`.
* **Reproducible Notebooks** – Step-by-step walkthroughs for each modelling task.
## Project Structure
```text
MLAIProject1/
├── notebooks
│ ├── fraudLogistic.ipynb # Logistic Regression baseline
│ ├── fraudRF.ipynb # Random Forest improvement
│ └── NLPNaiveBayesFinal.ipynb # Text-based category classifier
│
├── preprocessing/
│ ├── cleaner.py # Placeholder for data-cleaning helpers
│ ├── feature_engineering.py# Generate model-ready features
│ └── generate_fraud_data.py# Create synthetic fraud samples
│
├── utils/
│ └── fillna.py # Merge raw files & patch missing values
│
├── raw/ # Original csv statements & generated fraud
├── processed/ # Cleaned / model-ready datasets
└── README # You are here
```
## Data
1. **Raw statements** – Exported CSVs from your bank or credit-card provider should be placed in `raw/`.
Example columns: `Transaction Date`, `Posted Date`, `Card No.`, `Description`, `Category`, `Debit`, `Credit`.
2. **Synthetic fraud** – Run `python preprocessing/generate_fraud_data.py` to create extra fraudulent transactions for class-balance.
3. **Combined dataset** – Execute `utils/fillna.py` to merge raw + synthetic data and to patch missing *Debit/Credit* fields.
The resulting file is written to `processed/combined_transactions.csv`.
## Getting Started
1. **Clone the repo**
```bash
git clone https://github.com/your-username/MLAIProject1.git
cd MLAIProject1
```
2. **Create a virtual environment & install dependencies** (example)
```bash
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt # create one containing pandas numpy scikit-learn nltk jupyter
```
3. **Prepare the data**
```bash
# (a) Generate synthetic fraud samples
python preprocessing/generate_fraud_data.py
# (b) Combine with your latest statement
python utils/fillna.py
```
4. **Open the notebooks**
```bash
jupyter notebook
```
Run the notebooks in numerical order to reproduce the experiments.
## Notebook Guide
| Notebook | Goal |
|----------|------|
| `fraudLogistic.ipynb` | Baseline fraud-detection using Logistic Regression |
| `fraudRF.ipynb` | Improved ensemble approach with Random Forests |
| `NLPNaiveBayesFinal.ipynb` | Categorise transactions based on text using Naive Bayes |
## Results
Detailed metrics (accuracy, precision-recall, ROC curves) are reported inside each notebook.
A summary table will be added here once experiments are finalised.
## Contributing
Pull requests are welcome! Feel free to open an issue or submit improvements, whether it’s code, documentation, or ideas for new models.