The Credit Card Fraud Detection Problem includes modeling past credit card transactions with the knowledge of the ones that turned out to be fraud. So,this system is used to identify whether a new transaction is fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions while minimizing the incorrect fraud classifications.
https://www.kaggle.com/mlg-ulb/creditcardfraud
The provided code is an example of anomaly detection in a dataset using two outlier detection algorithms: Isolation Forest and Local Outlier Factor. Here's a breakdown of what the code does:
-
Importing the required packages: The necessary libraries are imported, including numpy, pandas, matplotlib.pyplot, seaborn, sys, and scipy.
-
Loading the dataset: The dataset is loaded using pandas from a file specified by the "LOCATION OF THE FILE" placeholder.
-
Dataset exploration:
- Printing the column names: The names of the columns in the dataset are printed.
- Printing the shape of the data: The dimensions (number of rows and columns) of the data are printed.
- Printing data statistics: Summary statistics of the dataset are printed, including count, mean, min, max, etc.
-
Plotting histograms: Histograms for each parameter in the dataset are plotted using the
histfunction from pandas. The histograms visualize the distribution of values for each parameter. -
Determining the number of fraud cases: The code calculates the fraction of fraud cases in the dataset by dividing the number of fraud instances (Class = 1) by the number of valid instances (Class = 0). The result is stored in the
outlier_fractionvariable. -
Printing the number of fraud cases and valid transactions: The code prints the number of fraud cases and valid transactions in the dataset.
-
Correlation matrix: The code computes the correlation matrix of the dataset using the
corrfunction from pandas. It visualizes the correlation matrix as a heatmap using seaborn'sheatmapfunction. -
Data preprocessing:
- Getting the columns from the DataFrame: All the column names from the dataset are stored in the
columnsvariable. - Filtering unwanted columns: The code removes the "Class" column from the list of columns, as it is the target variable.
- Splitting into features and target: The features (X) and target (Y) variables are separated from the dataset.
- Getting the columns from the DataFrame: All the column names from the dataset are stored in the
-
Printing the shapes of X and Y: The code prints the shapes (dimensions) of the feature matrix (X) and target vector (Y).
-
Importing necessary libraries: The code imports additional libraries required for classification evaluation, including
classification_reportandaccuracy_scorefrom sklearn.metrics,IsolationForestfrom sklearn.ensemble, andLocalOutlierFactorfrom sklearn.neighbors. -
Defining random state: The variable
stateis set to 1, which will be used as the random state for the outlier detection algorithms. -
Defining outlier detection tools: Two outlier detection algorithms, Isolation Forest and Local Outlier Factor, are defined as dictionary entries in the
classifiersdictionary. -
Plotting the number of errors: For each outlier detection algorithm, the code fits the data and predicts the outliers. It then compares the predicted outliers with the actual class labels (fraud or valid) to calculate the number of errors. The accuracy score and classification report are also printed.