Detecting phishing website using machine learning
Phishing is a type of attack where an attacker tricks the victim to give up sensitive information such as login credentials by disguising as a trustworthy entity. In this application we will try to detect a phishing website using the features that differentiates these domains from the legitimate ones. We will create our own dataset, train and test various machine learning models using Jupyter Notebooks on IBM Watson studio and deploy the best model to be used by the application for detection.
Various features that are used to create the dataset are as follows :
- Using IP Address - check if URL has an ip address in it
- HTTPS - checking the existance of 'https', trusted certificate authority and age of certificate
- URL Short - check if url has been shortened
- Having @ symbol - it leads the browser to ignore everything preceding the '@' symbol
- Having double-slash - means that the user will be redirected (http://www.legitimate.com//http://www.phishing.com)
- Domain registration Length - Trustworthy domains are regularly paid for several years
- favicon - favicon loaded from the domain or not
- Existance of https token in the domain part of the URL
- Request URL - examines whether the external objects contained within a webpage are loaded from another domain
- URL of Anchor - If the tags and the website have different domain names
- Links in tags - It is expected that tags (Meta, Script and Link) are linked to the same domain of the webpage.
- Server Form Handler - If it is blank or contains any other domain name
- Submitting information to email
- Abnormal URL - if domain name (from whois) not in url
- redirect count
- invisible iframe
- Age of domain
- web traffic - google rank for page
- statistical report - match it with top 10 domains and top 10 IPs from PhishTank
- Sign up for an IBM Cloud account
- Login to the IBM Watson Studio
- Install Python3.7
- Install dependencies
pip install -r packages.txt
The dataset created for this application uses around 250 legitimate and 250 phishing urls with 20 features each as mentioned above. You can add more data and features (feature_extraction.py) to the project to create your own dataset as shown below.
The URLs for phishing websites was retrieved from here (verified_online.csv) and The URLs for legitimate websites was retrieved from here (top1m.csv)
- Create the dataset for the phishing websites
python create_dataset.py <file_with_phishing_url> <number_of_urls_to_use> <output_file> <target_value>
python create_dataset.py verified_online.csv 500 dataset2.csv 1
- Create the dataset for the legitimate websites
python create_dataset.py <file_with_legitimate_url> <number_of_urls_to_use> <output_file> <target_value>
python create_dataset.py top1m.csv 500 dataset2.csv 0
Sign up for IBM's Watson Studio.
Note: By creating a project in Watson Studio a free tier
Object Storageservice will be created in your IBM Cloud account. Take note of your service names as you will need to select them in the following steps.
-
On Watson Studio's Welcome Page select
New Project. -
Choose the
Data Scienceoption and clickCreate Project. -
Name your project, select the Cloud Object Storage service instance and click
Create
- Drag and drop the dataset (
csv) file you just created to Watson Studio's dashboard to upload it to Cloud Object Storage.
-
Create a New Notebook.
-
Import the notebook found in this repository
-
Give a name to the notebook and select a
Python 3.5runtime environment, then clickCreate.
To make the dataset available in the notebook, we need to refer to where it lives. Watson Studio automatically generates a connection to your Cloud Object Storage instance and gives access to your data.
- Go to the Files section to the right of the notebook and click
Insert to codefor the data you have uploaded. ChooseInsert pandas DataFrame.
The steps should allow you to understand the dataset, analyze and visualize it. You will then go through the preprocessing and feature engineering processes to make the data suitable for modeling. Finally, you will build some machine learning models and test them to compare their performances.
- Navigate to your project and add a new machine learning model.
- Give it a name, choose a machine learning service, select model builder as model type as logistic regression is one of the best model for our dataset and is available in the builder, select the default runtime and select Manual.
- Add the reduced dataset to the model.
- Add a deployment
- Get the deployment url and the machine learning model instance tokens.
- Replace the deployment url and tokens in the check_url.py file
python check_url.py <url>
https://www.researchgate.net/publication/277476345_Phishing_Websites_Features