A Simple Keyword Research Tool

Background

To design a good SEO strategy, access to big amounts of data is essential. However, access to quality data is getting harder and harder.

At the present moment, SEO tools available are expensive or offer low-quality data.

About this repository

This repository was created as a final project for a Data Analyst Bootcamp at Ironhack.

Objectives

1. Proof of concept:

Design a minimum working application for SEO keyword research with a limited time (around a week) and 0 budget.

2. Easy to deploy application:

Design an application that can be deployed online by anyone with only few commands.

3. Prove that news can be a good predictor of search behavior abnormalities:

The search behavior of the users is usually predictable and easy to forecast. However, sometimes there are sudden spikes in searches that affect the quality of the prediction models.

This project pretends to show that we can combine, data extracted from the news and the historical behavior of the users, to improve the quality of the prediction models.

You can see a working version of this repository in this url.

Installation and requirements

To deploy this application you should have to meet some minimum requirements.

Requirements

A server with MySQL (or similar) installed and about 100Mb available.
The Guardian API access key. Can be easily obtained here for free.
The New York Times API Key. Can be easily obtained here for free.

Installation

Open the terminal and clone the repository

git clone https://github.com/sborto86/final-project

Create a .streamlit directory and an inside a file called secrets.toml

mkdir .streamlit
touch ./.streamlit/secrets.toml

Open the secrets.toml file and add the necessary keys for the application to work in the following format

THE_GUARDIAN = "<The Guardian API key>"
NYT = "<The New York Times API key>"
SQLHOST = "<SQL server host address>"
SQLUSER = "<SQL database user name>"
SQLPW = "<SQL database pasword>"
SQLPORT = "<SQL port>"
SQLDB = "<name of the SQL database>"

Create the standards that are going to be used to estimate google search volume (it can take around 15 minutes). This should create also the SQL database automatically. In the console execute the following python script.

python ./db/standards_db.py

Create the database and the New York Times archive (it can take around 10 minutes). In the console execute the following python script:

python nyarchive.py

Now that everything is ready the application can be tested locally. In the console execute streamlit:

 streamlit run main.py

Finally to deploy the application online, just follow the instructions here

Limitations

Most of the limitations of this application come from the difficulty of getting good and reliable data

News are only extracted from only two sources, The Guardian and The New York Times
Google search data is global, regional data is not abaviable
The application is optimized for English keywords, other languages can be used, but the predictions will have a higher degree of inaccuracy.
Low search keywords will not generate results, the estimated lower limit of detection is about 1.000 - 2.000 searches per day.
Only short keywords are accepted , maximum 3 words lengh are accepted.
The data acquisition might be slow, 3 - 5 minutes per new keyword.
The historical data is limited to two years
Google might block the request, the acquisition of data requires multiple calls to the google trends website that might trigger the firewall.
Search volume values are estimations

How it works?

A simple schema of the application structure

Data acquisition

Sources of data used in this application:

Google Trends: Web scrapping using the library pytrends.

The Guardian: Live API calls.

The New York Times: News archives of the last two years are stored in the database, and the missing data will be updated (if necessary) in every call.

1. Data Processing

1. Getting Google trends data

First let's see what google trends offers us:

The information that we obtain is only a weekly average relative to the maximum

2. From relative data to absolute data

To convert from relative to absolute data, the information provided by Semrush (one of the most renowned SEO tools) was used.

You can read the full article by clicking here

The information we get from this article is the average monthly search of the term "youtube" from January through August 2022 (see bellow):

3. Creating standards

Then with this information, we get an array of keywords from high-volume search keywords to reach the limit of detection of Google Trends:

Finally, by performing successive pair comparisons we obtain an estimation of the absolute search volume of each keyword:

Once these standards are created, we can proceed to extrapolate the absolute volume data of any keyword.

4. Getting the historical data

To get the calculated absolute search volume of a new keyword, the following process is going to performed:

Find the most silimar standard: By comparing each standard and the keyword in Google Trends we get the standard that have a similar search volume than the keyword.
Convert the relative volume to absolute volume: We use the standard to extrapolate the keyword search volume.
Retrive the historical data: Once we have the search volume from the window of the standards (January to August 2022), we get the last two years historical data for the keyword from Google Trends (relative data).
Obtain the absolute historical data : Finally we use the absolute data obtained in the second step to calculate the historical data.

For example if search for "pizza" we get the following result:

Now that we have the google historical data we can proceed to scrape the newspapers (The Guardian and The New York Times)

The process is simple we check the number of news published every day that the keyword is found in the headline or the summary.

Using the same example as before we obtain something like this:

5. Getting everything toghether

After putting all the data together, we are ready to proceed to the next step, forecasting.

But first, let's see another example, if we search for Covid-19:

As we can see here there is a good correlation between the peaks of news published and the peaks of the searches in google.

Looks like we are on the right track...

Data Storage

All the extracted and processed data is stored in a simple SQL database. The schema is shown below:

Machine Learning

In order, to fit our data into a model and perform forecasting the Facebook Prophet algorithm is used. There are sevral reasons to choose this algorithm, the main reasons are exposed bellow:

It is fast: We need a model that don't delay to much the processing time as the web scrapping process is already slow.
It is less sensible to outliers than other models: Compare to other predictive models is less to abnormal peaks in search volume (outliers)
Good predicting yearly and weekly seasonability: The algorithm was designed specially to predict seasonability,
Can ignore periods of data: This is one of the most important features to choose this model. As we want to exclude the periods where peaks are detected on news.

But let's see one example, if we search for Ukraine we obtain the following results:

Future Improvements

Improve speed: Implement asynchronous calls by modifying the pytrends library to work with threads and use a random IP from a proxy list (to avoid being blocked by google).
Improve integrity: The code has been written with a limited amount of time and might need some debugging processes.
Include other sources of data to get better search estimations and trends from the news.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
config		config
darts_logs/Air_RNN		darts_logs/Air_RNN
db		db
img		img
pages		pages
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Simple Keyword Research Tool

Table of contents

Background

About this repository

Objectives

Installation and requirements

Requirements

Installation

Limitations

How it works?

Data acquisition

1. Data Processing

1. Getting Google trends data

2. From relative data to absolute data

3. Creating standards

4. Getting the historical data

5. Getting everything toghether

Data Storage

Machine Learning

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Simple Keyword Research Tool

Table of contents

Background

About this repository

Objectives

Installation and requirements

Requirements

Installation

Limitations

How it works?

Data acquisition

1. Data Processing

1. Getting Google trends data

2. From relative data to absolute data

3. Creating standards

4. Getting the historical data

5. Getting everything toghether

Data Storage

Machine Learning

Future Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages