This project was created to assist individuals in web-scraping. It provides the basic structure by which to organized threaded web-scraping.
For many projects there needs to be some multi-processing involved in web-scraping due to the inordinate amount of time it would be needed to scrape large amounts of data from several websites. Thus, we created this to help modularize this process.
**Note: This tutorial assumes a working Windows 10 environment.
Step 1: Create virtualenv environment. Run the virtualenv package. For installation of virtualenv visit https://virtualenv.pypa.io/en/latest/installation.html. We recommend installation through pip.
(base) PS C:\documents\project> virtualenv venvStep 2: Access your virtual environment named venv.
(base) PS C:\documents\project> ./venv/Scripts/activate
You should see:
(venv) PS C:\documents\project>
Step 3: Install required packages.
beautifulsoup4- https://pypi.org/project/beautifulsoup4/requests- https://docs.python-requests.org/en/latest/lxml- https://lxml.de/
(venv) PS C:\documents\project> pip install beautifulsoup4
(venv) PS C:\documents\project> pip install requests
(venv) PS C:\documents\project> pip install lxml
<Insert Gwen's Parser Tutorial. Please include the modularization (how to create parser class, what method to override, etc).> (First draft)
Follow template in scrapeTemplate.py, replacing as needed. A number of for loops is needed equal to as many pages deep the scraper must iterate through. For the template, it assumes a first index list, then a sub index of Aa Ab Ac and so on. This can be adjusted by removing the subletter loop and adjusting variable names in the deeper for loop accordingly.
Note: This template only works if the given website has an alphabetized index.
You can follow the "How to write parser" section to create your parser.
Step 2: Create a WebsiteThread class.
This program works by placing each individual website into a different thread, through inheritance of the Thread class.
Navigate to ~/ClientThreads/ClientThreads.py. It should look like the following:
from threading import Thread
# Scrapers
from scrapers.DrugsComScraper import DrugsComScraper
from scrapers.MayoclinicScraper import MayoclinicScraper
# Clients
from clients.WebsiteClient import WebsiteClient
class MayoClinicThread(Thread):
def run(self):
print("[START] MayoClinicClient")
mayoClinicClient = WebsiteClient(
name="Mayoclinic",
base_url="https://www.mayoclinic.org/",
ext=["", "drugs-supplements", "drug-list?letter=A"],
verbose=False
).run(MayoclinicScraper)
print("[END] MayoClinicClient")
class DrugsComThread(Thread):
def run(self):
print("[START] DrugsComClient")
drugsComClient = WebsiteClient(
name="drugs.com",
base_url="https://www.drugs.com",
ext=["", "drug_information.html"],
verbose=False
).run(DrugsComScraper)
print("[START] DrugsComClient")create your WebsiteThreading class. An example of one is already added by default in ClientThreads.py:
class WikiThread(Thread):
def run(self):
# Empty for now
passPopulate WebsiteThread::run() by calling the WebsiteClient class with your desired website's specific information. Note that ext(url extensions) could possibly be empty, depending on how you originally defined your parser's parse() function on "How to write Parser". Note: Ensure that etx has an empty string element at the beginning.
Your code will look like the following:
class WikiThread(Thread):
def run(self):
print("[START] WikiClient")
drugsComClient = WebsiteClient(
name="Wikipedia",
base_url="https://en.wikipedia.org/wiki/Medicine",
ext=[""], # example of empty extensions
verbose=False
).run(WikiScraper)
print("[START] WikiClient")Notice the WikiScraper, which is the class you defined in the "How to write parser" section. Now your parser should be almost ready to run.
Step 3: Add your defined class to the THREAD list in main.py.
Before:
from ClientThreads.ClientThreads import *
THREADS = [DrugsComThread]
def main():
# Runs every thread at the same time.
for i in range(len(THREADS)):
t = THREADS[i]()
t.start()
if __name__ == "__main__":
main()After:
from ClientThreads.ClientThreads import *
THREADS = [DrugsComThread, WikiThread]
def main():
# Runs every thread at the same time.
for i in range(len(THREADS)):
t = THREADS[i]()
t.start()
if __name__ == "__main__":
main()Creating the parser and the modularization is the complicated part. To run we simply execute main.py:
(venv) PS C:\documents\project> python main.py