webCrawler: Web Crawler Application

This Python application is a web crawler that systematically browses websites, processing the URLs and storing relevant data. It's particularly focused on extracting information related to politicians from the sites it visits.

Features

The web crawler provides the following key features:

Website Crawling: It visits each URL of a website and processes the content to extract useful data.
Customizable User Agents: Randomly selects user agents from a list for each request. This is done to prevent blocking by websites that limit requests from the same user agent.
URL Filtering: It filters out suspicious URLs and URLs that are not allowed by robots.txt of the website. It also ensures that it only visits new URLs and does not revisit already visited ones.
Data Extraction: Extracts page content, including the page title and text within paragraph tags.
Politician Information Extraction: It specifically looks for politician names in the title of each page visited and classifies the page as potentially useful if it contains a politician's name.
Data Storage: Stores data in a SQLite database and Firestore. It saves all URLs visited, but only includes additional data for potentially useful pages (those that contain politician names).
Error Handling: The application handles various exceptions that may arise during the web crawling process, such as timeouts and WebDriver exceptions.

Tech Stack

Python: The application is written entirely in Python.
SQLite: A relational database used for storing URL data.
Firestore: Google's NoSQL cloud database used for storing more extensive data for potentially useful pages.
Selenium: A tool for automating browsers. This application uses Selenium to fetch web pages.
BeautifulSoup: A library for extracting data from HTML and XML files. It is used for parsing HTML content of web pages.
urllib: A Python module used for fetching URLs.

Prerequisites

Python 3
Selenium WebDriver
SQLite
Firestore

Installation

Clone the repository: git clone https://github.com/pacoreyes/webCrawler.git
Navigate to the project directory: cd web-crawler
Install the required Python modules: pip install -r requirements.txt

Usage

To run the crawler, execute python3 main.py
The crawler will start from the position saved in the POSITION_FILE. If this is the first time running the crawler, it will start from the beginning.

Note: Make sure to respect the policies outlined in robots.txt of the websites you're crawling. Some sites do not allow certain pages to be crawled, and others limit the frequency of requests.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
blueprints		blueprints
connections		connections
static		static
templates		templates
README.md		README.md
app.py		app.py
config.py		config.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webCrawler: Web Crawler Application

Features

Tech Stack

Prerequisites

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

pacoreyes/webCrawler

Folders and files

Latest commit

History

Repository files navigation

webCrawler: Web Crawler Application

Features

Tech Stack

Prerequisites

Installation

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages