This project is a distributed, depth-first web scraper built using Scrapy, designed to extract information from specified websites. It utilizes scrapy-playwright for handling JavaScript-rendered content, ensuring comprehensive data extraction from modern web pages. The crawler is designed for distributed operation, allowing multiple instances to work concurrently and efficiently.
-
Clone the repository:
git clone [repository_url] cd search_engine_crawler -
Create and activate a Python virtual environment (Optional, for local development/testing):
python3 -m venv venv source venv/bin/activate -
Install dependencies (Optional, for local development/testing):
pip install -r requirements.txt
The web_spider uses default start_urls and allowed_domains defined in search_engine_crawler/constants.py. sitemap_urls can still be provided as command-line arguments.
This project is configured to run using Docker Compose, which simplifies setup and deployment, especially with Redis for distributed crawling.
-
Build and run the Docker containers:
docker-compose up --build
This command will:
- Build the Docker images for the
init_urlsandcrawlerservices. - Start a Redis server.
- Run
init_urlsto push initial URLs fromstart_urls.txtto Redis. - Start the
crawlerservice, which will begin scraping URLs from Redis.
- Build the Docker images for the
-
Spawning Multiple Crawlers: To run multiple instances of the crawler for increased concurrency, use the
--scaleflag:docker-compose up --build --scale crawler=3
Replace
3with the desired number of crawler instances. -
Stopping the services:
docker-compose down
To run the spider locally (without Docker Compose), ensure you have activated your virtual environment and installed dependencies.
To run the spider using the default URLs:
scrapy crawl web_spiderTo provide sitemap_urls (if applicable):
scrapy crawl web_spider -a sitemap_urls="http://example.com/sitemap.xml"scrapy.cfg: Scrapy project configuration.search_engine_crawler/: The main Python package for the Scrapy project.__init__.py: Makessearch_engine_crawlera Python package.settings.py: Scrapy project settings, including middleware and pipeline configurations.items.py: Defines the data structure for scraped items.pipelines.py: Processes scraped items (e.g., saving to a database).middlewares.py: Custom downloader and spider middlewares.constants.py: Defines defaultstart_urlsandallowed_domains.spiders/: Directory containing the spider definitions.__init__.py: Makesspidersa Python package.web_spider.py: The main spider for crawling web pages, usingscrapy-playwrightfor dynamic content.
requirements.txt: Lists all Python dependencies.scraped_data.db: SQLite database for storing scraped data (if configured in pipelines).CHANGELOG.md: Documents all notable changes to the project.Dockerfile: Defines the Docker image for the crawler and URL initializer.docker-compose.yml: Orchestrates the multi-container Docker application (Redis, URL initializer, crawler).init_redis_urls.sh: Script to push initial URLs fromstart_urls.txtto Redis.start_urls.txt: Contains the list of initial URLs for the crawler.