Back to readme.

Web Scraper Program Documentation

This application consists of 5 projects/modules:

Application
Downloader
WebScraper - module
- Jobs
- WebScraping and processing
ProjectListCrawler
MailSender
WebScraperTests

The lifetime of this application can be divided into two parts:

The configuration of the application
Running web scraper jobs by Quartz.NET

For details, see following paragraphs.

Application

The application is the main startup project, it is responsible for the configuration of the application (EmailNotifier, WebScraperConfiguration) and the startup of the web scraper.

It uses Spectre.Console for the configuration guide and System.Text.Json.JsonSerializer for the serialization and deserialization of the configuration.

After the configuration process, the application starts the web scraper, which is implemented as IHostedService.

For details, see Program.cs.

Downloader

The downloader is a module that is responsible for downloading the web pages, it is implemented as a HttpClient wrapper and it uses HAP to parse the incoming HTML.

WebScraper - module

The web scraper module is the core of this application, it is responsible for the scraping of the web pages, processing of the scraped data and storing the data in the database. The scraping is configured by the WebScraperConfiguration class, which is passed to the constructor of the web scraper's Startup class. For running the web scraper as a hosted service, use the configured Startup class to configure an instance of IHostBuilder.

Jobs

The web scraper is based on the Quartz.NET library, which is a job scheduling library for .NET. The Quartz runs 3 types of jobs:

ScrapeAuctionListsJob - scrapes the auction lists and stores the scraped data in the database. It uses an configured instance of WebScraper class to scrape the data. More about this later.
AuctionEndingUpdateJob - it is scheduled for every auction record created, it should update the stored data before the auction ends (mainly the current price).
DeleteOldRecordsJob - deletes records, which represent auctions that ended before more than configured number of days, from the database using an instance of IUnitOfWork and an instance of IAuctionRecordRepository. This job is started once a day.

WebScraping and processing

When the WebScraper.ScrapeAsync method is called, it starts the scraping process. The scraping process can be described in short as:

Create productPageLinkSink - a buffer block that is connected to an action block, that processes the links to the product pages.
For each link in the ScrapingJobDefinition call the IProductListCrawler.Crawl(Uri productListStart, IProductListProcessor processor, ITargetBlock<IReadOnlyCollection<Uri>> productPageTarget, CancellationToken token) method to crawl the product list page and send the product page links to the productPageLinkSink.
When the productPageLinkSink receives a collection of product page links from the product list crawler, it creates an instance of IUnitOfWork - unit of work and productPageActionBlock, which is used to dose parsed product pages one at a time to an IAuctionRecordManager contained in the unit of work. The collection of the received product page links is passed to an instance of IProductPageLinkHandler through its method
```
HandleLinksAsync(
      IEnumerable<Uri> links,
      ITargetBlock<ProductPageParsingResult> targetBlock,
      CancellationToken cancellationToken);
```
The productPageActionBlock is used as the targetBlock parameter of the HandleLinksAsync method. After the productPageActionBlock is completed, the unit of work is completed, which in the default implementation means that the SaveChanges method of the database context is called. The database context is derived from the abstract DbContext class that comes from Entity Framework Core. EF core is used to create the database and to store the scraped data in it.
When the AuctionRecordManager gets the newly scraped record, the new record is compared with the records that are already in the database. If the record is not in the database, it is added to the database and an AuctionEndingUpdateJob is scheduled for it. If the record is already in the database as a record of already ended auction, a notification about a readdition is sent, then the record is updated and the AuctionEndingUpdateJob is rescheduled for it.

ProjectListCrawler

This module is responsible for crawling the auction lists, it uses an instance of IProductListProcessor to process a product list page and get the address of the next product list page and the links to the product pages. The product pages are sent to a target block (ITargetBlock).

The default implementation of the IProductListCrawler is ProductListCrawler.

MailSender

This module is responsible for sending emails, it uses MailKit for sending emails and MimeKit for creating the email message.

It is not directly used by the WebScraper module, but it is used by the EmailNotifier class, which is instantiated by the Application module and passed to the WebScraper.Startup class.

WebScraperTests

This module contains integration/unit tests for the WebScraper and Application module.

The tests require Docker to be running on the machine, because the tests use a Testcontainers.MsSql nuget package to run a SQL Server instance in a docker container. See WebScraperTests.

The email notification functionality is demonstrated and tested in EmailNotificationTests.

Back to readme.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Scraper Program Documentation

Application

Downloader

WebScraper - module

Jobs

WebScraping and processing

ProjectListCrawler

MailSender

WebScraperTests

FilesExpand file tree

ProgramDocumentation.md

Latest commit

History

ProgramDocumentation.md

File metadata and controls

Web Scraper Program Documentation

Application

Downloader

WebScraper - module

Jobs

WebScraping and processing

ProjectListCrawler

MailSender

WebScraperTests