Back to readme.
This application consists of 5 projects/modules:
The lifetime of this application can be divided into two parts:
- The configuration of the application
- Running web scraper jobs by
Quartz.NET
For details, see following paragraphs.
The application is the main startup project, it is responsible for the configuration of the application (EmailNotifier, WebScraperConfiguration) and the startup of the web scraper.
It uses Spectre.Console for the configuration guide and System.Text.Json.JsonSerializer for the serialization and deserialization of the configuration.
After the configuration process, the application starts the web scraper, which is implemented as IHostedService.
For details, see Program.cs.
The downloader is a module that is responsible for downloading the web pages, it is implemented as a HttpClient wrapper and it uses HAP to parse the incoming HTML.
The web scraper module is the core of this application, it is responsible for the scraping of the web pages, processing of the scraped data and storing the data in the database. The scraping is configured by the WebScraperConfiguration class, which is passed to the constructor of the web scraper's Startup class. For running the web scraper as a hosted service, use the configured Startup class to configure an instance of IHostBuilder.
The web scraper is based on the Quartz.NET library, which is a job scheduling library for .NET. The Quartz runs 3 types of jobs:
- ScrapeAuctionListsJob - scrapes the auction lists and stores the scraped data in the database. It uses an configured instance of WebScraper class to scrape the data. More about this later.
- AuctionEndingUpdateJob - it is scheduled for every auction record created, it should update the stored data before the auction ends (mainly the current price).
- DeleteOldRecordsJob - deletes records, which represent auctions that ended before more than configured number of days, from the database using an instance of IUnitOfWork and an instance of IAuctionRecordRepository. This job is started once a day.
When the WebScraper.ScrapeAsync method is called, it starts the scraping process. The scraping process can be described in short as:
- Create
productPageLinkSink- a buffer block that is connected to an action block, that processes the links to the product pages. - For each link in the ScrapingJobDefinition call the
IProductListCrawler.Crawl(Uri productListStart, IProductListProcessor processor, ITargetBlock<IReadOnlyCollection<Uri>> productPageTarget, CancellationToken token)method to crawl the product list page and send the product page links to theproductPageLinkSink. - When the
productPageLinkSinkreceives a collection of product page links from the product list crawler, it creates an instance ofIUnitOfWork- unit of work andproductPageActionBlock, which is used to dose parsed product pages one at a time to an IAuctionRecordManager contained in the unit of work. The collection of the received product page links is passed to an instance of IProductPageLinkHandler through its methodTheHandleLinksAsync( IEnumerable<Uri> links, ITargetBlock<ProductPageParsingResult> targetBlock, CancellationToken cancellationToken);
productPageActionBlockis used as thetargetBlockparameter of theHandleLinksAsyncmethod. After theproductPageActionBlockis completed, the unit of work is completed, which in the default implementation means that theSaveChangesmethod of the database context is called. The database context is derived from the abstractDbContextclass that comes from Entity Framework Core. EF core is used to create the database and to store the scraped data in it. - When the
AuctionRecordManagergets the newly scraped record, the new record is compared with the records that are already in the database. If the record is not in the database, it is added to the database and anAuctionEndingUpdateJobis scheduled for it. If the record is already in the database as a record of already ended auction, a notification about a readdition is sent, then the record is updated and theAuctionEndingUpdateJobis rescheduled for it.
This module is responsible for crawling the auction lists, it uses an instance of IProductListProcessor to process a product list page and get the address of the next product list page and the links to the product pages. The product pages are sent to a target block (ITargetBlock).
The default implementation of the IProductListCrawler is ProductListCrawler.
This module is responsible for sending emails, it uses MailKit for sending emails and MimeKit for creating the email message.
It is not directly used by the WebScraper module, but it is used by the EmailNotifier class, which is instantiated by the Application module and passed to the WebScraper.Startup class.
This module contains integration/unit tests for the WebScraper and Application module.
The tests require Docker to be running on the machine, because the tests use a Testcontainers.MsSql nuget package to run a SQL Server instance in a docker container. See WebScraperTests.
The email notification functionality is demonstrated and tested in EmailNotificationTests.
Back to readme.