Let's design a Web Crawler that will browse and download the World Wide Web.
It's a software program which browses the WWW in a methodical and automated manner, collecting documents by recursively fetching links from a set of starting pages.
Search engines use web crawling as a means to provide up-to-date data. Search engines download all pages and create an index on them to perform faster searches.
Other uses of web crawlers:
- Test web pages and links for valid syntax and structure.
- To search for copyright infringements.
- To maintain mirror sites for popular web sites.
- To monitor sites to see where their content changes.
Scalability: Our service needs to be scalable, since we'll be fetching hundreds of millions of web documents.
Extensibility: Our service should be designed in a way that allows newer functionality to be added to it. It should be able to allow for newer document formats that needs to be downloaded and processed in future.
We should be asking a few questions here:
Is it a crawler for only HTML pages? Or should we fetch and store other media types like images, videos, etc.
It's important to clarify this because it will change the design. If we're writing a general-purpose crawler, we might want to break down the parsing module into different sets of modules: one for HTML, another for videos,..., so basically each module handling a given media type.
For this design, let's assume our web crawler will deal with HTML only.
Let's assume HTTP for now. Again, it should not be hard to extend it to other protocols.
Let's assume we need to crawl 1Billion websites. Since one site can contain many URLs, assume an upper bound of 15 billion web pages.
Some web crawlers implement the Robots Exclusion Protocol, which allows Webmasters to declare parts of their sites off limits to crawlers. The Robots Exclusion Protocol requires a Web Crawler to fetch a document called robot.txt which contains these declarations for that site before downloading any real content from it.
If we crawl 15B pages in 4 weeks, how many pages will we need to fetch per second?
15B / (4 weeks * 7 days * 86400 sec) ~= 6200 web pages/sec
What about storage? Pages sizes vary. But since we are dealing with HTML only, let's assume an average page size is 100KB. With each page, if we're storing 500 bytes of metadata, total storage we would need is:
15B * (100KB + 500 bytes)
15 B * 100.5 KB ~= 1.5 Petabytes
We don't want to go beyond 70% capacity of our storage system, so the total storage we will need is:
1.5 petabytes / 0.7 ==> 2.14 Petabytes
The basic algorithm of a web crawler is this:
- Taking in a list of seed URLs as input, pick a URL from the unvisited URL list.
- Find the URL host-name's IP address.
- Establish a connection to the host to download its corresponding documents.
- Parse the documents contents to look for new URLs.
- Add the new URLs to the list of unvisited URLs.
- Process the downloaded document, e.g, store it, or index the contents
- Go back to step 1.
Breath first or depth first? Breadth-first search (BFS) is usually used. We can also use Depth-first search especially when the crawler has already established a connection with a website. In this situation, the crawler will just DFS all the URLs within the website to save some handshaking overhead.
Path-ascending crawling: Path-ascending crawling helps discover a hidden or isolated resources. In this scheme, a crawler would ascend to every path in each URL like so:
given a seed URL of http://xyz.com/a/b/one.html
it will attempt to crawl /a/b/, /a/ and /
A large volume implies that the web crawler can only dowload a fraction of the web pages, so it's critical that the web crawler should be intelligent enough to prioritize download.
Web pages change frequenty. By the time the crawler is downloading the last page from the site, the page may change dynamically, or a new page may be added.
Components of a bare minimum crawler:
- URL frontier: stores a list of URLs to download and prioritize which URLs should be crawled first.
- HTTP Fetcher: to retrieve a web page from the hosts server.
- Extractor: to extract links from HTML documents.
- Duplicate Remover: to make sure same content is not extracted twice.
- Datastore: to store retrieved pages, URLs and other metadata.
Assume the crawler is running on a single server, where multiple working threads are performing all the steps needed to download and process a document in a loop.
Step 1: remove an absolute URL from the shared URL frontier for downloading. the URL begins with a scheme (e.g HTTP) which identifies the network protocol that should be used to download it. We can implement these protocols in a modular way for extensibility, so that later if our crawler needs to support more protocols, it can easily be done.
Step 2: Based on the URL's scheme, the worker calls the appropriate protocol module to download the document.
Step 3: After downloading, the document is written into a Document Input Stream (DIS). This will enable other modules to re-read the document multiple times.
Step 4: The worker invokes the dedupe test to see whether this document (associated with a different URL) has already been seen before. If so, the document is not processed any further and the worker thread removes the next URL from the frontier.
Step 5: Process the downloaded document. Each doc has a different MIME type like HTML page, Image, Video etc. We can implement these MIME schemes in a modular way, to allow for extensibility when our crawler need to support more types. The worker invokes the process method of each processing module with that MIME type.
Step 6: The HTML processing module will extract links from the page. Each link is converted into an absolute URL and testsed against a user-supplied filter to determine if it should be downloaded. If the URL passes the filter, the worker performs the URL-dedupe test, which checks if the URL has been downloaded before. If it's new, it is added into the URL frontier.
Let's see how each component can be distributed onto multiple machines:
This is the data structure that contains all the URLs that are queued to be downloaded. We crawl by performing a breadth-first traversal of the Web, starting from the pages in the seed set. We can use a FIFO queue to implement this.
Since we have a huge list of URLs to crawl, we can distribute our URL frontier into multiple servers. Let's assume on each server we have multiple threads performing the crawling tasks. Our hash function maps each URL to the server responsible for crawling it.
Constraints for a distributed URL frontier:
- The crawler should not overload a server by downloading a lot of pages.
- Multiple machines should not connect to a single web server.
For each server, we can have distinct FIFO sub-queues, where each worker thread will remove URLs for crawling.
Once a new URL is added, we determine which sub-queue it belongs to by using the URL's canonical hostname. Our hash function will map each hostname to a thread number. Together, these two points imply, that only one worker thread will download documents from a given web server and also, by using FIFO queue, it'll not overload a web server.
The size would be in the 100s of millions of URLs. Therefore, we need to store the URLs on disk. We can implement our queues in such a way that they have separate buffers for enqueuing and dequeuing. Enqueuing buffer, once filled, will be dumped to the disk. Dequeuing buffers will keep a cache of URLs that need to be visited; periodically reading from the disk to fill the buffer.
This will download the document corresponding to a given URL using the appropriate network protocol like HTTP. Webmasters create a robot.txt to make certain parts of the websites off limits for the crawler.
To avoid downloading this text file on every request, our HTTP protocol module can maintain a cache mapping host-names to their robot's exclusion rules.
We cache the document locally using DIS to avoid downloading the document multiple times.
A DIS is an input stream that caches the doc's entire contents in memory. It also provides methods to re-read the document. Larger documents can be temporarily written to a backing file.
Each worker will have a DIS, which it reuses from document to document. After extracting a URL from the frontier, the worker passes that URL to the relevant protocol module (in our case, for HTTP) which initializes the DIS from a network connection to contain the document's contents. The worker then passes the DIS to all relevant processing modules.
To prevent processing a doc more than once, we perform a dedupe test on each doc to remove duplication.
We can calculate a 64-bit checksum (using MD5 or SHA) on every processed document and store it in a database. For each new document, we compare its checksum to all previously calculated checksums to see if the doc has been seen before.
We need to keep a unique set containing checksums of all previously process documents. Considering 15 billion distinct web pages, we would need about
15B * 8 bytes => 120 GB
We can have a small LRU cache on each server with everything backed by persistent storage.
Steps:
- Check if the checksum is present in the cache.
- If not, check if the checksum is in the back storage.
- If found, ignore the document.
- Otherwise, add the checksum to the cache and back storage.
URL filtering mechanism provides a customizable way to control the set of URLs that are downloaded. We can define filters to restrict URLs by domain, prefix or protocol type.
Before adding the URL to the frontier, the worker thread consults the user-supplied URL filter.
Before we contact a web server, a Web crawler must use a DNS to map the Web server's hostname into an IP address. DNS name resolution will be a big bottleneck of our crawlers given the amount of URLs we are working with. To avoid repeated requests, we can start caching DNS results by building our local DNS server.
While extracting links, we will encounter multiple links to the same document. To avoid downloading and processing a doc multiple times, we use a URL dedupe test on each extracted link before adding it to the URL frontier.
We can store a fixed-size checksums for all the URLs seen by our crawler in a database. To reduce number of operations performed on the DB, we can keep an in-memory cache of popular URLs on each host shared by all threads. This is because links to some URLs are quite common, so caching will lead to a high in-memory hit rate.
To keep a unique set of checksums of all previously seen URLS, we would need:
15 Billion URLS * 4 bytes => 60 GB
A crawl of the entire Web takes weeks to complete. To guard against failures, our crawler can write regular snapshots of its state to the disk. An interrupted on aborted crawl can easily be restarted from the latest checkpoint.
We should use consistent hashing for distribution among crawler servers. Consistent hashing will help in:
- replacing dead hosts
- distributing load among crawling servers
All crawling servers will perform regular checkpointing and storing their FIFO queues to disks. If a server goes down, we can replace it. During the replacement, consistent hashing should shift the load to other load-capable servers.

