Skip to content

Consecutive DNS errors crash a Tranco crawl #1116

@Mathis-Z

Description

@Mathis-Z

Problem

I am doing a crawl of the top 100k Tranco domains. Unfortunately, the Tranco list contains many domains that cannot be resolved using public DNS. Since each DNS resolve error counts against the TaskManagers maximum consecutive errors limit, this will inevitably crash the crawler. Even worse, this only happens after running the crawl for quite a while because these streaks of unresolvable domains don't occur in the top ~20k domains.

Expected Behavior

I would expect OpenWPM to handle DNS errors gracefully and continue a crawl even if many domains are unresolvable. The error could somehow be recorded but in no case should the crawl abort. Maybe the behavior of OpenWPM in case of DNS errors should be configurable through some parameter (e.g., specifying retry/mark failed/abort). This might also be nice for other types of errors (e.g., network temporarily unavailable).

Example Code

from pathlib import Path
import sys
import os

sys.path.append(os.path.abspath("openwpm"))
from openwpm.command_sequence import CommandSequence
from openwpm.commands.browser_commands import GetCommand
from openwpm.config import BrowserParams, ManagerParams
from openwpm.storage.sql_provider import SQLiteStorageProvider
from openwpm.task_manager import TaskManager


if __name__ == "__main__":
    urls = [f"http://some-nonexistent-domain-nwjkf{i}.com" for i in range(100)]

    browser_params = [BrowserParams(display_mode="headless") for _ in range(4)]
    manager_params = ManagerParams(num_browsers=4)

    with TaskManager(
        manager_params,
        browser_params,
        SQLiteStorageProvider(Path("/tmp/crawl-data.sqlite")),
        None,
    ) as manager:
        print("Starting crawl...")

        for index, url in enumerate(urls):
            command_sequence = CommandSequence(url, site_rank=index)
            command_sequence.append_command(GetCommand(url=url, sleep=3), timeout=60)
            manager.execute_command_sequence(command_sequence)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions