Skip to content

mzhang0/py-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Requirements

Python 3

Beautiful Soup

lxml Parser

Usage

Directory setup:

py-crawler/
    crawler.py
    recover.py
    input/
        frontier.txt (optional)
        visited.txt (optional)
        count.txt (optional)
        filetypes.txt (optional)
        subdomains.txt (optional)
    output/
        ...

To run a fresh crawl:

$ python crawler.py

To resume a stopped or crashed crawl, copy the frontier.txt, visited.txt, count.txt, filetypes.txt, and subdomains.txt files from the output directory to the input directory, and then:

$ python crawler.py recovery

Notes

If the crawler crashes or if you interrupt the program via CTRL+C, then the contents of the frontier, visited list, subdomains list, crawl counter, and filetypes dictionary are dumped into .txt files at the output directory.

Logs are periodically written to log.csv which can also be found in the output directory once the crawl begins.

If lxml cannot be installed, change PARSER in crawler.py to "html.parser"

About

Simple web crawler built using Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages