GitHub - mzhang0/py-crawler: Simple web crawler built using Python

Requirements

Python 3

Beautiful Soup

lxml Parser

Usage

Directory setup:

py-crawler/
    crawler.py
    recover.py
    input/
        frontier.txt (optional)
        visited.txt (optional)
        count.txt (optional)
        filetypes.txt (optional)
        subdomains.txt (optional)
    output/
        ...

To run a fresh crawl:

$ python crawler.py

To resume a stopped or crashed crawl, copy the frontier.txt, visited.txt, count.txt, filetypes.txt, and subdomains.txt files from the output directory to the input directory, and then:

$ python crawler.py recovery

Notes

If the crawler crashes or if you interrupt the program via CTRL+C, then the contents of the frontier, visited list, subdomains list, crawl counter, and filetypes dictionary are dumped into .txt files at the output directory.

Logs are periodically written to log.csv which can also be found in the output directory once the crawl begins.

If lxml cannot be installed, change PARSER in crawler.py to "html.parser"

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
recover.py		recover.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirements

Usage

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Requirements

Usage

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages