A polite downloader for Common Crawl data, written in Rust.
cargo install ccdownOther methods
git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .Grab the latest release for your platform from the releases page.
ccdown download-paths CC-MAIN-2025-08 warc ./pathsSupported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table
Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM
ccdown download ./paths/warc.paths.gz ./data| Flag | Description | Default |
|---|---|---|
-t |
Number of concurrent downloads | 10 |
-r |
Max retries per file | 1000 |
-p |
Show progress bars | off |
-f |
Flat file output (no directory structure) | off |
-n |
Numbered output (for Ungoliant Pipeline) | off |
-s |
Abort on unrecoverable errors (401, 403, 404) | off |
ccdown download -p -t 5 ./paths/warc.paths.gz ./dataNote: Keep threads at 10 or below. Too many concurrent requests will get you
403'd by the server, and those errors are unrecoverable.
Python bindings
pip install ccdownfrom ccdown import Client
client = Client(threads=10, retries=1000, progress=True)
# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")
# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")
# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")
# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")Client(threads=10, retries=1000, progress=False) — Create a client with shared config.
client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.
client.download(path_file) — Returns a builder with chainable options:
.files_only()— flatten directory structure.numbered()— enumerate output files (for Ungoliant).strict()— abort on unrecoverable HTTP errors.to(dst)— execute the download
MIT OR Apache-2.0
