Skip to content

4thel00z/ccdown

 
 

ccdown

A polite downloader for Common Crawl data, written in Rust.

crates.io PyPI docs.rs CI License


Install

cargo install ccdown
Other methods

From source

git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .

Pre-built binaries

Grab the latest release for your platform from the releases page.

Usage

1. Download the path manifest for a crawl

ccdown download-paths CC-MAIN-2025-08 warc ./paths

Supported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table

Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM

2. Download the actual data

ccdown download ./paths/warc.paths.gz ./data

Options

Flag Description Default
-t Number of concurrent downloads 10
-r Max retries per file 1000
-p Show progress bars off
-f Flat file output (no directory structure) off
-n Numbered output (for Ungoliant Pipeline) off
-s Abort on unrecoverable errors (401, 403, 404) off

Example

ccdown download -p -t 5 ./paths/warc.paths.gz ./data

Note: Keep threads at 10 or below. Too many concurrent requests will get you 403'd by the server, and those errors are unrecoverable.

Python bindings

Install

pip install ccdown

Usage

from ccdown import Client

client = Client(threads=10, retries=1000, progress=True)

# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")

# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")

# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")

# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")

API

Client(threads=10, retries=1000, progress=False) — Create a client with shared config.

client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.

client.download(path_file) — Returns a builder with chainable options:

  • .files_only() — flatten directory structure
  • .numbered() — enumerate output files (for Ungoliant)
  • .strict() — abort on unrecoverable HTTP errors
  • .to(dst) — execute the download

License

MIT OR Apache-2.0

Packages

 
 
 

Contributors

Languages

  • Rust 100.0%