GitHub - 4thel00z/ccdown: A rust based, resumable downloader cli and python library for Common Crawl data

A polite downloader for Common Crawl data, written in Rust.

Install

cargo install ccdown

Other methods

From source

git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .

Pre-built binaries

Grab the latest release for your platform from the releases page.

Usage

1. Download the path manifest for a crawl

ccdown download-paths CC-MAIN-2025-08 warc ./paths

Supported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table

Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM

2. Download the actual data

ccdown download ./paths/warc.paths.gz ./data

Options

Flag	Description	Default
`-t`	Number of concurrent downloads	`10`
`-r`	Max retries per file	`1000`
`-p`	Show progress bars	off
`-f`	Flat file output (no directory structure)	off
`-n`	Numbered output (for Ungoliant Pipeline)	off
`-s`	Abort on unrecoverable errors (401, 403, 404)	off

Example

ccdown download -p -t 5 ./paths/warc.paths.gz ./data

Note: Keep threads at 10 or below. Too many concurrent requests will get you 403'd by the server, and those errors are unrecoverable.

Python bindings

Install

pip install ccdown

Usage

from ccdown import Client

client = Client(threads=10, retries=1000, progress=True)

# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")

# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")

# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")

# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")

API

Client(threads=10, retries=1000, progress=False) — Create a client with shared config.

client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.

client.download(path_file) — Returns a builder with chainable options:

.files_only() — flatten directory structure
.numbered() — enumerate output files (for Ungoliant)
.strict() — abort on unrecoverable HTTP errors
.to(dst) — execute the download

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.github		.github
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SECURITY.md		SECURITY.md
logo.png		logo.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

From source

Pre-built binaries

Usage

1. Download the path manifest for a crawl

2. Download the actual data

Options

Example

Install

Usage

API

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Install

From source

Pre-built binaries

Usage

1. Download the path manifest for a crawl

2. Download the actual data

Options

Example

Install

Usage

API

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages