This tutorial walks through creating an annotation from scratch, using the Curlie web directory as an example. By the end, you'll know how to take any dataset of websites and overlay it onto Common Crawl's index.
Already familiar with the concepts? See the quickstart guide for a condensed reference, or the YAML reference for all configuration options.
Common Crawl is a huge archive of web pages — billions of them. The index tells you what's in the archive, but not what kind of site each page came from.
An annotation adds that missing context. It's a file that says things like:
- "this domain is a cooking website"
- "this domain is in German"
- "this URL is listed in a human-curated directory"
You build it once, then join it against the Common Crawl index to filter or enrich the crawl data.
In its most basic form, an annotation is just a single YAML file that points at a public dataset:
# join_majestic_million.yaml — that's it, that's the whole annotation
table:
source:
url: https://downloads.majestic.com/majestic_million.csv
format: csv
right_columns:
- GlobalRank
- RefSubNets
join_columns:
left: url_host_registered_domain
right: DomainThis YAML says: "grab the Majestic Million CSV from the internet and join it against Common Crawl's index on the domain column." No download scripts, no conversion — just a YAML file describing where the data lives and how it connects.
You can find ready-made examples like this in the examples/external-data/ directory. Drop one into a query and you instantly have extra columns on every domain in a crawl.
For datasets that aren't a clean CSV with domain columns (most of them!), you'll need a bit more work. That's what the rest of this tutorial covers.
To run an annotation query, you need three YAML files:
python annotate.py left.yaml join.yaml action.yaml
│ │ │
│ │ └─ What to output (columns, filters)
│ └─ Your dataset (the annotation)
└─ Common Crawl's index (left side of the join)
Think of it like a database query: LEFT is the big table, JOIN adds your columns, ACTION is the SELECT/WHERE.
You need a dataset that has website domains or URLs in it. It can be:
- A CSV or TSV file
- A JSON API
- An HTML page you can scrape
- A database dump
- ...
For this tutorial, we'll use Curlie.org — the largest human-edited web directory with 2.9 million entries. It's a free download as a tar.gz of TSV files.
What to look for in a dataset:
- Does it have a column with domains or URLs? (required)
- What extra info does it provide? (categories, ratings, flags, scores)
- What license is it under? (respect the terms!)
- Is it freely downloadable?
Before you scrape or download, always check robots.txt! Open https://curlie.org/robots.txt in your browser. This file tells automated tools what they're allowed to access (see RFC 9309 for the spec). Curlie's robots.txt allows access to their download — so we're good. If a site's robots.txt blocks your user agent or the paths you need, respect that and look for an alternative data source.
# Download (~170MB compressed)
curl -L -o curlie-rdf-all.tar.gz https://curlie.org/directory-dl
tar xzf curlie-rdf-all.tar.gzPeek at what's inside:
head -3 curlie-rdf/rdf-Arts-c.tsvhttp://www.awn.com/ Animation World Network Provides information resources... 423945
http://www.toonhound.com/ Toonhound British cartoon, animation... 423945
Four tab-separated columns: URL, title, description, category ID. That URL column is what we need.
The category files (*-s.tsv) map those IDs to human-readable paths:
423945 Arts/Animation 4343 Sites primarily related to...
Common Crawl's index uses SURT keys to identify hosts. A SURT (Sort-friendly URI Rewriting Transform) reverses the domain:
www.example.com→com,examplecooking.allrecipes.com→com,allrecipes,cooking
This lets the database do fast prefix lookups. Your annotation needs a surt_host_name column to join with the index.
Here's the conversion script (curlie-convert.py):
import duckdb, surt as surt_lib, utils
con = duckdb.connect()
# Step 3a: Read TSV files and join sites with categories
con.sql("""
CREATE TABLE sites AS
WITH raw_sites AS (
SELECT url, category_id
FROM read_csv('curlie-rdf/rdf-*-c.tsv', sep='\t', header=false,
columns={'url': 'VARCHAR', 'title': 'VARCHAR',
'description': 'VARCHAR', 'category_id': 'BIGINT'},
ignore_errors=true, null_padding=true, strict_mode=false, parallel=false)
),
raw_cats AS (
SELECT category_id, category_path
FROM read_csv('curlie-rdf/rdf-*-s.tsv', sep='\t', header=false,
columns={'category_id': 'BIGINT', 'category_path': 'VARCHAR',
'site_count': 'INTEGER', 'description': 'VARCHAR',
'geo1': 'VARCHAR', 'geo2': 'VARCHAR'},
ignore_errors=true, null_padding=true, strict_mode=false, parallel=false)
)
SELECT s.url, c.category_path
FROM raw_sites s LEFT JOIN raw_cats c ON s.category_id = c.category_id
WHERE s.url LIKE 'http%'
""")
# Step 3b: Extract domains and compute SURTs
rows = con.sql("""
SELECT DISTINCT url, split_part(split_part(url, '://', 2), '/', 1) as domain
FROM sites
""").fetchall()
surt_data = []
for url, domain in rows:
try:
surt_host = utils.thing_to_surt_host_name(domain)
url_surtkey = surt_lib.surt(url)
surt_data.append((url, surt_host, url_surtkey))
except (ValueError, TypeError):
pass # skip unparseable URLs
# Step 3c: Write the parquet
con.execute("CREATE TABLE surt_lookup (url VARCHAR, surt_host_name VARCHAR, url_surtkey VARCHAR)")
con.executemany("INSERT INTO surt_lookup VALUES (?, ?, ?)", surt_data)
con.sql("""
COPY (
SELECT s.surt_host_name, s.url_surtkey, e.category_path as category
FROM sites e JOIN surt_lookup s ON e.url = s.url
ORDER BY s.surt_host_name
) TO 'curlie.parquet' (FORMAT PARQUET)
""")The real script also extracts languages, validates domains, and handles edge cases — but this is the core idea: read the source data → add SURT columns → write parquet.
This tells the system where your parquet is and how to join it:
# join_curlie.yaml
table:
local: ./curlie.parquet
right_columns:
- domain
- lang
- category
join_columns:
- surt_host_namelocalpoints at your parquet fileright_columnslists the columns you want available in queriesjoin_columnssays what to join on —surt_host_namematches the host index
This is usually the same across all annotations and points to either the Common Crawl's host, or page, index:
# left_host_index.yaml
table:
web_prefix: https://data.commoncrawl.org/
paths: host-index-paths.gz
limits:
grep: [CC-MAIN-2024-51] # which crawl(s) to query
#count: 1 # uncomment to test with just 1 fileThis is the fun part — what do you want to find?
# action_cooking.yaml — find cooking sites in all languages
sql: "SELECT DISTINCT {columns} FROM joined WHERE {where}"
columns: "surt_host_name, domain, lang, category"
where: >
category LIKE '%Cooking%'
OR category LIKE '%Kochen%'
OR category LIKE '%Cuisine%'
OR category LIKE '%Cucina%'
OR category LIKE '%料理%'Or a simple "show me everything" action:
# action_star.yaml
sql: "SELECT {columns} FROM joined WHERE {where}"
columns: "*"
where: "category IS NOT NULL"
limits:
count: 100Clone the repository and set up a Python environment:
git clone https://github.com/commoncrawl/cc-index-annotations.git
cd cc-index-annotations
python -m venv .venv
source .venv/bin/activate # on Windows WSL: same command
pip install -r requirements.txtNavigate to the Curlie example and download the host index path list:
cd examples/curlie
# Download the list of host index parquet files (~2KB)
curl -L -o host-index-paths.gz \
"https://data.commoncrawl.org/projects/host-index-testing/v2.paths.gz"
# Symlink the core scripts (or copy them)
ln -s ../../annotate.py ../../duck_utils.py ../../utils.py .Now run the conversion and query:
# Convert Curlie TSV → parquet (takes ~2 minutes)
python curlie-convert.py
# Query: find cooking sites across all languages in a crawl
python annotate.py left_host_index.yaml join_curlie.yaml action_cooking.yamlThis outputs a CSV of every cooking-related domain that appears in the CC-MAIN-2024-51 crawl, with its Curlie category and language. 🎉
To share your annotation with the world, you need:
- The parquet file — host it somewhere accessible (S3, a web server, GitHub Releases)
- A join YAML — so others can plug it into their queries
- A README — what the data is, where it came from, what license it's under
See creating-annotations.md for the full spec, and yaml-reference.md for all YAML options.
| What | File | Purpose |
|---|---|---|
| Common Crawl index | left_*.yaml |
The big table (billions of hosts/URLs) |
| Your dataset | join_*.yaml |
Your annotation data as parquet |
| Your query | action_*.yaml |
What columns to output, what to filter |
| Join type | Use when |
|---|---|
surt_host_name |
Your data is per-domain (e.g. "example.com is a news site") |
url_surtkey |
Your data is per-URL (e.g. "this specific page is about cooking") |
url_host_registered_domain |
Joining URL index with domain-level data (asymmetric join) |
- Start small: use
count: 1in limits to test with one parquet file before querying all crawls - Debug with CSV: add
-dto your fetch script to also write a CSV you can open in a spreadsheet - Stack annotations: pass multiple join YAMLs to combine datasets in one query
- Use
prefix: when stacking, addprefix: curlie_to avoid column name collisions - Check
examples/: there are over 20 working examples covering spam lists, university rankings, fact-checkers, popularity rankings, and more
This project stands on the shoulders of some great open source tools:
- DuckDB — the in-process analytical database that makes all the SQL magic work, including reading remote parquet files over HTTP
- SURT — Internet Archive's Sort-friendly URI Rewriting Transform library, used to convert domains and URLs into the key format Common Crawl's index uses
- Apache Arrow / PyArrow — the columnar memory format and Parquet I/O that keeps everything fast
- Curlie.org — the largest human-edited web directory, used as the example throughout this tutorial. Originally the Open Directory Project (DMOZ), kept alive by volunteer editors since 2017