The annotation system uses three types of YAML configuration files, passed as arguments to annotate.py:
python annotate.py <left.yaml> [join.yaml ...] <action.yaml> [args ...]
All YAML files after the first and before the action are treated as join configurations. The last YAML file is always the action. Any arguments after the action YAML are passthrough arguments available as template variables.
Defines the "left" table — typically a Common Crawl host index or URL index.
| Key | Type | Required | Description |
|---|---|---|---|
table.local |
string | one of four | Path to a local .parquet file, a directory containing .parquet files, or a glob pattern (e.g. ./data-*.parquet) |
table.web_prefix |
string | one of four | HTTP(S) URL prefix (e.g. https://data.commoncrawl.org/) |
table.s3_prefix |
string | one of four | S3 URI prefix (e.g. s3://commoncrawl/) |
table.source |
dict | one of four | Direct URL to a CSV, parquet, or JSON file (see Source YAML below) |
table.paths |
string | required for web/s3 | Path to a file listing relative parquet paths. Supports .gz compressed files |
limits.grep |
list of strings | optional | Keep only paths containing any of these substrings (OR logic) |
limits.count |
integer | optional | Keep only the first N paths after grep filtering |
Exactly one of local, web_prefix, s3_prefix, or source must be specified.
When using local without paths:
- If it points to a file, that file is used directly
- If it points to a directory, all
.parquetfiles in it (and subdirectories) are auto-discovered - If it contains glob characters (
*,?), the pattern is expanded
# Local directory — auto-discovers all .parquet files
table:
local: /home/cc-pds/commoncrawl/projects/host-index-testing/v2/
limits:
grep: ["CC-MAIN-2025", "CC-MAIN-2024"]
count: 1# Local file — single parquet
table:
local: ./curlie.parquet# Local glob pattern
table:
local: ./data-*.parquet# HTTPS
table:
web_prefix: https://data.commoncrawl.org/
paths: host-index-paths.gz
limits:
grep: [CC-MAIN-2021-49]# S3
table:
s3_prefix: s3://commoncrawl/
paths: host-index-paths.gz
limits:
grep: ["CC-MAIN-2025", "CC-MAIN-2024"]
count: 1Defines a right-side table to join against the left table (or a previous join result). Multiple join YAMLs can be chained — they are processed in order, each joining against the result of the previous.
| Key | Type | Required | Description |
|---|---|---|---|
table.* |
yes | Same table and limits options as Left YAML (see above) |
|
limits.* |
optional | Same limits options as Left YAML |
|
right_columns |
list of strings | yes | Columns to select from the right table into the joined result |
join_columns |
list or dict | yes | Column(s) to join on. A list when names match in both tables, or a dict with left/right keys when column names differ (see examples) |
prefix |
string | optional | Prefix added to all right_columns in the result (e.g. tranco_ turns rank into tranco_rank). Useful when stacking multiple joins to avoid column name collisions |
join_type |
string | optional | OUTER (default) or INNER |
OUTER produces a LEFT OUTER JOIN — rows in the left table without a match in the right table are kept, with NULLs for the right columns. INNER drops unmatched rows from both sides.
# Host-level join on surt_host_name only
table:
local: ./
right_columns:
- wikipedia_spam
- wikipedia_shortener
join_columns:
- surt_host_name# Host-level join on surt_host_name + crawl, from S3
table:
s3_prefix: s3://commoncrawl/
paths: web-graph-outin-paths.gz
right_columns:
- webgraph_outdegree
- webgraph_indegree
join_columns:
- surt_host_name
- crawl# URL-level join
table:
web_prefix: https://data.commoncrawl.org/
paths: paths.urls.txt.gz
limits:
grep: [CC-MAIN-2020-05]
right_columns:
- gneissweb_technology
- gneissweb_science
- gneissweb_education
- gneissweb_medical
join_columns:
- url_surtkey
- crawl
- fetch_time# Asymmetric join — column names differ between left and right (single column)
table:
local: ./
right_columns:
- is_university
- country
- university_name
join_columns:
left: url_host_registered_domain
right: domain# Asymmetric join — multiple columns with different names
table:
local: ./
right_columns:
- score
join_columns:
left: [url_host_registered_domain, crawl]
right: [domain, crawl_id]join_columns supports two formats:
-
List — when column names are the same in both tables:
join_columns: - surt_host_name - crawl
-
Dict with left/right — when column names differ. Each side accepts a single string or a list. When using lists, columns are paired positionally:
join_columns: left: [url_host_registered_domain, crawl] right: [domain, crawl_id] # produces: left.url_host_registered_domain = right.domain # AND left.crawl = right.crawl_id
# Prefix to avoid column collisions when stacking multiple joins
table:
source:
url: https://downloads.majestic.com/majestic_million.csv
format: csv
prefix: majestic_
right_columns:
- GlobalRank
- RefSubNets
join_columns:
left: url_host_registered_domain
right: Domain
# Result columns: majestic_GlobalRank, majestic_RefSubNets# INNER join (only keep matched rows)
table:
web_prefix: https://data.commoncrawl.org/
paths: ../web-graph/web-graph-outin-paths.gz
limits:
grep: [CC-MAIN-2024-33]
join_type: INNER
right_columns:
- webgraph_outdegree
- webgraph_indegree
join_columns:
- surt_host_name
- crawlWhen multiple join YAMLs are provided, they are applied sequentially. Intermediate views are created as join_step_0, join_step_1, etc. The final join result is always named joined.
python annotate.py left.yaml join_webgraph.yaml join_wikipedia.yaml action.yaml
This produces: left → join with webgraph → join with wikipedia → joined view → action query.
Defines the SQL query to run against the final joined view.
| Key | Type | Required | Description |
|---|---|---|---|
sql |
string | yes | SQL template. Use {columns} and {where} as placeholders |
columns |
string | yes | Column list substituted into {columns} |
where |
string | yes | WHERE clause substituted into {where}. Supports template variables (see below) |
limits.count |
integer | optional | Maximum number of rows in the output. Appends LIMIT N to the query |
argv.surt_host_name |
boolean | optional | If true, passthrough arguments are converted to SURT format before substitution |
| Variable | Description |
|---|---|
{argv} |
The current passthrough argument (optionally SURT-converted) |
{and_tld} |
Auto-generated TLD filter, e.g. AND url_host_tld = 'org'. Only set when argv.surt_host_name: true |
{arg1}, {arg2}, ... |
Individual passthrough arguments by position |
When passthrough arguments are provided, the action runs once per argument. Output CSV files are named after the argument value (e.g. org,commoncrawl.csv). When no arguments are given, output goes to output.csv.
# Exact hostname lookup with SURT conversion
sql: "SELECT {columns} FROM joined WHERE {where} ORDER BY crawl ASC"
columns: "surt_host_name, crawl, hcrank10, webgraph_outdegree, webgraph_indegree"
where: "surt_host_name = '{argv}'{and_tld}"
argv:
surt_host_name: trueUsage: python annotate.py left.yaml join.yaml action.yaml commoncrawl.org
The argument commoncrawl.org is converted to SURT org,commoncrawl and {and_tld} becomes AND url_host_tld = 'org'.
# LIKE query for hostname prefix matching
sql: "SELECT {columns} FROM joined WHERE {where} ORDER BY hcrank10 DESC"
columns: "surt_host_name, crawl, hcrank10, webgraph_outdegree, webgraph_indegree"
where: "surt_host_name LIKE '{argv}%'{and_tld}"
argv:
surt_host_name: trueUsage: python annotate.py left.yaml join.yaml action.yaml .commoncrawl.org
# Fixed query, no arguments
sql: "SELECT {columns} FROM joined WHERE {where}"
columns: "surt_host_name, crawl, wikipedia_spam"
where: "wikipedia_spam = 1 AND crawl = 'CC-MAIN-2021-49'"# Multi-condition filter
sql: "SELECT {columns} FROM joined WHERE {where}"
columns: "surt_host_name, crawl, abuse_urlhaus_malware, abuse_ut1_malware"
where: "(abuse_urlhaus_malware = 1 OR abuse_ut1_malware = 1) AND crawl = 'CC-MAIN-2021-49'"table.source is the simplest way to use an external dataset — point directly at a URL. The format (CSV, parquet, JSON) is auto-detected from the file extension, or set explicitly. Works in both left and join position based on argument order.
This is the recommended method for distributing standalone annotations. A data provider can share a single YAML file and users can immediately join it against any Common Crawl index.
| Key | Type | Required | Description |
|---|---|---|---|
table.source.url |
string | one of two | URL or local path to a single file |
table.source.urls |
list | one of two | List of URLs/paths (loaded as a single table) |
table.source.format |
string | optional | csv, json, or parquet. Auto-detected from extension if omitted |
table.source.options |
dict | optional | Passed directly to DuckDB's read_csv(), read_json(), or read_parquet() |
Auto-detection: .csv/.tsv/.csv.gz → csv, .json/.jsonl/.ndjson → json, everything else → parquet.
# Tranco top-1M domain ranking (CSV without headers)
table:
source:
url: https://tranco-list.eu/download/L76X4/full
format: csv
options:
header: false
columns:
rank: INTEGER
domain: VARCHAR
right_columns:
- rank
join_columns:
left: url_host_registered_domain
right: domain# Local parquet file via source
table:
source:
url: ./my-annotation.parquet
right_columns:
- my_score
join_columns:
- surt_host_name# Multiple parquet files
table:
source:
urls:
- https://example.com/data-part1.parquet
- https://example.com/data-part2.parquet- Local and S3 globs work (
source.url: ./crawl=*/*.parquet) - HTTP URLs cannot glob — use
source.urlswith explicit paths, or useweb_prefix+pathsfor partitioned data table.sourceis best for single-file datasets; useweb_prefix/s3_prefixfor multi-partition CC infrastructure data
| Index type | Join columns | Description |
|---|---|---|
| Host index | surt_host_name |
Reversed hostname (e.g. com,example) |
| Host index | surt_host_name, crawl |
Hostname + crawl ID (e.g. CC-MAIN-2024-33) |
| URL index | url_surtkey, crawl, fetch_time |
Full SURT URL + crawl + timestamp |
File names are purely conventional — the system does not require any specific naming. The project uses this convention for clarity:
| Prefix | Purpose |
|---|---|
left_ |
Left table (index) configuration |
join_ |
Right table (annotation) join configuration |
action_ |
Query/action configuration |
Suffixes like _local, _web, _s3 indicate the data source type.
The table.paths key points to a file listing parquet file paths, one per line. These are relative to the prefix (web_prefix or s3_prefix). Lines are stripped of trailing whitespace.
Gzip-compressed files (.gz) are automatically decompressed. For local sources, glob wildcards (*) are supported in the paths value itself.
The system automatically configures DuckDB with:
- HTTP retries: 10 attempts, 2000ms wait between retries
- AWS credentials: auto-detected from credential chain, falls back to anonymous S3 access
- Object caching: enabled for HTTP sources
- Hive partitioning: enabled for all parquet reads
- Progress bar: enabled (disabled in CI)