Skip to content

Latest commit

 

History

History
51 lines (42 loc) · 3.86 KB

File metadata and controls

51 lines (42 loc) · 3.86 KB

Examples

Annotation examples

Directory Level Description
curlie/ host Curlie.org human-edited web directory — 2.9M entries, 90+ languages, 770K categories
slashtag/ host blekko human-curated topic categories — 120K+ domains, 1,280 tags
slashtag-url/ url Same as slashtag but at URL granularity
external-data/ host Ready-to-use external datasets — no fetch scripts needed (see below)
fineweb-edu/ host Educational quality scores from HuggingFace FineWeb-Edu (Llama3-70B rated, 0-5 scale)
gneissweb/ host Data Prep Kit's Gneissweb topic classification scores (technology, science, education, medical)
gneissweb-url/ url Same as gneissweb above, but at URL granularity
spam-abuse/ host Malware, phishing, and abuse flags from URLhaus, PhishTank, OpenPhish, and UT1
tranco-top1m/ host Domain popularity ranking from Tranco
university-ranking/ host University identification (Hipo) and world rankings (CWUR 2025)
university-ranking-url/ url Same as university-ranking above but at URL granularity
web-graph/ host Link metrics (outdegree, indegree) from Common Crawl's Web Graphs
web-graph-wikipedia/ host Multi-join example combining web-graph + wikipedia-spam
wikipedia/categories/ host Website classification from English Wikipedia categories (fact-checking, fake news, satirical, etc.)
wikipedia/categories-intl/ host Same as categories but across all language Wikipedias via langlinks auto-discovery
wikipedia/perennial/ host Wikipedia's Reliable Sources
wikipedia/spam/ host Spam and URL shortener flags from Wikipedia's spam list

External data sources (external-data/)

Pre-built YAML files that stream external datasets directly at query time — no local downloads needed. Can be stacked as extra columns on any query.

YAML Source Domains License
join_tranco.yaml Tranco top sites ranking ~5.4M CC BY-SA/BY-NC 4.0
join_majestic_million.yaml Majestic top 1M by referring subnets 1M CC BY 3.0
join_cisa_gov_domains.yaml CISA US .gov domain registry ~12.6K Public domain
join_gsa_nongov_federal.yaml GSA US federal non-.gov domains ~400 Public domain
join_ifcn_factcheckers.yaml IFCN verified fact-checkers ~167 Public
join_misinfo_domains.yaml Lasser et al. misinformation domains ~4.8K CC BY-SA 4.0

Quick start

From the project root, fetch dependencies for an example:

make web-graph

Then run a query:

cd examples/web-graph
python annotate.py left_web_host_index.yaml join_web_outin.yaml action_surt_host_name.yaml commoncrawl.org

All examples follow the same pattern: python annotate.py <left.yaml> <join.yaml> [join.yaml ...] <action.yaml> [args]. See docs/yaml-reference.md for the full YAML spec.