Wikipedia has Year Pages at specific URLs: https://en.wikipedia.org/wiki/1969
These contain important topics that should be added to the Temporal Finetuning dataset.
The extract_year_topics.py script extracts historical events from Wikipedia year pages (e.g., https://en.wikipedia.org/wiki/2020) and enriches them with article references for use in the Temporal Finetuning dataset.
Key Features:
- Fetches year pages via the Wikipedia API and parses the HTML structure
- Extracts dated events from the "Events" section with full date parsing (year, month, day)
- Captures direct Wikipedia article links embedded in event text as primary references
- Searches the local Wikipedia MCP server to find additional related articles using hybrid search
- Calculates relevance scores combining title similarity, search ranking, and position
- Deduplicates all article references and looks up article IDs via the MCP server
- Stores enriched event data as JSON files in
${WIKI_DATA}/topics/for downstream tools
The solution consists of a Python script extract_year_topics.py that:
-
Year Page Retrieval
- Fetches Wikipedia year pages (e.g., "151", "1990", ..., "2025") from the Wikipedia API
- Uses the Wikipedia API's
action=parseendpoint to get clean HTML content - Returns an error if the API call fails (no local fallback)
-
Topic Extraction
- Parses the HTML structure to identify the Events section
- Extracts date-topic pairs from list items in the Events section
- Uses regex patterns to parse dates in various formats:
- "January 1 – Event description"
- "January 1-3 – Event description"
- "January – Event description"
- Captures both the specific date and the event description
-
Direct Reference Extraction
- Extracts Wikipedia article links directly from event text (e.g., links to people, places, events)
- Looks up article IDs via keyword search on the MCP server
- Assigns maximum relevance score (1.0) to direct links
- Deduplicates links by title (case-insensitive)
-
Related Article Search & Relevance Scoring
- For each topic, uses the Wikipedia MCP server to perform hybrid search
- Queries both keyword (BM25) and semantic (k-NN) search
- Calculates a relevance score based on:
- Search result ranking (position in results)
- Search score from MCP server
- Title similarity to topic text
- Combined weighted score: 40% title similarity + 40% search score + 20% position
- Filters out articles already in direct references (by title and article_id)
- Deduplicates results and returns top N articles (default: 5) sorted by relevance
-
Temporal Validation
- After collecting direct references and related articles, cross-checks each article against the temporal augmentation data stored in PostgreSQL (via the MCP server's
POST /mcp/temporalendpoint) - For year X, any article whose
earliest_dateyear is greater than X is excluded — this prevents anachronistic references (e.g., an article about COVID-19 appearing in a 2018 topic) - Articles without temporal augmentation data are kept (no filtering applied)
- Results are cached in-memory so each article ID is only queried once across all topics
- The number of temporally excluded articles is reported in the processing summary
- After collecting direct references and related articles, cross-checks each article against the temporal augmentation data stored in PostgreSQL (via the MCP server's
-
Data Storage
- Creates output directory:
${WIKI_DATA}/topics/ - Stores results in JSON format:
${WIKI_DATA}/topics/year_topics_YYYY.json - Each file contains:
- Year metadata (year, extracted_date, source, total_topics)
- Array of topics with:
- Date fields (date, year, month, day, date_text)
- Topic description
- Array of
direct_references(links found in event text):- Article title
- Article path and href
- Article ID (looked up via MCP server)
- Relevance score (always 1.0 for direct links)
- Source: "direct_link"
- Array of
related_articles(found via search):- Article title
- Article ID
- Relevance score (0-1)
- Search score
- Title similarity score
- Creates output directory:
{
"year": 2020,
"extracted_date": "2026-03-04T13:17:17.540885Z",
"source": "wikipedia_api",
"total_topics": 288,
"topics": [
{
"year": 2020,
"date": "2020-01-01",
"month": 1,
"day": 1,
"date_text": "January 1",
"topic": "Flash floods struck Jakarta, Indonesia, killing 66 people in the worst flooding in over a decade.",
"direct_references": [
{
"title": "2020 Jakarta floods",
"article_path": "2020_Jakarta_floods",
"href": "/wiki/2020_Jakarta_floods",
"source": "direct_link",
"relevance_score": 1.0,
"article_id": 62718198
},
{
"title": "Jakarta",
"article_path": "Jakarta",
"href": "/wiki/Jakarta",
"source": "direct_link",
"relevance_score": 1.0,
"article_id": 16275
},
{
"title": "Indonesia",
"article_path": "Indonesia",
"href": "/wiki/Indonesia",
"source": "direct_link",
"relevance_score": 1.0,
"article_id": 14579
}
],
"related_articles": [
{
"title": "2007 Jakarta flood",
"article_id": 9931233,
"relevance_score": 0.298,
"search_score": 0.03,
"title_similarity": 0.243
},
{
"title": "Floods in Jakarta",
"article_id": 38269340,
"relevance_score": 0.26,
"search_score": 0.03,
"title_similarity": 0.298
}
]
}
]
}Processing all 1,844 year pages (151–2025) took approximately 6.6 hours in a single uninterrupted run with 8 parallel workers against a local MCP server. Modern years take significantly longer due to having more events and article references to resolve.
| Year Range | Years | Avg Topics/Year | Avg Time/Year | Total Time |
|---|---|---|---|---|
| 151–500 | 341 | 7 | 3.8s | 21 min |
| 501–1000 | 484 | 9 | 5.1s | 41 min |
| 1001–1500 | 497 | 18 | 8.4s | 70 min |
| 1501–1800 | 300 | 36 | 15.9s | 79 min |
| 1801–1900 | 100 | 66 | 27.4s | 46 min |
| 1901–1950 | 50 | 130 | 54.6s | 46 min |
| 1951–2000 | 50 | 155 | 69.1s | 58 min |
| 2001–2025 | 25 | 140 | 81.6s | 34 min |
Total topics extracted: ~50,500 across all years. Some years are missing from Wikipedia (31 gaps, mostly in the 300–1600 range).
Use --resume to restart an interrupted run without re-processing completed years.
# Source the DeepRedAI environment (sets WIKI_DATA and other paths)
source deepred-env.sh
# Or set WIKI_DATA manually if not using deepred-env.sh:
# export WIKI_DATA="/mnt/data/wikipedia"
# Extract topics for a specific year
python scripts/extract_year_topics.py --year 1990
# Extract topics for a range of years
python scripts/extract_year_topics.py --start-year 1900 --end-year 2025
# Resume an interrupted range (skip already-saved years)
python scripts/extract_year_topics.py --start-year 1900 --end-year 2025 --resume
# Dry-run: extract topics from HTML only, no MCP lookups
python scripts/extract_year_topics.py --year 2020 --dry-run
# Adjust number of related articles per topic (default: 5)
python scripts/extract_year_topics.py --year 2020 --max-articles 10
# Set number of parallel workers for MCP lookups (default: 8)
python scripts/extract_year_topics.py --year 2020 --workers 4
# Override output directory (default: $WIKI_DATA/topics/)
python scripts/extract_year_topics.py --year 2020 --output-dir /tmp/topics
# Use verbose output for debugging
python scripts/extract_year_topics.py --year 2020 --verbose
# Save raw HTML for debugging analysis
python scripts/extract_year_topics.py --year 2020 --save-htmlrequests- HTTP client for Wikipedia API and MCP server callsbeautifulsoup4- HTML parsingrapidfuzz- String similarity scoring (faster alternative to fuzzywuzzy)tqdm- Progress bars (optional, recommended)- Access to local Wikipedia MCP server (default: http://localhost:7000)
- Set environment configuration variables:
# Source the DeepRedAI environment (recommended — sets WIKI_DATA automatically)
source deepred-env.sh
# Or set paths manually:
# export WIKI_DATA="/mnt/data/wikipedia"
# Optional: Override MCP host/port if not using defaults
# export MCP_HOST="localhost"
# export MCP_PORT="7000"- Install Python dependencies:
Install packages within the DeepRedAI virtual environment:
# Activate the DeepRedAI virtual environment (already done by deepred-env.sh)
source ${DEEPRED_VENV}/bin/activate
# Install required packages
pip install requests beautifulsoup4 rapidfuzz tqdm- Verify MCP server is running:
# Check if MCP server is accessible
curl http://localhost:7000/healthSee Wikipedia MCP for more information.
- Verify installation:
# Test with a single year
python scripts/extract_year_topics.py --year 2020 --verbose
# Check the output
ls -lh ${WIKI_DATA}/topics/
cat ${WIKI_DATA}/topics/year_topics_2020.json | jq '.topics[0]'