Skip to content

Latest commit

 

History

History
260 lines (209 loc) · 9.55 KB

File metadata and controls

260 lines (209 loc) · 9.55 KB

Wikipedia Year-Topics

Wikipedia has Year Pages at specific URLs: https://en.wikipedia.org/wiki/1969

These contain important topics that should be added to the Temporal Finetuning dataset.

Overview

The extract_year_topics.py script extracts historical events from Wikipedia year pages (e.g., https://en.wikipedia.org/wiki/2020) and enriches them with article references for use in the Temporal Finetuning dataset.

Key Features:

  • Fetches year pages via the Wikipedia API and parses the HTML structure
  • Extracts dated events from the "Events" section with full date parsing (year, month, day)
  • Captures direct Wikipedia article links embedded in event text as primary references
  • Searches the local Wikipedia MCP server to find additional related articles using hybrid search
  • Calculates relevance scores combining title similarity, search ranking, and position
  • Deduplicates all article references and looks up article IDs via the MCP server
  • Stores enriched event data as JSON files in ${WIKI_DATA}/topics/ for downstream tools

Solution

Architecture

The solution consists of a Python script extract_year_topics.py that:

  1. Year Page Retrieval

    • Fetches Wikipedia year pages (e.g., "151", "1990", ..., "2025") from the Wikipedia API
    • Uses the Wikipedia API's action=parse endpoint to get clean HTML content
    • Returns an error if the API call fails (no local fallback)
  2. Topic Extraction

    • Parses the HTML structure to identify the Events section
    • Extracts date-topic pairs from list items in the Events section
    • Uses regex patterns to parse dates in various formats:
      • "January 1 – Event description"
      • "January 1-3 – Event description"
      • "January – Event description"
    • Captures both the specific date and the event description
  3. Direct Reference Extraction

    • Extracts Wikipedia article links directly from event text (e.g., links to people, places, events)
    • Looks up article IDs via keyword search on the MCP server
    • Assigns maximum relevance score (1.0) to direct links
    • Deduplicates links by title (case-insensitive)
  4. Related Article Search & Relevance Scoring

    • For each topic, uses the Wikipedia MCP server to perform hybrid search
    • Queries both keyword (BM25) and semantic (k-NN) search
    • Calculates a relevance score based on:
      • Search result ranking (position in results)
      • Search score from MCP server
      • Title similarity to topic text
      • Combined weighted score: 40% title similarity + 40% search score + 20% position
    • Filters out articles already in direct references (by title and article_id)
    • Deduplicates results and returns top N articles (default: 5) sorted by relevance
  5. Temporal Validation

    • After collecting direct references and related articles, cross-checks each article against the temporal augmentation data stored in PostgreSQL (via the MCP server's POST /mcp/temporal endpoint)
    • For year X, any article whose earliest_date year is greater than X is excluded — this prevents anachronistic references (e.g., an article about COVID-19 appearing in a 2018 topic)
    • Articles without temporal augmentation data are kept (no filtering applied)
    • Results are cached in-memory so each article ID is only queried once across all topics
    • The number of temporally excluded articles is reported in the processing summary
  6. Data Storage

    • Creates output directory: ${WIKI_DATA}/topics/
    • Stores results in JSON format: ${WIKI_DATA}/topics/year_topics_YYYY.json
    • Each file contains:
      • Year metadata (year, extracted_date, source, total_topics)
      • Array of topics with:
        • Date fields (date, year, month, day, date_text)
        • Topic description
        • Array of direct_references (links found in event text):
          • Article title
          • Article path and href
          • Article ID (looked up via MCP server)
          • Relevance score (always 1.0 for direct links)
          • Source: "direct_link"
        • Array of related_articles (found via search):
          • Article title
          • Article ID
          • Relevance score (0-1)
          • Search score
          • Title similarity score

Data Format Example

{
  "year": 2020,
  "extracted_date": "2026-03-04T13:17:17.540885Z",
  "source": "wikipedia_api",
  "total_topics": 288,
  "topics": [
    {
      "year": 2020,
      "date": "2020-01-01",
      "month": 1,
      "day": 1,
      "date_text": "January 1",
      "topic": "Flash floods struck Jakarta, Indonesia, killing 66 people in the worst flooding in over a decade.",
      "direct_references": [
        {
          "title": "2020 Jakarta floods",
          "article_path": "2020_Jakarta_floods",
          "href": "/wiki/2020_Jakarta_floods",
          "source": "direct_link",
          "relevance_score": 1.0,
          "article_id": 62718198
        },
        {
          "title": "Jakarta",
          "article_path": "Jakarta",
          "href": "/wiki/Jakarta",
          "source": "direct_link",
          "relevance_score": 1.0,
          "article_id": 16275
        },
        {
          "title": "Indonesia",
          "article_path": "Indonesia",
          "href": "/wiki/Indonesia",
          "source": "direct_link",
          "relevance_score": 1.0,
          "article_id": 14579
        }
      ],
      "related_articles": [
        {
          "title": "2007 Jakarta flood",
          "article_id": 9931233,
          "relevance_score": 0.298,
          "search_score": 0.03,
          "title_similarity": 0.243
        },
        {
          "title": "Floods in Jakarta",
          "article_id": 38269340,
          "relevance_score": 0.26,
          "search_score": 0.03,
          "title_similarity": 0.298
        }
      ]
    }
  ]
}

Runtime Estimates

Processing all 1,844 year pages (151–2025) took approximately 6.6 hours in a single uninterrupted run with 8 parallel workers against a local MCP server. Modern years take significantly longer due to having more events and article references to resolve.

Year Range Years Avg Topics/Year Avg Time/Year Total Time
151–500 341 7 3.8s 21 min
501–1000 484 9 5.1s 41 min
1001–1500 497 18 8.4s 70 min
1501–1800 300 36 15.9s 79 min
1801–1900 100 66 27.4s 46 min
1901–1950 50 130 54.6s 46 min
1951–2000 50 155 69.1s 58 min
2001–2025 25 140 81.6s 34 min

Total topics extracted: ~50,500 across all years. Some years are missing from Wikipedia (31 gaps, mostly in the 300–1600 range).

Use --resume to restart an interrupted run without re-processing completed years.

Usage

# Source the DeepRedAI environment (sets WIKI_DATA and other paths)
source deepred-env.sh

# Or set WIKI_DATA manually if not using deepred-env.sh:
# export WIKI_DATA="/mnt/data/wikipedia"

# Extract topics for a specific year
python scripts/extract_year_topics.py --year 1990

# Extract topics for a range of years
python scripts/extract_year_topics.py --start-year 1900 --end-year 2025

# Resume an interrupted range (skip already-saved years)
python scripts/extract_year_topics.py --start-year 1900 --end-year 2025 --resume

# Dry-run: extract topics from HTML only, no MCP lookups
python scripts/extract_year_topics.py --year 2020 --dry-run

# Adjust number of related articles per topic (default: 5)
python scripts/extract_year_topics.py --year 2020 --max-articles 10

# Set number of parallel workers for MCP lookups (default: 8)
python scripts/extract_year_topics.py --year 2020 --workers 4

# Override output directory (default: $WIKI_DATA/topics/)
python scripts/extract_year_topics.py --year 2020 --output-dir /tmp/topics

# Use verbose output for debugging
python scripts/extract_year_topics.py --year 2020 --verbose

# Save raw HTML for debugging analysis
python scripts/extract_year_topics.py --year 2020 --save-html

Dependencies

Required Packages

  • requests - HTTP client for Wikipedia API and MCP server calls
  • beautifulsoup4 - HTML parsing
  • rapidfuzz - String similarity scoring (faster alternative to fuzzywuzzy)
  • tqdm - Progress bars (optional, recommended)
  • Access to local Wikipedia MCP server (default: http://localhost:7000)

Installation Steps

  1. Set environment configuration variables:
# Source the DeepRedAI environment (recommended — sets WIKI_DATA automatically)
source deepred-env.sh

# Or set paths manually:
# export WIKI_DATA="/mnt/data/wikipedia"

# Optional: Override MCP host/port if not using defaults
# export MCP_HOST="localhost"
# export MCP_PORT="7000"
  1. Install Python dependencies:

Install packages within the DeepRedAI virtual environment:

# Activate the DeepRedAI virtual environment (already done by deepred-env.sh)
source ${DEEPRED_VENV}/bin/activate

# Install required packages
pip install requests beautifulsoup4 rapidfuzz tqdm
  1. Verify MCP server is running:
# Check if MCP server is accessible
curl http://localhost:7000/health

See Wikipedia MCP for more information.

  1. Verify installation:
# Test with a single year
python scripts/extract_year_topics.py --year 2020 --verbose

# Check the output
ls -lh ${WIKI_DATA}/topics/
cat ${WIKI_DATA}/topics/year_topics_2020.json | jq '.topics[0]'