diff --git a/README.md b/README.md index 79fb7b910..6e4bf2318 100644 --- a/README.md +++ b/README.md @@ -22,6 +22,7 @@
Full Documentation · + Data Sources · All Releases
diff --git a/docs/DATA_COLLECTION.md b/docs/DATA_COLLECTION.md new file mode 100644 index 000000000..6aabf6633 --- /dev/null +++ b/docs/DATA_COLLECTION.md @@ -0,0 +1,1079 @@ +# World Monitor Data Collection Documentation + +This document provides comprehensive details about how the World Monitor application collects, processes, and presents data from over 150 external sources. + +## Table of Contents + +1. [Architecture Overview](#architecture-overview) +2. [Data Sources](#data-sources) +3. [Data Collection Methods](#data-collection-methods) +4. [Refresh Intervals & Caching](#refresh-intervals--caching) +5. [Authentication & API Keys](#authentication--api-keys) +6. [Data Processing Pipeline](#data-processing-pipeline) +7. [Reliability & Resilience](#reliability--resilience) + +--- + +## Architecture Overview + +World Monitor employs a **distributed multi-source aggregation architecture** with the following components: + +### Infrastructure Layers + +1. **Vercel Edge Functions** (Primary) + - Server-side API proxies for secure key management + - CORS handling and rate limiting + - Cross-user caching via Upstash Redis + - Located in `/api` directory + +2. **Railway Relay Server** + - Real-time WebSocket streaming for AIS (vessel) and OpenSky (aircraft) data + - RSS feed proxy for sources blocked by Vercel IPs (UN News, CISA, etc.) + - Deployed separately from main application + +3. **Browser-Based Services** + - Client-side data aggregation and display + - Local caching and state management + - AI-powered analysis and classification + - Located in `/src/services` directory + +### Data Flow + +``` +External APIs/RSS → Vercel/Railway → Browser Services → UI Components + ↓ + Upstash Redis Cache (Cross-user) + ↓ + Browser localStorage (Per-user) +``` + +--- + +## Data Sources + +### 1. Conflict & Geopolitical Events + +#### ACLED (Armed Conflict Location & Event Data) +- **API**: `/api/acled` and `/api/acled-conflict` +- **Data**: Protest events, conflict incidents, violence against civilians +- **Coverage**: Global, 200+ countries +- **Update Frequency**: Daily +- **Authentication**: API access token required +- **Source**: https://acleddata.com/ +- **License**: Free for researchers/non-commercial use + +#### UCDP (Uppsala Conflict Data Program) +- **API**: `/api/ucdp` and `/api/ucdp-events` +- **Data**: Organized violence, one-sided violence, non-state conflicts +- **Coverage**: Global since 1989 +- **Update Frequency**: Daily for events, monthly for aggregates +- **Authentication**: None (public API) +- **Source**: https://ucdp.uu.se/ + +#### GDELT (Global Database of Events, Language, and Tone) +- **API**: `/api/gdelt-geo` and `/api/gdelt-doc` +- **Data**: Geo-tagged events, document processing, tone analysis +- **Coverage**: Real-time global events from news media +- **Update Frequency**: Every 15 minutes +- **Authentication**: None (public API) +- **Source**: https://www.gdeltproject.org/ + +#### UNHCR (UN Refugee Agency) +- **API**: `/api/unhcr-population` +- **Data**: Refugee populations, internally displaced persons, asylum seekers +- **Coverage**: Global displacement data +- **Update Frequency**: Quarterly with monthly updates for major crises +- **Authentication**: None (public API) +- **Source**: https://www.unhcr.org/ +- **License**: CC BY 4.0 + +### 2. Real-Time Military & Aviation Tracking + +#### AISStream (Vessel Tracking) +- **Connection**: WebSocket via Railway relay +- **Data**: Live vessel positions, course, speed, ship details +- **Coverage**: Global maritime traffic (AIS-equipped vessels) +- **Update Frequency**: Real-time (every 10-60 seconds per vessel) +- **Authentication**: API key required +- **Source**: https://aisstream.io/ +- **Notes**: Uses Railway relay server to maintain persistent WebSocket connection + +#### OpenSky Network (Aircraft Tracking) +- **API**: `/api/opensky` via Railway relay +- **Data**: Aircraft positions, altitude, velocity, callsign +- **Coverage**: Global air traffic with ADS-B coverage +- **Update Frequency**: Real-time (every 10 seconds) +- **Authentication**: OAuth2 (optional, for higher rate limits) +- **Source**: https://opensky-network.org/ +- **License**: Free for non-commercial use + +#### Wingbits (Aircraft Enrichment) +- **API**: Aircraft metadata enrichment +- **Data**: Aircraft owner, operator, type, registration details +- **Coverage**: Global aircraft registry +- **Authentication**: API key required +- **Source**: https://wingbits.com/ +- **Notes**: Used to enrich OpenSky data with ownership information + +#### FAA Status +- **API**: `/api/faa-status` +- **Data**: Airport delays, ground stops, flight restrictions +- **Coverage**: US airports and airspace +- **Update Frequency**: Every 5 minutes +- **Authentication**: None (public API) +- **Source**: https://www.faa.gov/ + +### 3. Natural Disasters & Environmental Events + +#### GDACS (Global Disaster Alert and Coordination System) +- **API**: Direct fetch to `https://www.gdacs.org/gdacsapi/api/events` +- **Data**: Major disasters (earthquakes, floods, cyclones, droughts) +- **Coverage**: Global, severity-based filtering +- **Update Frequency**: Real-time alerts +- **Authentication**: None (public API) +- **Source**: https://www.gdacs.org/ + +#### NASA EONET (Earth Observatory Natural Events Tracker) +- **API**: Direct fetch to `https://eonet.gsfc.nasa.gov/api/v3/events` +- **Data**: Wildfires, severe storms, volcanoes, floods, landslides, sea/lake ice +- **Coverage**: Global satellite observations +- **Update Frequency**: Near real-time (15-30 minute lag) +- **Authentication**: None (public API) +- **Source**: https://eonet.gsfc.nasa.gov/ + +#### NASA FIRMS (Fire Information for Resource Management System) +- **API**: `/api/firms-fires` +- **Data**: Satellite fire detections (VIIRS SNPP/NOAA-20) +- **Coverage**: Global, 375m resolution +- **Update Frequency**: Near real-time (3-hour lag) +- **Authentication**: API key required +- **Source**: https://firms.modaps.eosdis.nasa.gov/ + +#### USGS Earthquakes +- **API**: `/api/earthquakes` +- **Data**: Earthquake events magnitude 4.5+ +- **Coverage**: Global seismic network +- **Update Frequency**: Near real-time (5-10 minute lag for M4.5+) +- **Authentication**: None (public API) +- **Source**: https://earthquake.usgs.gov/ + +#### NWS (National Weather Service) +- **API**: Direct fetch to `https://api.weather.gov/alerts/active` +- **Data**: Severe weather alerts, warnings, watches +- **Coverage**: United States +- **Update Frequency**: Real-time +- **Authentication**: None (public API) +- **Source**: https://www.weather.gov/ + +#### Open-Meteo (Climate Data) +- **API**: Used by `/api/climate-anomalies` +- **Data**: Temperature, precipitation, climate anomalies (via ERA5 reanalysis) +- **Coverage**: Global gridded data +- **Update Frequency**: Daily for historical, 5-day lag for ERA5 +- **Authentication**: None (public API, optional key for higher limits) +- **Source**: https://open-meteo.com/ +- **Notes**: Processes Copernicus ERA5 data + +### 4. Financial Markets & Economics + +#### Finnhub +- **API**: `/api/finnhub` +- **Data**: Stock quotes, real-time prices, company data +- **Coverage**: US and international equities +- **Update Frequency**: Every 2 minutes (free tier: 60 calls/minute) +- **Authentication**: API key required +- **Source**: https://finnhub.io/ +- **License**: Free tier available + +#### Yahoo Finance +- **API**: `/api/yahoo-finance` +- **Data**: Stock prices, indices, historical data +- **Coverage**: Global markets +- **Update Frequency**: Every 2 minutes +- **Authentication**: None (public API) +- **Source**: https://finance.yahoo.com/ + +#### CoinGecko +- **API**: `/api/coingecko` +- **Data**: Cryptocurrency prices, market cap, 24h changes +- **Coverage**: Bitcoin, Ethereum, Solana, and major cryptocurrencies +- **Update Frequency**: Every 2 minutes +- **Authentication**: None (public API, free tier) +- **Source**: https://www.coingecko.com/ + +#### Polymarket +- **API**: `/api/polymarket` +- **Data**: Prediction market odds on political/world events +- **Coverage**: Global event predictions +- **Update Frequency**: Every 5 minutes +- **Authentication**: None (public Gamma API) +- **Source**: https://gamma-api.polymarket.com/ +- **Notes**: Shows real-money betting odds on outcomes + +#### FRED (Federal Reserve Economic Data) +- **API**: `/api/fred-data` +- **Data**: Economic indicators, interest rates, inflation, unemployment +- **Coverage**: US economic data, 800,000+ time series +- **Update Frequency**: Varies by series (daily to quarterly) +- **Authentication**: API key required +- **Source**: https://fred.stlouisfed.org/ +- **License**: Free for non-commercial use + +#### EIA (Energy Information Administration) +- **API**: `/api/eia/*` endpoints +- **Data**: Oil prices, petroleum production, inventory, natural gas +- **Coverage**: US energy data +- **Update Frequency**: Weekly for most series +- **Authentication**: API key required +- **Source**: https://www.eia.gov/ +- **License**: Public domain (US government) + +#### USA Spending +- **API**: Direct fetch to `https://api.usaspending.gov/api/v2` +- **Data**: US federal spending, defense contracts, grants +- **Coverage**: All US government spending +- **Update Frequency**: Daily +- **Authentication**: None (public API) +- **Source**: https://usaspending.gov/ + +### 5. Internet Infrastructure & Outages + +#### Cloudflare Radar +- **API**: `/api/cloudflare-outages` +- **Data**: Internet outages, BGP events, traffic anomalies +- **Coverage**: Global internet infrastructure +- **Update Frequency**: Near real-time +- **Authentication**: API token required (free Cloudflare account) +- **Source**: https://radar.cloudflare.com/ + +#### NGA Warnings (Undersea Cables) +- **API**: `/api/nga-warnings` +- **Data**: Cable repair warnings, navigation hazards +- **Coverage**: Global undersea infrastructure +- **Update Frequency**: As issued +- **Authentication**: None (public notices) +- **Source**: National Geospatial-Intelligence Agency + +### 6. Technology & Research + +#### ArXiv +- **API**: `/api/arxiv` +- **Data**: Scientific papers, preprints (CS.AI category by default) +- **Coverage**: Computer science, AI/ML research +- **Update Frequency**: Every hour +- **Authentication**: None (public API) +- **Source**: https://arxiv.org/ + +#### GitHub Trending +- **API**: `/api/github-trending` +- **Data**: Trending repositories by language and timeframe +- **Coverage**: All public GitHub repositories +- **Update Frequency**: Every 30 minutes +- **Authentication**: None (web scraping) +- **Source**: https://github.com/trending + +#### Hacker News +- **API**: `/api/hackernews` +- **Data**: Top stories, new stories, best stories +- **Coverage**: Tech/startup news aggregation +- **Update Frequency**: Every 5 minutes +- **Authentication**: None (public Firebase API) +- **Source**: https://news.ycombinator.com/ + +### 7. Cybersecurity + +#### Cyber Threats API +- **API**: `/api/cyber-threats` +- **Data**: Command & control servers, malware hosts, phishing sites, malicious IPs +- **Coverage**: Global threat intelligence +- **Update Frequency**: Multiple times daily +- **Authentication**: None (aggregates public threat feeds) +- **Sources**: + - AlienVault OTX + - URLhaus (abuse.ch) + - ThreatFox (abuse.ch) + - Feodo Tracker + - PhishTank +- **Notes**: Geo-locates threats for map display + +#### CISA Advisories +- **Feed**: Via RSS proxy +- **Data**: Cybersecurity alerts, advisories, vulnerability bulletins +- **Coverage**: US government cybersecurity guidance +- **Update Frequency**: As issued +- **Authentication**: None +- **Source**: https://www.cisa.gov/ + +### 8. News Aggregation (100+ RSS Feeds) + +The application aggregates news from over 100 RSS feeds across multiple categories: + +#### Wire Services (Tier 1) +- **Reuters** (World, Business) +- **AP News** +- **AFP** +- **Bloomberg** + +#### Government Sources (Tier 1) +- **White House** (press releases, statements) +- **State Department** (briefings, travel advisories) +- **Pentagon** (news releases, contracts) +- **UN News** +- **CISA** (cybersecurity alerts) +- **Treasury, DOJ, DHS, CDC, FEMA** +- **Federal Reserve, SEC** + +#### Major News Outlets (Tier 2) +- **BBC World, BBC Middle East** +- **Guardian World, Guardian Middle East** +- **NPR News** +- **CNN World** +- **Al Jazeera** +- **Financial Times** +- **Politico** + +#### Defense & Intelligence (Tier 3) +- **Defense One** +- **Breaking Defense** +- **The War Zone** +- **Defense News** +- **Janes** +- **Foreign Policy** +- **The Diplomat** +- **Bellingcat** +- **Krebs on Security** + +#### Think Tanks (Tier 3) +- **CSIS** (Center for Strategic & International Studies) +- **RAND Corporation** +- **Brookings Institution** +- **Carnegie Endowment** +- **Atlantic Council** +- **Foreign Affairs** +- **Arms Control Association** +- **Bulletin of the Atomic Scientists** +- **RUSI, Wilson Center, GMF, CNAS** + +#### Regional News +- **Middle East**: Al Jazeera, Al Arabiya, TRT World +- **Africa**: All Africa, ISS Africa +- **Latin America**: Latin America News +- **Asia-Pacific**: Nikkei, The Diplomat + +#### Technology & Startups (Tech Variant) +- **TechCrunch, VentureBeat, Ars Technica, The Verge** +- **Y Combinator, a16z, Sequoia blogs** +- **Crunchbase, CB Insights, PitchBook** +- **Regional**: e27 (SEA), 36Kr (China), Inc42 (India), TechCabal (Africa) +- **Think Tanks**: Stanford HAI, MIT Tech Review, OECD Digital + +#### Feed Processing +- **Proxy**: Feeds routed through `/api/rss-proxy` (Vercel) or Railway relay +- **Circuit Breakers**: Failed feeds enter 5-minute cooldown after 2 failures +- **Caching**: 10-minute TTL in browser localStorage +- **Classification**: AI-powered threat classification via Groq/OpenRouter +- **Entity Extraction**: Automatic extraction of locations, organizations, CVEs + +--- + +## Data Collection Methods + +### 1. REST API Calls + +Most data sources use standard HTTP/REST APIs: + +```javascript +// Example from services/earthquakes.ts +const response = await fetch('/api/earthquakes'); +const data = await response.json(); +``` + +**Characteristics:** +- Polling-based (request every X minutes) +- Edge function proxies hide API keys +- Rate limiting on server side +- Response caching via Upstash Redis + +### 2. RSS Feed Parsing + +RSS/Atom feeds are fetched and parsed in browser: + +```javascript +// Example from services/rss.ts +const response = await fetchWithProxy(feed.url); +const text = await response.text(); +const parser = new DOMParser(); +const doc = parser.parseFromString(text, 'text/xml'); +``` + +**Characteristics:** +- Proxied through Vercel or Railway to handle CORS +- DOM parser for XML processing +- Deduplication by URL and title +- Per-feed circuit breakers + +### 3. WebSocket Streaming + +Real-time data uses WebSocket connections: + +```javascript +// Example from services/ais.ts +const ws = new WebSocket(WS_RELAY_URL); +ws.onmessage = (event) => { + const message = JSON.parse(event.data); + // Process real-time vessel position +}; +``` + +**Characteristics:** +- Persistent connection via Railway relay +- Real-time updates (sub-second latency) +- Automatic reconnection on disconnect +- Used for: AIS vessels, OpenSky aircraft + +### 4. GraphQL Queries + +Some sources use GraphQL (e.g., Polymarket): + +```javascript +const response = await fetch('https://gamma-api.polymarket.com/query', { + method: 'POST', + body: JSON.stringify({ query: ... }) +}); +``` + +### 5. Web Scraping + +Limited scraping for sources without APIs: + +```javascript +// Example: GitHub trending (no official API) +const html = await fetch('/api/github-trending').then(r => r.text()); +// Parse HTML to extract trending repos +``` + +**Note**: Only used when no official API exists + +--- + +## Refresh Intervals & Caching + +### Update Frequencies (from `REFRESH_INTERVALS`) + +| Data Type | Refresh Interval | Rationale | +|-----------|------------------|-----------| +| **RSS Feeds** | 5 minutes | Balance freshness with rate limits | +| **Markets/Crypto** | 2 minutes | Fast-moving market data | +| **Prediction Markets** | 5 minutes | Slower-changing probabilities | +| **AIS Vessels** | 10 minutes | Snapshot refresh (WebSocket is real-time) | +| **ArXiv Papers** | 1 hour | Papers published once daily | +| **GitHub Trending** | 30 minutes | Rankings update slowly | +| **Hacker News** | 5 minutes | Active discussion board | + +### Caching Strategy + +#### 1. Cross-User Cache (Upstash Redis) +- **Purpose**: Share expensive operations across all users +- **Location**: Vercel edge functions +- **TTL**: Varies by data type (5-60 minutes) +- **Use Cases**: + - AI-generated summaries (World Brief) + - Risk score calculations + - ACLED conflict data (API has rate limits) + +```javascript +// Example from api/_upstash-cache.js +const cached = await redis.get(`cache:${key}`); +if (cached) return JSON.parse(cached); + +const fresh = await fetchFreshData(); +await redis.set(`cache:${key}`, JSON.stringify(fresh), { ex: 300 }); // 5 min TTL +``` + +#### 2. Per-User Cache (Browser localStorage) +- **Purpose**: Avoid re-fetching data on page reload +- **Location**: Browser localStorage +- **TTL**: 10 minutes default +- **Use Cases**: + - RSS feed items + - Market data + - Map layer data + +```javascript +// Example from services/persistent-cache.ts +export async function getPersistentCache