Skip to content

Latest commit

 

History

History
501 lines (427 loc) · 17.5 KB

File metadata and controls

501 lines (427 loc) · 17.5 KB

🌐 Real API Validation Study Design

Validating Documentation Sweet Spot Findings with Live APIs


🎯 Study Objectives

Primary Goals:

  1. Validate mock endpoint findings using real-world API services
  2. Test documentation sweet spot hypothesis with actual API constraints
  3. Assess LLM behavior with real authentication, rate limits, and error conditions
  4. Strengthen research validity through real-world validation

Research Questions:

  • RQ1: Do our documentation sweet spot findings hold with real APIs?
  • RQ2: How do real-world constraints (rate limits, downtime) affect LLM success?
  • RQ3: Are there additional insights from real API testing not captured by mocks?

🔧 Selected Real APIs by Domain

1. 🌤️ Weather Domain

Primary Choice: OpenWeatherMap API

  • Free Tier: 1,000 calls/day, 60 calls/minute
  • Endpoint: GET /data/2.5/weather
  • Authentication: API key in query parameter (appid)
  • Cost: Free tier sufficient for testing
  • Reliability: 99.9% uptime, established service
  • Documentation Quality: Multiple levels available (quick start vs comprehensive)

Alternative: WeatherAPI.com

  • Free Tier: 1 million calls/month
  • Better rate limits for extensive testing

2. 📰 News Domain

Primary Choice: NewsAPI.org

  • Free Tier: 1,000 requests/day
  • Endpoint: GET /v2/top-headlines
  • Authentication: API key in header (X-API-Key)
  • Cost: Free tier sufficient
  • Reliability: High uptime, real-time data
  • Documentation: Well-structured with multiple complexity levels

Alternative: The Guardian API

  • Free Tier: 12,000 calls/day
  • No API key required for basic access

3. 💱 Currency Domain

Primary Choice: ExchangeRate-API.com

  • Free Tier: 1,500 requests/month
  • Endpoint: GET /v4/latest/{base_currency}
  • Authentication: API key in query parameter
  • Cost: Free tier sufficient
  • Reliability: High availability, real-time rates
  • Documentation: Simple to comprehensive options

Alternative: Fixer.io

  • Free Tier: 100 requests/month
  • More comprehensive but limited free tier

4. 👤 User Management Domain

Primary Choice: JSONPlaceholder + Custom Auth Layer

  • Free Tier: Unlimited (test service)
  • Endpoint: GET /users/{id}
  • Authentication: Custom header simulation
  • Cost: Free
  • Reliability: Designed for testing
  • Documentation: Can create multiple quality levels

Alternative: Supabase Auth API

  • Free Tier: 50,000 monthly active users
  • Real authentication but more complex setup

5. 📁 File Storage Domain

Primary Choice: Box API

  • Free Tier: 10GB storage, API access
  • Endpoint: GET /2.0/files/{file_id}
  • Authentication: Bearer token (OAuth2)
  • Cost: Free tier sufficient
  • Reliability: Enterprise-grade uptime
  • Documentation: Multiple complexity levels

Alternative: Dropbox API

  • Free Tier: 2GB storage, API access
  • Simpler authentication for testing

🔬 Real API Validation Experiment Design

Phase 1: API Setup and Standardization

1.1 API Account Creation

# API credentials management
REAL_API_KEYS = {
    "openweathermap": os.getenv("OPENWEATHER_API_KEY"),
    "newsapi": os.getenv("NEWS_API_KEY"),
    "exchangerate": os.getenv("EXCHANGE_API_KEY"),
    "jsonplaceholder": "test_key_simulation",
    "box": os.getenv("BOX_ACCESS_TOKEN")
}

1.2 Standardized Real API Specifications

REAL_API_SPECS = {
    "weather": {
        "base_url": "https://api.openweathermap.org/data/2.5",
        "endpoint": "/weather",
        "auth_method": "query_param",
        "auth_param": "appid",
        "required_params": ["q"],  # location query
        "rate_limit": "60/minute",
        "expected_fields": ["main", "weather", "name"]
    },
    "news": {
        "base_url": "https://newsapi.org/v2",
        "endpoint": "/top-headlines",
        "auth_method": "header",
        "auth_header": "X-API-Key",
        "required_params": ["category"],
        "rate_limit": "1000/day",
        "expected_fields": ["articles", "totalResults"]
    },
    "currency": {
        "base_url": "https://v6.exchangerate-api.com/v6",
        "endpoint": "/latest/{base}",
        "auth_method": "path_param",
        "auth_position": "after_version",
        "required_params": ["base"],
        "rate_limit": "1500/month",
        "expected_fields": ["conversion_rates", "base_code"]
    },
    "user": {
        "base_url": "https://jsonplaceholder.typicode.com",
        "endpoint": "/users/{id}",
        "auth_method": "header",
        "auth_header": "Authorization",
        "required_params": ["id"],
        "rate_limit": "unlimited",
        "expected_fields": ["name", "email", "id"]
    },
    "file": {
        "base_url": "https://api.box.com/2.0",
        "endpoint": "/files/{file_id}",
        "auth_method": "header",
        "auth_header": "Authorization",
        "required_params": ["file_id"],
        "rate_limit": "1000/hour",
        "expected_fields": ["name", "size", "id"]
    }
}

Phase 2: Documentation Quality Variants for Real APIs

2.1 Real API Documentation Levels

For each API, create 5 documentation quality levels using actual API documentation as source material:

Excellent (5.0/5.0): Official comprehensive documentation with all examples Good (4.0/5.0): Official quick start guides with basic examples Average (3.0/5.0): Community tutorials and simplified guides Basic (2.0/5.0): Minimal official reference documentation Poor (1.0/5.0): Incomplete or outdated documentation snippets

2.2 Real Documentation Sources

REAL_DOC_SOURCES = {
    "weather": {
        "excellent": "https://openweathermap.org/api/one-call-3",
        "good": "https://openweathermap.org/current",
        "average": "Community tutorial extracts",
        "basic": "API reference only",
        "poor": "Minimal endpoint description"
    },
    # Similar for other domains...
}

Phase 3: Enhanced Testing Framework

3.1 Real API Testing Infrastructure

class RealAPITestingFramework:
    def __init__(self):
        self.rate_limiters = {}
        self.api_health_monitors = {}
        self.cost_trackers = {}
        self.retry_strategies = {}
    
    async def test_with_real_api(self, code: str, api_domain: str, quality_level: str):
        """Test LLM-generated code with real APIs"""
        
        # Pre-flight checks
        if not self._check_api_health(api_domain):
            return self._create_failure_result("API_UNAVAILABLE")
        
        if not self._check_rate_limits(api_domain):
            return self._create_failure_result("RATE_LIMITED")
        
        # Execute with real API
        result = await self._execute_real_api_code(code, api_domain)
        
        # Track usage and costs
        self._update_usage_tracking(api_domain, result)
        
        return result

3.2 Real-World Error Handling

REAL_WORLD_SCENARIOS = {
    "rate_limiting": {
        "detection": "429 status code or rate limit headers",
        "handling": "Exponential backoff with jitter",
        "success_criteria": "Graceful degradation"
    },
    "api_downtime": {
        "detection": "5xx status codes or timeouts",
        "handling": "Circuit breaker pattern",
        "success_criteria": "Appropriate error messaging"
    },
    "authentication_expiry": {
        "detection": "401 with token expired message",
        "handling": "Token refresh or re-authentication",
        "success_criteria": "Automatic recovery"
    },
    "data_format_changes": {
        "detection": "Unexpected response structure",
        "handling": "Flexible parsing with fallbacks",
        "success_criteria": "Graceful handling of schema changes"
    }
}

Phase 4: Comparative Analysis Framework

4.1 Mock vs Real API Comparison Metrics

COMPARISON_METRICS = {
    "success_rate_correlation": {
        "mock_results": "Previous experiment success rates",
        "real_results": "Real API success rates",
        "analysis": "Pearson correlation coefficient"
    },
    "documentation_sweet_spot": {
        "mock_optimal": "Average/Basic (3.0-2.0/5.0)",
        "real_optimal": "To be determined",
        "analysis": "Sweet spot consistency check"
    },
    "error_handling_sophistication": {
        "mock_errors": "Simple 401/404/429 simulation",
        "real_errors": "Complex real-world error scenarios",
        "analysis": "LLM adaptation to complexity"
    },
    "implementation_robustness": {
        "mock_environment": "Controlled, predictable",
        "real_environment": "Variable, realistic",
        "analysis": "Code quality under real constraints"
    }
}

🛡️ Practical Considerations

1. 🔐 API Key Management and Security

Secure Key Storage

# Environment-based key management
import os
from cryptography.fernet import Fernet

class SecureAPIKeyManager:
    def __init__(self):
        self.encryption_key = os.getenv("ENCRYPTION_KEY")
        self.cipher = Fernet(self.encryption_key)
    
    def get_api_key(self, service: str) -> str:
        """Retrieve and decrypt API key"""
        encrypted_key = os.getenv(f"{service.upper()}_API_KEY_ENCRYPTED")
        return self.cipher.decrypt(encrypted_key).decode()
    
    def rotate_keys(self):
        """Implement key rotation strategy"""
        # Automated key rotation logic
        pass

Key Security Measures

  • Environment Variables: Store keys in .env files (not in code)
  • Encryption at Rest: Encrypt stored keys using Fernet
  • Access Logging: Log all API key usage for audit trails
  • Rotation Schedule: Rotate keys monthly or after experiments
  • Scope Limitation: Use read-only keys where possible

2. ⚡ Rate Limiting and Cost Management

Rate Limiting Strategy

class RateLimitManager:
    def __init__(self):
        self.limits = {
            "openweathermap": {"calls": 60, "period": 60},  # 60/minute
            "newsapi": {"calls": 1000, "period": 86400},    # 1000/day
            "exchangerate": {"calls": 1500, "period": 2592000}  # 1500/month
        }
        self.usage_tracking = {}
    
    async def check_and_wait(self, api_name: str):
        """Check rate limits and wait if necessary"""
        if self._is_rate_limited(api_name):
            wait_time = self._calculate_wait_time(api_name)
            await asyncio.sleep(wait_time)

Cost Estimation

COST_ANALYSIS = {
    "free_tier_limits": {
        "openweathermap": {"calls": 1000, "cost": 0},
        "newsapi": {"calls": 1000, "cost": 0},
        "exchangerate": {"calls": 1500, "cost": 0},
        "total_monthly_cost": 0
    },
    "experiment_requirements": {
        "total_tests": 75,  # 5 domains × 5 quality levels × 3 LLMs
        "calls_per_test": 3,  # Success + 2 error scenarios
        "total_calls": 225,
        "safety_margin": 2,
        "estimated_calls": 450
    },
    "cost_projection": {
        "within_free_tiers": True,
        "estimated_monthly_cost": 0,
        "risk_assessment": "Low"
    }
}

3. 🔧 API Downtime and Change Management

Health Monitoring

class APIHealthMonitor:
    def __init__(self):
        self.health_endpoints = {
            "openweathermap": "https://api.openweathermap.org/data/2.5/weather?q=London&appid=test",
            "newsapi": "https://newsapi.org/v2/top-headlines?country=us&apiKey=test"
        }
    
    async def check_api_health(self, api_name: str) -> bool:
        """Check if API is available before testing"""
        try:
            response = await self._make_health_check(api_name)
            return response.status_code in [200, 401]  # 401 = auth error but API is up
        except:
            return False
    
    async def wait_for_api_recovery(self, api_name: str, max_wait: int = 3600):
        """Wait for API to recover from downtime"""
        start_time = time.time()
        while time.time() - start_time < max_wait:
            if await self.check_api_health(api_name):
                return True
            await asyncio.sleep(60)  # Check every minute
        return False

Fallback Strategies

FALLBACK_STRATEGIES = {
    "primary_api_down": {
        "action": "Switch to alternative API",
        "alternatives": {
            "weather": "WeatherAPI.com → OpenWeatherMap",
            "news": "Guardian API → NewsAPI",
            "currency": "Fixer.io → ExchangeRate-API"
        }
    },
    "rate_limit_exceeded": {
        "action": "Implement exponential backoff",
        "max_wait": "24 hours",
        "alternative": "Switch to backup API"
    },
    "api_schema_change": {
        "action": "Update expected fields dynamically",
        "validation": "Flexible response parsing",
        "notification": "Alert for manual review"
    }
}

4. 📊 Comparison Methodology

Statistical Comparison Framework

class MockVsRealComparison:
    def __init__(self, mock_results, real_results):
        self.mock_results = mock_results
        self.real_results = real_results
    
    def calculate_correlation(self):
        """Calculate correlation between mock and real results"""
        mock_scores = [r['success_score'] for r in self.mock_results]
        real_scores = [r['success_score'] for r in self.real_results]
        return np.corrcoef(mock_scores, real_scores)[0, 1]
    
    def analyze_sweet_spot_consistency(self):
        """Check if sweet spot pattern holds in both environments"""
        mock_sweet_spot = self._find_sweet_spot(self.mock_results)
        real_sweet_spot = self._find_sweet_spot(self.real_results)
        return {
            "mock_optimal": mock_sweet_spot,
            "real_optimal": real_sweet_spot,
            "consistency": mock_sweet_spot == real_sweet_spot
        }
    
    def identify_new_insights(self):
        """Identify insights only visible with real APIs"""
        return {
            "rate_limit_handling": self._analyze_rate_limit_behavior(),
            "error_recovery": self._analyze_error_recovery(),
            "real_world_robustness": self._analyze_robustness()
        }

🎯 Expected Outcomes and Value Assessment

Scenario 1: Strong Validation (High Correlation)

If real API results strongly correlate with mock results:

  • Validation: Confirms documentation sweet spot is universal
  • Research Impact: Strengthens validity and generalizability
  • Practical Value: High confidence in recommendations
  • Publication Value: Robust evidence for academic submission

Scenario 2: Partial Validation (Moderate Correlation)

If real API results partially align with mock results:

  • Insights: Identifies real-world factors affecting LLM performance
  • Research Impact: Reveals nuanced behavior patterns
  • Practical Value: More sophisticated guidance for practitioners
  • Publication Value: Comprehensive understanding of phenomenon

Scenario 3: Divergent Results (Low Correlation)

If real API results differ significantly from mock results:

  • Discovery: Uncovers limitations of controlled testing
  • Research Impact: Identifies critical real-world factors
  • Practical Value: Essential insights for production usage
  • Publication Value: Important negative results and methodology insights

Additional Insights Expected from Real APIs:

  1. Rate Limit Adaptation: How LLMs handle rate limiting in generated code
  2. Error Recovery: LLM ability to implement robust error handling
  3. Authentication Complexity: Performance with OAuth vs simple API keys
  4. Data Variability: Handling of real-world data inconsistencies
  5. Network Resilience: Code robustness under network conditions

🚀 Recommendation: Proceed with Real API Validation

Why Real API Testing is Valuable:

1. 🔬 Scientific Rigor

  • External Validity: Confirms findings generalize beyond controlled conditions
  • Replication: Demonstrates reproducibility in real-world settings
  • Robustness: Tests hypothesis under realistic constraints

2. 📈 Research Impact

  • Academic Credibility: Strengthens evidence for peer review
  • Industry Relevance: Provides practical, actionable insights
  • Methodological Contribution: Establishes gold standard for LLM API testing

3. 💡 Practical Insights

  • Production Guidance: Real-world recommendations for developers
  • Tool Development: Insights for AI-assisted development platforms
  • Documentation Strategy: Evidence-based guidance for API writers

4. 📊 Cost-Benefit Analysis

  • Low Cost: All APIs have sufficient free tiers
  • High Value: Significant research and practical benefits
  • Manageable Risk: Well-defined fallback strategies
  • Timeline: 2-3 weeks additional research time

Implementation Priority:

  1. Phase 1: Weather + News APIs (highest reliability, best free tiers)
  2. Phase 2: Currency + User Management APIs (moderate complexity)
  3. Phase 3: File Storage API (most complex authentication)

Success Criteria:

  • Minimum Viable Validation: 2 domains showing consistent patterns
  • Strong Validation: 4+ domains confirming sweet spot phenomenon
  • Exceptional Outcome: New insights not captured by mock testing

Recommendation: Proceed with real API validation study to strengthen research validity and uncover additional insights about LLM behavior in production environments. 🎯