- Validate mock endpoint findings using real-world API services
- Test documentation sweet spot hypothesis with actual API constraints
- Assess LLM behavior with real authentication, rate limits, and error conditions
- Strengthen research validity through real-world validation
- RQ1: Do our documentation sweet spot findings hold with real APIs?
- RQ2: How do real-world constraints (rate limits, downtime) affect LLM success?
- RQ3: Are there additional insights from real API testing not captured by mocks?
Primary Choice: OpenWeatherMap API
- Free Tier: 1,000 calls/day, 60 calls/minute
- Endpoint:
GET /data/2.5/weather - Authentication: API key in query parameter (
appid) - Cost: Free tier sufficient for testing
- Reliability: 99.9% uptime, established service
- Documentation Quality: Multiple levels available (quick start vs comprehensive)
Alternative: WeatherAPI.com
- Free Tier: 1 million calls/month
- Better rate limits for extensive testing
Primary Choice: NewsAPI.org
- Free Tier: 1,000 requests/day
- Endpoint:
GET /v2/top-headlines - Authentication: API key in header (
X-API-Key) - Cost: Free tier sufficient
- Reliability: High uptime, real-time data
- Documentation: Well-structured with multiple complexity levels
Alternative: The Guardian API
- Free Tier: 12,000 calls/day
- No API key required for basic access
Primary Choice: ExchangeRate-API.com
- Free Tier: 1,500 requests/month
- Endpoint:
GET /v4/latest/{base_currency} - Authentication: API key in query parameter
- Cost: Free tier sufficient
- Reliability: High availability, real-time rates
- Documentation: Simple to comprehensive options
Alternative: Fixer.io
- Free Tier: 100 requests/month
- More comprehensive but limited free tier
Primary Choice: JSONPlaceholder + Custom Auth Layer
- Free Tier: Unlimited (test service)
- Endpoint:
GET /users/{id} - Authentication: Custom header simulation
- Cost: Free
- Reliability: Designed for testing
- Documentation: Can create multiple quality levels
Alternative: Supabase Auth API
- Free Tier: 50,000 monthly active users
- Real authentication but more complex setup
Primary Choice: Box API
- Free Tier: 10GB storage, API access
- Endpoint:
GET /2.0/files/{file_id} - Authentication: Bearer token (OAuth2)
- Cost: Free tier sufficient
- Reliability: Enterprise-grade uptime
- Documentation: Multiple complexity levels
Alternative: Dropbox API
- Free Tier: 2GB storage, API access
- Simpler authentication for testing
# API credentials management
REAL_API_KEYS = {
"openweathermap": os.getenv("OPENWEATHER_API_KEY"),
"newsapi": os.getenv("NEWS_API_KEY"),
"exchangerate": os.getenv("EXCHANGE_API_KEY"),
"jsonplaceholder": "test_key_simulation",
"box": os.getenv("BOX_ACCESS_TOKEN")
}REAL_API_SPECS = {
"weather": {
"base_url": "https://api.openweathermap.org/data/2.5",
"endpoint": "/weather",
"auth_method": "query_param",
"auth_param": "appid",
"required_params": ["q"], # location query
"rate_limit": "60/minute",
"expected_fields": ["main", "weather", "name"]
},
"news": {
"base_url": "https://newsapi.org/v2",
"endpoint": "/top-headlines",
"auth_method": "header",
"auth_header": "X-API-Key",
"required_params": ["category"],
"rate_limit": "1000/day",
"expected_fields": ["articles", "totalResults"]
},
"currency": {
"base_url": "https://v6.exchangerate-api.com/v6",
"endpoint": "/latest/{base}",
"auth_method": "path_param",
"auth_position": "after_version",
"required_params": ["base"],
"rate_limit": "1500/month",
"expected_fields": ["conversion_rates", "base_code"]
},
"user": {
"base_url": "https://jsonplaceholder.typicode.com",
"endpoint": "/users/{id}",
"auth_method": "header",
"auth_header": "Authorization",
"required_params": ["id"],
"rate_limit": "unlimited",
"expected_fields": ["name", "email", "id"]
},
"file": {
"base_url": "https://api.box.com/2.0",
"endpoint": "/files/{file_id}",
"auth_method": "header",
"auth_header": "Authorization",
"required_params": ["file_id"],
"rate_limit": "1000/hour",
"expected_fields": ["name", "size", "id"]
}
}For each API, create 5 documentation quality levels using actual API documentation as source material:
Excellent (5.0/5.0): Official comprehensive documentation with all examples Good (4.0/5.0): Official quick start guides with basic examples Average (3.0/5.0): Community tutorials and simplified guides Basic (2.0/5.0): Minimal official reference documentation Poor (1.0/5.0): Incomplete or outdated documentation snippets
REAL_DOC_SOURCES = {
"weather": {
"excellent": "https://openweathermap.org/api/one-call-3",
"good": "https://openweathermap.org/current",
"average": "Community tutorial extracts",
"basic": "API reference only",
"poor": "Minimal endpoint description"
},
# Similar for other domains...
}class RealAPITestingFramework:
def __init__(self):
self.rate_limiters = {}
self.api_health_monitors = {}
self.cost_trackers = {}
self.retry_strategies = {}
async def test_with_real_api(self, code: str, api_domain: str, quality_level: str):
"""Test LLM-generated code with real APIs"""
# Pre-flight checks
if not self._check_api_health(api_domain):
return self._create_failure_result("API_UNAVAILABLE")
if not self._check_rate_limits(api_domain):
return self._create_failure_result("RATE_LIMITED")
# Execute with real API
result = await self._execute_real_api_code(code, api_domain)
# Track usage and costs
self._update_usage_tracking(api_domain, result)
return resultREAL_WORLD_SCENARIOS = {
"rate_limiting": {
"detection": "429 status code or rate limit headers",
"handling": "Exponential backoff with jitter",
"success_criteria": "Graceful degradation"
},
"api_downtime": {
"detection": "5xx status codes or timeouts",
"handling": "Circuit breaker pattern",
"success_criteria": "Appropriate error messaging"
},
"authentication_expiry": {
"detection": "401 with token expired message",
"handling": "Token refresh or re-authentication",
"success_criteria": "Automatic recovery"
},
"data_format_changes": {
"detection": "Unexpected response structure",
"handling": "Flexible parsing with fallbacks",
"success_criteria": "Graceful handling of schema changes"
}
}COMPARISON_METRICS = {
"success_rate_correlation": {
"mock_results": "Previous experiment success rates",
"real_results": "Real API success rates",
"analysis": "Pearson correlation coefficient"
},
"documentation_sweet_spot": {
"mock_optimal": "Average/Basic (3.0-2.0/5.0)",
"real_optimal": "To be determined",
"analysis": "Sweet spot consistency check"
},
"error_handling_sophistication": {
"mock_errors": "Simple 401/404/429 simulation",
"real_errors": "Complex real-world error scenarios",
"analysis": "LLM adaptation to complexity"
},
"implementation_robustness": {
"mock_environment": "Controlled, predictable",
"real_environment": "Variable, realistic",
"analysis": "Code quality under real constraints"
}
}# Environment-based key management
import os
from cryptography.fernet import Fernet
class SecureAPIKeyManager:
def __init__(self):
self.encryption_key = os.getenv("ENCRYPTION_KEY")
self.cipher = Fernet(self.encryption_key)
def get_api_key(self, service: str) -> str:
"""Retrieve and decrypt API key"""
encrypted_key = os.getenv(f"{service.upper()}_API_KEY_ENCRYPTED")
return self.cipher.decrypt(encrypted_key).decode()
def rotate_keys(self):
"""Implement key rotation strategy"""
# Automated key rotation logic
pass- Environment Variables: Store keys in
.envfiles (not in code) - Encryption at Rest: Encrypt stored keys using Fernet
- Access Logging: Log all API key usage for audit trails
- Rotation Schedule: Rotate keys monthly or after experiments
- Scope Limitation: Use read-only keys where possible
class RateLimitManager:
def __init__(self):
self.limits = {
"openweathermap": {"calls": 60, "period": 60}, # 60/minute
"newsapi": {"calls": 1000, "period": 86400}, # 1000/day
"exchangerate": {"calls": 1500, "period": 2592000} # 1500/month
}
self.usage_tracking = {}
async def check_and_wait(self, api_name: str):
"""Check rate limits and wait if necessary"""
if self._is_rate_limited(api_name):
wait_time = self._calculate_wait_time(api_name)
await asyncio.sleep(wait_time)COST_ANALYSIS = {
"free_tier_limits": {
"openweathermap": {"calls": 1000, "cost": 0},
"newsapi": {"calls": 1000, "cost": 0},
"exchangerate": {"calls": 1500, "cost": 0},
"total_monthly_cost": 0
},
"experiment_requirements": {
"total_tests": 75, # 5 domains × 5 quality levels × 3 LLMs
"calls_per_test": 3, # Success + 2 error scenarios
"total_calls": 225,
"safety_margin": 2,
"estimated_calls": 450
},
"cost_projection": {
"within_free_tiers": True,
"estimated_monthly_cost": 0,
"risk_assessment": "Low"
}
}class APIHealthMonitor:
def __init__(self):
self.health_endpoints = {
"openweathermap": "https://api.openweathermap.org/data/2.5/weather?q=London&appid=test",
"newsapi": "https://newsapi.org/v2/top-headlines?country=us&apiKey=test"
}
async def check_api_health(self, api_name: str) -> bool:
"""Check if API is available before testing"""
try:
response = await self._make_health_check(api_name)
return response.status_code in [200, 401] # 401 = auth error but API is up
except:
return False
async def wait_for_api_recovery(self, api_name: str, max_wait: int = 3600):
"""Wait for API to recover from downtime"""
start_time = time.time()
while time.time() - start_time < max_wait:
if await self.check_api_health(api_name):
return True
await asyncio.sleep(60) # Check every minute
return FalseFALLBACK_STRATEGIES = {
"primary_api_down": {
"action": "Switch to alternative API",
"alternatives": {
"weather": "WeatherAPI.com → OpenWeatherMap",
"news": "Guardian API → NewsAPI",
"currency": "Fixer.io → ExchangeRate-API"
}
},
"rate_limit_exceeded": {
"action": "Implement exponential backoff",
"max_wait": "24 hours",
"alternative": "Switch to backup API"
},
"api_schema_change": {
"action": "Update expected fields dynamically",
"validation": "Flexible response parsing",
"notification": "Alert for manual review"
}
}class MockVsRealComparison:
def __init__(self, mock_results, real_results):
self.mock_results = mock_results
self.real_results = real_results
def calculate_correlation(self):
"""Calculate correlation between mock and real results"""
mock_scores = [r['success_score'] for r in self.mock_results]
real_scores = [r['success_score'] for r in self.real_results]
return np.corrcoef(mock_scores, real_scores)[0, 1]
def analyze_sweet_spot_consistency(self):
"""Check if sweet spot pattern holds in both environments"""
mock_sweet_spot = self._find_sweet_spot(self.mock_results)
real_sweet_spot = self._find_sweet_spot(self.real_results)
return {
"mock_optimal": mock_sweet_spot,
"real_optimal": real_sweet_spot,
"consistency": mock_sweet_spot == real_sweet_spot
}
def identify_new_insights(self):
"""Identify insights only visible with real APIs"""
return {
"rate_limit_handling": self._analyze_rate_limit_behavior(),
"error_recovery": self._analyze_error_recovery(),
"real_world_robustness": self._analyze_robustness()
}If real API results strongly correlate with mock results:
- Validation: Confirms documentation sweet spot is universal
- Research Impact: Strengthens validity and generalizability
- Practical Value: High confidence in recommendations
- Publication Value: Robust evidence for academic submission
If real API results partially align with mock results:
- Insights: Identifies real-world factors affecting LLM performance
- Research Impact: Reveals nuanced behavior patterns
- Practical Value: More sophisticated guidance for practitioners
- Publication Value: Comprehensive understanding of phenomenon
If real API results differ significantly from mock results:
- Discovery: Uncovers limitations of controlled testing
- Research Impact: Identifies critical real-world factors
- Practical Value: Essential insights for production usage
- Publication Value: Important negative results and methodology insights
- Rate Limit Adaptation: How LLMs handle rate limiting in generated code
- Error Recovery: LLM ability to implement robust error handling
- Authentication Complexity: Performance with OAuth vs simple API keys
- Data Variability: Handling of real-world data inconsistencies
- Network Resilience: Code robustness under network conditions
- External Validity: Confirms findings generalize beyond controlled conditions
- Replication: Demonstrates reproducibility in real-world settings
- Robustness: Tests hypothesis under realistic constraints
- Academic Credibility: Strengthens evidence for peer review
- Industry Relevance: Provides practical, actionable insights
- Methodological Contribution: Establishes gold standard for LLM API testing
- Production Guidance: Real-world recommendations for developers
- Tool Development: Insights for AI-assisted development platforms
- Documentation Strategy: Evidence-based guidance for API writers
- Low Cost: All APIs have sufficient free tiers
- High Value: Significant research and practical benefits
- Manageable Risk: Well-defined fallback strategies
- Timeline: 2-3 weeks additional research time
- Phase 1: Weather + News APIs (highest reliability, best free tiers)
- Phase 2: Currency + User Management APIs (moderate complexity)
- Phase 3: File Storage API (most complex authentication)
- Minimum Viable Validation: 2 domains showing consistent patterns
- Strong Validation: 4+ domains confirming sweet spot phenomenon
- Exceptional Outcome: New insights not captured by mock testing
Recommendation: Proceed with real API validation study to strengthen research validity and uncover additional insights about LLM behavior in production environments. 🎯