Skip to content

Latest commit

Β 

History

History
385 lines (308 loc) Β· 9 KB

File metadata and controls

385 lines (308 loc) Β· 9 KB

AI Content Processing API

A powerful FastAPI-based REST API for extracting text from various file types using OpenAI GPT and Google Gemini APIs.

πŸš€ Quick Start

1. Setup Environment

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment template and add your API keys
cp env_example.txt .env
# Edit .env with your OpenAI and Google API keys

2. Start the API Server

# Option 1: Using the startup script (recommended)
python run_api.py

# Option 2: Direct uvicorn command
uvicorn api_server:app --host 0.0.0.0 --port 8000 --reload

# Option 3: Using the api_server module
python api_server.py

3. Access API Documentation

πŸ“‹ Supported File Types

Documents (OpenAI Processor)

  • PDF: .pdf
  • Word: .doc, .docx
  • Excel: .xls, .xlsx
  • Text: .txt

Multimedia (Gemini Processor)

  • Video: .mp4, .avi, .mov, .mkv
  • Audio: .mp3, .wav, .m4a, .webm, .ogg

YouTube Content (YouTube Processor)

  • YouTube videos and audio content
  • Supports various YouTube URL formats

πŸ›  API Endpoints

Health & Information

  • GET / - API information and welcome message
  • GET /health - Health check and configuration status
  • GET /supported-types - Detailed information about supported file types

Single File Processing

  • POST /extract - Upload and extract text from a single file
  • POST /extract-url - Extract text from a file at a URL
  • POST /extract-youtube - Extract content from a YouTube video

Batch Processing

  • POST /extract-batch - Upload and extract text from multiple files
  • POST /extract-batch-url - Extract text from multiple URLs
  • POST /extract-youtube-batch - Extract content from multiple YouTube videos

πŸ“– API Usage Examples

1. Health Check

curl -X GET "http://localhost:8000/health"

2. Single File Upload

curl -X POST "http://localhost:8000/extract" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@document.pdf"

3. Extract from URL

curl -X POST "http://localhost:8000/extract-url" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/document.pdf",
    "filename": "my_document.pdf"
  }'

4. YouTube Content Extraction

curl -X POST "http://localhost:8000/extract-youtube" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.youtube.com/watch?v=VIDEO_ID",
    "title": "Custom Video Title"
  }'

5. Batch File Processing

curl -X POST "http://localhost:8000/extract-batch" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@file1.pdf" \
  -F "files=@file2.docx" \
  -F "files=@file3.mp4"

πŸ”§ Configuration

Environment Variables

Create a .env file with the following variables:

# Required API Keys
OPENAI_API_KEY=sk-your-openai-api-key
GOOGLE_API_KEY=your-google-api-key

# Optional API Configuration
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=true
API_LOG_LEVEL=info

# Optional Model Configuration
OPENAI_MODEL=gpt-3.5-turbo
GEMINI_MODEL=models/gemini-1.5-flash

Processing Options

Many endpoints support these query parameters:

  • parallel: Enable parallel processing (default: true)
  • max_workers: Number of worker threads (default: 4, max: 16)

Example:

curl -X POST "http://localhost:8000/extract-batch?parallel=true&max_workers=8" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@file1.pdf" \
  -F "files=@file2.pdf"

πŸ“Š Response Format

Successful Extraction Response

{
  "file_id": "550e8400-e29b-41d4-a716-446655440000",
  "file_info": {
    "name": "document.pdf",
    "extension": ".pdf",
    "size_mb": 2.5,
    "mime_type": "application/pdf",
    "url": "https://example.com/document.pdf"
  },
  "success": true,
  "extracted_text": "This is the extracted text content...",
  "processor_used": "OpenAIProcessor",
  "processing_time": 3.45,
  "text_length": 1250,
  "timestamp": "2024-01-15T10:30:00"
}

Batch Processing Response

{
  "batch_id": "batch-550e8400-e29b-41d4-a716-446655440000",
  "total_files": 3,
  "successful": 2,
  "failed": 1,
  "total_processing_time": 15.67,
  "total_characters": 5420,
  "results": [
    // Array of ExtractionResult objects
  ],
  "timestamp": "2024-01-15T10:30:00"
}

Error Response

{
  "error": "HTTP 400",
  "detail": "Unsupported file type: .xyz. Supported: .pdf, .doc, .docx, ...",
  "timestamp": "2024-01-15T10:30:00"
}

🐍 Python Client Example

import requests
import json

# API base URL
BASE_URL = "http://localhost:8000"

def check_health():
    """Check API health."""
    response = requests.get(f"{BASE_URL}/health")
    return response.json()

def extract_from_file(file_path):
    """Extract text from a local file."""
    with open(file_path, 'rb') as f:
        files = {'file': f}
        response = requests.post(f"{BASE_URL}/extract", files=files)
    return response.json()

def extract_from_url(url, filename=None):
    """Extract text from a URL."""
    data = {"url": url}
    if filename:
        data["filename"] = filename

    response = requests.post(
        f"{BASE_URL}/extract-url",
        json=data
    )
    return response.json()

def extract_from_youtube(youtube_url, title=None):
    """Extract content from YouTube."""
    data = {"url": youtube_url}
    if title:
        data["title"] = title

    response = requests.post(
        f"{BASE_URL}/extract-youtube",
        json=data
    )
    return response.json()

# Example usage
if __name__ == "__main__":
    # Health check
    health = check_health()
    print("API Status:", health["status"])

    # Extract from file
    result = extract_from_file("document.pdf")
    if result["success"]:
        print(f"Extracted {result['text_length']} characters")
        print("Text preview:", result["extracted_text"][:200])
    else:
        print("Error:", result["error"])

🚦 Production Deployment

Docker Deployment

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

EXPOSE 8000
CMD ["python", "run_api.py"]

Environment Configuration for Production

# Production settings
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=false
API_LOG_LEVEL=info

# Security (configure as needed)
CORS_ORIGINS=["https://yourdomain.com"]

Running with Gunicorn (Production)

pip install gunicorn
gunicorn api_server:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000

πŸ”’ Security Considerations

API Key Security

  • Store API keys in environment variables, never in code
  • Use secrets management in production
  • Rotate API keys regularly

File Upload Security

  • File size limits are enforced (100MB default)
  • File type validation based on extensions
  • Temporary files are automatically cleaned up
  • Consider implementing authentication for production use

Rate Limiting (Recommended)

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

# Add to endpoints:
@limiter.limit("10/minute")
async def extract_text_from_file(...):
    ...

πŸ“ˆ Monitoring and Logging

Built-in Logging

The API includes structured logging for:

  • Request processing
  • Error tracking
  • Performance monitoring
  • File cleanup operations

Health Monitoring

  • /health endpoint provides comprehensive status
  • API key configuration status
  • Processor availability
  • Supported file types

Performance Metrics

  • Processing time tracking
  • Success/failure rates
  • Character count statistics
  • Batch processing metrics

πŸ› Troubleshooting

Common Issues

  1. API Keys Not Working

    # Check your .env file
    cat .env
    
    # Verify API keys in health endpoint
    curl http://localhost:8000/health
  2. File Upload Fails

    • Check file size (max 100MB)
    • Verify file type is supported
    • Ensure proper multipart/form-data content type
  3. YouTube Processing Issues

    • Verify yt-dlp is installed: pip install yt-dlp
    • Check if video is publicly available
    • Some videos may have regional restrictions
  4. Memory Issues with Large Files

    • Use smaller batch sizes
    • Reduce max_workers parameter
    • Consider implementing streaming for very large files

Debug Mode

# Run with debug logging
API_LOG_LEVEL=debug python run_api.py

🀝 Contributing

  1. Follow the project's coding standards (see cursor rules)
  2. Add tests for new endpoints
  3. Update documentation for API changes
  4. Ensure proper error handling
  5. Follow security best practices

πŸ“„ License

This project is licensed under the MIT License. See LICENSE file for details.