AI Content Processing API

A powerful FastAPI-based REST API for extracting text from various file types using OpenAI GPT and Google Gemini APIs.

🚀 Quick Start

1. Setup Environment

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment template and add your API keys
cp env_example.txt .env
# Edit .env with your OpenAI and Google API keys

2. Start the API Server

# Option 1: Using the startup script (recommended)
python run_api.py

# Option 2: Direct uvicorn command
uvicorn api_server:app --host 0.0.0.0 --port 8000 --reload

# Option 3: Using the api_server module
python api_server.py

3. Access API Documentation

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
API Info: http://localhost:8000/

📋 Supported File Types

Documents (OpenAI Processor)

PDF: .pdf
Word: .doc, .docx
Excel: .xls, .xlsx
Text: .txt

Multimedia (Gemini Processor)

Video: .mp4, .avi, .mov, .mkv
Audio: .mp3, .wav, .m4a, .webm, .ogg

YouTube Content (YouTube Processor)

YouTube videos and audio content
Supports various YouTube URL formats

🛠 API Endpoints

Health & Information

GET / - API information and welcome message
GET /health - Health check and configuration status
GET /supported-types - Detailed information about supported file types

Single File Processing

POST /extract - Upload and extract text from a single file
POST /extract-url - Extract text from a file at a URL
POST /extract-youtube - Extract content from a YouTube video

Batch Processing

POST /extract-batch - Upload and extract text from multiple files
POST /extract-batch-url - Extract text from multiple URLs
POST /extract-youtube-batch - Extract content from multiple YouTube videos

📖 API Usage Examples

1. Health Check

curl -X GET "http://localhost:8000/health"

2. Single File Upload

curl -X POST "http://localhost:8000/extract" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@document.pdf"

3. Extract from URL

curl -X POST "http://localhost:8000/extract-url" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/document.pdf",
    "filename": "my_document.pdf"
  }'

4. YouTube Content Extraction

curl -X POST "http://localhost:8000/extract-youtube" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.youtube.com/watch?v=VIDEO_ID",
    "title": "Custom Video Title"
  }'

5. Batch File Processing

curl -X POST "http://localhost:8000/extract-batch" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@file1.pdf" \
  -F "files=@file2.docx" \
  -F "files=@file3.mp4"

🔧 Configuration

Environment Variables

Create a .env file with the following variables:

# Required API Keys
OPENAI_API_KEY=sk-your-openai-api-key
GOOGLE_API_KEY=your-google-api-key

# Optional API Configuration
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=true
API_LOG_LEVEL=info

# Optional Model Configuration
OPENAI_MODEL=gpt-3.5-turbo
GEMINI_MODEL=models/gemini-1.5-flash

Processing Options

Many endpoints support these query parameters:

parallel: Enable parallel processing (default: true)
max_workers: Number of worker threads (default: 4, max: 16)

Example:

curl -X POST "http://localhost:8000/extract-batch?parallel=true&max_workers=8" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@file1.pdf" \
  -F "files=@file2.pdf"

📊 Response Format

Successful Extraction Response

{
  "file_id": "550e8400-e29b-41d4-a716-446655440000",
  "file_info": {
    "name": "document.pdf",
    "extension": ".pdf",
    "size_mb": 2.5,
    "mime_type": "application/pdf",
    "url": "https://example.com/document.pdf"
  },
  "success": true,
  "extracted_text": "This is the extracted text content...",
  "processor_used": "OpenAIProcessor",
  "processing_time": 3.45,
  "text_length": 1250,
  "timestamp": "2024-01-15T10:30:00"
}

Batch Processing Response

{
  "batch_id": "batch-550e8400-e29b-41d4-a716-446655440000",
  "total_files": 3,
  "successful": 2,
  "failed": 1,
  "total_processing_time": 15.67,
  "total_characters": 5420,
  "results": [
    // Array of ExtractionResult objects
  ],
  "timestamp": "2024-01-15T10:30:00"
}

Error Response

{
  "error": "HTTP 400",
  "detail": "Unsupported file type: .xyz. Supported: .pdf, .doc, .docx, ...",
  "timestamp": "2024-01-15T10:30:00"
}

🐍 Python Client Example

import requests
import json

# API base URL
BASE_URL = "http://localhost:8000"

def check_health():
    """Check API health."""
    response = requests.get(f"{BASE_URL}/health")
    return response.json()

def extract_from_file(file_path):
    """Extract text from a local file."""
    with open(file_path, 'rb') as f:
        files = {'file': f}
        response = requests.post(f"{BASE_URL}/extract", files=files)
    return response.json()

def extract_from_url(url, filename=None):
    """Extract text from a URL."""
    data = {"url": url}
    if filename:
        data["filename"] = filename

    response = requests.post(
        f"{BASE_URL}/extract-url",
        json=data
    )
    return response.json()

def extract_from_youtube(youtube_url, title=None):
    """Extract content from YouTube."""
    data = {"url": youtube_url}
    if title:
        data["title"] = title

    response = requests.post(
        f"{BASE_URL}/extract-youtube",
        json=data
    )
    return response.json()

# Example usage
if __name__ == "__main__":
    # Health check
    health = check_health()
    print("API Status:", health["status"])

    # Extract from file
    result = extract_from_file("document.pdf")
    if result["success"]:
        print(f"Extracted {result['text_length']} characters")
        print("Text preview:", result["extracted_text"][:200])
    else:
        print("Error:", result["error"])

🚦 Production Deployment

Docker Deployment

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

EXPOSE 8000
CMD ["python", "run_api.py"]

Environment Configuration for Production

# Production settings
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=false
API_LOG_LEVEL=info

# Security (configure as needed)
CORS_ORIGINS=["https://yourdomain.com"]

Running with Gunicorn (Production)

pip install gunicorn
gunicorn api_server:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000

🔒 Security Considerations

API Key Security

Store API keys in environment variables, never in code
Use secrets management in production
Rotate API keys regularly

File Upload Security

File size limits are enforced (100MB default)
File type validation based on extensions
Temporary files are automatically cleaned up
Consider implementing authentication for production use

Rate Limiting (Recommended)

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

# Add to endpoints:
@limiter.limit("10/minute")
async def extract_text_from_file(...):
    ...

📈 Monitoring and Logging

Built-in Logging

The API includes structured logging for:

Request processing
Error tracking
Performance monitoring
File cleanup operations

Health Monitoring

/health endpoint provides comprehensive status
API key configuration status
Processor availability
Supported file types

Performance Metrics

Processing time tracking
Success/failure rates
Character count statistics
Batch processing metrics

🐛 Troubleshooting

Common Issues

API Keys Not Working

# Check your .env file
cat .env

# Verify API keys in health endpoint
curl http://localhost:8000/health

File Upload Fails
- Check file size (max 100MB)
- Verify file type is supported
- Ensure proper multipart/form-data content type
YouTube Processing Issues
- Verify yt-dlp is installed: pip install yt-dlp
- Check if video is publicly available
- Some videos may have regional restrictions
Memory Issues with Large Files
- Use smaller batch sizes
- Reduce max_workers parameter
- Consider implementing streaming for very large files

Debug Mode

# Run with debug logging
API_LOG_LEVEL=debug python run_api.py

🤝 Contributing

Follow the project's coding standards (see cursor rules)
Add tests for new endpoints
Update documentation for API changes
Ensure proper error handling
Follow security best practices

📄 License

This project is licensed under the MIT License. See LICENSE file for details.

FilesExpand file tree

FASTAPI_README.md

Latest commit

History