A powerful FastAPI-based REST API for extracting text from various file types using OpenAI GPT and Google Gemini APIs.
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment template and add your API keys
cp env_example.txt .env
# Edit .env with your OpenAI and Google API keys# Option 1: Using the startup script (recommended)
python run_api.py
# Option 2: Direct uvicorn command
uvicorn api_server:app --host 0.0.0.0 --port 8000 --reload
# Option 3: Using the api_server module
python api_server.py- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- API Info: http://localhost:8000/
- PDF:
.pdf - Word:
.doc,.docx - Excel:
.xls,.xlsx - Text:
.txt
- Video:
.mp4,.avi,.mov,.mkv - Audio:
.mp3,.wav,.m4a,.webm,.ogg
- YouTube videos and audio content
- Supports various YouTube URL formats
GET /- API information and welcome messageGET /health- Health check and configuration statusGET /supported-types- Detailed information about supported file types
POST /extract- Upload and extract text from a single filePOST /extract-url- Extract text from a file at a URLPOST /extract-youtube- Extract content from a YouTube video
POST /extract-batch- Upload and extract text from multiple filesPOST /extract-batch-url- Extract text from multiple URLsPOST /extract-youtube-batch- Extract content from multiple YouTube videos
curl -X GET "http://localhost:8000/health"curl -X POST "http://localhost:8000/extract" \
-H "Content-Type: multipart/form-data" \
-F "file=@document.pdf"curl -X POST "http://localhost:8000/extract-url" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/document.pdf",
"filename": "my_document.pdf"
}'curl -X POST "http://localhost:8000/extract-youtube" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"title": "Custom Video Title"
}'curl -X POST "http://localhost:8000/extract-batch" \
-H "Content-Type: multipart/form-data" \
-F "files=@file1.pdf" \
-F "files=@file2.docx" \
-F "files=@file3.mp4"Create a .env file with the following variables:
# Required API Keys
OPENAI_API_KEY=sk-your-openai-api-key
GOOGLE_API_KEY=your-google-api-key
# Optional API Configuration
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=true
API_LOG_LEVEL=info
# Optional Model Configuration
OPENAI_MODEL=gpt-3.5-turbo
GEMINI_MODEL=models/gemini-1.5-flashMany endpoints support these query parameters:
parallel: Enable parallel processing (default: true)max_workers: Number of worker threads (default: 4, max: 16)
Example:
curl -X POST "http://localhost:8000/extract-batch?parallel=true&max_workers=8" \
-H "Content-Type: multipart/form-data" \
-F "files=@file1.pdf" \
-F "files=@file2.pdf"{
"file_id": "550e8400-e29b-41d4-a716-446655440000",
"file_info": {
"name": "document.pdf",
"extension": ".pdf",
"size_mb": 2.5,
"mime_type": "application/pdf",
"url": "https://example.com/document.pdf"
},
"success": true,
"extracted_text": "This is the extracted text content...",
"processor_used": "OpenAIProcessor",
"processing_time": 3.45,
"text_length": 1250,
"timestamp": "2024-01-15T10:30:00"
}{
"batch_id": "batch-550e8400-e29b-41d4-a716-446655440000",
"total_files": 3,
"successful": 2,
"failed": 1,
"total_processing_time": 15.67,
"total_characters": 5420,
"results": [
// Array of ExtractionResult objects
],
"timestamp": "2024-01-15T10:30:00"
}{
"error": "HTTP 400",
"detail": "Unsupported file type: .xyz. Supported: .pdf, .doc, .docx, ...",
"timestamp": "2024-01-15T10:30:00"
}import requests
import json
# API base URL
BASE_URL = "http://localhost:8000"
def check_health():
"""Check API health."""
response = requests.get(f"{BASE_URL}/health")
return response.json()
def extract_from_file(file_path):
"""Extract text from a local file."""
with open(file_path, 'rb') as f:
files = {'file': f}
response = requests.post(f"{BASE_URL}/extract", files=files)
return response.json()
def extract_from_url(url, filename=None):
"""Extract text from a URL."""
data = {"url": url}
if filename:
data["filename"] = filename
response = requests.post(
f"{BASE_URL}/extract-url",
json=data
)
return response.json()
def extract_from_youtube(youtube_url, title=None):
"""Extract content from YouTube."""
data = {"url": youtube_url}
if title:
data["title"] = title
response = requests.post(
f"{BASE_URL}/extract-youtube",
json=data
)
return response.json()
# Example usage
if __name__ == "__main__":
# Health check
health = check_health()
print("API Status:", health["status"])
# Extract from file
result = extract_from_file("document.pdf")
if result["success"]:
print(f"Extracted {result['text_length']} characters")
print("Text preview:", result["extracted_text"][:200])
else:
print("Error:", result["error"])FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["python", "run_api.py"]# Production settings
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=false
API_LOG_LEVEL=info
# Security (configure as needed)
CORS_ORIGINS=["https://yourdomain.com"]pip install gunicorn
gunicorn api_server:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000- Store API keys in environment variables, never in code
- Use secrets management in production
- Rotate API keys regularly
- File size limits are enforced (100MB default)
- File type validation based on extensions
- Temporary files are automatically cleaned up
- Consider implementing authentication for production use
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
# Add to endpoints:
@limiter.limit("10/minute")
async def extract_text_from_file(...):
...The API includes structured logging for:
- Request processing
- Error tracking
- Performance monitoring
- File cleanup operations
/healthendpoint provides comprehensive status- API key configuration status
- Processor availability
- Supported file types
- Processing time tracking
- Success/failure rates
- Character count statistics
- Batch processing metrics
-
API Keys Not Working
# Check your .env file cat .env # Verify API keys in health endpoint curl http://localhost:8000/health
-
File Upload Fails
- Check file size (max 100MB)
- Verify file type is supported
- Ensure proper multipart/form-data content type
-
YouTube Processing Issues
- Verify yt-dlp is installed:
pip install yt-dlp - Check if video is publicly available
- Some videos may have regional restrictions
- Verify yt-dlp is installed:
-
Memory Issues with Large Files
- Use smaller batch sizes
- Reduce max_workers parameter
- Consider implementing streaming for very large files
# Run with debug logging
API_LOG_LEVEL=debug python run_api.py- Follow the project's coding standards (see cursor rules)
- Add tests for new endpoints
- Update documentation for API changes
- Ensure proper error handling
- Follow security best practices
This project is licensed under the MIT License. See LICENSE file for details.