Skip to content

Latest commit

 

History

History
428 lines (332 loc) · 11.4 KB

File metadata and controls

428 lines (332 loc) · 11.4 KB

Text Extraction API

A FastAPI-based REST API for extracting text from various file types using AI services (OpenAI + Gemini).

Features

  • Multi-format Support: Extract text from documents (PDF, DOC, TXT, XLS) and multimedia files (MP4, MP3, AVI, etc.)

  • AI-Powered: Uses OpenAI for documents and Google Gemini for multimedia

  • Audio-Only Processing: Video files are processed for audio transcription only (no video frame analysis)

  • YouTube Support: Direct transcription from YouTube videos via audio extraction

  • Web Page Support: Extract clean text from Wikipedia, blogs, articles, and other web content

  • RESTful API: Standard HTTP endpoints with JSON responses

  • File Upload: Support for single and batch file uploads

  • Async Processing: Built with FastAPI for high performance

  • Automatic Cleanup: Temporary files are cleaned up automatically

  • Error Handling: Comprehensive error handling with detailed messages

Supported File Types

Documents (OpenAI Processor)

  • .pdf - PDF documents
  • .txt - Plain text files
  • .doc, .docx - Microsoft Word documents
  • .xls, .xlsx - Microsoft Excel spreadsheets

Multimedia (Gemini Processor) - Audio Only

  • .mp4, .avi, .mov, .mkv - Video files (audio track transcription only)
  • .mp3, .wav, .m4a, .webm, .ogg - Audio files (full transcription)

YouTube Videos (YouTube Processor) - Audio Only

  • YouTube video URLs (youtube.com, youtu.be, m.youtube.com)
  • Downloads audio and generates transcripts using AI (no video analysis)

Web Pages (Web Processor) - HTML Content Extraction

  • Wikipedia pages (wikipedia.org)
  • Blog posts and articles
  • News websites and documentation
  • Any HTML-based web content
  • Automatic content cleaning and formatting

Installation

  1. Clone the repository and navigate to the project directory

  2. Create a virtual environment:

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Install system dependencies for YouTube processing:

    # macOS
    brew install ffmpeg
    
    # Ubuntu/Debian
    sudo apt update && sudo apt install ffmpeg
    
    # Windows (using chocolatey)
    choco install ffmpeg
  5. Set up environment variables in .env file:

    OPENAI_API_KEY=your_openai_api_key_here
    GOOGLE_API_KEY=your_gemini_api_key_here
    

Starting the Server

Standard Method

python api_server.py

For Python Version Issues

If you encounter module import errors, use the explicit Python path:

# Check your Python installation path
which python3
python3 --version

# Use explicit path if needed (example for Homebrew Python)
/opt/homebrew/opt/python@3.13/bin/python3.13 api_server.py

The server will start on http://localhost:8000

API Endpoints

Health Check

GET /health

Check API health and configuration status.

curl http://localhost:8000/health

Response:

{
    "status": "healthy",
    "message": "Text Extraction API is running",
    "supported_extensions": {...},
    "api_keys_configured": {...}
}

Get Supported Types

GET /supported-types

Get list of supported file types and processors.

curl http://localhost:8000/supported-types

Extract Text from URL

POST /extract-url

Extract text from a single file URL or web page.

For files (PDFs, documents, etc.):

curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf"}'

For web pages (Wikipedia, articles, etc.):

curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/History_of_Portugal"}'

For audio files (WebM, OGG, MP3, etc.):

curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/audio.webm"}'
curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/audio.ogg"}'

With custom filename:

curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/file", "filename": "document.pdf"}'

Extract Text from Multiple URLs

POST /extract-batch-url

Extract text from multiple file URLs or web pages (max 10 URLs).

curl -X POST http://localhost:8000/extract-batch-url \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com/doc1.pdf", "https://en.wikipedia.org/wiki/Portugal", "https://example.com/doc2.txt"]}'

With custom filenames:

curl -X POST http://localhost:8000/extract-batch-url \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com/file1", "https://example.com/file2"], "filenames": ["doc1.pdf", "doc2.txt"]}'

Extract Text from YouTube Video

POST /extract-youtube

Extract transcript from a single YouTube video using audio processing.

curl -X POST http://localhost:8000/extract-youtube \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'

With custom title:

curl -X POST http://localhost:8000/extract-youtube \
  -H "Content-Type: application/json" \
  -d '{"url": "https://youtu.be/dQw4w9WgXcQ", "title": "Custom Video Title"}'

Extract Text from Multiple YouTube Videos

POST /extract-youtube-batch

Extract transcripts from multiple YouTube videos (max 10 videos).

curl -X POST http://localhost:8000/extract-youtube-batch \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ", "https://youtu.be/9bZkp7q19f0"]}'

With custom titles:

curl -X POST http://localhost:8000/extract-youtube-batch \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ", "https://youtu.be/9bZkp7q19f0"], "titles": ["Video 1", "Video 2"]}'

Extract Text from Uploaded File

POST /extract

Upload and extract text from a single file.

curl -X POST -F "file=@document.pdf" http://localhost:8000/extract

Response:

{
    "file_id": "uuid-here",
    "file_info": {
        "name": "document.pdf",
        "extension": ".pdf",
        "size_mb": 1.5,
        "mime_type": "application/pdf"
    },
    "success": true,
    "extracted_text": "Extracted text content...",
    "processor_used": "OpenAIProcessor",
    "processing_time": 2.45,
    "text_length": 1234,
    "timestamp": "2025-06-25T14:18:55.955439"
}

Extract Text from Multiple Uploaded Files

POST /extract-batch

Upload and extract text from multiple files (max 10 files).

curl -X POST \
  -F "files=@document1.pdf" \
  -F "files=@document2.txt" \
  -F "files=@video.mp4" \
  http://localhost:8000/extract-batch

Response:

{
    "batch_id": "uuid-here",
    "total_files": 3,
    "successful": 3,
    "failed": 0,
    "total_processing_time": 15.2,
    "total_characters": 5678,
    "results": [...],
    "timestamp": "2025-06-25T14:18:55.955439"
}

Interactive Documentation

FastAPI provides automatic interactive documentation:

Python Client Example

import requests

# Health check
response = requests.get("http://localhost:8000/health")
print(response.json())

# Upload a file
with open("document.pdf", "rb") as f:
    files = {"file": ("document.pdf", f, "application/pdf")}
    response = requests.post("http://localhost:8000/extract", files=files)
    result = response.json()
    print(f"Extracted {result['text_length']} characters")
    print(result['extracted_text'][:200] + "...")

# Extract from web page
web_data = {
    "url": "https://en.wikipedia.org/wiki/History_of_Portugal"
}
response = requests.post("http://localhost:8000/extract-url", json=web_data)
result = response.json()
if result['success']:
    print(f"Web content extracted: {result['text_length']} characters")
    print(result['extracted_text'][:300] + "...")

For more examples, see api_client_examples.py and youtube_client_examples.py.

Response Format

Success Response

{
    "file_id": "unique-identifier",
    "file_info": {
        "name": "filename.ext",
        "extension": ".ext",
        "size_mb": 1.23,
        "mime_type": "mime/type"
    },
    "success": true,
    "extracted_text": "Content here...",
    "processor_used": "OpenAIProcessor",
    "processing_time": 1.23,
    "text_length": 456,
    "timestamp": "ISO-8601-timestamp"
}

Error Response

{
    "file_id": "unique-identifier",
    "file_info": {...},
    "success": false,
    "error": "Error description",
    "processor_used": null,
    "processing_time": 0.0,
    "text_length": 0,
    "timestamp": "ISO-8601-timestamp"
}

Error Codes

  • 400 Bad Request: Invalid file type or missing file
  • 503 Service Unavailable: Text extractor not initialized
  • 500 Internal Server Error: Processing failed

Configuration

Environment Variables

  • OPENAI_API_KEY: OpenAI API key for document processing
  • GOOGLE_API_KEY: Google API key for Gemini (multimedia processing)

Limits

  • Maximum 10 files per batch request
  • Temporary files are automatically cleaned up
  • Processing timeout varies by file type and size

Troubleshooting

Python Version Issues

If you encounter ModuleNotFoundError for web scraping packages:

  1. Check Python version and installation:

    python3 --version
    which python3
    python3 -c "import sys; print(sys.executable)"
  2. Verify package installation:

    python3 -c "import requests, bs4; print('Web scraping packages found')"
  3. Use explicit Python path if needed:

    # Find your Python installation
    ls /opt/homebrew/opt/python@*/bin/python*
    
    # Use explicit path
    /opt/homebrew/opt/python@3.13/bin/python3.13 api_server.py

Common Issues

  1. PyDub warnings: Warning: PyDub not available - No module named 'pyaudioop'

    • This is a warning and doesn't affect functionality
    • Audio processing still works through other methods
  2. FastAPI deprecation warnings:

    • These are warnings about on_event being deprecated
    • Functionality works normally, warnings can be ignored
  3. Web page extraction issues:

    • Some websites may block automated requests
    • Try different User-Agent strings if needed
    • Check if the website requires authentication
  4. SSL/OpenSSL warnings:

    • These are system-level warnings and don't affect API functionality
    • Consider updating system SSL libraries if needed

Performance Comparison

Method Speed Requirements Quality
/extract-url (Web Pages) Fast (5-10s) Web access High (cleaned content)
/extract-youtube Slow (30-60s) Audio download + AI High (AI transcription)
/extract (File Upload) Medium (10-30s) File upload High (AI processing)

Web page extraction is fast and efficient for articles, Wikipedia pages, and other text-heavy websites.

Development

For development and testing, see the example files:

  • api_client_examples.py - General API usage examples
  • youtube_client_examples.py - YouTube-specific examples
  • url_client_examples.py - URL processing examples