Text Extraction API

A FastAPI-based REST API for extracting text from various file types using AI services (OpenAI + Gemini).

Features

Multi-format Support: Extract text from documents (PDF, DOC, TXT, XLS) and multimedia files (MP4, MP3, AVI, etc.)
AI-Powered: Uses OpenAI for documents and Google Gemini for multimedia
Audio-Only Processing: Video files are processed for audio transcription only (no video frame analysis)
YouTube Support: Direct transcription from YouTube videos via audio extraction
Web Page Support: Extract clean text from Wikipedia, blogs, articles, and other web content
RESTful API: Standard HTTP endpoints with JSON responses
File Upload: Support for single and batch file uploads
Async Processing: Built with FastAPI for high performance
Automatic Cleanup: Temporary files are cleaned up automatically
Error Handling: Comprehensive error handling with detailed messages

Supported File Types

Documents (OpenAI Processor)

.pdf - PDF documents
.txt - Plain text files
.doc, .docx - Microsoft Word documents
.xls, .xlsx - Microsoft Excel spreadsheets

Multimedia (Gemini Processor) - Audio Only

.mp4, .avi, .mov, .mkv - Video files (audio track transcription only)
.mp3, .wav, .m4a, .webm, .ogg - Audio files (full transcription)

YouTube Videos (YouTube Processor) - Audio Only

YouTube video URLs (youtube.com, youtu.be, m.youtube.com)
Downloads audio and generates transcripts using AI (no video analysis)

Web Pages (Web Processor) - HTML Content Extraction

Wikipedia pages (wikipedia.org)
Blog posts and articles
News websites and documentation
Any HTML-based web content
Automatic content cleaning and formatting

Installation

Clone the repository and navigate to the project directory

Create a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Install system dependencies for YouTube processing:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows (using chocolatey)
choco install ffmpeg

Set up environment variables in .env file:

OPENAI_API_KEY=your_openai_api_key_here
GOOGLE_API_KEY=your_gemini_api_key_here

Starting the Server

Standard Method

python api_server.py

For Python Version Issues

If you encounter module import errors, use the explicit Python path:

# Check your Python installation path
which python3
python3 --version

# Use explicit path if needed (example for Homebrew Python)
/opt/homebrew/opt/python@3.13/bin/python3.13 api_server.py

The server will start on http://localhost:8000

API Endpoints

Health Check

GET /health

Check API health and configuration status.

curl http://localhost:8000/health

Response:

{
    "status": "healthy",
    "message": "Text Extraction API is running",
    "supported_extensions": {...},
    "api_keys_configured": {...}
}

Get Supported Types

GET /supported-types

Get list of supported file types and processors.

curl http://localhost:8000/supported-types

Extract Text from URL

POST /extract-url

Extract text from a single file URL or web page.

For files (PDFs, documents, etc.):

curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf"}'

For web pages (Wikipedia, articles, etc.):

curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/History_of_Portugal"}'

For audio files (WebM, OGG, MP3, etc.):

curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/audio.webm"}'

curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/audio.ogg"}'

With custom filename:

curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/file", "filename": "document.pdf"}'

Extract Text from Multiple URLs

POST /extract-batch-url

Extract text from multiple file URLs or web pages (max 10 URLs).

curl -X POST http://localhost:8000/extract-batch-url \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com/doc1.pdf", "https://en.wikipedia.org/wiki/Portugal", "https://example.com/doc2.txt"]}'

With custom filenames:

curl -X POST http://localhost:8000/extract-batch-url \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com/file1", "https://example.com/file2"], "filenames": ["doc1.pdf", "doc2.txt"]}'

Extract Text from YouTube Video

POST /extract-youtube

Extract transcript from a single YouTube video using audio processing.

curl -X POST http://localhost:8000/extract-youtube \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'

With custom title:

curl -X POST http://localhost:8000/extract-youtube \
  -H "Content-Type: application/json" \
  -d '{"url": "https://youtu.be/dQw4w9WgXcQ", "title": "Custom Video Title"}'

Extract Text from Multiple YouTube Videos

POST /extract-youtube-batch

Extract transcripts from multiple YouTube videos (max 10 videos).

curl -X POST http://localhost:8000/extract-youtube-batch \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ", "https://youtu.be/9bZkp7q19f0"]}'

With custom titles:

curl -X POST http://localhost:8000/extract-youtube-batch \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ", "https://youtu.be/9bZkp7q19f0"], "titles": ["Video 1", "Video 2"]}'

Extract Text from Uploaded File

POST /extract

Upload and extract text from a single file.

curl -X POST -F "file=@document.pdf" http://localhost:8000/extract

Response:

{
    "file_id": "uuid-here",
    "file_info": {
        "name": "document.pdf",
        "extension": ".pdf",
        "size_mb": 1.5,
        "mime_type": "application/pdf"
    },
    "success": true,
    "extracted_text": "Extracted text content...",
    "processor_used": "OpenAIProcessor",
    "processing_time": 2.45,
    "text_length": 1234,
    "timestamp": "2025-06-25T14:18:55.955439"
}

Extract Text from Multiple Uploaded Files

POST /extract-batch

Upload and extract text from multiple files (max 10 files).

curl -X POST \
  -F "files=@document1.pdf" \
  -F "files=@document2.txt" \
  -F "files=@video.mp4" \
  http://localhost:8000/extract-batch

Response:

{
    "batch_id": "uuid-here",
    "total_files": 3,
    "successful": 3,
    "failed": 0,
    "total_processing_time": 15.2,
    "total_characters": 5678,
    "results": [...],
    "timestamp": "2025-06-25T14:18:55.955439"
}

Interactive Documentation

FastAPI provides automatic interactive documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Python Client Example

import requests

# Health check
response = requests.get("http://localhost:8000/health")
print(response.json())

# Upload a file
with open("document.pdf", "rb") as f:
    files = {"file": ("document.pdf", f, "application/pdf")}
    response = requests.post("http://localhost:8000/extract", files=files)
    result = response.json()
    print(f"Extracted {result['text_length']} characters")
    print(result['extracted_text'][:200] + "...")

# Extract from web page
web_data = {
    "url": "https://en.wikipedia.org/wiki/History_of_Portugal"
}
response = requests.post("http://localhost:8000/extract-url", json=web_data)
result = response.json()
if result['success']:
    print(f"Web content extracted: {result['text_length']} characters")
    print(result['extracted_text'][:300] + "...")

For more examples, see api_client_examples.py and youtube_client_examples.py.

Response Format

Success Response

{
    "file_id": "unique-identifier",
    "file_info": {
        "name": "filename.ext",
        "extension": ".ext",
        "size_mb": 1.23,
        "mime_type": "mime/type"
    },
    "success": true,
    "extracted_text": "Content here...",
    "processor_used": "OpenAIProcessor",
    "processing_time": 1.23,
    "text_length": 456,
    "timestamp": "ISO-8601-timestamp"
}

Error Response

{
    "file_id": "unique-identifier",
    "file_info": {...},
    "success": false,
    "error": "Error description",
    "processor_used": null,
    "processing_time": 0.0,
    "text_length": 0,
    "timestamp": "ISO-8601-timestamp"
}

Error Codes

400 Bad Request: Invalid file type or missing file
503 Service Unavailable: Text extractor not initialized
500 Internal Server Error: Processing failed

Configuration

Environment Variables

OPENAI_API_KEY: OpenAI API key for document processing
GOOGLE_API_KEY: Google API key for Gemini (multimedia processing)

Limits

Maximum 10 files per batch request
Temporary files are automatically cleaned up
Processing timeout varies by file type and size

Troubleshooting

Python Version Issues

If you encounter ModuleNotFoundError for web scraping packages:

Check Python version and installation:

python3 --version
which python3
python3 -c "import sys; print(sys.executable)"

Verify package installation:

python3 -c "import requests, bs4; print('Web scraping packages found')"

Use explicit Python path if needed:

# Find your Python installation
ls /opt/homebrew/opt/python@*/bin/python*

# Use explicit path
/opt/homebrew/opt/python@3.13/bin/python3.13 api_server.py

Common Issues

PyDub warnings: Warning: PyDub not available - No module named 'pyaudioop'
- This is a warning and doesn't affect functionality
- Audio processing still works through other methods
FastAPI deprecation warnings:
- These are warnings about on_event being deprecated
- Functionality works normally, warnings can be ignored
Web page extraction issues:
- Some websites may block automated requests
- Try different User-Agent strings if needed
- Check if the website requires authentication
SSL/OpenSSL warnings:
- These are system-level warnings and don't affect API functionality
- Consider updating system SSL libraries if needed

Performance Comparison

Method	Speed	Requirements	Quality
`/extract-url` (Web Pages)	Fast (5-10s)	Web access	High (cleaned content)
`/extract-youtube`	Slow (30-60s)	Audio download + AI	High (AI transcription)
`/extract` (File Upload)	Medium (10-30s)	File upload	High (AI processing)

Web page extraction is fast and efficient for articles, Wikipedia pages, and other text-heavy websites.

Development

For development and testing, see the example files:

api_client_examples.py - General API usage examples
youtube_client_examples.py - YouTube-specific examples
url_client_examples.py - URL processing examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Extraction API

Features

Supported File Types

Documents (OpenAI Processor)

Multimedia (Gemini Processor) - Audio Only

YouTube Videos (YouTube Processor) - Audio Only

Web Pages (Web Processor) - HTML Content Extraction

Installation

Starting the Server

Standard Method

For Python Version Issues

API Endpoints

Health Check

Get Supported Types

Extract Text from URL

Extract Text from Multiple URLs

Extract Text from YouTube Video

Extract Text from Multiple YouTube Videos

Extract Text from Uploaded File

Extract Text from Multiple Uploaded Files

Interactive Documentation

Python Client Example

Response Format

Success Response

Error Response

Error Codes

Configuration

Environment Variables

Limits

Troubleshooting

Python Version Issues

Common Issues

Performance Comparison

Development

FilesExpand file tree

API_README.md

Latest commit

History

API_README.md

File metadata and controls

Text Extraction API

Features

Supported File Types

Documents (OpenAI Processor)

Multimedia (Gemini Processor) - Audio Only

YouTube Videos (YouTube Processor) - Audio Only

Web Pages (Web Processor) - HTML Content Extraction

Installation

Starting the Server

Standard Method

For Python Version Issues

API Endpoints

Health Check

Get Supported Types

Extract Text from URL

Extract Text from Multiple URLs

Extract Text from YouTube Video

Extract Text from Multiple YouTube Videos

Extract Text from Uploaded File

Extract Text from Multiple Uploaded Files

Interactive Documentation

Python Client Example

Response Format

Success Response

Error Response

Error Codes

Configuration

Environment Variables

Limits

Troubleshooting

Python Version Issues

Common Issues

Performance Comparison

Development