A FastAPI-based REST API for extracting text from various file types using AI services (OpenAI + Gemini).
-
Multi-format Support: Extract text from documents (PDF, DOC, TXT, XLS) and multimedia files (MP4, MP3, AVI, etc.)
-
AI-Powered: Uses OpenAI for documents and Google Gemini for multimedia
-
Audio-Only Processing: Video files are processed for audio transcription only (no video frame analysis)
-
YouTube Support: Direct transcription from YouTube videos via audio extraction
-
Web Page Support: Extract clean text from Wikipedia, blogs, articles, and other web content
-
RESTful API: Standard HTTP endpoints with JSON responses
-
File Upload: Support for single and batch file uploads
-
Async Processing: Built with FastAPI for high performance
-
Automatic Cleanup: Temporary files are cleaned up automatically
-
Error Handling: Comprehensive error handling with detailed messages
.pdf- PDF documents.txt- Plain text files.doc,.docx- Microsoft Word documents.xls,.xlsx- Microsoft Excel spreadsheets
.mp4,.avi,.mov,.mkv- Video files (audio track transcription only).mp3,.wav,.m4a,.webm,.ogg- Audio files (full transcription)
- YouTube video URLs (
youtube.com,youtu.be,m.youtube.com) - Downloads audio and generates transcripts using AI (no video analysis)
- Wikipedia pages (
wikipedia.org) - Blog posts and articles
- News websites and documentation
- Any HTML-based web content
- Automatic content cleaning and formatting
-
Clone the repository and navigate to the project directory
-
Create a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Install system dependencies for YouTube processing:
# macOS brew install ffmpeg # Ubuntu/Debian sudo apt update && sudo apt install ffmpeg # Windows (using chocolatey) choco install ffmpeg
-
Set up environment variables in
.envfile:OPENAI_API_KEY=your_openai_api_key_here GOOGLE_API_KEY=your_gemini_api_key_here
python api_server.pyIf you encounter module import errors, use the explicit Python path:
# Check your Python installation path
which python3
python3 --version
# Use explicit path if needed (example for Homebrew Python)
/opt/homebrew/opt/python@3.13/bin/python3.13 api_server.pyThe server will start on http://localhost:8000
GET /health
Check API health and configuration status.
curl http://localhost:8000/healthResponse:
{
"status": "healthy",
"message": "Text Extraction API is running",
"supported_extensions": {...},
"api_keys_configured": {...}
}GET /supported-types
Get list of supported file types and processors.
curl http://localhost:8000/supported-typesPOST /extract-url
Extract text from a single file URL or web page.
For files (PDFs, documents, etc.):
curl -X POST http://localhost:8000/extract-url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/document.pdf"}'For web pages (Wikipedia, articles, etc.):
curl -X POST http://localhost:8000/extract-url \
-H "Content-Type: application/json" \
-d '{"url": "https://en.wikipedia.org/wiki/History_of_Portugal"}'For audio files (WebM, OGG, MP3, etc.):
curl -X POST http://localhost:8000/extract-url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/audio.webm"}'curl -X POST http://localhost:8000/extract-url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/audio.ogg"}'With custom filename:
curl -X POST http://localhost:8000/extract-url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/file", "filename": "document.pdf"}'POST /extract-batch-url
Extract text from multiple file URLs or web pages (max 10 URLs).
curl -X POST http://localhost:8000/extract-batch-url \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com/doc1.pdf", "https://en.wikipedia.org/wiki/Portugal", "https://example.com/doc2.txt"]}'With custom filenames:
curl -X POST http://localhost:8000/extract-batch-url \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com/file1", "https://example.com/file2"], "filenames": ["doc1.pdf", "doc2.txt"]}'POST /extract-youtube
Extract transcript from a single YouTube video using audio processing.
curl -X POST http://localhost:8000/extract-youtube \
-H "Content-Type: application/json" \
-d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'With custom title:
curl -X POST http://localhost:8000/extract-youtube \
-H "Content-Type: application/json" \
-d '{"url": "https://youtu.be/dQw4w9WgXcQ", "title": "Custom Video Title"}'POST /extract-youtube-batch
Extract transcripts from multiple YouTube videos (max 10 videos).
curl -X POST http://localhost:8000/extract-youtube-batch \
-H "Content-Type: application/json" \
-d '{"urls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ", "https://youtu.be/9bZkp7q19f0"]}'With custom titles:
curl -X POST http://localhost:8000/extract-youtube-batch \
-H "Content-Type: application/json" \
-d '{"urls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ", "https://youtu.be/9bZkp7q19f0"], "titles": ["Video 1", "Video 2"]}'POST /extract
Upload and extract text from a single file.
curl -X POST -F "file=@document.pdf" http://localhost:8000/extractResponse:
{
"file_id": "uuid-here",
"file_info": {
"name": "document.pdf",
"extension": ".pdf",
"size_mb": 1.5,
"mime_type": "application/pdf"
},
"success": true,
"extracted_text": "Extracted text content...",
"processor_used": "OpenAIProcessor",
"processing_time": 2.45,
"text_length": 1234,
"timestamp": "2025-06-25T14:18:55.955439"
}POST /extract-batch
Upload and extract text from multiple files (max 10 files).
curl -X POST \
-F "files=@document1.pdf" \
-F "files=@document2.txt" \
-F "files=@video.mp4" \
http://localhost:8000/extract-batchResponse:
{
"batch_id": "uuid-here",
"total_files": 3,
"successful": 3,
"failed": 0,
"total_processing_time": 15.2,
"total_characters": 5678,
"results": [...],
"timestamp": "2025-06-25T14:18:55.955439"
}FastAPI provides automatic interactive documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
import requests
# Health check
response = requests.get("http://localhost:8000/health")
print(response.json())
# Upload a file
with open("document.pdf", "rb") as f:
files = {"file": ("document.pdf", f, "application/pdf")}
response = requests.post("http://localhost:8000/extract", files=files)
result = response.json()
print(f"Extracted {result['text_length']} characters")
print(result['extracted_text'][:200] + "...")
# Extract from web page
web_data = {
"url": "https://en.wikipedia.org/wiki/History_of_Portugal"
}
response = requests.post("http://localhost:8000/extract-url", json=web_data)
result = response.json()
if result['success']:
print(f"Web content extracted: {result['text_length']} characters")
print(result['extracted_text'][:300] + "...")For more examples, see api_client_examples.py and youtube_client_examples.py.
{
"file_id": "unique-identifier",
"file_info": {
"name": "filename.ext",
"extension": ".ext",
"size_mb": 1.23,
"mime_type": "mime/type"
},
"success": true,
"extracted_text": "Content here...",
"processor_used": "OpenAIProcessor",
"processing_time": 1.23,
"text_length": 456,
"timestamp": "ISO-8601-timestamp"
}{
"file_id": "unique-identifier",
"file_info": {...},
"success": false,
"error": "Error description",
"processor_used": null,
"processing_time": 0.0,
"text_length": 0,
"timestamp": "ISO-8601-timestamp"
}- 400 Bad Request: Invalid file type or missing file
- 503 Service Unavailable: Text extractor not initialized
- 500 Internal Server Error: Processing failed
OPENAI_API_KEY: OpenAI API key for document processingGOOGLE_API_KEY: Google API key for Gemini (multimedia processing)
- Maximum 10 files per batch request
- Temporary files are automatically cleaned up
- Processing timeout varies by file type and size
If you encounter ModuleNotFoundError for web scraping packages:
-
Check Python version and installation:
python3 --version which python3 python3 -c "import sys; print(sys.executable)" -
Verify package installation:
python3 -c "import requests, bs4; print('Web scraping packages found')" -
Use explicit Python path if needed:
# Find your Python installation ls /opt/homebrew/opt/python@*/bin/python* # Use explicit path /opt/homebrew/opt/python@3.13/bin/python3.13 api_server.py
-
PyDub warnings:
Warning: PyDub not available - No module named 'pyaudioop'- This is a warning and doesn't affect functionality
- Audio processing still works through other methods
-
FastAPI deprecation warnings:
- These are warnings about
on_eventbeing deprecated - Functionality works normally, warnings can be ignored
- These are warnings about
-
Web page extraction issues:
- Some websites may block automated requests
- Try different User-Agent strings if needed
- Check if the website requires authentication
-
SSL/OpenSSL warnings:
- These are system-level warnings and don't affect API functionality
- Consider updating system SSL libraries if needed
| Method | Speed | Requirements | Quality |
|---|---|---|---|
/extract-url (Web Pages) |
Fast (5-10s) | Web access | High (cleaned content) |
/extract-youtube |
Slow (30-60s) | Audio download + AI | High (AI transcription) |
/extract (File Upload) |
Medium (10-30s) | File upload | High (AI processing) |
Web page extraction is fast and efficient for articles, Wikipedia pages, and other text-heavy websites.
For development and testing, see the example files:
api_client_examples.py- General API usage examplesyoutube_client_examples.py- YouTube-specific examplesurl_client_examples.py- URL processing examples