Photograph any document in a foreign language. Get the English translation overlaid directly on the image — in seconds.
Upload an image containing text in any language. The app:
- Extracts every word from the image using AI vision
- Detects the source language automatically
- Translates everything to English
- Renders the English translation directly over the original image — preserving layout, tables, and structure
Example: A Lao government health report with 18 provinces, statistics, and tables → fully translated overlay image + clean extracted text, all in under 90 seconds.
Traditional OCR tools extract text but lose all layout context. Translation tools accept text but don't know where it came from. This project combines both into a single pipeline — the output is an image you can actually read, not just a block of raw translated text.
Use cases:
- Translating foreign government reports, medical documents, or infographics
- Extracting structured table data from images in any language
- Any workflow where someone receives image-based documents in a language they don't read
http://localhost:8000
Upload any JPEG, PNG, or WebP image → click Translate Image → results appear in under 90 seconds.
The UI shows:
- Original and Translated Overlay images side by side
- Source text (extracted, in original language) and English translation as structured markdown
- 4-step pipeline tracker showing real-time progress
- Render complete toast showing exact time taken
Image Upload
│
▼
┌─────────────────────────────────────┐
│ Step 1: OCR Scan │
│ Gemini Vision reads every text │
│ element and its position │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Step 2: Language Detection │
│ Unicode block analysis identifies │
│ script → langdetect for Latin │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Step 3: Translation │
│ Pass 1 — headers, stat boxes, │
│ titles, footers + bounding boxes │
│ Pass 2 — all table rows (compact │
│ format, prevents truncation) │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Step 4: Render Overlay │
│ Each text region gets a │
│ semi-transparent fill box + │
│ English text sized to match the │
│ original visual weight │
└─────────────────────────────────────┘
Why two Gemini passes? Gemini has a response length limit. Asking for all bounding boxes in one call causes truncation on dense documents (tables with 18+ rows). Pass 1 handles metadata, Pass 2 handles table rows using a compact array format — preventing truncation while keeping a single API key.
| Layer | Technology | Why |
|---|---|---|
| Backend | Python 3.11+, FastAPI, Uvicorn | Async, production-ready, auto docs at /docs |
| AI — OCR + Translation | Google Gemini Vision API (gemini-flash-latest) |
Single call handles OCR + translation + bounding boxes + table extraction |
| Image processing | OpenCV, Pillow | Overlay rendering, colour sampling, font compositing |
| Language detection | Unicode block analysis + langdetect | Fast, no API needed — detects 50+ scripts from character ranges |
| Frontend | Vanilla JS, HTML, CSS | Zero dependencies, instant load, dark/light theme |
| Containerisation | Docker | One-command deployment |
Primary: gemini-flash-latest
Automatic fallback chain (handles quota exhaustion gracefully):
gemini-flash-latest → gemini-2.0-flash → gemini-2.5-flash → gemini-2.0-flash-lite
Each model is tried in order. If one returns a 429 (rate limit), the code reads the retryDelay from Google's response and waits exactly that long before trying the next model — no wasted retries.
Free tier limits: 1,500 requests/day · 15 requests/minute · $0 cost
Any language Gemini Vision can read. Tested and confirmed working:
| Script | Languages |
|---|---|
| Lao | Lao |
| Thai | Thai |
| Arabic | Arabic, Urdu, Persian |
| Devanagari | Hindi, Sanskrit, Nepali |
| CJK | Chinese (Simplified + Traditional), Japanese, Korean |
| Cyrillic | Russian, Ukrainian, Bulgarian |
| Latin | English, French, German, Spanish, Portuguese, Vietnamese, Indonesian, and more |
| Other | Bengali, Tamil, Telugu, Kannada, Malayalam, Khmer, Myanmar, Hebrew |
text-ocr-translator/
├── backend/
│ ├── main.py # FastAPI app — routes, rate limiting, file cleanup
│ ├── ocr_engine.py # Pipeline orchestrator — connects all stages
│ ├── assets/fonts/ # Noto font family for multilingual overlay rendering
│ └── services/
│ ├── ocr_service.py # Two-pass Gemini Vision strategy
│ ├── overlay_service.py # OpenCV + Pillow overlay renderer
│ └── language_service.py # Unicode block analysis + langdetect
├── static/
│ └── script.js # Frontend — Promise.all fetch, toast, progress bar
├── templates/
│ └── index.html # Single-page UI with dark/light theme
├── uploads/ # Runtime — auto-cleaned after 1 hour
├── outputs/ # Runtime — auto-cleaned after 1 hour
├── .env.example # API key template
├── Dockerfile
├── docker-compose.yml
└── requirements.txt
→ https://aistudio.google.com/app/apikey — no credit card, instant access
git clone https://github.com/arunkmr13/text-ocr-translator.git
cd text-ocr-translator
cp .env.example .env
# Add your key: GEMINI_API_KEY=your_key_heredocker compose up --build
# App at http://localhost:8000python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
uvicorn backend.main:app --reload
# App at http://localhost:8000Auto-generated docs available at http://localhost:8000/docs
Accepts an image and returns the translated overlay + extracted text.
Request
Content-Type: multipart/form-data
Body: file (JPEG / PNG / WebP, max 5 MB)
curl example
curl -X POST http://localhost:8000/upload \
-F "file=@document.jpg"Python example
import requests
with open("document.jpg", "rb") as f:
response = requests.post(
"http://localhost:8000/upload",
files={"file": f}
)
data = response.json()
print(data["translated_text"]) # English translation
print(data["detected_language"]) # ISO 639-1 code e.g. "lo", "th", "ar"
print(data["region_count"]) # Number of text regions rendered
print(data["translated_image"]) # Path to overlay imageResponse
{
"original_image": "uploads/uuid.jpg",
"translated_image": "outputs/uuid.jpg",
"extracted_text": "# ສະພາບໄຂ້ຍຸງລາຍ\n...",
"translated_text": "# Dengue Fever Situation\n...",
"detected_language": "lo",
"region_count": 93,
"warning": null
}Error codes
| Code | Meaning |
|---|---|
| 413 | File exceeds the 5 MB size limit |
| 415 | Unsupported file type — only JPEG, PNG, WebP accepted |
| 422 | OCR failed — Gemini API error or no text found in image |
| 429 | Rate limit exceeded — 20 requests per 60 seconds per IP |
Why Gemini Vision instead of Tesseract + Google Translate? Tesseract requires language packs installed per language, struggles with non-Latin scripts, and has no table understanding. Gemini Vision reads any script out of the box, understands document structure, and returns both translation and spatial coordinates in a single API call.
Why vanilla JS instead of React?
The UI state is simple — one upload, one result. A framework would add build complexity with no benefit. Vanilla JS with Promise.all handles the concurrent fetch + animation correctly with zero dependencies.
Why two Gemini passes instead of one? Gemini's response token limit causes truncation on dense documents. A single pass requesting all 90+ bounding boxes for an 18-province table gets cut off mid-response. Pass 1 (metadata) + Pass 2 (table rows in compact array format) guarantees complete coverage without hitting limits.
Why height-driven font sizing?
The English overlay text must match the visual weight of the original language text. Since Gemini returns bounding boxes derived from the original glyph heights, using font_size = (bbox_height - padding) × 1.25 produces cap-heights that match the original text scale — regardless of language or document size.
| Concern | How it's handled |
|---|---|
| API key exposure | Stored in .env, git-ignored, never hardcoded |
| File uploads | Magic-byte validation (not just MIME type), 5 MB limit |
| Rate limiting | 20 requests / 60s per IP, in-memory middleware |
| Disk usage | Auto-cleanup of uploads + outputs after 1 hour |
| Quota exhaustion | 4-model fallback chain with retryDelay-aware sleep |
| Large tables | Two-pass Gemini strategy prevents response truncation |
| Invalid JSON | Sanitisation pass handles malformed escapes from gemini-2.5-flash |
- Bounding box accuracy — Gemini estimates normalised coordinates rather than computing pixel-precise positions. Overlay alignment is approximate (±5% of image dimensions).
- Processing time — Two Gemini API calls per request = ~60-90 seconds total. Bottleneck is network latency to Google's API, not local computation.
- Free tier quota — 1,500 requests/day. Each translation uses 2 calls, so effectively 750 translations/day.
- Overlay on complex backgrounds — Semi-transparent fill boxes work well on solid/gradient backgrounds. Highly textured backgrounds may reduce overlay legibility.
| Variable | Required | Description |
|---|---|---|
GEMINI_API_KEY |
✅ | Google Gemini API key |
- figcaption — AI-powered image caption generator
- docx-report-engine — Automated Word document report generation
- whiteboard-digitiser — Hand-drawn whiteboard → Mermaid diagram
MIT — free to use, modify, and distribute.
Built by Arun Kumar