See2Say - AI Vision to Speech for the Visually Impaired

See2Say is an AI-powered accessibility platform that helps visually impaired users understand videos through natural speech. It converts video content into meaningful audio narration using modern Computer Vision and Generative AI.

Think of it as: A companion that describes visual surroundings through audio.

What See2Say Does

User uploads a video
Video is broken into frames using OpenCV
Each frame is captioned using BLIP (Vision-Language Model)
All captions are merged into natural Hinglish narration using Gemini
Final narration is converted into audio output
User listens to the scene description without needing visual access

Core Features

AI-based video understanding
Vision to Language to Speech pipeline
Hinglish narration (human-like, not robotic)
Built for blind and low-vision users
FastAPI backend (scalable)
React + Tailwind frontend
Hugging Face-ready deployment

Tech Stack

Backend (FastAPI)

fastapi
uvicorn
python-dotenv
google-genai
opencv-python
Pillow (PIL)
transformers
torch
gtts
numpy

Technology Choices

transformers + torch for BLIP image captioning
google-genai for Gemini summarization
opencv-python for frame extraction
gtts for text-to-speech conversion
fastapi for clean, asynchronous API backend

Frontend (React + Vite)

{
  "react": "^19.x",
  "vite": "^7.x",
  "tailwindcss": "^4.x"
}

Minimal UI focused on accessibility
Lightweight and fast performance

Project Structure

see2say/
├── Backend/
│   ├── main.py                # FastAPI entry point
│   ├── app/
│   │   ├── routes.py          # API routes
│   │   ├── models.py          # BLIP captioning logic
│   │   ├── gemini_client.py   # Gemini summarization
│   │   ├── utils.py           # TTS and helper functions
│   └── uploads/               # Temporary video storage
│
├── frontend/
│   ├── src/
│   ├── package.json
│
├── requirements.txt
├── Dockerfile
├── .gitignore
└── README.md

Environment Variables

Create a .env file (do not commit this file):

GEMINI_API_KEY=your_api_key_here

On Hugging Face Spaces, set this under Settings -> Repository Secrets.

How It Works

Video Input
  ↓
Frame Extraction (OpenCV)
  ↓
Image Captioning (BLIP)
  ↓
Narration Generation (Gemini)
  ↓
Speech Synthesis (gTTS)
  ↓
Audio Output

Screenshots

API Endpoint

POST /process-video

Input

Video file upload

Output

{
  "captions": ["man walking on road", "car passing by"],
  "final_summary": "Ek aadmi road ke side chal raha hai...",
  "audio": "base64_audio_string"
}

Local Setup

Backend Setup

pip install -r requirements.txt
uvicorn Backend.main:app --reload

Frontend Setup

cd frontend
npm install
npm run dev

Deployment

Backend deployed using Hugging Face Spaces (Docker)
Models are not stored in GitHub repository
BLIP model downloads at runtime from Hugging Face Hub

This approach keeps the repository clean, lightweight, and secure.

Security Notes

.env file is never committed to version control
Model weights are never pushed to the repository
Secrets managed via Hugging Face settings or environment variables

Use Cases

Accessibility for visually impaired users
Video narration and summarization
Multimodal AI research
Assistive learning tools

Future Improvements

Real-time camera narration
Multi-language support
Emotion-aware descriptions
Mobile-first UI optimization
Offline TTS fallback capability

Project Significance

Accessibility is not merely a feature—it is a responsibility. See2Say is built to reduce dependency and provide blind users with greater confidence in understanding their visual environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

See2Say - AI Vision to Speech for the Visually Impaired

What See2Say Does

Core Features

Tech Stack

Backend (FastAPI)

Frontend (React + Vite)

Project Structure

Environment Variables

How It Works

Screenshots

API Endpoint

POST /process-video

Local Setup

Backend Setup

Frontend Setup

Deployment

Security Notes

Use Cases

Future Improvements

Project Significance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Backend		Backend
Frontend		Frontend
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

See2Say - AI Vision to Speech for the Visually Impaired

What See2Say Does

Core Features

Tech Stack

Backend (FastAPI)

Frontend (React + Vite)

Project Structure

Environment Variables

How It Works

Screenshots

API Endpoint

POST /process-video

Local Setup

Backend Setup

Frontend Setup

Deployment

Security Notes

Use Cases

Future Improvements

Project Significance

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages