See2Say is an AI-powered accessibility platform that helps visually impaired users understand videos through natural speech. It converts video content into meaningful audio narration using modern Computer Vision and Generative AI.
Think of it as: A companion that describes visual surroundings through audio.
- User uploads a video
- Video is broken into frames using OpenCV
- Each frame is captioned using BLIP (Vision-Language Model)
- All captions are merged into natural Hinglish narration using Gemini
- Final narration is converted into audio output
- User listens to the scene description without needing visual access
- AI-based video understanding
- Vision to Language to Speech pipeline
- Hinglish narration (human-like, not robotic)
- Built for blind and low-vision users
- FastAPI backend (scalable)
- React + Tailwind frontend
- Hugging Face-ready deployment
fastapi
uvicorn
python-dotenv
google-genai
opencv-python
Pillow (PIL)
transformers
torch
gtts
numpy
Technology Choices
transformers + torchfor BLIP image captioninggoogle-genaifor Gemini summarizationopencv-pythonfor frame extractiongttsfor text-to-speech conversionfastapifor clean, asynchronous API backend
{
"react": "^19.x",
"vite": "^7.x",
"tailwindcss": "^4.x"
}- Minimal UI focused on accessibility
- Lightweight and fast performance
see2say/
├── Backend/
│ ├── main.py # FastAPI entry point
│ ├── app/
│ │ ├── routes.py # API routes
│ │ ├── models.py # BLIP captioning logic
│ │ ├── gemini_client.py # Gemini summarization
│ │ ├── utils.py # TTS and helper functions
│ └── uploads/ # Temporary video storage
│
├── frontend/
│ ├── src/
│ ├── package.json
│
├── requirements.txt
├── Dockerfile
├── .gitignore
└── README.md
Create a .env file (do not commit this file):
GEMINI_API_KEY=your_api_key_here
On Hugging Face Spaces, set this under Settings -> Repository Secrets.
Video Input
↓
Frame Extraction (OpenCV)
↓
Image Captioning (BLIP)
↓
Narration Generation (Gemini)
↓
Speech Synthesis (gTTS)
↓
Audio Output
Input
- Video file upload
Output
{
"captions": ["man walking on road", "car passing by"],
"final_summary": "Ek aadmi road ke side chal raha hai...",
"audio": "base64_audio_string"
}pip install -r requirements.txt
uvicorn Backend.main:app --reloadcd frontend
npm install
npm run dev- Backend deployed using Hugging Face Spaces (Docker)
- Models are not stored in GitHub repository
- BLIP model downloads at runtime from Hugging Face Hub
This approach keeps the repository clean, lightweight, and secure.
.envfile is never committed to version control- Model weights are never pushed to the repository
- Secrets managed via Hugging Face settings or environment variables
- Accessibility for visually impaired users
- Video narration and summarization
- Multimodal AI research
- Assistive learning tools
- Real-time camera narration
- Multi-language support
- Emotion-aware descriptions
- Mobile-first UI optimization
- Offline TTS fallback capability
Accessibility is not merely a feature—it is a responsibility. See2Say is built to reduce dependency and provide blind users with greater confidence in understanding their visual environment.