A sophisticated multimodal Retrieval-Augmented Generation (RAG) system that enables conversational interaction with YouTube video content through advanced AI analysis.
- Multimodal Analysis: Processes both visual frames and textual captions from YouTube videos
- Precise Timestamps: Generates accurate time references for all answers with ±20-second context windows
- Visual Evidence: Displays relevant video frames that support generated responses
- Conversational Interface: Natural language querying powered by Google's Gemini AI
- Efficient Indexing: Leverages Qdrant vector database for high-performance content retrieval
- Responsive UI: Clean, modern Streamlit interface with real-time progress indicators
- Create a virtual environment
- Git clone the repo
- Install Requirements file
pip install -r requirements.txt - Configure your Gemini API key in
config/config.yaml streamlit run app.py- Clear DB Index on UI [side bar button] before uploading url.