A powerful FastAPI-based multi-modal ingestion system that processes PDFs, scanned documents, images, videos, YouTube links, and text files — then optionally performs semantic retrieval using FAISS + HuggingFace embeddings and refines answers using an LLM (Ollama via NPMAI).
- 📄 Extract text from searchable PDFs
- 🖨️ OCR for scanned PDFs
- 🖼️ Image OCR (Tesseract + OpenCV preprocessing)
- 🎥 Local video speech-to-text (Whisper)
- 📺 YouTube video transcription (yt-dlp + Whisper)
- 📃 Plain text processing
- 🧠 FAISS vector database creation & loading
- 🔎 Semantic similarity search
- ♻️ Iterative refinement using LLM (Ollama)
- 🗂 Automatic ingestion routing based on file type
Client Request
↓
/ingestion Endpoint
↓
File Type Detection
↓
Text Extraction (PDF/OCR/Video/etc.)
↓
Optional Vector DB Retrieval (FAISS)
↓
Refinement via LLM
↓
Final Response
GET /
Returns:
{ "ok": true }POST /ingestion
file→ Upload file (pdf, txt, mp4, jpg, png, etc.)query→ Optional semantic queryDB_PATH→ Path to vector databaselink→ YouTube linkoutput_path→ Download location for videotemperature→ LLM temperaturemodel→ Ollama model name
| Type | Processing Method |
|---|---|
| PDF (text-based) | PyMuPDF |
| PDF (scanned) | pdf2image + Tesseract |
| Image | OpenCV + Tesseract |
| TXT | Direct read |
| MP4 | Whisper transcription |
| YouTube | yt-dlp + Whisper |
If query and DB_PATH are provided:
- Check if FAISS DB exists
- If yes → Load and perform similarity search
- If no → Create embeddings & save DB
- Retrieve top 4 chunks
- Send to LLM refine loop
- Embeddings:
all-MiniLM-L6-v2 - Vector DB: FAISS
- Chunk Size: 1000
- Overlap: 200
For each retrieved chunk:
- Pass context to LLM
- Iteratively refine previous answer
- Return final refined response
Install required packages:
pip install fastapi uvicorn
pip install langchain langchain-community
pip install faiss-cpu
pip install whisper
pip install moviepy
pip install pytesseract
pip install pdf2image
pip install pymupdf
pip install yt-dlp
pip install opencv-python
pip install pillow
pip install numpyMake sure:
- Tesseract OCR is installed in system
- FFmpeg is installed
- Ollama is running locally
uvicorn main:app --reloadPOST /ingestion
Form Data:
file = document.pdf
query = "Summarize key points"
DB_PATH = vector_db
model = llama3
temperature = 0.7
- GPU is disabled (
CUDA_VISIBLE_DEVICES="") - Whisper model loads once (thread-safe singleton)
- FAISS uses dangerous deserialization (use trusted DB paths only)
- Temporary audio saved as
temp.wav
- Streaming responses
- Async video processing
- Chunk-level caching
- Background task queue
- Better refine logic
- Support for multiple vector stores
- Use through ##Docker
- FastAPI
- FAISS
- HuggingFace Embeddings
- Whisper
- OpenCV
- Tesseract OCR
- PyMuPDF
- yt-dlp
- ##npmai
MIT License
This system acts as a universal AI ingestion pipeline capable of processing multi-modal data and performing intelligent semantic retrieval with LLM refinement.
It can serve as:
- AI document assistant
- Video summarizer
- Research helper
- OCR intelligence engine
- Knowledge base system
