Submitted by: Krishom Basukala
Email: krishombasukala@gmail.com
Date: February 8, 2026
Repository: github.com/Krish-Om/rag-api
- β PDF and TXT file upload support
- β Text extraction (pdfplumber for PDF, direct reading for TXT)
- β
Two chunking strategies implemented:
- Fixed-size chunking: 800 characters with 100 char overlap
- Semantic chunking: Intelligent boundary detection using spaCy
- β Embeddings generated using ONNX-optimized all-MiniLM-L6-v2 model (384D)
- β Vector storage in Qdrant (as required - not FAISS/Chroma)
- β
Metadata stored in PostgreSQL database:
- Document ID, filename, upload timestamp
- Chunking strategy, chunk count
- File size, document type
- β Custom RAG implementation (no RetrievalQAChain used)
- β Redis for chat memory with TOON optimization
- β Multi-turn conversation support with context maintenance
- β
Interview booking using LLM-powered extraction:
- Natural language booking requests
- Extracts: name, email, date, time, interview type
- Validates and provides suggestions for missing fields
- β Booking information stored in PostgreSQL database
- β Hybrid spaCy + LLM approach for robust extraction
- β Clean, modular code following best practices
- β Type hints throughout (Python 3.13 typing)
- β Industry-standard project structure
- β Comprehensive documentation
- β Docker deployment ready
- β Vector DB: Using Qdrant (NOT FAISS or Chroma)
- β RAG: Custom implementation (NOT RetrievalQAChain)
- β No UI: Backend-only as required
- β Redis: Used for chat memory
- β Booking: LLM-powered natural language extraction
- Docker & Docker Compose
- 8GB+ RAM recommended
- 10GB disk space
# Clone repository
git clone https://github.com/Krish-Om/rag-api.git
cd rag-api
# Start all services
./deploy.sh up
# Wait ~2 minutes for services to initialize
# API will be available at: http://localhost:8000docker compose up -d# Health check
curl http://localhost:8000/api/v1/health
# Upload document
curl -X POST http://localhost:8000/api/v1/upload \
-F "uploaded_file=@README.md" \
-F "chunking_strategy=semantic"
# Chat with RAG
curl -X POST http://localhost:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{"query": "What is this API about?"}'
# Book interview
curl -X POST http://localhost:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"query": "I want to book a technical interview. My name is John Doe, email john@example.com, date is 2026-02-20, time is 2:00 PM"
}'- FastAPI Application: Main API server
- PostgreSQL 16: Document & booking metadata
- Qdrant: Vector database for embeddings (384D)
- Redis 7: Chat session memory with TOON optimization
- Ollama: Local LLM service (llama3.2:1b)
- ONNX Runtime: 78% smaller image, 67% less memory vs PyTorch
- TOON Format: 40% token reduction for LLM prompts
- Hybrid Extraction: spaCy NER + LLM reasoning for bookings
- Type-Safe: Full typing with Pydantic models
- Document Upload: ~1-2s per document
- Embedding Generation: ~100ms per chunk
- Vector Search: <50ms for similarity queries
- LLM Response: 6-10s including retrieval
- Memory Usage: ~2.25GB total (all services)
- Interactive Docs: http://localhost:8000/docs (Swagger UI)
- Alternative Docs: http://localhost:8000/redoc (ReDoc)
- Detailed Guide: See README.md
Upload and process documents with chunking and vectorization.
Request:
curl -X POST http://localhost:8000/api/v1/upload \
-F "uploaded_file=@document.pdf" \
-F "chunking_strategy=semantic"Response:
{
"message": "Document successfully uploaded",
"document_id": 8,
"filename": "document.pdf",
"chunks_created": 5,
"processing_time_ms": 1234
}Conversational RAG with booking support.
Request:
curl -X POST http://localhost:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"query": "What are the main topics in the uploaded documents?",
"session_id": "optional-session-id"
}'Response (with context):
{
"response": "Based on the documents, the main topics include...",
"session_id": "uuid-string",
"context_used": true,
"sources": [
{
"doc_id": 8,
"content_preview": "Document excerpt...",
"score": 0.85
}
],
"booking_info": null
}Response (with booking):
{
"response": "I've created your booking...",
"session_id": "uuid-string",
"context_used": false,
"sources": [],
"booking_info": {
"booking_detected": true,
"booking_status": "valid",
"extracted_info": {
"name": "John Doe",
"email": "john@example.com",
"date": "2026-02-20",
"time": "14:00",
"type": "technical"
},
"missing_fields": [],
"suggestions": [],
"booking_id": 123
}
}- Test Coverage: 93% overall
- Status: β ALL TESTS PASSED
- Environment: Docker Compose (production-like)
| Feature | Status | Details |
|---|---|---|
| Document Upload (TXT) | β | 8 documents processed |
| Document Upload (PDF) | β | Complex PDF handling |
| Semantic Chunking | β | Intelligent boundaries |
| ONNX Embeddings | β | 384D vectors, <100ms |
| Vector Storage | β | Qdrant with 7+ vectors |
| RAG Context Retrieval | β | Score 0.67+ for relevance |
| Multi-Turn Conversations | β | Session memory working |
| Booking Detection | β | Intent recognized |
| Booking Extraction | β | All fields extracted |
| Database Persistence | β | PostgreSQL verified |
Detailed Test Report: docs/TESTING.md
- One-command startup with Docker Compose
- All services containerized and orchestrated
- Health monitoring and logging
- Graceful error handling
- ONNX Runtime: 78% smaller Docker image (800MB vs 3.5GB)
- Memory Efficient: 67% less RAM usage (400MB vs 1.2GB)
- Fast Startup: 75% faster cold start (3-5s vs 15-20s)
- Token Optimization: TOON format saves 40% LLM tokens
- Hybrid Booking Extraction: spaCy + LLM for robustness
- Multi-Turn Context: Redis-backed conversation memory
- Custom RAG: No pre-built chains, full control
- Smart Chunking: Both fixed-size and semantic strategies
- Type Safety: Full type hints throughout
- Modular Design: Clean separation of concerns
- Documentation: Comprehensive README and API docs
- Best Practices: Follows industry standards
.
βββ README.md # Main documentation
βββ SUBMISSION.md # This file
βββ docs/
β βββ TESTING.md # Detailed test results
β βββ ONNX_OPTIMIZATION.md # Performance optimization guide
βββ app/ # Application source code
β βββ api/ # API endpoints
β βββ services/ # Business logic
β βββ database/ # Database models
β βββ config.py # Configuration
βββ scripts/ # Utility scripts
β βββ convert_to_onnx.py # Model conversion
β βββ migrate_db.py # Database migration
βββ docker-compose.yml # Service orchestration
βββ Dockerfile # API container definition
βββ deploy.sh # Deployment helper script
- Best local vector database for development
- Excellent documentation and Python client
- Production-ready with horizontal scaling support
- Native Docker support
- Fully local LLM (no API keys required)
- Fast inference with quantized models
- Easy Docker deployment
- Cost-effective for demos and development
- 78% smaller deployments
- 67% less memory usage
- Same embedding quality
- Universal runtime (portable)
- Fast in-memory chat history
- TOON format: 40% token savings
- Simple session management
- Production-ready caching
./deploy.sh up
# or
docker compose up -d# Set environment variables
export DATABASE_URL=postgresql://user:pass@prod-db:5432/ragdb
export QDRANT_URL=http://qdrant-cluster:6333
export REDIS_URL=redis://redis-cluster:6379
# Build and push
docker build -t rag-api:prod .
docker push rag-api:prod
# Deploy to orchestration platform
kubectl apply -f k8s/# Install dependencies
pip install -e .
python -m spacy download en_core_web_sm
# Start services (PostgreSQL, Redis, Qdrant, Ollama)
docker compose up -d postgres redis qdrant ollama
# Run API
uvicorn app.app:app --reload --port 8000| Operation | Average | Target | Status |
|---|---|---|---|
| Document Upload | 1.2s | <5s | β |
| Embedding Generation | 100ms | <500ms | β |
| Vector Search | 35ms | <100ms | β |
| LLM Response | 7s | <15s | β |
| Booking Extraction | 6s | <15s | β |
| Service | Memory | CPU | Disk |
|---|---|---|---|
| API | ~400MB | 5-15% | 800MB |
| PostgreSQL | ~150MB | 2-5% | 200MB |
| Qdrant | ~180MB | 3-8% | 150MB |
| Redis | ~20MB | 1-2% | 50MB |
| Ollama | ~1.5GB | 20-60% | 1.3GB |
| Total | ~2.25GB | 31-90% | 2.5GB |
- WebSocket Support: Real-time chat streaming
- OCR Integration: Scanned PDF support with pytesseract
- Session TTL: Automatic Redis cleanup (24-hour expiry)
- Rate Limiting: API request throttling
- Authentication: JWT token-based auth
- Monitoring: Prometheus + Grafana dashboards
- Horizontal scaling with load balancers
- Qdrant cluster mode for distributed vectors
- Redis Sentinel for high availability
- Async processing with Celery for large uploads
Krishom Basukala
- Email: krishombasukala@gmail.com
- LinkedIn: linkedin.com/in/krishom-basukala
- GitHub: github.com/Krish-Om
- Location: Bhaktapur, Nepal
- β Available for 1+ year commitment
- β Comfortable with 2-month notice period if selected
- Task Started: February 4, 2026
- Task Completed: February 8, 2026
- Total Time: 4 days (including testing and documentation)
- Public Repository: github.com/Krish-Om/rag-api
- Branch:
main(production-ready code) - License: MIT
Thank you for the opportunity to showcase my skills!
This submission demonstrates production-ready code, comprehensive testing, and professional documentation practices suitable for enterprise deployment.