RAG API - PalmMind Technology Task Submission

Submitted by: Krishom Basukala
Email: krishombasukala@gmail.com
Date: February 8, 2026
Repository: github.com/Krish-Om/rag-api

✅ Task Completion Checklist

API 1: Document Ingestion ✅

✅ PDF and TXT file upload support
✅ Text extraction (pdfplumber for PDF, direct reading for TXT)
✅ Two chunking strategies implemented:
- Fixed-size chunking: 800 characters with 100 char overlap
- Semantic chunking: Intelligent boundary detection using spaCy
✅ Embeddings generated using ONNX-optimized all-MiniLM-L6-v2 model (384D)
✅ Vector storage in Qdrant (as required - not FAISS/Chroma)
✅ Metadata stored in PostgreSQL database:
- Document ID, filename, upload timestamp
- Chunking strategy, chunk count
- File size, document type

API 2: Conversational RAG ✅

✅ Custom RAG implementation (no RetrievalQAChain used)
✅ Redis for chat memory with TOON optimization
✅ Multi-turn conversation support with context maintenance
✅ Interview booking using LLM-powered extraction:
- Natural language booking requests
- Extracts: name, email, date, time, interview type
- Validates and provides suggestions for missing fields
✅ Booking information stored in PostgreSQL database
✅ Hybrid spaCy + LLM approach for robust extraction

Code Quality ✅

✅ Clean, modular code following best practices
✅ Type hints throughout (Python 3.13 typing)
✅ Industry-standard project structure
✅ Comprehensive documentation
✅ Docker deployment ready

Constraints Adherence ✅

✅ Vector DB: Using Qdrant (NOT FAISS or Chroma)
✅ RAG: Custom implementation (NOT RetrievalQAChain)
✅ No UI: Backend-only as required
✅ Redis: Used for chat memory
✅ Booking: LLM-powered natural language extraction

🚀 Quick Start

Prerequisites

Docker & Docker Compose
8GB+ RAM recommended
10GB disk space

One-Command Deployment

# Clone repository
git clone https://github.com/Krish-Om/rag-api.git
cd rag-api

# Start all services
./deploy.sh up

# Wait ~2 minutes for services to initialize
# API will be available at: http://localhost:8000

Alternative: Docker Compose

docker compose up -d

Test the APIs

# Health check
curl http://localhost:8000/api/v1/health

# Upload document
curl -X POST http://localhost:8000/api/v1/upload \
  -F "uploaded_file=@README.md" \
  -F "chunking_strategy=semantic"

# Chat with RAG
curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "What is this API about?"}'

# Book interview
curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{
    "query": "I want to book a technical interview. My name is John Doe, email john@example.com, date is 2026-02-20, time is 2:00 PM"
  }'

🏗️ Architecture Highlights

Services

FastAPI Application: Main API server
PostgreSQL 16: Document & booking metadata
Qdrant: Vector database for embeddings (384D)
Redis 7: Chat session memory with TOON optimization
Ollama: Local LLM service (llama3.2:1b)

Key Technologies

ONNX Runtime: 78% smaller image, 67% less memory vs PyTorch
TOON Format: 40% token reduction for LLM prompts
Hybrid Extraction: spaCy NER + LLM reasoning for bookings
Type-Safe: Full typing with Pydantic models

Performance

Document Upload: ~1-2s per document
Embedding Generation: ~100ms per chunk
Vector Search: <50ms for similarity queries
LLM Response: 6-10s including retrieval
Memory Usage: ~2.25GB total (all services)

📋 API Documentation

Full API Documentation

Interactive Docs: http://localhost:8000/docs (Swagger UI)
Alternative Docs: http://localhost:8000/redoc (ReDoc)
Detailed Guide: See README.md

Key Endpoints

POST /api/v1/upload

Upload and process documents with chunking and vectorization.

Request:

curl -X POST http://localhost:8000/api/v1/upload \
  -F "uploaded_file=@document.pdf" \
  -F "chunking_strategy=semantic"

Response:

{
  "message": "Document successfully uploaded",
  "document_id": 8,
  "filename": "document.pdf",
  "chunks_created": 5,
  "processing_time_ms": 1234
}

POST /api/v1/chat

Conversational RAG with booking support.

Request:

curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the main topics in the uploaded documents?",
    "session_id": "optional-session-id"
  }'

Response (with context):

{
  "response": "Based on the documents, the main topics include...",
  "session_id": "uuid-string",
  "context_used": true,
  "sources": [
    {
      "doc_id": 8,
      "content_preview": "Document excerpt...",
      "score": 0.85
    }
  ],
  "booking_info": null
}

Response (with booking):

{
  "response": "I've created your booking...",
  "session_id": "uuid-string",
  "context_used": false,
  "sources": [],
  "booking_info": {
    "booking_detected": true,
    "booking_status": "valid",
    "extracted_info": {
      "name": "John Doe",
      "email": "john@example.com",
      "date": "2026-02-20",
      "time": "14:00",
      "type": "technical"
    },
    "missing_fields": [],
    "suggestions": [],
    "booking_id": 123
  }
}

🧪 Testing

Test Results

Test Coverage: 93% overall
Status: ✅ ALL TESTS PASSED
Environment: Docker Compose (production-like)

Test Scenarios Verified

Feature	Status	Details
Document Upload (TXT)	✅	8 documents processed
Document Upload (PDF)	✅	Complex PDF handling
Semantic Chunking	✅	Intelligent boundaries
ONNX Embeddings	✅	384D vectors, <100ms
Vector Storage	✅	Qdrant with 7+ vectors
RAG Context Retrieval	✅	Score 0.67+ for relevance
Multi-Turn Conversations	✅	Session memory working
Booking Detection	✅	Intent recognized
Booking Extraction	✅	All fields extracted
Database Persistence	✅	PostgreSQL verified

Detailed Test Report: docs/TESTING.md

🎯 Key Achievements

1. Production-Ready Deployment

One-command startup with Docker Compose
All services containerized and orchestrated
Health monitoring and logging
Graceful error handling

2. Performance Optimization

ONNX Runtime: 78% smaller Docker image (800MB vs 3.5GB)
Memory Efficient: 67% less RAM usage (400MB vs 1.2GB)
Fast Startup: 75% faster cold start (3-5s vs 15-20s)
Token Optimization: TOON format saves 40% LLM tokens

3. Advanced Features

Hybrid Booking Extraction: spaCy + LLM for robustness
Multi-Turn Context: Redis-backed conversation memory
Custom RAG: No pre-built chains, full control
Smart Chunking: Both fixed-size and semantic strategies

4. Code Quality

Type Safety: Full type hints throughout
Modular Design: Clean separation of concerns
Documentation: Comprehensive README and API docs
Best Practices: Follows industry standards

📖 Documentation Structure

.
├── README.md                    # Main documentation
├── SUBMISSION.md               # This file
├── docs/
│   ├── TESTING.md              # Detailed test results
│   └── ONNX_OPTIMIZATION.md    # Performance optimization guide
├── app/                        # Application source code
│   ├── api/                    # API endpoints
│   ├── services/               # Business logic
│   ├── database/               # Database models
│   └── config.py               # Configuration
├── scripts/                    # Utility scripts
│   ├── convert_to_onnx.py     # Model conversion
│   └── migrate_db.py           # Database migration
├── docker-compose.yml          # Service orchestration
├── Dockerfile                  # API container definition
└── deploy.sh                   # Deployment helper script

🔍 Technical Decisions

Why Qdrant?

Best local vector database for development
Excellent documentation and Python client
Production-ready with horizontal scaling support
Native Docker support

Why Ollama?

Fully local LLM (no API keys required)
Fast inference with quantized models
Easy Docker deployment
Cost-effective for demos and development

Why ONNX?

78% smaller deployments
67% less memory usage
Same embedding quality
Universal runtime (portable)

Why Redis + TOON?

Fast in-memory chat history
TOON format: 40% token savings
Simple session management
Production-ready caching

🚀 Deployment Options

1. Local Docker (Recommended for Demo)

./deploy.sh up
# or
docker compose up -d

2. Production Deployment

# Set environment variables
export DATABASE_URL=postgresql://user:pass@prod-db:5432/ragdb
export QDRANT_URL=http://qdrant-cluster:6333
export REDIS_URL=redis://redis-cluster:6379

# Build and push
docker build -t rag-api:prod .
docker push rag-api:prod

# Deploy to orchestration platform
kubectl apply -f k8s/

3. Local Development

# Install dependencies
pip install -e .
python -m spacy download en_core_web_sm

# Start services (PostgreSQL, Redis, Qdrant, Ollama)
docker compose up -d postgres redis qdrant ollama

# Run API
uvicorn app.app:app --reload --port 8000

📊 Performance Metrics

Latency

Operation	Average	Target	Status
Document Upload	1.2s	<5s	✅
Embedding Generation	100ms	<500ms	✅
Vector Search	35ms	<100ms	✅
LLM Response	7s	<15s	✅
Booking Extraction	6s	<15s	✅

Resource Usage

Service	Memory	CPU	Disk
API	~400MB	5-15%	800MB
PostgreSQL	~150MB	2-5%	200MB
Qdrant	~180MB	3-8%	150MB
Redis	~20MB	1-2%	50MB
Ollama	~1.5GB	20-60%	1.3GB
Total	~2.25GB	31-90%	2.5GB

🔧 Future Enhancements

Planned Improvements

WebSocket Support: Real-time chat streaming
OCR Integration: Scanned PDF support with pytesseract
Session TTL: Automatic Redis cleanup (24-hour expiry)
Rate Limiting: API request throttling
Authentication: JWT token-based auth
Monitoring: Prometheus + Grafana dashboards

Scalability Considerations

Horizontal scaling with load balancers
Qdrant cluster mode for distributed vectors
Redis Sentinel for high availability
Async processing with Celery for large uploads

📞 Contact

Krishom Basukala

Email: krishombasukala@gmail.com
LinkedIn: linkedin.com/in/krishom-basukala
GitHub: github.com/Krish-Om
Location: Bhaktapur, Nepal

📝 Notes

Commitment

✅ Available for 1+ year commitment
✅ Comfortable with 2-month notice period if selected

Submission Timeline

Task Started: February 4, 2026
Task Completed: February 8, 2026
Total Time: 4 days (including testing and documentation)

Repository

Public Repository: github.com/Krish-Om/rag-api
Branch: main (production-ready code)
License: MIT

Thank you for the opportunity to showcase my skills!

This submission demonstrates production-ready code, comprehensive testing, and professional documentation practices suitable for enterprise deployment.

FilesExpand file tree

SUBMISSION.md

Latest commit

History