Skip to content

Latest commit

Β 

History

History
427 lines (345 loc) Β· 11.7 KB

File metadata and controls

427 lines (345 loc) Β· 11.7 KB

RAG API - PalmMind Technology Task Submission

Submitted by: Krishom Basukala
Email: krishombasukala@gmail.com
Date: February 8, 2026
Repository: github.com/Krish-Om/rag-api


βœ… Task Completion Checklist

API 1: Document Ingestion βœ…

  • βœ… PDF and TXT file upload support
  • βœ… Text extraction (pdfplumber for PDF, direct reading for TXT)
  • βœ… Two chunking strategies implemented:
    • Fixed-size chunking: 800 characters with 100 char overlap
    • Semantic chunking: Intelligent boundary detection using spaCy
  • βœ… Embeddings generated using ONNX-optimized all-MiniLM-L6-v2 model (384D)
  • βœ… Vector storage in Qdrant (as required - not FAISS/Chroma)
  • βœ… Metadata stored in PostgreSQL database:
    • Document ID, filename, upload timestamp
    • Chunking strategy, chunk count
    • File size, document type

API 2: Conversational RAG βœ…

  • βœ… Custom RAG implementation (no RetrievalQAChain used)
  • βœ… Redis for chat memory with TOON optimization
  • βœ… Multi-turn conversation support with context maintenance
  • βœ… Interview booking using LLM-powered extraction:
    • Natural language booking requests
    • Extracts: name, email, date, time, interview type
    • Validates and provides suggestions for missing fields
  • βœ… Booking information stored in PostgreSQL database
  • βœ… Hybrid spaCy + LLM approach for robust extraction

Code Quality βœ…

  • βœ… Clean, modular code following best practices
  • βœ… Type hints throughout (Python 3.13 typing)
  • βœ… Industry-standard project structure
  • βœ… Comprehensive documentation
  • βœ… Docker deployment ready

Constraints Adherence βœ…

  • βœ… Vector DB: Using Qdrant (NOT FAISS or Chroma)
  • βœ… RAG: Custom implementation (NOT RetrievalQAChain)
  • βœ… No UI: Backend-only as required
  • βœ… Redis: Used for chat memory
  • βœ… Booking: LLM-powered natural language extraction

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • 8GB+ RAM recommended
  • 10GB disk space

One-Command Deployment

# Clone repository
git clone https://github.com/Krish-Om/rag-api.git
cd rag-api

# Start all services
./deploy.sh up

# Wait ~2 minutes for services to initialize
# API will be available at: http://localhost:8000

Alternative: Docker Compose

docker compose up -d

Test the APIs

# Health check
curl http://localhost:8000/api/v1/health

# Upload document
curl -X POST http://localhost:8000/api/v1/upload \
  -F "uploaded_file=@README.md" \
  -F "chunking_strategy=semantic"

# Chat with RAG
curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "What is this API about?"}'

# Book interview
curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{
    "query": "I want to book a technical interview. My name is John Doe, email john@example.com, date is 2026-02-20, time is 2:00 PM"
  }'

πŸ—οΈ Architecture Highlights

Services

  • FastAPI Application: Main API server
  • PostgreSQL 16: Document & booking metadata
  • Qdrant: Vector database for embeddings (384D)
  • Redis 7: Chat session memory with TOON optimization
  • Ollama: Local LLM service (llama3.2:1b)

Key Technologies

  • ONNX Runtime: 78% smaller image, 67% less memory vs PyTorch
  • TOON Format: 40% token reduction for LLM prompts
  • Hybrid Extraction: spaCy NER + LLM reasoning for bookings
  • Type-Safe: Full typing with Pydantic models

Performance

  • Document Upload: ~1-2s per document
  • Embedding Generation: ~100ms per chunk
  • Vector Search: <50ms for similarity queries
  • LLM Response: 6-10s including retrieval
  • Memory Usage: ~2.25GB total (all services)

πŸ“‹ API Documentation

Full API Documentation

Key Endpoints

POST /api/v1/upload

Upload and process documents with chunking and vectorization.

Request:

curl -X POST http://localhost:8000/api/v1/upload \
  -F "uploaded_file=@document.pdf" \
  -F "chunking_strategy=semantic"

Response:

{
  "message": "Document successfully uploaded",
  "document_id": 8,
  "filename": "document.pdf",
  "chunks_created": 5,
  "processing_time_ms": 1234
}

POST /api/v1/chat

Conversational RAG with booking support.

Request:

curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the main topics in the uploaded documents?",
    "session_id": "optional-session-id"
  }'

Response (with context):

{
  "response": "Based on the documents, the main topics include...",
  "session_id": "uuid-string",
  "context_used": true,
  "sources": [
    {
      "doc_id": 8,
      "content_preview": "Document excerpt...",
      "score": 0.85
    }
  ],
  "booking_info": null
}

Response (with booking):

{
  "response": "I've created your booking...",
  "session_id": "uuid-string",
  "context_used": false,
  "sources": [],
  "booking_info": {
    "booking_detected": true,
    "booking_status": "valid",
    "extracted_info": {
      "name": "John Doe",
      "email": "john@example.com",
      "date": "2026-02-20",
      "time": "14:00",
      "type": "technical"
    },
    "missing_fields": [],
    "suggestions": [],
    "booking_id": 123
  }
}

πŸ§ͺ Testing

Test Results

  • Test Coverage: 93% overall
  • Status: βœ… ALL TESTS PASSED
  • Environment: Docker Compose (production-like)

Test Scenarios Verified

Feature Status Details
Document Upload (TXT) βœ… 8 documents processed
Document Upload (PDF) βœ… Complex PDF handling
Semantic Chunking βœ… Intelligent boundaries
ONNX Embeddings βœ… 384D vectors, <100ms
Vector Storage βœ… Qdrant with 7+ vectors
RAG Context Retrieval βœ… Score 0.67+ for relevance
Multi-Turn Conversations βœ… Session memory working
Booking Detection βœ… Intent recognized
Booking Extraction βœ… All fields extracted
Database Persistence βœ… PostgreSQL verified

Detailed Test Report: docs/TESTING.md


🎯 Key Achievements

1. Production-Ready Deployment

  • One-command startup with Docker Compose
  • All services containerized and orchestrated
  • Health monitoring and logging
  • Graceful error handling

2. Performance Optimization

  • ONNX Runtime: 78% smaller Docker image (800MB vs 3.5GB)
  • Memory Efficient: 67% less RAM usage (400MB vs 1.2GB)
  • Fast Startup: 75% faster cold start (3-5s vs 15-20s)
  • Token Optimization: TOON format saves 40% LLM tokens

3. Advanced Features

  • Hybrid Booking Extraction: spaCy + LLM for robustness
  • Multi-Turn Context: Redis-backed conversation memory
  • Custom RAG: No pre-built chains, full control
  • Smart Chunking: Both fixed-size and semantic strategies

4. Code Quality

  • Type Safety: Full type hints throughout
  • Modular Design: Clean separation of concerns
  • Documentation: Comprehensive README and API docs
  • Best Practices: Follows industry standards

πŸ“– Documentation Structure

.
β”œβ”€β”€ README.md                    # Main documentation
β”œβ”€β”€ SUBMISSION.md               # This file
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ TESTING.md              # Detailed test results
β”‚   └── ONNX_OPTIMIZATION.md    # Performance optimization guide
β”œβ”€β”€ app/                        # Application source code
β”‚   β”œβ”€β”€ api/                    # API endpoints
β”‚   β”œβ”€β”€ services/               # Business logic
β”‚   β”œβ”€β”€ database/               # Database models
β”‚   └── config.py               # Configuration
β”œβ”€β”€ scripts/                    # Utility scripts
β”‚   β”œβ”€β”€ convert_to_onnx.py     # Model conversion
β”‚   └── migrate_db.py           # Database migration
β”œβ”€β”€ docker-compose.yml          # Service orchestration
β”œβ”€β”€ Dockerfile                  # API container definition
└── deploy.sh                   # Deployment helper script

πŸ” Technical Decisions

Why Qdrant?

  • Best local vector database for development
  • Excellent documentation and Python client
  • Production-ready with horizontal scaling support
  • Native Docker support

Why Ollama?

  • Fully local LLM (no API keys required)
  • Fast inference with quantized models
  • Easy Docker deployment
  • Cost-effective for demos and development

Why ONNX?

  • 78% smaller deployments
  • 67% less memory usage
  • Same embedding quality
  • Universal runtime (portable)

Why Redis + TOON?

  • Fast in-memory chat history
  • TOON format: 40% token savings
  • Simple session management
  • Production-ready caching

πŸš€ Deployment Options

1. Local Docker (Recommended for Demo)

./deploy.sh up
# or
docker compose up -d

2. Production Deployment

# Set environment variables
export DATABASE_URL=postgresql://user:pass@prod-db:5432/ragdb
export QDRANT_URL=http://qdrant-cluster:6333
export REDIS_URL=redis://redis-cluster:6379

# Build and push
docker build -t rag-api:prod .
docker push rag-api:prod

# Deploy to orchestration platform
kubectl apply -f k8s/

3. Local Development

# Install dependencies
pip install -e .
python -m spacy download en_core_web_sm

# Start services (PostgreSQL, Redis, Qdrant, Ollama)
docker compose up -d postgres redis qdrant ollama

# Run API
uvicorn app.app:app --reload --port 8000

πŸ“Š Performance Metrics

Latency

Operation Average Target Status
Document Upload 1.2s <5s βœ…
Embedding Generation 100ms <500ms βœ…
Vector Search 35ms <100ms βœ…
LLM Response 7s <15s βœ…
Booking Extraction 6s <15s βœ…

Resource Usage

Service Memory CPU Disk
API ~400MB 5-15% 800MB
PostgreSQL ~150MB 2-5% 200MB
Qdrant ~180MB 3-8% 150MB
Redis ~20MB 1-2% 50MB
Ollama ~1.5GB 20-60% 1.3GB
Total ~2.25GB 31-90% 2.5GB

πŸ”§ Future Enhancements

Planned Improvements

  1. WebSocket Support: Real-time chat streaming
  2. OCR Integration: Scanned PDF support with pytesseract
  3. Session TTL: Automatic Redis cleanup (24-hour expiry)
  4. Rate Limiting: API request throttling
  5. Authentication: JWT token-based auth
  6. Monitoring: Prometheus + Grafana dashboards

Scalability Considerations

  • Horizontal scaling with load balancers
  • Qdrant cluster mode for distributed vectors
  • Redis Sentinel for high availability
  • Async processing with Celery for large uploads

πŸ“ž Contact

Krishom Basukala


πŸ“ Notes

Commitment

  • βœ… Available for 1+ year commitment
  • βœ… Comfortable with 2-month notice period if selected

Submission Timeline

  • Task Started: February 4, 2026
  • Task Completed: February 8, 2026
  • Total Time: 4 days (including testing and documentation)

Repository


Thank you for the opportunity to showcase my skills!

This submission demonstrates production-ready code, comprehensive testing, and professional documentation practices suitable for enterprise deployment.