Skip to content

vrraj/chat-with-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

64 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Chat with Your Docs: End-to-End RAG Pipeline

A modular Tool-Assisted Retrieval-Augmented Generation (RAG) framework for building AI applications that generate grounded answers with citations from structured and unstructured knowledge sources.

Unlike simple vector-search demos, this project provides a multi-stage RAG pipeline with configurable retrieval, prompts, models, tools, and observability.

πŸš€ Get Started: See section Getting Started to run the system locally.

πŸ†• What's New in v2.0

  • Multi‑LLM Pipeline Orchestration
    Use different LLM providers and models across pipeline stages through the vrraj‑llm‑adapter, based on cost, capabilities, and task suitability.

  • Stage-Specific Model Selection
    Runtime model selection per pipeline stage (rewrite, rerank, summarization, inference) via UI or API.

  • Registry-Driven LLM Integration
    Model capabilities, pricing, and parameter policies are referenced from the adapter’s (vrraj-llm-adapter) registry. This can be extended/overridden with a custom registry.

  • Domain-Aware Prompt Registry
    Centralized prompt control layer that decouples prompts from application code. Prompts are grouped by domain/pipeline stage in a YAML-driven registry enabling rapid prompt experimentation and domain‑specific pipeline behavior.

  • Advanced Context Window Management
    Hybrid strategy combining summarized conversation history with recent verbatim turns to maintain context while controlling token usage. See Technical Overview for implementation details.

  • Cost Tracking and Observability
    Real-time stage visibility, token accounting, and per-stage cost tracking across the pipeline.

  • Response Post-processing
    Configurable output transformation layer, currently supporting Markdown β†’ scoped HTML conversion and extensible post-processing workflows.

  • Embeddable Chat Experience
    Drop-in widget with comprehensive configuration via API params.

  • Domain-Based Access Controls
    Isolation and authorization enforced consistently across APIs and embedded clients.

  • Dual Chat Modes: Stateful and Stateless
    Support for both stateless (/chat) and stateful (/chat/{session_id}) chat patterns.

For additional details, see the Release Notes 2.0.

Chat with RAG v2 architecture diagram showing multi‑LLM orchestration, prompt registry, observability, embeddable chat, and domain controls

Auth & Security Note
This app enforces domain-based access controls across APIs and embedded widgets (domain isolation, collection separation, widget lockdown). See Security & Deployment for more details.


High-Level RAG Pipeline Overview

The system runs through two parallel workflows: an Ingestion Pipeline (build the knowledge base) and a Chat Orchestration Pipeline (retrieve + answer).

Pipeline Flow
Ingestion Documents / URLs β†’ Load Sources β†’ Extract & Parse β†’ Chunk & Normalize β†’ Metadata Augmentation β†’ Embeddings β†’ Vector Storage
Chat User Prompt β†’ Query Rewrite β†’ Retrieval β†’ Rerank β†’ Context Assembly β†’ LLM Inference β†’ Tool Execution β†’ Response Synthesis β†’ Post-Processing β†’ Final Response
%%{init: {'themeVariables': { 'fontSize': '16px', 'subgraphFontSize': '20px', 'subgraphTitleColor': '#1e6bb8'}}}%%
graph LR
    %% Theme Styling from your finalized Hub
    classDef core fill:#dcebe8,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8,font-weight:bold;
    classDef feat fill:#ffffff,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8;
    classDef highlight fill:#159957,stroke:#159957,stroke-width:2px,color:#fff,font-weight:bold;

    subgraph "CHAT ORCHESTRATION"
        direction LR
        U[User Prompt] --> QR[Query Rewrite]
        QR --> Search[Retrieval]

        %% Return path from Ingestion back to Chat
        R[Rerank] --> Ctx[Context Assembly] --> Inf[LLM Inference]

        Inf --> Tools{Tool Execution?}
        Tools -- "Yes" --> API[Tool Calls] --> Synth[Response Synthesis] --> Post[Post-Processing]
        Tools -- "No" --> Synth
        Synth --> Post
        Post --> Out[Final Response]
    end

    subgraph "INGESTION PIPELINE"
        direction LR
        S[Sources] --> P[Parse] --> C[Chunk] --> D[Add Metadata] --> E[Embed] --> DB[(Vector DB)]
    end

    %% PHYSICAL CONNECTIONS
    Search -- "Query" --> DB
    DB -- "Results" --> R

    %% Apply Themes
    %% Using 'core' style for the main entry/exit and database
    class U,Out,DB core;
    %% Using 'feat' style for standard logic steps
    class QR,Search,R,Ctx,API,Post,S,P,C,D,E feat;
    %% Using 'highlight' (Cayman Green) for the critical LLM stages
    class Inf,Synth highlight;
Loading

Example Use Cases

This project serves as a reference architecture for production-style Tool-Assisted RAG systems. Typical use cases include:

  • Document-grounded assistants for PDFs, HTML pages, internal documentation, and MediaWiki-based knowledge sources
  • Multi-LLM experimentation across pipeline stages for retrieval, reranking, summarization, and inference
  • Prompt and retrieval tuning using query rewrite, reranking, and domain-specific prompt configurations
  • Embeddable knowledge assistants for websites and internal portals
  • API-driven Tool-Assisted RAG workflows for chat, ingestion, embedded experiences, and runtime pipeline control
  • Domain-specific knowledge bases with separate collections, embeddings, and prompt domains
  • Observable Tool-Assisted RAG systems where each pipeline stage can be inspected, tuned, and cost-tracked

πŸ“Έ Inference Pipeline in Action

The screenshot below shows a live chat run through the inference pipeline, highlighting key system capabilities:

  • Query rewrite for improved retrieval
  • Multi‑turn context preservation
  • Retrieval + inference working together
  • Optional tool calls
  • Multi‑model execution (OpenAI and Gemini)
  • HTML‑formatted responses with citations

Chat pipeline UI showing query rewrite, multi-turn context, pipeline stages, tool calls, and citations

Chat pipeline UI showing query rewriting, multi-turn context handling, explicit pipeline stages, tool invocation, and cited responses.

πŸ“¦ Embedded RAG Chat Experience

The screenshot below illustrates the two configuration options for the chat widget:

  • Direct iframe embedding (simplest)
  • Embed loader script using HTML data-* attributes (advanced configuration)

Chat embedding options iframe and inline page

See the Embeddable Widget Configuration section below for implementation examples.

πŸ–₯️ Application Workspace

The workspace provides a central entry point to the main parts of the application. From here you can open the chat interface, manage documents, inspect the vector store, run batch ingestion, and configure embeddable chat experiences.

Content ingestion UI with primary actions and indexing tools for PDFs, HTML, and MediaWiki

Content ingestion UI showing primary actions and indexing tools (batch upload, PDF/HTML/MediaWiki), estimation mode, and metadata controls.


Features

1. Dual Chat Modes

  • Stateless (/chat) β€” Client-managed history for web frontends
  • Stateful (/chat/{session_id}) β€” Server-managed sessions for backend and mobile integrations
  • Shared orchestration β€” Both modes use the same RAG pipeline

See details: Session-Based Chat API

2. High-Fidelity Ingestion

  • Multi-source extraction β€” Parse PDFs, MediaWiki, and HTML
  • Structured processing β€” Smart chunking, structure preservation, and configurable noise filtering
  • Batch ingestion β€” Process local directories (file://) or remote URLs with optional token and cost estimation

3. Advanced Orchestration

Coordinates retrieval, context management, prompt selection, model execution, tool use, observability, and output rendering in a deterministic multi-stage pipeline.

3.1 Pipeline Orchestration

  • Stage-aware execution β€” Query Rewrite β†’ Retrieval β†’ Rerank β†’ Summarization β†’ Inference β†’ Tools β†’ Post-processing
  • Stage-specific models β€” Configure models per stage based on cost, capabilities, and task suitability
  • API-driven control β€” Runtime pipeline configuration via FastAPI endpoints
  • Provider abstraction β€” Uses vrraj-llm-adapter and the YAML prompt registry to decouple models and prompts from application code

3.2 Context Management

Long-running conversations remain coherent through a hybrid strategy combining a persistent summary with a bounded recent-history window, keeping context stable and cache-efficient.

3.3 Query Rewrite

Improves retrieval accuracy by selectively refining user intent before search. Rewrites are confidence-gated, context-aware, and configurable per request.

3.4 Retrieval, Inference & Tools

  • Vector retrieval β€” Configurable Qdrant search with top-k and score thresholds
  • Context assembly β€” Builds prompts from instructions, retrieved context, documents, and tool results
  • Tool execution β€” Native function calling (web search, weather, airports)
  • Verified citations β€” Final responses include citations to source URLs and document sections where available

3.5 Observability & Cost Tracking

  • Real-time observability β€” SSE stream exposing pipeline stage execution and intermediate events
  • Per-stage accounting β€” Token usage and cost metrics for every turn

3.6 Post-Processing

  • HTML rendering β€” Markdown β†’ scoped HTML conversion for rich display
  • Isolated stage β€” Output formatting evolves independently from core inference
  • Extensible pipeline β€” Supports custom post-processing workflows

4. Embeddable Chat Experience

  • Website-ready widget β€” Embed the full RAG experience into external sites
  • Configurable behavior β€” Control pipeline parameters through runtime configuration
  • Domain-aware isolation β€” Support separate knowledge bases and prompt domains per deployment

πŸš€ Getting Started

Get the system running in minutes using the provided Makefile. This setup uses Docker for the core infrastructure while maintaining a developer-friendly local environment through volume mounting.

Provider Note: The system supports both OpenAI and Gemini, and providers can be switched or mixed per pipeline stage after setup.

πŸ“‹ 1. Prerequisites

Ensure your environment meets these requirements before proceeding:

  • OS: macOS or Linux (Windows supported via Docker).
  • Git – required to clone the repository. Install: https://git-scm.com/downloads
  • Docker & Docker Compose: Required for the Qdrant v1.14.1 database and the web app container. Get Docker here
  • Python 3.10+: Required for local development, IDE support, and ingestion scripts.
  • LLM Provider API Key(s): Supports OpenAI and Gemini. For the model configuration details, see the Model Registry documentation.

Tip

[!TIP] The system requires an OpenAI or Gemini API key for LLM inference.
Add your key(s) to the .env file during setup.
See Configure API Keys and Budget Controls for guidance.

⚑2 Automated Setup- Preferred

To bootstrap the environment quickly, run the setup script below.

Step 1 β€” Run the bootstrap script

git clone https://github.com/vrraj/chat-with-rag.git
cd chat-with-rag
bash scripts/rag_setup.sh

The script will automatically:

  • create .env from .env.example (if it does not already exist)
  • start Docker services (make start)
  • create a Python virtual environment and install dependencies
  • seed the sample data (make seed)

Step 2 β€” Add your API keys and restart the application

Open .env and add one or both keys:

OPENAI_API_KEY=your_openai_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here

Then restart the application:

make stop
make start

Launch Application: πŸ‘‰ http://localhost:8000

Note: API keys are loaded when the application starts. If you add or change keys later, restart the application for the changes to take effect.

πŸ› οΈ 2.1 Manual Setup

Use this path if you want full control over each setup step instead of the bootstrap script.

Step 1 β€” Clone the repository

git clone https://github.com/vrraj/chat-with-rag.git
cd chat-with-rag

Step 2 β€” Create .env and add your API keys

cp .env.example .env

Then edit .env and add one or both keys:

OPENAI_API_KEY=your_openai_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here

Step 3 β€” Start the application stack

make start

Note for macOS users: make start will attempt to launch Docker Desktop if it is not already running.

Step 4 β€” Seed sample data (need local Python environment)

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
make seed

Optional: Deactivate the virtual environment after seeding

deactivate

Step 5 β€” Open the application

Visit: πŸ‘‰ http://localhost:8000

▢️ 2.2 Running & Managing the Application

Use the following Make targets to manage the application lifecycle:

  • Start the application stack (Qdrant + web app):

    make start
    
  • Stop the application stack:

    make stop
    

πŸ”„ Updating the Docker Images

If you pull new changes, rebuild the application image so your local stack reflects the latest code and dependency updates. You do not need to stop the containers first β€” Docker Compose will rebuild the image and recreate the service automatically.

git pull
docker compose up --build -d

Because rebuilding can leave behind unused <none> images over time, it is a good idea to periodically prune dangling images:

docker image prune -f

Tip: Use docker compose up --build -d after pulling changes to Dockerfiles, Python dependencies, or other container-related files.

For additional Make targets (logs, reset, reseed, maintenance utilities), refer to:

For details on the stateless chat API (POST /chat) used by frontend/chat.html, including request/response shape and parameter options, see:

πŸ‘‰ docs/api-reference.md


πŸ”„ Updating Your Local Copy

If you pull new changes from the repository, update your environment before restarting the application.

git pull

For most updates, restart the application stack:

make stop
make start

If changes affect container dependencies (for example Dockerfile or requirements.txt), rebuild the Docker image instead:

docker compose up --build -d
docker image prune -f

If Python dependencies for local scripts changed, refresh the virtual environment:

source venv/bin/activate
pip install -r requirements.txt

Configure API Keys and Budget Controls

Set up your LLM Provider, OpenAI and / or Gemini (API Keys and Budget Limits).

The setup instructions in this section use OpenAI for reference. You may follow the same steps for Gemini.

Option A: OpenAI (Getting Started)
Recommendation Action Rationale
Budget Set a limit of $5–$10. Establishes a safety ceiling for testing.
Dedicated Key Name it chat-with-rag. Isolates usage tracking for this specific project.
Alerts Set a 50% notification. Provides proactive cost control.
Option B: Gemini
Recommendation Action Rationale
Quota Set a daily quota limit based on your budget. Prevents unexpected cost overruns.
Dedicated Key Name it chat-with-rag-gemini. Isolates usage tracking for this project.
Monitoring Enable usage alerts in Google Cloud Console. Provides proactive cost visibility.

Note: Gemini uses quota-based limits instead of hard dollar limits. Configure quotas in Google AI Studio or Google Cloud Console.


🧩 Prompt Registry (YAML)

This repo uses a YAML-based prompt registry to keep prompts centralized and avoid drift between code paths.

πŸ“ Registry file

  • Path: prompts/prompt_registry.yaml
  • Role: Source of truth for stage prompt text and templates.
  • Implementation Detail: All default prompts and domain-specific overrides are defined in prompts/prompt_registry.yaml, which acts as the single source of truth for prompt behavior across the pipeline.
  • Current coverage: Inference and query rewrite are registry-driven; rerank and summarization use the registry for their fixed instructions/templates.

🎯 Prompt domains (params.prompt_domain)

You can select a prompt domain per request using params.prompt_domain.

  • If prompt_domain is empty or omitted, the system uses global_defaults.
  • If prompt_domain is set (example: mountains), the system applies domain-specific overrides (currently by appending additional domain system instructions).

In the UI (frontend/chat.html), the Prompt Domain dropdown under Inference controls the value sent on every chat request.

Jinja2 Template System

The prompt registry uses Jinja2 templating to safely inject conversation history and RAG context into prompts:

  • Conversation Context: Summarized history + recent conversation turns
  • RAG Context: Retrieved documents + web search results
  • User Input: Current user question

This ensures safe separation of system instructions from dynamic data while maintaining consistent formatting across all pipeline stages.

For detailed configuration options, see the Configuration Reference.


πŸ“¦ Batch Ingestion

Batch ingestion is the recommended way to build or refresh a knowledge base from multiple sources at once. It supports local documents, remote URLs, and mixed source sets, with optional estimation before indexing.

Note: Changing the embedding model requires re-embedding and rebuilding the vector index. See docs/technical-overview.md for the recommended re-ingestion workflow.

🎯 What It Does

Each source in the batch is processed through the same ingestion pipeline used elsewhere in the application:

load β†’ extract β†’ chunk β†’ augment metadata β†’ embed β†’ store

This makes it the easiest way to populate or refresh Qdrant collections consistently at scale.

πŸ“ How to Organize Documents

A practical pattern is to organize source files by topic or domain before ingestion.

data/
β”œβ”€β”€ mountains/
β”‚   β”œβ”€β”€ everest.pdf
β”‚   β”œβ”€β”€ kilimanjaro.pdf
β”‚   └── whitney.html
β”œβ”€β”€ oceans/
β”‚   β”œβ”€β”€ pacific.html
β”‚   └── atlantic.html
└── travel/
    β”œβ”€β”€ italy-guide.pdf
    └── rome.html

This makes it easier to:

  • build domain-specific collections
  • keep metadata consistent
  • re-index a single topic area without rebuilding everything

πŸ’‘ Typical Uses

  • ingest a folder of PDFs
  • index a curated list of webpages
  • process mixed source sets in a single batch
  • rebuild a collection after changing chunking or embedding settings

πŸ“„ Example Batch Configuration

{
  "items": [
    {
      "url": "file:///app/data/mountains/everest.pdf",
      "doc_type": "pdf",
      "skip_sections": ["References", "External links"]
    },
    {
      "url": "https://en.wikipedia.org/wiki/Mount_Whitney",
      "doc_type": "mediawiki"
    }
  ],
  "max_chunks": 100,
  "estimate": true,
  "force_delete": false
}

Start with "estimate": true to preview cost and processing behavior before committing a batch to storage.

See Technical Documentation: Batch Ingestion for provider-specific limits, embedding batch sizing, and advanced ingestion workflows.


πŸͺŸ Embeddable Widget Configuration

A lightweight widget that embeds the full RAG pipeline into any website.

The widget exposes the same orchestration used by the main application β€” retrieval, reranking, context management, tool calling, and response post‑processing β€” while remaining easy to deploy and configure.

Supports domain isolation so different websites can use different knowledge bases and prompt domains.

For complete documentation on the embedded chat UI, see the Embedded Chat Guide.

βš™οΈ Configuration Options

The widget can be configured in two ways:

  • Direct iframe embedding (simplest)
  • Embed loader script using HTML data-* attributes (advanced configuration)

πŸ–ΌοΈ Simple Example (iframe)

<iframe 
  src="https://your-server.com/chat-embed.html?top_k=5&show_citations=true&namespace=simple-chat"
  width="100%" 
  height="400px"
  style="border: 0; border-radius: 8px;"
  title="Embedded Chat">
</iframe>

πŸ”§ Advanced Example (Embed Loader)

<!-- 1. Target container -->
<div id="chat-embed" 
     data-api-url="https://your-server.com"
     data-model_key="openai:gpt-4o-mini"
     data-temperature="0.7"
     data-top_k="10"
     data-show_processing_steps="true"
     data-show_citations="true"
     data-namespace="oceans">
</div>

<!-- 2. Embed loader script -->
<script src="https://your-server.com/static/embed-loader.js"
        data-target="#chat-embed"
        data-api-url="https://your-server.com"
        data-model_key="openai:gpt-4o-mini"
        data-temperature="0.7"
        data-top_k="10"
        data-show_processing_steps="true"
        data-show_citations="true"
        data-namespace="oceans">
</script>

The embed loader automatically initializes the widget and connects it to the configured backend API.


πŸ”„ Session-Based (Stateful) Chat API

The system supports both stateless and stateful chat architectures. The current web UI uses a stateless pattern with client-managed history, while the session-based API is better suited for backend, mobile, and multi-client integrations that benefit from server-managed conversation state.

🎯 Quick Overview

Feature Stateless (/chat) Session-Based (/chat/{session_id})
History Management Client sends full history each request Server maintains history automatically
Use Case Web frontend, simple integrations Mobile apps, backend systems, multi-device
Pipeline Quality Identical RAG pipeline Identical RAG pipeline
Setup No setup required Create session first
%%{init: {'themeVariables': { 'nodePadding': '5', 'mainBkg': '#fff'}, 'flowchart': { 'curve': 'basis', 'rankSpacing': 30, 'nodeSpacing': 20}}}%%
graph TD
    %% Theme Styling - All borders unified to #1e6bb8
    classDef core fill:#e1f0f0,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8,font-weight:normal;
    classDef feat fill:#ffffff,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8;
    classDef logic fill:#f0fff4,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8,font-weight:normal;
    classDef stateful fill:#cae3e3,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8;
    classDef spacer opacity:0;

    %% Entry
    Start[User Message] --> Mode{Chat Mode}

    %% Stateless Path
    Mode -- "Stateless" --> SL[Stateless Flow]
    SL --> Hist[Send Full History + user_id]

    %% Stateful Path
    Mode -- "Stateful" --> SF[Stateful Flow]
    SF --> Sess[Create/Get Session]
    Sess --> Ctx[Get Session Context]

    %% Shared Orchestration
    Hist --> Pipe
    Ctx --> Pipe

   
        Pipe[πŸ”„ Shared Orchestrator Pipeline] --> Steps
Steps[Query Rewrite β†’ Retrieval β†’ Rerank β†’ Inference β†’ Tools]
 

    %% Exit Logic
    Steps --> Res[Response + Metrics]
    Res --> Out1[Return to Client]
    Res --> Out2[Update Session + Return]

    %% Applying Styles
    class Start core;
    class Mode,SL,Hist,Out1 feat;
    class SF,Sess,Ctx,Out2 stateful;
    class Pipeline,Pipe,Steps,Res logic;
Loading

πŸš€ Session-Based Chat API Examples

1. Create a Session

curl -X POST http://localhost:8000/chat/session
# Response: {"session_id": "12d8cd79-0ee8-4dcd-97a5-5983effcbccd"}

2. Send Messages (Context Preserved)

# First message
curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is Mount Everest?",
    "history": [],
    "params": {"top_k": 5, "temperature": 0.7, "max_output_tokens": 500}
  }'

# Follow-up (understands context from previous message)
curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
  -H "Content-Type: application/json" \
  -d '{
    "message": "How tall is it?",
    "history": [],
    "params": {"top_k": 5, "temperature": 0.7, "max_output_tokens": 500}
  }'

3. Use Different Models

curl -X POST http://localhost:8000/chat/session-id \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Explain quantum computing",
    "history": [],
    "params": {
      "top_k": 5,
      "temperature": 0.7,
      "max_output_tokens": 500,
      "model_keys": {
        "inference": "gemini:gemini-2.5-flash"
      }
    }
  }'

πŸ“š When to Use Session-Based API

Scenario Recommended API Reason
Web frontend Stateless (/chat) Simpler, client-managed state
Mobile apps Session-based (/chat/{session_id}) Server-side persistence
Backend integrations Session-based Automatic context management
Multi-device access Session-based Shared conversation state
Long-running conversations Session-based Automatic history management

πŸ”§ Key Benefits

  • Automatic context management - No need to send history in each request
  • Token-aware truncation - Prevents context overflow automatically
  • Multi-device support - Same session accessible from different clients
  • Identical pipeline quality - Same retrieval, rewrite, and inference as stateless
  • Model override support - Per-request model selection via model_keys
  • Session-based token accounting - Isolated cost tracking per session

πŸ“Š Token Accounting & Namespaces

Stateless vs Session-Based Token Tracking

Approach Namespace Pattern Token Isolation Use Case
Stateless user_id:conversation_id Per conversation Web frontend, client-managed
Session-Based session:{session_id} Per session Mobile apps, backend systems

How Token Accounting Works

Stateless (/chat):

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is RAG?",
    "history": [],
    "params": {
      "user_id": "user123",
      "conversation_id": "conv456",
      "top_k": 5
    }
  }'
  • Namespace: user123:conv456
  • Token tracking: Isolated per conversation

Session-Based (/chat/{session_id}):

curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is RAG?",
    "history": [],
    "params": {
      "user_id": "user123",  # Optional
      "top_k": 5
    }
  }'
  • Namespace: session:12d8cd79-0ee8-4dcd-97a5-5983effcbccd
  • Token tracking: Isolated per session

Benefits of Namespace Isolation

  • Cost tracking - Monitor tokens per user/conversation/session
  • Cache management - Separate caches for different contexts
  • Resource isolation - Prevent cross-contamination of data
  • Usage analytics - Track patterns per namespace

πŸ“– Learn More


πŸ› οΈ Included Tools

The chat pipeline supports optional tool use during inference.

Current built-in tools include:

  • Web Search β€” Adds external web results to the inference context when enabled
  • Weather β€” Returns current weather and forecast data for requested locations
  • Airport Lookup β€” Returns nearby airport information for travel- and location-based queries

Tool usage can be enabled per request via the application configuration and is integrated into the final response synthesis stage.


πŸ“š Knowledge Base and Sample Data

When you run make seed, the system populates Qdrant with a curated dataset derived from approximately 70 Wikipedia pages focused on mountains and related geography topics. The dataset is indexed into two collections β€” document_index (OpenAI embeddings) and document_index_gemini (Gemini embeddings) β€” allowing you to test the same content across different embedding models.

Note: The active collection is selected through the active_domain setting in backend/core/config.py, which determines both the Qdrant collection and the embedding model used by the system.

πŸ“„ Data Attribution

This project includes a sample knowledge base derived from Wikipedia.

  • Source: ~70 curated Wikipedia articles processed via a custom high-fidelity MediaWiki extraction pipeline.
  • Integrity: Source URLs and author metadata are preserved within the vector payloads to enable verified citations.
  • License: Distributed under CC BY-SA 4.0.
  • Full Credits: Detailed source links and compliance information can be found in docs/attributions.md.

πŸ” Explore the Data

You can verify the indexed documents through the web interface or the command line:

Method Action
Frontend UI Navigate to the "View Documents" page to see titles, URLs, and metadata.
Terminal (CLI) Run the following to list the first 100 document titles:
source venv/bin/activate
python scripts/qdrant_scripts/qdrant_ops.py --list-titles --limit 100

πŸ”„ Managing Your Collections

The system supports domain-based collection management where each domain is tightly coupled with its embedding model to prevent dimension drift and ensure consistency. Selecting a collection automatically sets the compatible embedding model and vector dimensions.

Domain-Based Configuration

A single active_domain setting configures both the collection name and embedding model. This helps prevent dimension drift and ensures consistency. The default is set to mountains which uses the openai:embed_small model. You may modify or add to the configuration in backend/core/config.py.

Only one domain can be active at a time, and that defines the Qdrant collection and embedding model. This approach allows you to maintain multiple "knowledge bases" on the same database. You can swap between domains at any time just by changing the active_domain variable.

# In backend/core/config.py
DOMAIN_EMBEDDING_CONFIG = {
    "default": {
        "collection_name": "document_index",
        "embedding_model_key": "openai:embed_small"
    },
    "mountains": {
        "collection_name": "document_index", 
        "embedding_model_key": "openai:embed_small"
    },
    "oceans": {
        "collection_name": "document_index_gemini",
        "embedding_model_key": "gemini:native-embed"
    }
}

# Active domain selection (single change point)
active_domain: str = "mountains"

[!TIP]

πŸ’¬ Example Queries

The following examples are based on the seed data.

  1. Retrieval-based

    β€’ β€œWhere is Mount Kilimanjaro located?”

    β€’ β€œCompare Kilimanjaro with Mount Whitney.”

β€’ "Elevation of mont blanc in the alps , the current weather and the closest airport"

β€’ β€œWhich mountain is taller β€” Everest or Whitney?”

β€’	β€œWhere is Mount Whitney located? What is the closest airport to it?”

β€’	β€œTell me about hiking in Patagonia. What is the weather forecast for the area?”
  1. Multi-Turn RAG + Tool Calling Demo (Conversation Chain)

The following is a real conversation demonstrating:

  • Multi-turn context retention
  • Retrieval + rewrite + rerank
  • Automatic tool invocation (get_nearby_airports)
  • Weather queries over prior tool results
  • Final synthesis combining retrieval + tool calls
  • Per-stage cost and token metrics

USER: where is mount kilimanjaro ? ASSISTANT:Mount Kilimanjaro is located in Tanzania....

USER: compare kilimanjaro with Mount whitney

ASSISTANT: Mount Kilimanjaro and Mount Whitney are both prominent peaks, but differ in height, location....

USER: what are the closest airports to Mount Kilimanjaro

ASSISTANT:The nearest airport to the coordinates (-3.0674, 37.3556) is Amboseli Airport (ASV β€’ HKAM), ~48 km away. Tools Used: get_nearby_airports

USER: and for Mount Whitney

ASSISTANT:Closest airports to Mount Whitney (36.5785, -118.2919).... Tools Used: get_nearby_airports

USER: what is the weather forecast for mount whitney

ASSISTANT:Mount Whitney, California β€” 28 Oct High: 7.2 Β°C (45 Β°F).... Tools Used: get_weather

Multi-turn conversation with tool calls and citations. Preserves context and history across turns.

Multi-turn conversation with tool calls and citations. Preserves context and history across turns.


LLM Integration

This system uses the Python package vrraj-llm-adapter to provide a unified interface across multiple LLM providers.

The adapter normalizes model configuration, requests, responses, tool calls, and usage metrics across providers while allowing different models to be used across pipeline stages.

πŸ”‘ Key Capabilities

  • Multi-Provider Support β€” Works with OpenAI and Gemini models
  • Registry-Driven Model Configuration β€” Model capabilities, pricing, and parameter policies are defined in a centralized model registry
  • Provider-Agnostic Calls β€” The same application code works across providers
  • Custom Model Registries β€” Users can extend or override models without changing application code

βš™οΈ Custom Registry Path

To load a user-defined custom model registry, set:

export CUSTOM_REGISTRY_PATH=/path/to/your/custom_registry.py

See the Model Registry documentation for the models supported, default model definitions, reasoning model configurations, and guidance on extending the adapter with custom models.


πŸ—οΈ Technical Overview

Technical details about the system architecture, pipelines, design decisions, and engineering approach are available here:

πŸ‘‰ docs/technical-overview.md

This overview covers module structure, extraction pipeline, embedding flow, Qdrant indexing, batch ingestion (local PDFs + URLs with optional cost estimation), chat orchestration, SSE streaming, and frontend–backend integration.


πŸ—‚οΈ Project Structure

chat-with-rag/
β”œβ”€β”€ backend/      # API, chat orchestration, ingestion pipeline, vector DB integration, tools
β”œβ”€β”€ frontend/     # Chat UI, embed pages, static assets
β”œβ”€β”€ scripts/      # Batch ingestion and maintenance utilities
β”œβ”€β”€ prompts/      # YAML prompt registry
β”œβ”€β”€ docs/         # Technical architecture and API documentation
β”œβ”€β”€ data/         # Seed/demo datasets
└── images/       # README and documentation images

See docs/technical-overview.md for a deeper architectural breakdown of the system modules and pipelines.


πŸ” Security & Deployment

This application includes a domain-based access control framework for APIs and embedded widgets.

πŸ›‘οΈ Included Security Controls

  • Domain-Based API Access β€” Chat and embedding endpoints can enforce domain-level access rules
  • Embeddable Widget Restrictions β€” chat-embed.html can be restricted to authorized domains
  • Collection Isolation β€” Separate domains can be mapped to different knowledge bases and prompt configurations

These security controls help prevent unauthorized access and ensure that different domains or websites can only access their designated knowledge bases and configurations.


πŸ“‘ API Usage

For complete API documentation including usage examples, request/response formats, and integration guides, see the API Reference.


🧰 Qdrant Operations

Manage your vector collections with the Qdrant Operations CLI. Essential for backup, export, and collection maintenance.

# Export collection to JSONL (for backup/seeding)
python scripts/qdrant_scripts/qdrant_ops.py export

# List document titles and inspect collection
python scripts/qdrant_scripts/qdrant_ops.py list-titles

# Target specific collection (e.g., Gemini embeddings)
python scripts/qdrant_scripts/qdrant_ops.py --collection document_index_gemini export -f docs-index-seed-gemini.jsonl

πŸ‘‰ Technical Overview: Qdrant Operations CLI


βš–οΈ License & Usage

This project is source-available for personal, educational, and evaluation purposes.
It is permitted to run, modify, and fork the code for non-commercial use.

Redistribution, sublicensing, or commercial use of this project or derivative works requires explicit written permission from the author. Β© 2026 Rajkumar Velliavitil β€” All Rights Reserved

About

Modular tool-assisted RAG framework with smart ingestion, multi-stage response pipelines, multi-LLM support, dynamic prompts, and embeddable chat interfaces.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors