A modular Tool-Assisted Retrieval-Augmented Generation (RAG) framework for building AI applications that generate grounded answers with citations from structured and unstructured knowledge sources.
Unlike simple vector-search demos, this project provides a multi-stage RAG pipeline with configurable retrieval, prompts, models, tools, and observability.
π Get Started: See section Getting Started to run the system locally.
-
MultiβLLM Pipeline Orchestration
Use different LLM providers and models across pipeline stages through the vrrajβllmβadapter, based on cost, capabilities, and task suitability. -
Stage-Specific Model Selection
Runtime model selection per pipeline stage (rewrite, rerank, summarization, inference) via UI or API. -
Registry-Driven LLM Integration
Model capabilities, pricing, and parameter policies are referenced from the adapterβs (vrraj-llm-adapter) registry. This can be extended/overridden with a custom registry. -
Domain-Aware Prompt Registry
Centralized prompt control layer that decouples prompts from application code. Prompts are grouped by domain/pipeline stage in a YAML-driven registry enabling rapid prompt experimentation and domainβspecific pipeline behavior. -
Advanced Context Window Management
Hybrid strategy combining summarized conversation history with recent verbatim turns to maintain context while controlling token usage. See Technical Overview for implementation details. -
Cost Tracking and Observability
Real-time stage visibility, token accounting, and per-stage cost tracking across the pipeline. -
Response Post-processing
Configurable output transformation layer, currently supporting Markdown β scoped HTML conversion and extensible post-processing workflows. -
Embeddable Chat Experience
Drop-in widget with comprehensive configuration via API params. -
Domain-Based Access Controls
Isolation and authorization enforced consistently across APIs and embedded clients. -
Dual Chat Modes: Stateful and Stateless
Support for both stateless (/chat) and stateful (/chat/{session_id}) chat patterns.
For additional details, see the Release Notes 2.0.
Auth & Security Note
This app enforces domain-based access controls across APIs and embedded widgets (domain isolation, collection separation, widget lockdown). See Security & Deployment for more details.
The system runs through two parallel workflows: an Ingestion Pipeline (build the knowledge base) and a Chat Orchestration Pipeline (retrieve + answer).
| Pipeline | Flow |
|---|---|
| Ingestion | Documents / URLs β Load Sources β Extract & Parse β Chunk & Normalize β Metadata Augmentation β Embeddings β Vector Storage |
| Chat | User Prompt β Query Rewrite β Retrieval β Rerank β Context Assembly β LLM Inference β Tool Execution β Response Synthesis β Post-Processing β Final Response |
%%{init: {'themeVariables': { 'fontSize': '16px', 'subgraphFontSize': '20px', 'subgraphTitleColor': '#1e6bb8'}}}%%
graph LR
%% Theme Styling from your finalized Hub
classDef core fill:#dcebe8,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8,font-weight:bold;
classDef feat fill:#ffffff,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8;
classDef highlight fill:#159957,stroke:#159957,stroke-width:2px,color:#fff,font-weight:bold;
subgraph "CHAT ORCHESTRATION"
direction LR
U[User Prompt] --> QR[Query Rewrite]
QR --> Search[Retrieval]
%% Return path from Ingestion back to Chat
R[Rerank] --> Ctx[Context Assembly] --> Inf[LLM Inference]
Inf --> Tools{Tool Execution?}
Tools -- "Yes" --> API[Tool Calls] --> Synth[Response Synthesis] --> Post[Post-Processing]
Tools -- "No" --> Synth
Synth --> Post
Post --> Out[Final Response]
end
subgraph "INGESTION PIPELINE"
direction LR
S[Sources] --> P[Parse] --> C[Chunk] --> D[Add Metadata] --> E[Embed] --> DB[(Vector DB)]
end
%% PHYSICAL CONNECTIONS
Search -- "Query" --> DB
DB -- "Results" --> R
%% Apply Themes
%% Using 'core' style for the main entry/exit and database
class U,Out,DB core;
%% Using 'feat' style for standard logic steps
class QR,Search,R,Ctx,API,Post,S,P,C,D,E feat;
%% Using 'highlight' (Cayman Green) for the critical LLM stages
class Inf,Synth highlight;
This project serves as a reference architecture for production-style Tool-Assisted RAG systems. Typical use cases include:
- Document-grounded assistants for PDFs, HTML pages, internal documentation, and MediaWiki-based knowledge sources
- Multi-LLM experimentation across pipeline stages for retrieval, reranking, summarization, and inference
- Prompt and retrieval tuning using query rewrite, reranking, and domain-specific prompt configurations
- Embeddable knowledge assistants for websites and internal portals
- API-driven Tool-Assisted RAG workflows for chat, ingestion, embedded experiences, and runtime pipeline control
- Domain-specific knowledge bases with separate collections, embeddings, and prompt domains
- Observable Tool-Assisted RAG systems where each pipeline stage can be inspected, tuned, and cost-tracked
The screenshot below shows a live chat run through the inference pipeline, highlighting key system capabilities:
- Query rewrite for improved retrieval
- Multiβturn context preservation
- Retrieval + inference working together
- Optional tool calls
- Multiβmodel execution (OpenAI and Gemini)
- HTMLβformatted responses with citations
Chat pipeline UI showing query rewriting, multi-turn context handling, explicit pipeline stages, tool invocation, and cited responses.
The screenshot below illustrates the two configuration options for the chat widget:
- Direct iframe embedding (simplest)
- Embed loader script using HTML
data-*attributes (advanced configuration)
See the Embeddable Widget Configuration section below for implementation examples.
The workspace provides a central entry point to the main parts of the application. From here you can open the chat interface, manage documents, inspect the vector store, run batch ingestion, and configure embeddable chat experiences.
Content ingestion UI showing primary actions and indexing tools (batch upload, PDF/HTML/MediaWiki), estimation mode, and metadata controls.
- Stateless (
/chat) β Client-managed history for web frontends - Stateful (
/chat/{session_id}) β Server-managed sessions for backend and mobile integrations - Shared orchestration β Both modes use the same RAG pipeline
See details: Session-Based Chat API
- Multi-source extraction β Parse PDFs, MediaWiki, and HTML
- Structured processing β Smart chunking, structure preservation, and configurable noise filtering
- Batch ingestion β Process local directories (
file://) or remote URLs with optional token and cost estimation
Coordinates retrieval, context management, prompt selection, model execution, tool use, observability, and output rendering in a deterministic multi-stage pipeline.
- Stage-aware execution β Query Rewrite β Retrieval β Rerank β Summarization β Inference β Tools β Post-processing
- Stage-specific models β Configure models per stage based on cost, capabilities, and task suitability
- API-driven control β Runtime pipeline configuration via FastAPI endpoints
- Provider abstraction β Uses vrraj-llm-adapter and the YAML prompt registry to decouple models and prompts from application code
Long-running conversations remain coherent through a hybrid strategy combining a persistent summary with a bounded recent-history window, keeping context stable and cache-efficient.
Improves retrieval accuracy by selectively refining user intent before search. Rewrites are confidence-gated, context-aware, and configurable per request.
- Vector retrieval β Configurable Qdrant search with top-k and score thresholds
- Context assembly β Builds prompts from instructions, retrieved context, documents, and tool results
- Tool execution β Native function calling (web search, weather, airports)
- Verified citations β Final responses include citations to source URLs and document sections where available
- Real-time observability β SSE stream exposing pipeline stage execution and intermediate events
- Per-stage accounting β Token usage and cost metrics for every turn
- HTML rendering β Markdown β scoped HTML conversion for rich display
- Isolated stage β Output formatting evolves independently from core inference
- Extensible pipeline β Supports custom post-processing workflows
- Website-ready widget β Embed the full RAG experience into external sites
- Configurable behavior β Control pipeline parameters through runtime configuration
- Domain-aware isolation β Support separate knowledge bases and prompt domains per deployment
Get the system running in minutes using the provided Makefile. This setup uses Docker for the core infrastructure while maintaining a developer-friendly local environment through volume mounting.
Provider Note: The system supports both OpenAI and Gemini, and providers can be switched or mixed per pipeline stage after setup.
Ensure your environment meets these requirements before proceeding:
- OS: macOS or Linux (Windows supported via Docker).
- Git β required to clone the repository. Install: https://git-scm.com/downloads
- Docker & Docker Compose: Required for the Qdrant v1.14.1 database and the web app container. Get Docker here
- Python 3.10+: Required for local development, IDE support, and ingestion scripts.
- LLM Provider API Key(s): Supports OpenAI and Gemini. For the model configuration details, see the Model Registry documentation.
Tip
[!TIP] The system requires an OpenAI or Gemini API key for LLM inference.
Add your key(s) to the.envfile during setup.
See Configure API Keys and Budget Controls for guidance.
To bootstrap the environment quickly, run the setup script below.
Step 1 β Run the bootstrap script
git clone https://github.com/vrraj/chat-with-rag.git
cd chat-with-rag
bash scripts/rag_setup.shThe script will automatically:
- create
.envfrom.env.example(if it does not already exist) - start Docker services (
make start) - create a Python virtual environment and install dependencies
- seed the sample data (
make seed)
Step 2 β Add your API keys and restart the application
Open .env and add one or both keys:
OPENAI_API_KEY=your_openai_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here
Then restart the application:
make stop
make startLaunch Application: π http://localhost:8000
Note: API keys are loaded when the application starts. If you add or change keys later, restart the application for the changes to take effect.
Use this path if you want full control over each setup step instead of the bootstrap script.
Step 1 β Clone the repository
git clone https://github.com/vrraj/chat-with-rag.git
cd chat-with-ragStep 2 β Create .env and add your API keys
cp .env.example .envThen edit .env and add one or both keys:
OPENAI_API_KEY=your_openai_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here
Step 3 β Start the application stack
make startNote for macOS users:
make startwill attempt to launch Docker Desktop if it is not already running.
Step 4 β Seed sample data (need local Python environment)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
make seedOptional: Deactivate the virtual environment after seeding
deactivateStep 5 β Open the application
Visit: π http://localhost:8000
Use the following Make targets to manage the application lifecycle:
-
Start the application stack (Qdrant + web app):
make start
-
Stop the application stack:
make stop
If you pull new changes, rebuild the application image so your local stack reflects the latest code and dependency updates. You do not need to stop the containers first β Docker Compose will rebuild the image and recreate the service automatically.
git pull
docker compose up --build -dBecause rebuilding can leave behind unused <none> images over time, it is a good idea to periodically prune dangling images:
docker image prune -fTip: Use
docker compose up --build -dafter pulling changes to Dockerfiles, Python dependencies, or other container-related files.
For additional Make targets (logs, reset, reseed, maintenance utilities), refer to:
- the
Makefilein the project root - docs/technical-overview.md
For details on the stateless chat API (POST /chat) used by frontend/chat.html, including request/response shape and parameter options, see:
If you pull new changes from the repository, update your environment before restarting the application.
git pullFor most updates, restart the application stack:
make stop
make startIf changes affect container dependencies (for example Dockerfile or requirements.txt), rebuild the Docker image instead:
docker compose up --build -d
docker image prune -fIf Python dependencies for local scripts changed, refresh the virtual environment:
source venv/bin/activate
pip install -r requirements.txtSet up your LLM Provider, OpenAI and / or Gemini (API Keys and Budget Limits).
The setup instructions in this section use OpenAI for reference. You may follow the same steps for Gemini.
| Recommendation | Action | Rationale |
|---|---|---|
| Budget | Set a limit of $5β$10. | Establishes a safety ceiling for testing. |
| Dedicated Key | Name it chat-with-rag. |
Isolates usage tracking for this specific project. |
| Alerts | Set a 50% notification. | Provides proactive cost control. |
| Recommendation | Action | Rationale |
|---|---|---|
| Quota | Set a daily quota limit based on your budget. | Prevents unexpected cost overruns. |
| Dedicated Key | Name it chat-with-rag-gemini. |
Isolates usage tracking for this project. |
| Monitoring | Enable usage alerts in Google Cloud Console. | Provides proactive cost visibility. |
Note: Gemini uses quota-based limits instead of hard dollar limits. Configure quotas in Google AI Studio or Google Cloud Console.
This repo uses a YAML-based prompt registry to keep prompts centralized and avoid drift between code paths.
- Path:
prompts/prompt_registry.yaml - Role: Source of truth for stage prompt text and templates.
- Implementation Detail: All default prompts and domain-specific overrides are defined in
prompts/prompt_registry.yaml, which acts as the single source of truth for prompt behavior across the pipeline. - Current coverage: Inference and query rewrite are registry-driven; rerank and summarization use the registry for their fixed instructions/templates.
You can select a prompt domain per request using params.prompt_domain.
- If
prompt_domainis empty or omitted, the system usesglobal_defaults. - If
prompt_domainis set (example:mountains), the system applies domain-specific overrides (currently by appending additional domain system instructions).
In the UI (frontend/chat.html), the Prompt Domain dropdown under Inference controls the value sent on every chat request.
The prompt registry uses Jinja2 templating to safely inject conversation history and RAG context into prompts:
- Conversation Context: Summarized history + recent conversation turns
- RAG Context: Retrieved documents + web search results
- User Input: Current user question
This ensures safe separation of system instructions from dynamic data while maintaining consistent formatting across all pipeline stages.
For detailed configuration options, see the Configuration Reference.
Batch ingestion is the recommended way to build or refresh a knowledge base from multiple sources at once. It supports local documents, remote URLs, and mixed source sets, with optional estimation before indexing.
Note: Changing the embedding model requires re-embedding and rebuilding the vector index. See docs/technical-overview.md for the recommended re-ingestion workflow.
Each source in the batch is processed through the same ingestion pipeline used elsewhere in the application:
load β extract β chunk β augment metadata β embed β store
This makes it the easiest way to populate or refresh Qdrant collections consistently at scale.
A practical pattern is to organize source files by topic or domain before ingestion.
data/
βββ mountains/
β βββ everest.pdf
β βββ kilimanjaro.pdf
β βββ whitney.html
βββ oceans/
β βββ pacific.html
β βββ atlantic.html
βββ travel/
βββ italy-guide.pdf
βββ rome.html
This makes it easier to:
- build domain-specific collections
- keep metadata consistent
- re-index a single topic area without rebuilding everything
- ingest a folder of PDFs
- index a curated list of webpages
- process mixed source sets in a single batch
- rebuild a collection after changing chunking or embedding settings
{
"items": [
{
"url": "file:///app/data/mountains/everest.pdf",
"doc_type": "pdf",
"skip_sections": ["References", "External links"]
},
{
"url": "https://en.wikipedia.org/wiki/Mount_Whitney",
"doc_type": "mediawiki"
}
],
"max_chunks": 100,
"estimate": true,
"force_delete": false
}Start with
"estimate": trueto preview cost and processing behavior before committing a batch to storage.
See Technical Documentation: Batch Ingestion for provider-specific limits, embedding batch sizing, and advanced ingestion workflows.
A lightweight widget that embeds the full RAG pipeline into any website.
The widget exposes the same orchestration used by the main application β retrieval, reranking, context management, tool calling, and response postβprocessing β while remaining easy to deploy and configure.
Supports domain isolation so different websites can use different knowledge bases and prompt domains.
For complete documentation on the embedded chat UI, see the Embedded Chat Guide.
The widget can be configured in two ways:
- Direct iframe embedding (simplest)
- Embed loader script using HTML
data-*attributes (advanced configuration)
<iframe
src="https://your-server.com/chat-embed.html?top_k=5&show_citations=true&namespace=simple-chat"
width="100%"
height="400px"
style="border: 0; border-radius: 8px;"
title="Embedded Chat">
</iframe><!-- 1. Target container -->
<div id="chat-embed"
data-api-url="https://your-server.com"
data-model_key="openai:gpt-4o-mini"
data-temperature="0.7"
data-top_k="10"
data-show_processing_steps="true"
data-show_citations="true"
data-namespace="oceans">
</div>
<!-- 2. Embed loader script -->
<script src="https://your-server.com/static/embed-loader.js"
data-target="#chat-embed"
data-api-url="https://your-server.com"
data-model_key="openai:gpt-4o-mini"
data-temperature="0.7"
data-top_k="10"
data-show_processing_steps="true"
data-show_citations="true"
data-namespace="oceans">
</script>The embed loader automatically initializes the widget and connects it to the configured backend API.
The system supports both stateless and stateful chat architectures. The current web UI uses a stateless pattern with client-managed history, while the session-based API is better suited for backend, mobile, and multi-client integrations that benefit from server-managed conversation state.
| Feature | Stateless (/chat) |
Session-Based (/chat/{session_id}) |
|---|---|---|
| History Management | Client sends full history each request | Server maintains history automatically |
| Use Case | Web frontend, simple integrations | Mobile apps, backend systems, multi-device |
| Pipeline Quality | Identical RAG pipeline | Identical RAG pipeline |
| Setup | No setup required | Create session first |
%%{init: {'themeVariables': { 'nodePadding': '5', 'mainBkg': '#fff'}, 'flowchart': { 'curve': 'basis', 'rankSpacing': 30, 'nodeSpacing': 20}}}%%
graph TD
%% Theme Styling - All borders unified to #1e6bb8
classDef core fill:#e1f0f0,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8,font-weight:normal;
classDef feat fill:#ffffff,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8;
classDef logic fill:#f0fff4,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8,font-weight:normal;
classDef stateful fill:#cae3e3,stroke:#1e6bb8,stroke-width:1px,color:#1e6bb8;
classDef spacer opacity:0;
%% Entry
Start[User Message] --> Mode{Chat Mode}
%% Stateless Path
Mode -- "Stateless" --> SL[Stateless Flow]
SL --> Hist[Send Full History + user_id]
%% Stateful Path
Mode -- "Stateful" --> SF[Stateful Flow]
SF --> Sess[Create/Get Session]
Sess --> Ctx[Get Session Context]
%% Shared Orchestration
Hist --> Pipe
Ctx --> Pipe
Pipe[π Shared Orchestrator Pipeline] --> Steps
Steps[Query Rewrite β Retrieval β Rerank β Inference β Tools]
%% Exit Logic
Steps --> Res[Response + Metrics]
Res --> Out1[Return to Client]
Res --> Out2[Update Session + Return]
%% Applying Styles
class Start core;
class Mode,SL,Hist,Out1 feat;
class SF,Sess,Ctx,Out2 stateful;
class Pipeline,Pipe,Steps,Res logic;
curl -X POST http://localhost:8000/chat/session
# Response: {"session_id": "12d8cd79-0ee8-4dcd-97a5-5983effcbccd"}# First message
curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
-H "Content-Type: application/json" \
-d '{
"message": "What is Mount Everest?",
"history": [],
"params": {"top_k": 5, "temperature": 0.7, "max_output_tokens": 500}
}'
# Follow-up (understands context from previous message)
curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
-H "Content-Type: application/json" \
-d '{
"message": "How tall is it?",
"history": [],
"params": {"top_k": 5, "temperature": 0.7, "max_output_tokens": 500}
}'curl -X POST http://localhost:8000/chat/session-id \
-H "Content-Type: application/json" \
-d '{
"message": "Explain quantum computing",
"history": [],
"params": {
"top_k": 5,
"temperature": 0.7,
"max_output_tokens": 500,
"model_keys": {
"inference": "gemini:gemini-2.5-flash"
}
}
}'| Scenario | Recommended API | Reason |
|---|---|---|
| Web frontend | Stateless (/chat) |
Simpler, client-managed state |
| Mobile apps | Session-based (/chat/{session_id}) |
Server-side persistence |
| Backend integrations | Session-based | Automatic context management |
| Multi-device access | Session-based | Shared conversation state |
| Long-running conversations | Session-based | Automatic history management |
- Automatic context management - No need to send history in each request
- Token-aware truncation - Prevents context overflow automatically
- Multi-device support - Same session accessible from different clients
- Identical pipeline quality - Same retrieval, rewrite, and inference as stateless
- Model override support - Per-request model selection via
model_keys - Session-based token accounting - Isolated cost tracking per session
| Approach | Namespace Pattern | Token Isolation | Use Case |
|---|---|---|---|
| Stateless | user_id:conversation_id |
Per conversation | Web frontend, client-managed |
| Session-Based | session:{session_id} |
Per session | Mobile apps, backend systems |
Stateless (/chat):
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"message": "What is RAG?",
"history": [],
"params": {
"user_id": "user123",
"conversation_id": "conv456",
"top_k": 5
}
}'- Namespace:
user123:conv456 - Token tracking: Isolated per conversation
Session-Based (/chat/{session_id}):
curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
-H "Content-Type: application/json" \
-d '{
"message": "What is RAG?",
"history": [],
"params": {
"user_id": "user123", # Optional
"top_k": 5
}
}'- Namespace:
session:12d8cd79-0ee8-4dcd-97a5-5983effcbccd - Token tracking: Isolated per session
- Cost tracking - Monitor tokens per user/conversation/session
- Cache management - Separate caches for different contexts
- Resource isolation - Prevent cross-contamination of data
- Usage analytics - Track patterns per namespace
- Technical Overview - Detailed architecture and implementation
- API Reference - Complete API documentation and examples
The chat pipeline supports optional tool use during inference.
Current built-in tools include:
- Web Search β Adds external web results to the inference context when enabled
- Weather β Returns current weather and forecast data for requested locations
- Airport Lookup β Returns nearby airport information for travel- and location-based queries
Tool usage can be enabled per request via the application configuration and is integrated into the final response synthesis stage.
When you run make seed, the system populates Qdrant with a curated dataset derived from approximately 70 Wikipedia pages focused on mountains and related geography topics. The dataset is indexed into two collections β document_index (OpenAI embeddings) and document_index_gemini (Gemini embeddings) β allowing you to test the same content across different embedding models.
Note: The active collection is selected through the
active_domainsetting inbackend/core/config.py, which determines both the Qdrant collection and the embedding model used by the system.
This project includes a sample knowledge base derived from Wikipedia.
- Source: ~70 curated Wikipedia articles processed via a custom high-fidelity MediaWiki extraction pipeline.
- Integrity: Source URLs and author metadata are preserved within the vector payloads to enable verified citations.
- License: Distributed under CC BY-SA 4.0.
- Full Credits: Detailed source links and compliance information can be found in docs/attributions.md.
You can verify the indexed documents through the web interface or the command line:
| Method | Action |
|---|---|
| Frontend UI | Navigate to the "View Documents" page to see titles, URLs, and metadata. |
| Terminal (CLI) | Run the following to list the first 100 document titles: |
source venv/bin/activate
python scripts/qdrant_scripts/qdrant_ops.py --list-titles --limit 100
The system supports domain-based collection management where each domain is tightly coupled with its embedding model to prevent dimension drift and ensure consistency. Selecting a collection automatically sets the compatible embedding model and vector dimensions.
A single active_domain setting configures both the collection name and embedding model. This helps prevent dimension drift and ensures consistency.
The default is set to mountains which uses the openai:embed_small model. You may modify or add to the configuration in backend/core/config.py.
Only one domain can be active at a time, and that defines the Qdrant collection and embedding model. This approach allows you to maintain multiple "knowledge bases" on the same database. You can swap between domains at any time just by changing the
active_domainvariable.
# In backend/core/config.py
DOMAIN_EMBEDDING_CONFIG = {
"default": {
"collection_name": "document_index",
"embedding_model_key": "openai:embed_small"
},
"mountains": {
"collection_name": "document_index",
"embedding_model_key": "openai:embed_small"
},
"oceans": {
"collection_name": "document_index_gemini",
"embedding_model_key": "gemini:native-embed"
}
}
# Active domain selection (single change point)
active_domain: str = "mountains"[!TIP]
The following examples are based on the seed data.
-
β’ βWhere is Mount Kilimanjaro located?β
β’ βCompare Kilimanjaro with Mount Whitney.β
β’ "Elevation of mont blanc in the alps , the current weather and the closest airport"
β’ βWhich mountain is taller β Everest or Whitney?β
β’ βWhere is Mount Whitney located? What is the closest airport to it?β
β’ βTell me about hiking in Patagonia. What is the weather forecast for the area?β
The following is a real conversation demonstrating:
- Multi-turn context retention
- Retrieval + rewrite + rerank
- Automatic tool invocation (
get_nearby_airports) - Weather queries over prior tool results
- Final synthesis combining retrieval + tool calls
- Per-stage cost and token metrics
USER: where is mount kilimanjaro ? ASSISTANT:Mount Kilimanjaro is located in Tanzania....
USER: compare kilimanjaro with Mount whitney
ASSISTANT: Mount Kilimanjaro and Mount Whitney are both prominent peaks, but differ in height, location....
USER: what are the closest airports to Mount Kilimanjaro
ASSISTANT:The nearest airport to the coordinates (-3.0674, 37.3556) is Amboseli Airport (ASV β’ HKAM), ~48 km away. Tools Used: get_nearby_airports
USER: and for Mount Whitney
ASSISTANT:Closest airports to Mount Whitney (36.5785, -118.2919).... Tools Used: get_nearby_airports
USER: what is the weather forecast for mount whitney
ASSISTANT:Mount Whitney, California β 28 Oct High: 7.2 Β°C (45 Β°F).... Tools Used: get_weather
Multi-turn conversation with tool calls and citations. Preserves context and history across turns.
This system uses the Python package vrraj-llm-adapter to provide a unified interface across multiple LLM providers.
The adapter normalizes model configuration, requests, responses, tool calls, and usage metrics across providers while allowing different models to be used across pipeline stages.
- Multi-Provider Support β Works with OpenAI and Gemini models
- Registry-Driven Model Configuration β Model capabilities, pricing, and parameter policies are defined in a centralized model registry
- Provider-Agnostic Calls β The same application code works across providers
- Custom Model Registries β Users can extend or override models without changing application code
To load a user-defined custom model registry, set:
export CUSTOM_REGISTRY_PATH=/path/to/your/custom_registry.pySee the Model Registry documentation for the models supported, default model definitions, reasoning model configurations, and guidance on extending the adapter with custom models.
Technical details about the system architecture, pipelines, design decisions, and engineering approach are available here:
π docs/technical-overview.md
This overview covers module structure, extraction pipeline, embedding flow, Qdrant indexing, batch ingestion (local PDFs + URLs with optional cost estimation), chat orchestration, SSE streaming, and frontendβbackend integration.
chat-with-rag/
βββ backend/ # API, chat orchestration, ingestion pipeline, vector DB integration, tools
βββ frontend/ # Chat UI, embed pages, static assets
βββ scripts/ # Batch ingestion and maintenance utilities
βββ prompts/ # YAML prompt registry
βββ docs/ # Technical architecture and API documentation
βββ data/ # Seed/demo datasets
βββ images/ # README and documentation images
See docs/technical-overview.md for a deeper architectural breakdown of the system modules and pipelines.
This application includes a domain-based access control framework for APIs and embedded widgets.
- Domain-Based API Access β Chat and embedding endpoints can enforce domain-level access rules
- Embeddable Widget Restrictions β
chat-embed.htmlcan be restricted to authorized domains - Collection Isolation β Separate domains can be mapped to different knowledge bases and prompt configurations
These security controls help prevent unauthorized access and ensure that different domains or websites can only access their designated knowledge bases and configurations.
For complete API documentation including usage examples, request/response formats, and integration guides, see the API Reference.
Manage your vector collections with the Qdrant Operations CLI. Essential for backup, export, and collection maintenance.
# Export collection to JSONL (for backup/seeding)
python scripts/qdrant_scripts/qdrant_ops.py export
# List document titles and inspect collection
python scripts/qdrant_scripts/qdrant_ops.py list-titles
# Target specific collection (e.g., Gemini embeddings)
python scripts/qdrant_scripts/qdrant_ops.py --collection document_index_gemini export -f docs-index-seed-gemini.jsonlπ Technical Overview: Qdrant Operations CLI
This project is source-available for personal, educational, and evaluation purposes.
It is permitted to run, modify, and fork the code for non-commercial use.
Redistribution, sublicensing, or commercial use of this project or derivative works requires explicit written permission from the author. Β© 2026 Rajkumar Velliavitil β All Rights Reserved

