Build a full-stack RAG (Retrieval-Augmented Generation) chatbot that supports both anonymous and authenticated users, with persistent chat history and production-ready AWS deployment.
- User Management: Anonymous users (first-time visitors) vs authenticated users (returning users)
- RAG Integration: Vector search + AI generation for contextual responses
- Chat Persistence: Store and retrieve conversation history
- Production Deployment: AWS-ready with scalability and security
- Modern Tech Stack: FastAPI + React + TypeScript
Server/
├── app/
│ ├── api/v1/endpoints/ # REST API routes
│ ├── auth/ # JWT authentication
│ ├── core/ # Configuration & database
│ ├── models/ # SQLAlchemy ORM models
│ ├── schemas/ # Pydantic data validation
│ └── services/ # Business logic layer
├── alembic/ # Database migrations
└── requirements.txt # Dependencies
- Alembic: Alembic helps with versioning and updating the database
- Service Layer: Business logic isolated from API endpoints which helps readability
- Middleware Pattern: Helps protect routes that may be sensitive (like chats)
- Pydantic: Response and request validation to ensure data integrity
- ACID Compliance: Full transactional support
- JSON Support: Native JSONB for flexible metadata storage (chat sources, retrieved context)
- Structure: Relational structure is intuitive for keeping track of user chats and history
- Managed Service: Automated backups, patching, and maintenance, especially when deploying to AWS
- Monitoring: CloudWatch integration for performance metrics
- Cost Effective: Pay-per-use model with reserved instances for production
-- Users Table (Supports both anonymous and authenticated users)
users (
id SERIAL PRIMARY KEY,
identifier VARCHAR(255) UNIQUE NOT NULL, -- UUID for anonymous, email for authenticated
user_type VARCHAR(20) NOT NULL, -- 'anonymous' or 'authenticated'
email VARCHAR(255) UNIQUE, -- NULL for anonymous users
hashed_password VARCHAR(255), -- NULL for anonymous users
first_name VARCHAR(100),
last_name VARCHAR(100),
is_active BOOLEAN DEFAULT true,
is_verified BOOLEAN DEFAULT false,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
last_login TIMESTAMP
);
-- Chat Sessions
chat_sessions (
id SERIAL PRIMARY KEY,
session_id UUID UNIQUE NOT NULL, -- Session identifier
user_id INTEGER REFERENCES users(id),
title VARCHAR(255) NOT NULL,
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
last_activity TIMESTAMP DEFAULT NOW()
);
-- Chat Messages
chat_messages (
id SERIAL PRIMARY KEY,
session_id INTEGER REFERENCES chat_sessions(id),
content TEXT NOT NULL,
role VARCHAR(20) NOT NULL, -- 'user' or 'assistant'
retrieved_context JSONB, -- RAG retrieved documents
sources JSONB, -- Source metadata
tokens_used INTEGER DEFAULT 0,
processing_time INTEGER DEFAULT 0, -- Response time in milliseconds
created_at TIMESTAMP DEFAULT NOW()
);- JSONB for Metadata: Flexible storage of RAG context and sources
- UUID Session IDs: Unique session identifiers for frontend
- Soft Deletes:
is_activeflags instead of hard deletes - Audit Trail:
created_at,updated_at,last_activitytimestamps - Foreign Key Constraints: Data integrity and referential integrity
- Alembic: Version-controlled database migrations
- Forward-Only: Production-safe migration approach
- Rollback Support: Development environment rollbacks
# Anonymous Users (Stateless)
- UUID-based identification
- JWT tokens with anonymous claims
- No database persistence for user data
- Chat history tied to session tokens
# Authenticated Users (Stateful)
- Email-based registration/login
- Database persistence with full profiles
- JWT tokens with user ID claims
- Persistent chat history across sessionsAnonymous User → Chat → Convert to Authenticated → Preserve History
- JWT Tokens: Stateless authentication supporting both user types
- Token Claims:
user_id,identifier,user_typefor flexible user handling - Middleware Protection: Global auth middleware + endpoint-level validation
- CORS Configuration: Secure cross-origin requests
# Pluggable Embedding Providers
- OpenAI: text-embedding-3-small
- NVIDIA NIM: nvidia/nv-embed-v1(1024-dim)
# Vector Database: Pinecone
- 1024-dimensional vectors
- Automatic dimension normalization
- Similarity search with configurable thresholdsUsed NVIDIA API key to get access to embedding models similar to llama-text-embed-v2 as the OpenAI embeddings were not very performant.
Query → Query Expansion → Embedding Generation → Vector Search → Context Retrieval → AI Generation → Response
The system employs a query expansion strategy to make sure it retrieves as many relevant documents as possible:
1. Keyword Extraction & Combinations
- Extract meaningful keywords from user queries
- Generate single keyword searches for broad matching
- Create bigram combinations for phrase matching
- Combine top keywords in various combinations
2. Question-Based Variations
- Add question starters (How, What, Why, When, Where) to non-question queries
- Generate multiple question formats for better coverage
- Include both question and non-question versions
3. Action-Oriented Expansions
- Add action verbs: help, support, troubleshoot, resolve, fix, solve
- Include information-seeking variations: information about, guide for
- Cover different user intents and search patterns
4. Synonym-Based Searches
- Map financial terms to common synonyms
- Example: "transaction declined" → "payment rejected", "purchase failed"
- Ensures matching even when users use different terminology
5. Context-Specific Expansions
- Add domain-specific contexts for financial queries
- Include: banking, financial services, payment processing, customer support
- Broadens search scope while maintaining relevance, especially for a targeted knowledge base
Example: For query "One of my transactions got declined, what should I do?"
- Generates 30+ search variations including:
- Single keywords: "transaction", "declined", "payment", "rejected"
- Phrases: "transaction declined", "payment rejected", "help with declined transaction"
- Questions: "How to fix declined transaction", "What is declined transaction"
- Context: "banking transaction declined", "customer support declined transaction"
Client/
├── components/
│ ├── auth/ # Authentication components
│ └── chat/ # Chat interface components
├── contexts/ # React context for state management (specifically auth)
├── hooks/ # Hooks required for state management (specifically auth)
├── services/ # API client, utilities, and types
└── pages/ # Route-based components
- React Context: Global auth state and user management
- API Integration: Axios for HTTP requests with interceptors
- Local States: React hooks for local state management (but for future would move to SSR/Tanstack for cleaner management)
/api/v1/auth/
├── POST /register # User registration
├── POST /login # User authentication
├── POST /anonymous # Anonymous user creation
├── POST /convert-anonymous # User conversion
├── GET /me # Current user info
└── POST /change-password # Password management
/api/v1/chat/
├── POST /send # Send message
├── GET /sessions # List chat sessions
├── GET /sessions/{id} # Get specific session
├── POST /sessions # Create new session
└── DELETE /sessions/{id} # Delete session
{
"message": "AI response content",
"session_id": "uuid",
"sources": [{"title": "...", "content": "..."}],
"tokens_used": 150,
"processing_time": 1200
}- JWT Tokens: Secure, stateless authentication
- Token Expiration: 30-minute access tokens
- Password Security: bcrypt hashing with salt
- CORS Protection: Configurable allowed origins
- Global Middleware: Automatic route protection
- Input Validation: Pydantic schemas for all inputs
- SQL Injection Prevention: SQLAlchemy ORM
- Environment Variables: Secure configuration management (AWS Parameter Store, Secrets Manager)
- HTTPS Only: TLS encryption in production
- Audit Logging: Request/response logging capability
Frontend: AWS Amplify
├── Global CDN distribution
├── Automatic HTTPS
├── Git-based deployments
└── Built-in CI/CD
Backend: AWS App Runner
├── Serverless container platform
├── Auto-scaling capabilities
├── Health check integration
└── Managed infrastructure
Database: Amazon RDS
├── Managed PostgreSQL
├── Automated backups
├── Multi-AZ availability
└── Security group isolation
Amplify (Frontend) → App Runner (Backend) → RDS (Database)
↓
Pinecone (Vector DB)
- Development: Local environment with hot reload
- Staging: App Runner with staging database
- Production: App Runner with production RDS + monitoring
- App Runner: Automatic scaling based on CPU/memory
- RDS: Read replicas for high read loads
- Pinecone: Managed vector database with auto-scaling (if more documents/vectors needed)
- Database Indexing: Optimized queries with proper indexes
- Caching Strategy: Redis-ready for session caching
- Connection Pooling: Database connection management
- Async Processing: Non-blocking I/O operations
- Health Checks:
/healthendpoint for load balancers - Logging: Structured logging for debugging
- Metrics: CloudWatch integration ready
- Error Handling: Comprehensive exception management
Challenge: How to maintain chat history for anonymous users without database storage?
Solution:
- UUID-based identification in JWT tokens
- Session-based chat history with ephemeral storage
- Conversion path to authenticated accounts
Challenge: How to integrate multiple embedding providers with consistent vector dimensions and ensure comprehensive document retrieval?
Solution:
- Pluggable Embedding Architecture: Support for multiple providers (OpenAI text-embedding-3-small, NVIDIA NIM nvidia/nv-embed-v1)
- Advanced Query Expansion: 30+ search variations including keywords, synonyms, questions, and context-specific expansions
- Configurable RAG Parameters: Chunk size (1000), overlap (200), similarity threshold (0.08), top-k results (5)
- Quality Filtering: Duplicate detection, relevance scoring, and content similarity checks
- Fallback Mechanisms: Graceful fallbacks when embedding services are unavailable
Challenge: Providing responsive chat interface with async processing?
Solution:
- Optimistic UI updates
- Loading states and error handling
- Efficient API design with minimal round trips
This architecture provides a solid foundation for a production-ready RAG chatbot with:
- Security First: Comprehensive authentication and authorization
- Developer Experience: Clean code structure and documentation
- Production Ready: AWS deployment strategy with monitoring
- Future Proof: Architecture that is adaptable and scalable