Skip to content

Latest commit

 

History

History
137 lines (106 loc) · 5.85 KB

File metadata and controls

137 lines (106 loc) · 5.85 KB

Vector Knowledge Base - System Architecture

Overview

This is a production-ready Flask application for building RAG (Retrieval-Augmented Generation) systems with PDF document processing, semantic search, and OpenAI integration. The system allows users to upload PDF documents, extract and chunk the text, create vector embeddings, and perform semantic searches using natural language queries.

User Preferences

Preferred communication style: Simple, everyday language.

System Architecture

High-Level Architecture

The application follows a modular Flask-based architecture with clear separation of concerns:

  1. Web Layer: Flask routes for UI and API endpoints
  2. Service Layer: Business logic for PDF processing, vector operations, and embeddings
  3. Data Layer: ChromaDB for vector storage with SQLite backend
  4. Storage Layer: Replit Object Storage for persistence and backups

Technology Stack

  • Backend Framework: Flask 3.1.0 with Gunicorn for production deployment
  • Vector Database: ChromaDB 0.6.3 with persistent SQLite storage
  • AI/ML: OpenAI API for embeddings (text-embedding-ada-002 model)
  • PDF Processing: PyMuPDF for robust text extraction
  • Authentication: HTTP Basic Auth with session management
  • Cloud Storage: Replit Object Storage for data persistence
  • Frontend: Bootstrap-based responsive UI with dark theme

Key Components

1. PDF Processing Service (services/pdf_processor.py)

  • Extracts text from PDF documents using PyMuPDF
  • Implements memory management and timeout controls
  • Handles large files with streaming processing
  • Provides error recovery and garbage collection

2. Vector Store Service (services/vector_store.py)

  • Manages ChromaDB operations and vector embeddings
  • Implements text chunking (500 tokens per chunk)
  • Provides semantic search with configurable similarity thresholds
  • Handles document metadata and retrieval

3. Embedding Service (services/embedding_service.py)

  • Integrates with OpenAI's text-embedding-ada-002 model
  • Implements privacy-aware logging (no content exposure)
  • Provides fallback mechanisms for deployment environments
  • Handles API rate limiting and error recovery

4. Authentication System

  • Web Auth (web/http_auth.py): Session-based authentication for web interface
  • API Auth (api/auth.py): API key authentication with multiple header support
  • Supports both Bearer tokens and X-API-KEY headers for compatibility

5. Privacy Protection System (utils/privacy_log_handler.py)

  • Comprehensive PII filtering in all log outputs
  • Query content redaction to prevent data leakage
  • API key and credential protection
  • Pattern-based sensitive data detection

Data Flow

Document Upload Flow

  1. Upload: User uploads PDF via web interface or API
  2. Processing: PDF text is extracted and validated
  3. Chunking: Text is split into 500-token chunks
  4. Embedding: Each chunk is sent to OpenAI for vector embedding
  5. Storage: Embeddings and metadata stored in ChromaDB
  6. Backup: Automatic backup to Object Storage (if configured)

Query Processing Flow

  1. Query Input: User submits natural language query
  2. Embedding: Query is converted to vector embedding
  3. Search: Vector similarity search in ChromaDB
  4. Ranking: Results ranked by similarity score
  5. Response: Top results returned with metadata and scores

Persistence and Backup Flow

  1. Automatic Backup: After uploads (1-hour intervals)
  2. Cloud Storage: Data synced to Replit Object Storage
  3. Restoration: Automatic restore on application startup
  4. Cleanup: Backup rotation to manage storage quotas

External Dependencies

Required APIs

  • OpenAI API: For generating text embeddings
    • Model: text-embedding-ada-002
    • Dimensions: 1536
    • Required for core functionality

Required Services

  • Replit Object Storage: For data persistence across deployments
  • Replit Secrets: For secure credential management

Required Environment Variables

  • OPENAI_API_KEY: OpenAI API access key
  • VKB_API_KEY: Custom API key for application access
  • SESSION_SECRET: Secret for session encryption
  • BASIC_AUTH_USERNAME: Web interface username
  • BASIC_AUTH_PASSWORD: Web interface password

Deployment Strategy

Replit Deployment Architecture

The application is designed specifically for Replit's deployment environment:

  1. Development Environment: Full features with file logging
  2. Production Environment: Streamlined with console logging only
  3. Automatic Scaling: Gunicorn handles concurrent requests
  4. Health Monitoring: Built-in health check endpoints

Data Persistence Strategy

Critical design decision: Data must survive Replit deployments and restarts.

Solution Chosen: Multi-layer persistence approach

  • Primary Storage: /home/runner/data/chromadb (guaranteed persistent)
  • Backup Storage: Replit Object Storage (cloud persistence)
  • Auto-Recovery: Restore from cloud on startup if local data missing

Alternatives Considered:

  • Local-only storage (rejected due to Replit restart behavior)
  • Cloud-only storage (rejected due to latency concerns)

Security Architecture

  • Authentication: Multi-method auth (Basic Auth + API keys)
  • Privacy Protection: Comprehensive PII filtering throughout system
  • Secure Defaults: Safe configuration for production deployment
  • Credential Management: Environment variable-based secrets

Performance Optimizations

  • Memory Management: Aggressive garbage collection for large PDFs
  • Chunking Strategy: Optimal 500-token chunks for embedding efficiency
  • Backup Rotation: Limited backup history to prevent storage overflow
  • Query Caching: ChromaDB handles vector similarity caching

The system prioritizes reliability, security, and ease of deployment while maintaining high performance for document processing and search operations.