Vector Knowledge Base - System Architecture

Overview

This is a production-ready Flask application for building RAG (Retrieval-Augmented Generation) systems with PDF document processing, semantic search, and OpenAI integration. The system allows users to upload PDF documents, extract and chunk the text, create vector embeddings, and perform semantic searches using natural language queries.

User Preferences

Preferred communication style: Simple, everyday language.

System Architecture

High-Level Architecture

The application follows a modular Flask-based architecture with clear separation of concerns:

Web Layer: Flask routes for UI and API endpoints
Service Layer: Business logic for PDF processing, vector operations, and embeddings
Data Layer: ChromaDB for vector storage with SQLite backend
Storage Layer: Replit Object Storage for persistence and backups

Technology Stack

Backend Framework: Flask 3.1.0 with Gunicorn for production deployment
Vector Database: ChromaDB 0.6.3 with persistent SQLite storage
AI/ML: OpenAI API for embeddings (text-embedding-ada-002 model)
PDF Processing: PyMuPDF for robust text extraction
Authentication: HTTP Basic Auth with session management
Cloud Storage: Replit Object Storage for data persistence
Frontend: Bootstrap-based responsive UI with dark theme

Key Components

1. PDF Processing Service (`services/pdf_processor.py`)

Extracts text from PDF documents using PyMuPDF
Implements memory management and timeout controls
Handles large files with streaming processing
Provides error recovery and garbage collection

2. Vector Store Service (`services/vector_store.py`)

Manages ChromaDB operations and vector embeddings
Implements text chunking (500 tokens per chunk)
Provides semantic search with configurable similarity thresholds
Handles document metadata and retrieval

3. Embedding Service (`services/embedding_service.py`)

Integrates with OpenAI's text-embedding-ada-002 model
Implements privacy-aware logging (no content exposure)
Provides fallback mechanisms for deployment environments
Handles API rate limiting and error recovery

4. Authentication System

Web Auth (web/http_auth.py): Session-based authentication for web interface
API Auth (api/auth.py): API key authentication with multiple header support
Supports both Bearer tokens and X-API-KEY headers for compatibility

5. Privacy Protection System (`utils/privacy_log_handler.py`)

Comprehensive PII filtering in all log outputs
Query content redaction to prevent data leakage
API key and credential protection
Pattern-based sensitive data detection

Data Flow

Document Upload Flow

Upload: User uploads PDF via web interface or API
Processing: PDF text is extracted and validated
Chunking: Text is split into 500-token chunks
Embedding: Each chunk is sent to OpenAI for vector embedding
Storage: Embeddings and metadata stored in ChromaDB
Backup: Automatic backup to Object Storage (if configured)

Query Processing Flow

Query Input: User submits natural language query
Embedding: Query is converted to vector embedding
Search: Vector similarity search in ChromaDB
Ranking: Results ranked by similarity score
Response: Top results returned with metadata and scores

Persistence and Backup Flow

Automatic Backup: After uploads (1-hour intervals)
Cloud Storage: Data synced to Replit Object Storage
Restoration: Automatic restore on application startup
Cleanup: Backup rotation to manage storage quotas

External Dependencies

Required APIs

OpenAI API: For generating text embeddings
- Model: text-embedding-ada-002
- Dimensions: 1536
- Required for core functionality

Required Services

Replit Object Storage: For data persistence across deployments
Replit Secrets: For secure credential management

Required Environment Variables

OPENAI_API_KEY: OpenAI API access key
VKB_API_KEY: Custom API key for application access
SESSION_SECRET: Secret for session encryption
BASIC_AUTH_USERNAME: Web interface username
BASIC_AUTH_PASSWORD: Web interface password

Deployment Strategy

Replit Deployment Architecture

The application is designed specifically for Replit's deployment environment:

Development Environment: Full features with file logging
Production Environment: Streamlined with console logging only
Automatic Scaling: Gunicorn handles concurrent requests
Health Monitoring: Built-in health check endpoints

Data Persistence Strategy

Critical design decision: Data must survive Replit deployments and restarts.

Solution Chosen: Multi-layer persistence approach

Primary Storage: /home/runner/data/chromadb (guaranteed persistent)
Backup Storage: Replit Object Storage (cloud persistence)
Auto-Recovery: Restore from cloud on startup if local data missing

Alternatives Considered:

Local-only storage (rejected due to Replit restart behavior)
Cloud-only storage (rejected due to latency concerns)

Security Architecture

Authentication: Multi-method auth (Basic Auth + API keys)
Privacy Protection: Comprehensive PII filtering throughout system
Secure Defaults: Safe configuration for production deployment
Credential Management: Environment variable-based secrets

Performance Optimizations

Memory Management: Aggressive garbage collection for large PDFs
Chunking Strategy: Optimal 500-token chunks for embedding efficiency
Backup Rotation: Limited backup history to prevent storage overflow
Query Caching: ChromaDB handles vector similarity caching

The system prioritizes reliability, security, and ease of deployment while maintaining high performance for document processing and search operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Knowledge Base - System Architecture

Overview

User Preferences

System Architecture

High-Level Architecture

Technology Stack

Key Components

1. PDF Processing Service (`services/pdf_processor.py`)

2. Vector Store Service (`services/vector_store.py`)

3. Embedding Service (`services/embedding_service.py`)

4. Authentication System

5. Privacy Protection System (`utils/privacy_log_handler.py`)

Data Flow

Document Upload Flow

Query Processing Flow

Persistence and Backup Flow

External Dependencies

Required APIs

Required Services

Required Environment Variables

Deployment Strategy

Replit Deployment Architecture

Data Persistence Strategy

Security Architecture

Performance Optimizations

FilesExpand file tree

replit.md

Latest commit

History

replit.md

File metadata and controls

Vector Knowledge Base - System Architecture

Overview

User Preferences

System Architecture

High-Level Architecture

Technology Stack

Key Components

1. PDF Processing Service (services/pdf_processor.py)

2. Vector Store Service (services/vector_store.py)

3. Embedding Service (services/embedding_service.py)

4. Authentication System

5. Privacy Protection System (utils/privacy_log_handler.py)

Data Flow

Document Upload Flow

Query Processing Flow

Persistence and Backup Flow

External Dependencies

Required APIs

Required Services

Required Environment Variables

Deployment Strategy

Replit Deployment Architecture

Data Persistence Strategy

Security Architecture

Performance Optimizations

1. PDF Processing Service (`services/pdf_processor.py`)

2. Vector Store Service (`services/vector_store.py`)

3. Embedding Service (`services/embedding_service.py`)

5. Privacy Protection System (`utils/privacy_log_handler.py`)