This is a production-ready Flask application for building RAG (Retrieval-Augmented Generation) systems with PDF document processing, semantic search, and OpenAI integration. The system allows users to upload PDF documents, extract and chunk the text, create vector embeddings, and perform semantic searches using natural language queries.
Preferred communication style: Simple, everyday language.
The application follows a modular Flask-based architecture with clear separation of concerns:
- Web Layer: Flask routes for UI and API endpoints
- Service Layer: Business logic for PDF processing, vector operations, and embeddings
- Data Layer: ChromaDB for vector storage with SQLite backend
- Storage Layer: Replit Object Storage for persistence and backups
- Backend Framework: Flask 3.1.0 with Gunicorn for production deployment
- Vector Database: ChromaDB 0.6.3 with persistent SQLite storage
- AI/ML: OpenAI API for embeddings (text-embedding-ada-002 model)
- PDF Processing: PyMuPDF for robust text extraction
- Authentication: HTTP Basic Auth with session management
- Cloud Storage: Replit Object Storage for data persistence
- Frontend: Bootstrap-based responsive UI with dark theme
- Extracts text from PDF documents using PyMuPDF
- Implements memory management and timeout controls
- Handles large files with streaming processing
- Provides error recovery and garbage collection
- Manages ChromaDB operations and vector embeddings
- Implements text chunking (500 tokens per chunk)
- Provides semantic search with configurable similarity thresholds
- Handles document metadata and retrieval
- Integrates with OpenAI's text-embedding-ada-002 model
- Implements privacy-aware logging (no content exposure)
- Provides fallback mechanisms for deployment environments
- Handles API rate limiting and error recovery
- Web Auth (
web/http_auth.py): Session-based authentication for web interface - API Auth (
api/auth.py): API key authentication with multiple header support - Supports both Bearer tokens and X-API-KEY headers for compatibility
- Comprehensive PII filtering in all log outputs
- Query content redaction to prevent data leakage
- API key and credential protection
- Pattern-based sensitive data detection
- Upload: User uploads PDF via web interface or API
- Processing: PDF text is extracted and validated
- Chunking: Text is split into 500-token chunks
- Embedding: Each chunk is sent to OpenAI for vector embedding
- Storage: Embeddings and metadata stored in ChromaDB
- Backup: Automatic backup to Object Storage (if configured)
- Query Input: User submits natural language query
- Embedding: Query is converted to vector embedding
- Search: Vector similarity search in ChromaDB
- Ranking: Results ranked by similarity score
- Response: Top results returned with metadata and scores
- Automatic Backup: After uploads (1-hour intervals)
- Cloud Storage: Data synced to Replit Object Storage
- Restoration: Automatic restore on application startup
- Cleanup: Backup rotation to manage storage quotas
- OpenAI API: For generating text embeddings
- Model: text-embedding-ada-002
- Dimensions: 1536
- Required for core functionality
- Replit Object Storage: For data persistence across deployments
- Replit Secrets: For secure credential management
OPENAI_API_KEY: OpenAI API access keyVKB_API_KEY: Custom API key for application accessSESSION_SECRET: Secret for session encryptionBASIC_AUTH_USERNAME: Web interface usernameBASIC_AUTH_PASSWORD: Web interface password
The application is designed specifically for Replit's deployment environment:
- Development Environment: Full features with file logging
- Production Environment: Streamlined with console logging only
- Automatic Scaling: Gunicorn handles concurrent requests
- Health Monitoring: Built-in health check endpoints
Critical design decision: Data must survive Replit deployments and restarts.
Solution Chosen: Multi-layer persistence approach
- Primary Storage:
/home/runner/data/chromadb(guaranteed persistent) - Backup Storage: Replit Object Storage (cloud persistence)
- Auto-Recovery: Restore from cloud on startup if local data missing
Alternatives Considered:
- Local-only storage (rejected due to Replit restart behavior)
- Cloud-only storage (rejected due to latency concerns)
- Authentication: Multi-method auth (Basic Auth + API keys)
- Privacy Protection: Comprehensive PII filtering throughout system
- Secure Defaults: Safe configuration for production deployment
- Credential Management: Environment variable-based secrets
- Memory Management: Aggressive garbage collection for large PDFs
- Chunking Strategy: Optimal 500-token chunks for embedding efficiency
- Backup Rotation: Limited backup history to prevent storage overflow
- Query Caching: ChromaDB handles vector similarity caching
The system prioritizes reliability, security, and ease of deployment while maintaining high performance for document processing and search operations.