BEACON is an AI-powered document management and RAG (Retrieval-Augmented Generation) system for government education policies.
- 🌐 Web Scraping: Automated document collection from government websites
- 📄 Document Management: Upload, process, and organize documents
- 🤖 AI Chat: Ask questions and get answers from documents
- 👥 Role-Based Access: Different permissions for different user types
- 🔍 Smart Search: Vector + keyword hybrid search
- 📊 Document Families: Version tracking and deduplication
- Backend: FastAPI (Python)
- Frontend: React + Vite
- Database: PostgreSQL + pgvector
- AI Models: Google Gemini, OpenRouter (Llama 3.3)
- Storage: Supabase S3
- Embeddings: BGE-M3 (multilingual)
Purpose: REST API for all operations
Key Files:
backend/main.py- FastAPI app initialization, CORS, routesbackend/database.py- SQLAlchemy models (User, Document, Institution, etc.)backend/routers/auth_router.py- Login, signup, email verificationbackend/routers/document_router.py- Document upload, download, approvalbackend/routers/enhanced_web_scraping_router.py- Web scraping endpointsbackend/routers/chat_router.py- AI chat with documents
What it does:
- Handles HTTP requests from frontend
- Manages database operations
- Enforces role-based access control
- Coordinates AI agents and tools
Purpose: Automatically scrape documents from government websites
Key Files:
Agent/web_scraping/enhanced_processor.py- Main scraping orchestratorAgent/web_scraping/site_scrapers/moe_scraper.py- Ministry of Education scraperAgent/web_scraping/site_scrapers/ugc_scraper.py- UGC scraperAgent/web_scraping/site_scrapers/aicte_scraper.py- AICTE scraperAgent/web_scraping/document_identity_manager.py- Deduplication logicAgent/web_scraping/sliding_window_manager.py- Pagination handlingAgent/web_scraping/session_storage.py- Persist scraping sessions
What it does:
- Visits government websites
- Finds document links (PDFs, DOCs)
- Downloads documents
- Uploads to Supabase S3
- Extracts metadata using AI
- Saves to database
- Detects duplicates and updates
Deduplication Strategy (3 levels):
- URL-based: Same URL = duplicate
- Content hash: Same SHA256 = duplicate
- Normalized URL: Similar URLs = duplicate
Purpose: Extract structured information from documents using AI
Key Files:
Agent/metadata/extractor.py- AI-powered metadata extractionAgent/metadata/reranker.py- Rerank search results by relevance
What it does:
- Extracts: title, department, document type, date, summary, keywords
- Uses: Google Gemini (gemma-3-12b) - 14,400 requests/day
- Fallback: OpenRouter if Gemini fails
- Quality validation: Ensures metadata meets standards
Metadata Fields Extracted:
{
"title": "National Education Policy 2020",
"department": "Ministry of Education",
"document_type": "policy",
"date_published": "2020-07-29",
"keywords": ["education", "policy", "reform"],
"summary": "Comprehensive education reform...",
"key_topics": ["curriculum", "teacher training"],
"entities": {
"departments": ["MoE", "UGC"],
"locations": ["India"],
"people": ["Minister"]
}
}Purpose: Convert text to vectors for semantic search
Key Files:
Agent/embeddings/bge_embedder.py- BGE-M3 multilingual embeddingsAgent/vector_store/pgvector_store.py- PostgreSQL vector storageAgent/chunking/adaptive_chunker.py- Smart text chunking
What it does:
- Splits documents into chunks (adaptive size)
- Generates 1024-dim embeddings (BGE-M3)
- Stores in PostgreSQL with pgvector
- Enables semantic search (cosine similarity)
Embedding Model: BAAI/bge-m3
- Dimension: 1024
- Languages: 100+ (Hindi, Tamil, Telugu, Bengali, etc.)
- Use case: Multilingual document search
Purpose: Embed documents on-demand to save compute
Key Files:
Agent/lazy_rag/lazy_embedder.py- On-demand embeddingAgent/tools/lazy_search_tools.py- Search with lazy embedding
How it works:
- User searches for "Indo-Norwegian program"
- System checks: Are relevant docs embedded?
- If NO: Rank unembed docs by metadata (BM25)
- Embed top 3 most relevant docs
- Search pgvector for matches
- Return results
Benefits:
- First search: Slower (needs embedding)
- Subsequent searches: Fast (uses cached embeddings)
- Saves compute: Only embed what's needed
Purpose: Answer questions using documents
Key Files:
Agent/rag_agent/react_agent.py- ReAct agent with toolsAgent/rag_enhanced/family_aware_retriever.py- Document family search
What it does:
- User asks: "What is the Indo-Norwegian program?"
- Agent searches documents
- Finds relevant chunks
- Generates answer using LLM
- Cites sources with document IDs
LLM Used: OpenRouter (Llama 3.3 70B)
- 200 requests/day (FREE)
- Excellent for Q&A
- No API version issues
Tools Available:
search_documents_lazy()- Search all documentssearch_specific_document_lazy()- Search one documentenhanced_search_documents()- Family-aware search
Purpose: Group related documents and track versions
Key Files:
Agent/document_families/family_manager.py- Family management
What it does:
- Groups: "NEP 2020 v1", "NEP 2020 v2" → Same family
- Tracks: Version numbers, superseded documents
- Detects: Updates to existing documents
- Improves RAG: Retrieves latest version
Example Family:
Family: "National Education Policy"
├── NEP 2020 Draft (v1.0) - Superseded
├── NEP 2020 Final (v2.0) - Latest ✅
└── NEP 2020 Amendment (v2.1) - Latest ✅
Purpose: User interface for all features
Key Files:
frontend/src/App.jsx- Main app, routingfrontend/src/pages/admin/EnhancedWebScrapingPage.jsx- Scraping UIfrontend/src/pages/admin/DocumentManagementPage.jsx- Document managementfrontend/src/pages/ChatPage.jsx- AI chat interfacefrontend/src/pages/auth/LoginPage.jsx- Login/signupfrontend/src/components/layout/Sidebar.jsx- Navigation
What it does:
- Provides UI for all features
- Handles authentication
- Displays documents, stats, logs
- Real-time scraping progress
- Chat interface with citations
Files Involved:
frontend/src/pages/admin/EnhancedWebScrapingPage.jsx- UIbackend/routers/enhanced_web_scraping_router.py- APIAgent/web_scraping/enhanced_processor.py- Scraping logicAgent/web_scraping/site_scrapers/moe_scraper.py- Site-specific scraperAgent/metadata/extractor.py- Metadata extractionbackend/utils/supabase_storage.py- S3 upload
Flow:
User clicks "Scrape" → API call → enhanced_processor.py
→ Site scraper finds links → Downloads PDFs
→ Uploads to S3 → Extracts metadata (AI)
→ Saves to database → Returns stats
Key Features:
- Multi-page pagination (up to 100 pages)
- Deduplication (3 levels)
- Stop button (graceful shutdown)
- Progress logging
- Family detection
Files Involved:
frontend/src/pages/admin/DocumentManagementPage.jsx- UIbackend/routers/document_router.py- APIbackend/utils/text_extractor.py- Text extractionbackend/utils/supabase_storage.py- S3 uploadAgent/metadata/extractor.py- Metadata extraction
Flow:
User uploads PDF → API receives file
→ Extracts text (PyPDF2/Tesseract OCR)
→ Uploads to S3 → Extracts metadata
→ Saves to database → Returns document ID
Supported Formats:
- PDF (with OCR fallback)
- DOCX, DOC
- TXT
- PPTX
Files Involved:
frontend/src/pages/ChatPage.jsx- Chat UIbackend/routers/chat_router.py- Chat APIAgent/rag_agent/react_agent.py- RAG agentAgent/tools/lazy_search_tools.py- Search toolsAgent/lazy_rag/lazy_embedder.py- Lazy embeddingAgent/vector_store/pgvector_store.py- Vector searchAgent/metadata/reranker.py- Result reranking
Flow:
User asks question → Chat API
→ RAG agent processes query
→ Searches documents (lazy embedding if needed)
→ Retrieves relevant chunks
→ LLM generates answer with citations
→ Returns formatted response
Search Strategy:
- Check if docs are embedded
- If not: Rank by metadata, embed top 3
- Vector search (70%) + BM25 (30%)
- Rerank results by relevance
- Return top 5 chunks
Files Involved:
backend/routers/auth_router.py- Authenticationbackend/database.py- User modelbackend/constants/roles.py- Role definitions- All routers - Access checks
Roles:
- Developer: Full access to everything
- Ministry Admin: Access to all public + ministry docs
- University Admin: Access to public + own institution docs
- Document Officer: Upload and manage docs
- Student: Read-only access to approved docs
- Public Viewer: Public documents only
How it works:
# In every API endpoint
current_user = get_current_user(token)
if current_user.role != "developer":
# Filter documents by role
query = query.filter(visibility_level="public")Files Involved:
Agent/document_families/family_manager.py- Family logicbackend/database.py- DocumentFamily modelAgent/web_scraping/enhanced_processor.py- Family detectionAgent/rag_enhanced/family_aware_retriever.py- Family search
How it works:
- Document scraped: "NEP 2020 Final.pdf"
- Check: Does family exist for "NEP 2020"?
- If YES: Add to family, increment version
- If NO: Create new family
- Mark previous version as superseded
- Update family centroid embedding
Benefits:
- Avoid duplicate results in search
- Always retrieve latest version
- Track document evolution
- Better RAG accuracy
BEACON/
├── backend/ # Backend API
│ ├── main.py # FastAPI app
│ ├── database.py # Database models
│ ├── routers/ # API endpoints
│ │ ├── auth_router.py # Authentication
│ │ ├── document_router.py # Document management
│ │ ├── chat_router.py # AI chat
│ │ └── enhanced_web_scraping_router.py # Web scraping
│ ├── utils/ # Utilities
│ │ ├── text_extractor.py # Text extraction
│ │ └── supabase_storage.py # S3 upload
│ └── constants/ # Constants
│ └── roles.py # Role definitions
│
├── Agent/ # AI Agent System
│ ├── web_scraping/ # Web scraping
│ │ ├── enhanced_processor.py # Main scraper
│ │ ├── site_scrapers/ # Site-specific scrapers
│ │ │ ├── moe_scraper.py # Ministry of Education
│ │ │ ├── ugc_scraper.py # UGC
│ │ │ └── aicte_scraper.py # AICTE
│ │ ├── document_identity_manager.py # Deduplication
│ │ └── sliding_window_manager.py # Pagination
│ │
│ ├── metadata/ # Metadata extraction
│ │ ├── extractor.py # AI metadata extraction
│ │ └── reranker.py # Result reranking
│ │
│ ├── embeddings/ # Embeddings
│ │ ├── bge_embedder.py # BGE-M3 embeddings
│ │ └── embedding_config.py # Config
│ │
│ ├── vector_store/ # Vector storage
│ │ └── pgvector_store.py # PostgreSQL + pgvector
│ │
│ ├── lazy_rag/ # Lazy RAG
│ │ └── lazy_embedder.py # On-demand embedding
│ │
│ ├── rag_agent/ # RAG agent
│ │ └── react_agent.py # ReAct agent
│ │
│ ├── rag_enhanced/ # Enhanced RAG
│ │ └── family_aware_retriever.py # Family search
│ │
│ ├── document_families/ # Document families
│ │ └── family_manager.py # Family management
│ │
│ ├── tools/ # Agent tools
│ │ └── lazy_search_tools.py # Search tools
│ │
│ └── chunking/ # Text chunking
│ └── adaptive_chunker.py # Smart chunking
│
├── frontend/ # React Frontend
│ ├── src/
│ │ ├── App.jsx # Main app
│ │ ├── pages/ # Pages
│ │ │ ├── admin/
│ │ │ │ ├── EnhancedWebScrapingPage.jsx
│ │ │ │ └── DocumentManagementPage.jsx
│ │ │ ├── ChatPage.jsx
│ │ │ └── auth/
│ │ │ └── LoginPage.jsx
│ │ ├── components/ # Components
│ │ │ └── layout/
│ │ │ └── Sidebar.jsx
│ │ └── services/
│ │ └── api.js # API client
│ └── package.json
│
├── .env # Environment variables
├── requirements.txt # Python dependencies
└── README.md # Documentation
Step-by-Step:
-
User Initiates Scraping (Frontend)
- File:
frontend/src/pages/admin/EnhancedWebScrapingPage.jsx - User clicks "Scrape" button for a source
- Sends POST request to
/api/enhanced-web-scraping/scrape-enhanced
- File:
-
API Receives Request (Backend)
- File:
backend/routers/enhanced_web_scraping_router.py - Validates user is admin
- Creates job ID for tracking
- Calls
enhanced_scrape_source()
- File:
-
Scraping Orchestration (Agent)
- File:
Agent/web_scraping/enhanced_processor.py - Function:
enhanced_scrape_source() - Gets source details from database
- Selects appropriate site scraper
- File:
-
Site-Specific Scraping
- File:
Agent/web_scraping/site_scrapers/moe_scraper.py(or ugc/aicte) - Function:
scrape_documents() - Visits website
- Finds document links using CSS selectors
- Handles pagination (sliding window)
- File:
-
Document Download
- File:
Agent/web_scraping/enhanced_processor.py - Function:
_download_and_process_document() - Downloads PDF/DOC file
- Saves to temp directory
- File:
-
Deduplication Check
- File:
Agent/web_scraping/document_identity_manager.py - Function:
is_duplicate() - Checks 3 levels:
- URL match
- Content hash (SHA256)
- Normalized URL
- If duplicate: Skip or update
- File:
-
Upload to S3
- File:
backend/utils/supabase_storage.py - Function:
upload_to_supabase() - Uploads file to Supabase bucket
- Returns public URL
- File:
-
Text Extraction
- File:
backend/utils/text_extractor.py - Function:
extract_text() - Extracts text from PDF/DOC
- Falls back to OCR if needed
- File:
-
Metadata Extraction (AI)
- File:
Agent/metadata/extractor.py - Function:
extract_metadata() - Uses Google Gemini (gemma-3-12b)
- Extracts: title, department, type, date, summary, keywords
- Validates quality
- File:
-
Family Detection
- File:
Agent/document_families/family_manager.py - Function:
find_or_create_family() - Checks if document belongs to existing family
- Creates new family or adds to existing
- Updates version numbers
- File:
-
Save to Database
- File:
Agent/web_scraping/enhanced_processor.py - Creates Document record
- Creates DocumentMetadata record
- Links to DocumentFamily
- Sets approval status
- File:
-
Return Results
- File:
backend/routers/enhanced_web_scraping_router.py - Returns stats:
- documents_new
- documents_updated
- documents_duplicate
- families_created
- families_updated
- File:
-
Frontend Updates
- File:
frontend/src/pages/admin/EnhancedWebScrapingPage.jsx - Displays success toast with stats
- Refreshes document list
- Updates counters
- File:
Files Touched (in order):
1. EnhancedWebScrapingPage.jsx (UI)
2. enhanced_web_scraping_router.py (API)
3. enhanced_processor.py (Orchestrator)
4. moe_scraper.py (Site scraper)
5. document_identity_manager.py (Dedup)
6. supabase_storage.py (S3 upload)
7. text_extractor.py (Text extraction)
8. extractor.py (Metadata AI)
9. family_manager.py (Families)
10. database.py (Save)
11. enhanced_web_scraping_router.py (Response)
12. EnhancedWebScrapingPage.jsx (Display)
Step-by-Step:
-
User Asks Question (Frontend)
- File:
frontend/src/pages/ChatPage.jsx - User types: "What is the Indo-Norwegian program?"
- Sends POST to
/api/chat/query
- File:
-
Chat API Receives Request (Backend)
- File:
backend/routers/chat_router.py - Validates user authentication
- Gets user role and institution
- Calls RAG agent
- File:
-
RAG Agent Initialization (Agent)
- File:
Agent/rag_agent/react_agent.py - Function:
PolicyRAGAgent.__init__() - Initializes LLM (OpenRouter - Llama 3.3 70B)
- Sets up search tools
- File:
-
Agent Processes Query
- File:
Agent/rag_agent/react_agent.py - Function:
query() - Analyzes question
- Decides which tool to use
- File:
-
Search Tool Invoked
- File:
Agent/tools/lazy_search_tools.py - Function:
search_documents_lazy() - Applies role-based filters
- File:
-
Check Embedding Status
- File:
Agent/tools/lazy_search_tools.py - Queries database for unembed documents
- Filters by user access permissions
- File:
-
Metadata-Based Ranking (If unembed docs exist)
- File:
Agent/tools/lazy_search_tools.py - Uses BM25 to rank documents by metadata
- Ranks based on: title, keywords, summary
- Selects top 3 most relevant
- File:
-
Lazy Embedding (On-demand)
- File:
Agent/lazy_rag/lazy_embedder.py - Function:
embed_document() - Chunks text (adaptive chunking)
- Generates embeddings (BGE-M3)
- Stores in pgvector
- File:
-
Vector Search
- File:
Agent/vector_store/pgvector_store.py - Function:
search() - Generates query embedding
- Searches pgvector (cosine similarity)
- Applies role-based filters
- Returns top matches
- File:
-
Hybrid Search (Optional)
- File:
Agent/retrieval/hybrid_retriever.py - Combines vector search (70%) + BM25 (30%)
- Merges and ranks results
- File:
-
Reranking (Optional)
- File:
Agent/metadata/reranker.py - Function:
rerank() - Uses LLM to rerank by relevance
- Returns top 5 results
- File:
-
Answer Generation
- File:
Agent/rag_agent/react_agent.py - LLM receives: question + relevant chunks
- Generates answer with citations
- Formats response
- File:
-
Return to User
- File:
backend/routers/chat_router.py - Returns formatted answer
- Includes source citations
- Document IDs and approval status
- File:
-
Frontend Displays Answer
- File:
frontend/src/pages/ChatPage.jsx - Shows answer with citations
- Links to source documents
- Displays approval badges
- File:
Files Touched (in order):
1. ChatPage.jsx (UI)
2. chat_router.py (API)
3. react_agent.py (RAG agent)
4. lazy_search_tools.py (Search)
5. lazy_embedder.py (Embedding)
6. adaptive_chunker.py (Chunking)
7. bge_embedder.py (Embeddings)
8. pgvector_store.py (Vector search)
9. reranker.py (Reranking)
10. react_agent.py (Answer generation)
11. chat_router.py (Response)
12. ChatPage.jsx (Display)
Step-by-Step:
-
User Uploads File (Frontend)
- File:
frontend/src/pages/admin/DocumentManagementPage.jsx - User selects PDF file
- Fills metadata form
- Clicks "Upload"
- File:
-
API Receives File (Backend)
- File:
backend/routers/document_router.py - Function:
upload_document() - Validates file type and size
- Checks user permissions
- File:
-
Save Temporary File
- File:
backend/routers/document_router.py - Saves to temp directory
- Generates unique filename
- File:
-
Text Extraction
- File:
backend/utils/text_extractor.py - Function:
extract_text() - For PDF: Uses PyPDF2
- If low quality: Falls back to Tesseract OCR
- For DOCX: Uses python-docx
- Returns extracted text
- File:
-
Upload to S3
- File:
backend/utils/supabase_storage.py - Function:
upload_to_supabase() - Uploads to Supabase bucket
- Returns public URL
- File:
-
Metadata Extraction (AI)
- File:
Agent/metadata/extractor.py - Function:
extract_metadata() - Uses Gemini (gemma-3-12b)
- Extracts structured metadata
- Validates quality
- File:
-
Create Database Records
- File:
backend/routers/document_router.py - Creates Document record:
- filename, file_type, s3_url
- extracted_text, uploader_id
- approval_status, visibility_level
- Creates DocumentMetadata record:
- title, department, document_type
- date_published, keywords, summary
- File:
-
Family Assignment (Optional)
- File:
Agent/document_families/family_manager.py - Checks for existing family
- Assigns to family or creates new
- File:
-
Return Success
- File:
backend/routers/document_router.py - Returns document ID
- Returns metadata
- File:
-
Frontend Updates
- File:
frontend/src/pages/admin/DocumentManagementPage.jsx - Shows success message
- Refreshes document list
- Clears form
- File:
Files Touched (in order):
1. DocumentManagementPage.jsx (UI)
2. document_router.py (API)
3. text_extractor.py (Text extraction)
4. supabase_storage.py (S3 upload)
5. extractor.py (Metadata AI)
6. family_manager.py (Families)
7. database.py (Save)
8. document_router.py (Response)
9. DocumentManagementPage.jsx (Display)
What: PostgreSQL extension for vector similarity search Why: Fast semantic search without separate vector DB How: Stores 1024-dim embeddings, uses cosine similarity
What: Multilingual embedding model Why: Supports 100+ languages including Hindi How: Converts text to 1024-dim vectors
What: Embed documents on-demand Why: Saves compute, only embed what's needed How: Rank by metadata, embed top matches
What: Group related documents Why: Track versions, avoid duplicates How: Similarity matching + version tracking
What: Vector + keyword search Why: Better accuracy than either alone How: 70% vector + 30% BM25
What: Different permissions per role Why: Security and privacy How: Filter queries by role and institution
# Database
DATABASE_HOSTNAME=aws-1-ap-south-1.pooler.supabase.com
DATABASE_PORT=5432
DATABASE_NAME=postgres
DATABASE_USERNAME=postgres.ppqdbqzlfxddfroxlycx
DATABASE_PASSWORD=suyashgandu
# AI Models
GOOGLE_API_KEY=AIzaSyDkCCqQdgGtrd2t1yGjCJ4zv4QmNNjn93w
OPENROUTER_API_KEY=sk-or-v1-288a791142fc9234...
OPENROUTER_MODEL=meta-llama/llama-3.3-70b-instruct:free
# LLM Providers
METADATA_LLM_PROVIDER=gemini # gemma-3-12b (14,400/day)
RAG_LLM_PROVIDER=openrouter # Llama 3.3 (200/day)
RERANKER_PROVIDER=openrouter # Llama 3.3 (200/day)
# Storage
SUPABASE_URL=https://ppqdbqzlfxddfroxlycx.supabase.co
SUPABASE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
SUPABASE_BUCKET_NAME=Docs
# Quality Control
DELETE_DOCS_WITHOUT_METADATA=false
REQUIRE_TITLE=false
REQUIRE_SUMMARY=falsefastapi==0.104.1
uvicorn==0.24.0
sqlalchemy==2.0.23
psycopg2-binary==2.9.9
pgvector==0.2.3
langchain==0.1.0
langchain-google-genai==0.0.6
langchain-openai==0.0.2
sentence-transformers==2.2.2
rank-bm25==0.2.2
python-multipart==0.0.6
python-jose==3.3.0
passlib==1.7.4
bcrypt==4.1.1
httpx==0.25.2
beautifulsoup4==4.12.2
PyPDF2==3.0.1
python-docx==1.1.0
pytesseract==0.3.10
Pillow==10.1.0- Speed: ~10 documents/minute
- Quota: 14,400 documents/day
- Deduplication: 33.5% duplicates found
- Success Rate: 95%+
- First search: 2-5 seconds (with lazy embedding)
- Subsequent searches: <1 second
- Accuracy: 85% (family-aware retrieval)
- Quota: 200 queries/day
- Database: PostgreSQL (Supabase)
- Files: Supabase S3 bucket
- Embeddings: pgvector (in PostgreSQL)
- Total docs: 1779+ documents
Check:
Agent/web_scraping/enhanced_processor.py- Logs- Site scraper (moe/ugc/aicte) - CSS selectors
- Metadata extraction - API quota
Check:
Agent/rag_agent/react_agent.py- LLM initialization.env- RAG_LLM_PROVIDER setting- OpenRouter API key validity
Check:
Agent/vector_store/pgvector_store.py- Embeddings exist- Role-based filters - User permissions
- Document approval status
Check:
Agent/metadata/extractor.py- API errors- Google API quota (14,400/day)
- Fallback to OpenRouter
BEACON is a comprehensive document management and RAG system with:
✅ Automated web scraping from government sites ✅ AI-powered metadata extraction (14,400/day quota) ✅ Intelligent chat with document citations ✅ Role-based access control for security ✅ Document families for version tracking ✅ Lazy RAG for efficient embedding ✅ Hybrid search for better accuracy ✅ Multilingual support (100+ languages)
Total Capacity: 14,600 operations/day Tech Stack: FastAPI + React + PostgreSQL + AI Status: Production-ready ✅
For questions or support, refer to individual file documentation or contact the development team.