A Flask + MongoDB search engine with TF-IDF ranking, real-time scraping (SSE), type-ahead suggestions, and a vanilla JS frontend. Now with separate scraping processes for improved performance and reliability.
Prerequisites
- Python 3.10+
- MongoDB 5+
Setup
# In project root
python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# macOS/Linux
# source .venv/bin/activate
pip install -r requirements.txt
# Install Playwright browsers (required for scraping)
python setup_playwright.py
# Create .env (see below) then start
python backend/app.py
# Or use the startup script (recommended)
python start_server.pyCreate .env in the project root:
MONGO_URI=mongodb://localhost:27017/datasaurus
DB_NAME=datasaurus
ADMIN_USERNAME=admin
ADMIN_PASSWORD=dinosaurus123
DOC_COLLECTION=documents
SCRAPE_LOG_COLLECTION=scrape_logs
HISTORY_COLLECTION=history
AUDIT_COLLECTION=audit_log
SEARCH_QUERIES_COLLECTION=search_queries
SUGGESTIONS_SEED_COLLECTION=suggestions_seed
- Search page: query input, tabs (All/News/Images), suggestions dropdown.
- Admin page: login with ADMIN_USERNAME/ADMIN_PASSWORD.
- Web Scraping: start/cancel scraping with live progress and process monitoring.
- History & Analytics: load/export/clear history; accuracy report; system stats.
- Import/Export: import JSON; export all docs.
- Suggestions: upload seed CSV (column: phrase), rebuild index.
The scraping system now runs as separate processes, providing several benefits:
- Non-blocking: Search functionality remains fully available while scraping runs
- Process isolation: Scraping failures don't affect the main application
- Better resource management: Each scraping job runs in its own process
- Process monitoring: Real-time status updates and PID tracking
- Graceful shutdown: Proper cleanup when processes are cancelled
- Scraping processes are spawned using
subprocess.Popen - Process status monitoring via
/api/scrape/statusendpoint - Automatic cleanup when processes complete or are cancelled
- SSE (Server-Sent Events) for real-time progress updates
- Process ID tracking for better debugging and monitoring
Base URL: http://localhost:5000/api
Authentication header for protected endpoints:
X-Admin-Token: <token>
POST /login
- Body: { "username": "admin", "password": "change_me" }
- Response: { "token": "..." }
Test
curl -s -X POST http://localhost:5000/api/login \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"change_me"}'POST /search
- Body: { "query": "tyrannosaurus" }
POST /search/news
- Body: { "query": "dinosaurs" }
POST /search/images
- Body: { "query": "fossils" }
Test
curl -s -X POST http://localhost:5000/api/search \
-H "Content-Type: application/json" \
-d '{"query":"tyrannosaurus"}'GET /suggestions?q=&limit=5
Test
curl -s "http://localhost:5000/api/suggestions?q=tyr&limit=5"POST /scrape
- Example body:
{
"max_pages": 20,
"max_depth": 2,
"concurrency": 5,
"timeout": 12,
"retries": 2,
"ignore_existing": false
}POST /scrape/cancel
GET /scrape/status
GET /scrape-history
GET /documents
GET /stream (SSE)
Test
TOKEN=$(curl -s -X POST http://localhost:5000/api/login -H "Content-Type: application/json" -d '{"username":"admin","password":"change_me"}' | python -c "import sys,json;print(json.load(sys.stdin)['token'])")
curl -s -X POST http://localhost:5000/api/scrape \
-H "X-Admin-Token: $TOKEN" \
-H "Content-Type: application/json" \
-d '{"max_pages":10,"max_depth":2,"concurrency":5,"timeout":12,"retries":2,"ignore_existing":true}'
# Check scraping process status
curl -s -H "X-Admin-Token: $TOKEN" http://localhost:5000/api/scrape/status
# View live events (may stream indefinitely)
curl -N http://localhost:5000/api/stream | catGET /export
POST /import?mode=merge|replace&confirm=OVERWRITE
- multipart form-data with key: file
Test export
curl -s -H "X-Admin-Token: $TOKEN" http://localhost:5000/api/exportTest import
# export.json should be like: { "docs": [ {"url":"https://...","title":"...","content":"...","tokens":["..."]} ] }
curl -s -X POST "http://localhost:5000/api/import?mode=merge" \
-H "X-Admin-Token: $TOKEN" \
-F file=@export.jsonGET /scrape-history
GET /history/export
DELETE /history/clear
Test clear curl -s -X DELETE http://localhost:5000/api/history/clear -H "X-Admin-Token: $TOKEN"
POST /suggestions/seed
- CSV upload with column: phrase (multipart key: file) POST /suggestions/rebuild
Test seed (CSV)
curl -s -X POST http://localhost:5000/api/suggestions/seed \
-H "X-Admin-Token: $TOKEN" \
-F file=@backend/data/suggestions_seed.csvFollow these steps after creating and activating your virtual environment:
- Install Python packages (includes Playwright):
pip install -r requirements.txt- Install Playwright browser binaries (one-time):
# Recommended
python setup_playwright.py
# Alternative (Windows-compatible)
python install_playwright_browsers.py
# Manual fallback
python -m playwright install
# Linux only (system dependencies)
python -m playwright install-deps
# Install only Chromium (optional)
python -m playwright install chromium- Verify the installation:
python test_playwright.pyIf you see successes for the test URLs, Playwright is ready for scraping.
The scraper now uses Playwright for headless browser automation. Configure in backend/config.py:
pip install -r requirements.txt- Install Playwright browser binaries (one-time):
# Recommended
python setup_playwright.py
# Alternative (Windows-compatible)
python install_playwright_browsers.py
# Manual fallback
python -m playwright install
# Linux only (system dependencies)
python -m playwright install-deps
# Install only Chromium (optional)
python -m playwright install chromium- Verify the installation:
python test_playwright.pyIf you see successes for the test URLs, Playwright is ready for scraping.
The scraper now uses Playwright for headless browser automation. Configure in backend/config.py:
# Playwright settings
PLAYWRIGHT_HEADLESS = True # Run browser in headless mode
PLAYWRIGHT_BROWSER = "chromium" # chromium, firefox, webkit
PLAYWRIGHT_VIEWPORT = {"width": 1920, "height": 1080}
PLAYWRIGHT_WAIT_FOR_LOAD = "networkidle" # load, domcontentloaded, networkidle
PLAYWRIGHT_SCREENSHOT = False # Enable for debugging
PLAYWRIGHT_USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
PLAYWRIGHT_USE_CHROME = True # Use installed Chrome instead of bundled shellBenefits of Playwright:
- JavaScript rendering for dynamic content
- Better handling of SPAs and modern websites
- More reliable than requests-based scraping
- Screenshot capability for debugging
- Expand synonym normalization in backend/api/routes.py
- Boost title/URL influence in backend/representation/tfidf.py
- Curate sources and adjust backend/data/stopwords.csv
- TF-IDF is rebuilt automatically after scrape/import
Test that searching works while scraping is running:
# Run the concurrent operations test
python test_concurrent_operations.pyThis test will:
- Start a scraping process
- Perform multiple searches concurrently
- Verify that searches complete successfully while scraping runs
- Report success rates and performance metrics
- 404 warnings during scrape: page not found; scraper skips and continues.
- Progress bar not moving: keep Admin page open (SSE connected); avoid proxy buffering.
- Missing env: ensure .env exists at project root with required variables.
- Process not starting: check that
psutilis installed (pip install psutil). - Scraping process hangs: use the cancel button or restart the server.
- backend/ ... Flask API, scraping, TF-IDF, suggestions
- scr/ ... frontend SPA (vanilla JS)
- requirements.txt
For educational use. Respect site policies when scraping.
Created by:
- Castro, Fredderico
- Caingcoy, Christian
- Aparri, Serg Michael
- Concepcion, Paul Dexter