DataSaurus - Dinosaur Knowledge Search Engine

A Flask + MongoDB search engine with TF-IDF ranking, real-time scraping (SSE), type-ahead suggestions, and a vanilla JS frontend. Now with separate scraping processes for improved performance and reliability.

Quick start

Prerequisites

Python 3.10+
MongoDB 5+

Setup

# In project root
python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# macOS/Linux
# source .venv/bin/activate

pip install -r requirements.txt

# Install Playwright browsers (required for scraping)
python setup_playwright.py

# Create .env (see below) then start
python backend/app.py

# Or use the startup script (recommended)
python start_server.py

Open http://localhost:5000

.env configuration

Create .env in the project root:

MONGO_URI=mongodb://localhost:27017/datasaurus
DB_NAME=datasaurus
ADMIN_USERNAME=admin
ADMIN_PASSWORD=dinosaurus123
DOC_COLLECTION=documents
SCRAPE_LOG_COLLECTION=scrape_logs
HISTORY_COLLECTION=history
AUDIT_COLLECTION=audit_log
SEARCH_QUERIES_COLLECTION=search_queries
SUGGESTIONS_SEED_COLLECTION=suggestions_seed

Using the app

Search page: query input, tabs (All/News/Images), suggestions dropdown.
Admin page: login with ADMIN_USERNAME/ADMIN_PASSWORD.
- Web Scraping: start/cancel scraping with live progress and process monitoring.
- History & Analytics: load/export/clear history; accuracy report; system stats.
- Import/Export: import JSON; export all docs.
- Suggestions: upload seed CSV (column: phrase), rebuild index.

Process-Based Scraping

The scraping system now runs as separate processes, providing several benefits:

Non-blocking: Search functionality remains fully available while scraping runs
Process isolation: Scraping failures don't affect the main application
Better resource management: Each scraping job runs in its own process
Process monitoring: Real-time status updates and PID tracking
Graceful shutdown: Proper cleanup when processes are cancelled

Key Features:

Scraping processes are spawned using subprocess.Popen
Process status monitoring via /api/scrape/status endpoint
Automatic cleanup when processes complete or are cancelled
SSE (Server-Sent Events) for real-time progress updates
Process ID tracking for better debugging and monitoring

API endpoints and how to test

Base URL: http://localhost:5000/api

Authentication header for protected endpoints:

X-Admin-Token: <token>

Auth

POST /login

Body: { "username": "admin", "password": "change_me" }
Response: { "token": "..." }

Test

curl -s -X POST http://localhost:5000/api/login \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"change_me"}'

Search

POST /search

Body: { "query": "tyrannosaurus" }

POST /search/news

Body: { "query": "dinosaurs" }

POST /search/images

Body: { "query": "fossils" }

Test

curl -s -X POST http://localhost:5000/api/search \
  -H "Content-Type: application/json" \
  -d '{"query":"tyrannosaurus"}'

Suggestions (public)

GET /suggestions?q=&limit=5

Test

curl -s "http://localhost:5000/api/suggestions?q=tyr&limit=5"

Scraping (protected)

POST /scrape

Example body:

{
  "max_pages": 20,
  "max_depth": 2,
  "concurrency": 5,
  "timeout": 12,
  "retries": 2,
  "ignore_existing": false
}

POST /scrape/cancel

GET /scrape/status

GET /scrape-history

GET /documents

GET /stream (SSE)

Test

TOKEN=$(curl -s -X POST http://localhost:5000/api/login -H "Content-Type: application/json" -d '{"username":"admin","password":"change_me"}' | python -c "import sys,json;print(json.load(sys.stdin)['token'])")

curl -s -X POST http://localhost:5000/api/scrape \
  -H "X-Admin-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_pages":10,"max_depth":2,"concurrency":5,"timeout":12,"retries":2,"ignore_existing":true}'

# Check scraping process status
curl -s -H "X-Admin-Token: $TOKEN" http://localhost:5000/api/scrape/status

# View live events (may stream indefinitely)
curl -N http://localhost:5000/api/stream | cat

Import/Export (protected)

GET /export

POST /import?mode=merge|replace&confirm=OVERWRITE

multipart form-data with key: file

Test export

curl -s -H "X-Admin-Token: $TOKEN" http://localhost:5000/api/export

Test import

# export.json should be like: { "docs": [ {"url":"https://...","title":"...","content":"...","tokens":["..."]} ] }
curl -s -X POST "http://localhost:5000/api/import?mode=merge" \
  -H "X-Admin-Token: $TOKEN" \
  -F file=@export.json

History (protected)

GET /scrape-history

GET /history/export

DELETE /history/clear

Test clear curl -s -X DELETE http://localhost:5000/api/history/clear -H "X-Admin-Token: $TOKEN"

Suggestions admin (protected)

POST /suggestions/seed

CSV upload with column: phrase (multipart key: file) POST /suggestions/rebuild

Test seed (CSV)

curl -s -X POST http://localhost:5000/api/suggestions/seed \
  -H "X-Admin-Token: $TOKEN" \
  -F file=@backend/data/suggestions_seed.csv

Playwright installation & setup

Follow these steps after creating and activating your virtual environment:

Install Python packages (includes Playwright):

pip install -r requirements.txt

Install Playwright browser binaries (one-time):

# Recommended
python setup_playwright.py

# Alternative (Windows-compatible)
python install_playwright_browsers.py

# Manual fallback
python -m playwright install
# Linux only (system dependencies)
python -m playwright install-deps
# Install only Chromium (optional)
python -m playwright install chromium

Verify the installation:

python test_playwright.py

If you see successes for the test URLs, Playwright is ready for scraping.

Playwright Configuration

The scraper now uses Playwright for headless browser automation. Configure in backend/config.py:

pip install -r requirements.txt

Install Playwright browser binaries (one-time):

# Recommended
python setup_playwright.py

# Alternative (Windows-compatible)
python install_playwright_browsers.py

# Manual fallback
python -m playwright install
# Linux only (system dependencies)
python -m playwright install-deps
# Install only Chromium (optional)
python -m playwright install chromium

Verify the installation:

python test_playwright.py

If you see successes for the test URLs, Playwright is ready for scraping.

Playwright Configuration

The scraper now uses Playwright for headless browser automation. Configure in backend/config.py:

# Playwright settings
PLAYWRIGHT_HEADLESS = True          # Run browser in headless mode
PLAYWRIGHT_BROWSER = "chromium"     # chromium, firefox, webkit
PLAYWRIGHT_VIEWPORT = {"width": 1920, "height": 1080}
PLAYWRIGHT_WAIT_FOR_LOAD = "networkidle"  # load, domcontentloaded, networkidle
PLAYWRIGHT_SCREENSHOT = False       # Enable for debugging
PLAYWRIGHT_USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
PLAYWRIGHT_USE_CHROME = True        # Use installed Chrome instead of bundled shell

Benefits of Playwright:

JavaScript rendering for dynamic content
Better handling of SPAs and modern websites
More reliable than requests-based scraping
Screenshot capability for debugging

Accuracy tuning

Expand synonym normalization in backend/api/routes.py
Boost title/URL influence in backend/representation/tfidf.py
Curate sources and adjust backend/data/stopwords.csv
TF-IDF is rebuilt automatically after scrape/import

Testing Concurrent Operations

Test that searching works while scraping is running:

# Run the concurrent operations test
python test_concurrent_operations.py

This test will:

Start a scraping process
Perform multiple searches concurrently
Verify that searches complete successfully while scraping runs
Report success rates and performance metrics

Troubleshooting

404 warnings during scrape: page not found; scraper skips and continues.
Progress bar not moving: keep Admin page open (SSE connected); avoid proxy buffering.
Missing env: ensure .env exists at project root with required variables.
Process not starting: check that psutil is installed (pip install psutil).
Scraping process hangs: use the cancel button or restart the server.

Project layout

backend/ ... Flask API, scraping, TF-IDF, suggestions
scr/ ... frontend SPA (vanilla JS)
requirements.txt

For educational use. Respect site policies when scraping.

Credits

Created by:

Castro, Fredderico
Caingcoy, Christian
Aparri, Serg Michael
Concepcion, Paul Dexter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataSaurus - Dinosaur Knowledge Search Engine

Quick start

.env configuration

Using the app

Process-Based Scraping

Key Features:

API endpoints and how to test

Auth

Search

Suggestions (public)

Scraping (protected)

Import/Export (protected)

History (protected)

Suggestions admin (protected)

Playwright installation & setup

Playwright Configuration

Playwright Configuration

Accuracy tuning

Testing Concurrent Operations

Troubleshooting

Project layout

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
scr		scr
.gitignore		.gitignore
README.md		README.md
install_playwright_browsers.py		install_playwright_browsers.py
package-lock.json		package-lock.json
requirements.txt		requirements.txt
setup_playwright.py		setup_playwright.py
start_server.py		start_server.py
test_concurrent_operations.py		test_concurrent_operations.py
test_playwright.py		test_playwright.py

Folders and files

Latest commit

History

Repository files navigation

DataSaurus - Dinosaur Knowledge Search Engine

Quick start

.env configuration

Using the app

Process-Based Scraping

Key Features:

API endpoints and how to test

Auth

Search

Suggestions (public)

Scraping (protected)

Import/Export (protected)

History (protected)

Suggestions admin (protected)

Playwright installation & setup

Playwright Configuration

Playwright Configuration

Accuracy tuning

Testing Concurrent Operations

Troubleshooting

Project layout

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages