Universal Web Scraper

Enterprise-grade web scraping framework showcasing modern JavaScript architecture, containerization, and DevOps best practices.

🎯 Portfolio Highlights

This project demonstrates:

Modern JavaScript (ES2024, Node.js 22+) with enterprise patterns
Scalable Architecture - Plugin-based parser system with dependency injection
Production DevOps - Docker containerization, CI/CD pipelines, comprehensive testing
Type Safety - Runtime validation with Zod schemas
Performance - Concurrent processing with intelligent queue management
Observability - Structured logging and monitoring capabilities

🚀 Technical Features

🏗️ Pluggable Architecture - Extensible parser system with auto-discovery
🎭 Browser Automation - Headless Chrome via Playwright with smart pooling
✅ Type-Safe Validation - Runtime schema validation with Zod
📊 Observability - Structured JSON logging with Pino
🔄 Resilient Processing - Built-in queue management, retries, and error handling
�️ Database Integration - Cassandra NoSQL for distributed storage, deduplication, and analytics
�🐳 Containerized - Multi-stage Docker builds optimized for production
🧪 Test-Driven - Comprehensive Jest test suite with mocking
🔧 DevOps Ready - GitHub Actions CI/CD with automated testing and security scanning

�️ Architecture & Tech Stack

Component	Technology	Purpose
Runtime	Node.js 22+	Modern JavaScript runtime
Scraping Engine	Crawlee + Playwright	Enterprise web scraping framework
Type Safety	Zod	Runtime schema validation
Logging	Pino	High-performance structured logging
Testing	Jest	Unit and integration testing
Code Quality	ESLint + Prettier	Automated code formatting & linting
Containerization	Docker + Kubernetes	Multi-stage builds & orchestration
CI/CD	GitHub Actions	Automated testing & deployment

🚀 Quick Start

# Install dependencies
npm install

# Run a single URL
npm start -- --url https://example.com --parser generic-news

# Process multiple URLs from file
npm start -- --file seeds.txt --parser auto

# With Cassandra database integration (optional)
make cassandra-dev                    # Start Cassandra locally
node scripts/cassandra-utils.js seed # Populate initial data
npm start -- --parser generic-news   # Uses database + file fallback

Example Usage

The seeds.txt file supports multiple formats:

https://www.example.com/article1
https://www.example.com/article2
{"url": "https://news.site.com", "parser": "generic-news"}
{"url": "https://weibo.com/user", "parser": "weibo"}

⚙️ Configuration Options

Parameter	Description	Default
`--url`	Single URL to process	-
`--file`	Path to seeds file	-
`--parser`	Parser to use (`auto` for detection)	`auto`
`--concurrency`	Concurrent requests	2
`--maxRequests`	Maximum requests to process	unlimited
`--delayMin`	Minimum delay between requests (ms)	500
`--delayMax`	Maximum delay between requests (ms)	1500

🐳 Docker Deployment

Multi-stage containerization for different environments:

# Build and run production container
make docker-build && make docker-run

# Development with live reload
make docker-dev

# Run test suite in container
make docker-test

Container Architecture

Development Stage: Full toolchain with hot reload
Testing Stage: Isolated environment for CI/CD
Production Stage: Minimal runtime optimized for performance
Security Features: Non-root user, health checks, minimal attack surface

☸️ Kubernetes Deployment

Enterprise-ready Kubernetes manifests for cloud deployment:

# Deploy to Kubernetes cluster
kubectl apply -f k8s/

# Or use the convenience script
./k8s/deploy.sh

# Check deployment status
kubectl get pods -n web-scraper
kubectl logs -f deployment/web-scraper -n web-scraper

Kubernetes Features

Namespace Isolation - Dedicated namespace for resource organization
ConfigMap Management - Environment configuration without secrets
Resource Limits - CPU/memory constraints for predictable performance
Health Probes - Liveness and readiness checks for reliability
Service Discovery - Internal networking and load balancing

🔧 Extending the Framework

Creating Custom Parsers

The plugin architecture allows easy extension for new sites:

import { BaseParser } from '../core/base-parser.js'
import { z } from 'zod'

const MyParserSchema = z.object({
  id: z.string(),
  title: z.string(),
  content: z.string().optional(),
  publishedAt: z.date().optional(),
})

export default class MyCustomParser extends BaseParser {
  id = 'my-site'
  domains = ['example.com']
  schema = MyParserSchema

  async canParse(url) {
    return /example\.com/.test(url)
  }

  async parse(page, ctx) {
    const title = await page.textContent('h1')
    const content = await page.textContent('article')

    return this.schema.parse({
      id: ctx.request.id,
      title,
      content,
      publishedAt: new Date(),
    })
  }
}

Parser Auto-Discovery

The framework automatically registers parsers found in src/parsers/ directory, enabling plug-and-play functionality for new sites.

⚡ Environment Configuration

Deployment Configuration

The framework uses a single crawlee.json configuration file optimized for all environments:

{
  "persistStorage": false,
  "logLevel": "INFO"
}

This single configuration works perfectly for all deployment scenarios:

🎯 Universal Configuration Benefits:

Development: Clean runs with no storage persistence
Production: Add CRAWLEE_PERSIST_STORAGE=true if storage needed
Serverless: Perfect as-is (no storage, minimal footprint)
Docker: Works in containers without modification

Environment Overrides

Override configuration through environment variables when needed:

# Enable storage for production if needed
export CRAWLEE_PERSIST_STORAGE=true
export CRAWLEE_LOG_LEVEL=ERROR

# Run with overrides
npm start

Key Configuration Benefits

🚀 No storage by default - Prevents disk bloat, perfect for serverless
📊 Appropriate logging - INFO level provides good visibility
🔧 Environment flexibility - Override via env vars when needed
🐳 Docker-ready - Works in containers without modification

Serverless Deployment

Optimized for AWS Lambda, Google Cloud Functions, and similar platforms:

No filesystem persistence required
Minimal logging overhead
Memory-efficient browser management
Environment variable configuration support

🔄 Data Processing

Parser Selection Logic

Forced via CLI (--parser <id>). Use --parser auto to disable forcing.
First parser whose canParse(url) returns truthy.
Fallback: attempt article extraction; if a title is found and generic-news exists, use it.

Output Format

Validated JSON objects printed to stdout (one per line) + structured logs to stderr:

{
  "id": "article_123",
  "title": "Sample Article",
  "content": "...",
  "publishedAt": "2024-01-01"
}

Results can be easily adapted for various outputs (database, message queue, file storage).

🔒 Production Considerations

Rate Limiting & Reliability

Configurable concurrency limits (default: 2 concurrent requests)
Random delays between requests (500-1500ms) for respectful scraping
Built-in retry mechanisms with exponential backoff
Request queue management for large-scale operations

Security & Monitoring

Structured logging for audit trails and debugging
Health checks for containerized deployments
Non-root user execution in Docker containers
Environment-based configuration management

🧪 Testing & Quality Assurance

# Run test suite
npm test

# Generate coverage report
npm run test:coverage

# Code quality checks
npm run lint && npm run format:check

# Full validation pipeline
npm run validate

📈 Future Enhancements

Storage Adapters: S3, PostgreSQL, MongoDB integrations
Monitoring: Prometheus metrics and alerting
Scaling: Distributed processing with message queues
Advanced Parsing: ML-based content extraction
API Gateway: RESTful interface for remote operations

📋 Technical Requirements

Node.js: 22.0.0 or higher
Memory: 2GB+ recommended for browser operations
Storage: Configurable (local filesystem or external)
Network: HTTP/HTTPS access for target sites

📄 License

MIT License - See LICENSE file for details.

Built with ❤️ to demonstrate enterprise-grade JavaScript architecture and modern DevOps practices.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
ci		ci
docker		docker
k8s		k8s
scripts		scripts
src		src
test		test
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierignore		.prettierignore
.prettierrc		.prettierrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
crawlee.json		crawlee.json
package-lock.json		package-lock.json
package.json		package.json
seeds-generic-news.txt		seeds-generic-news.txt
seeds-weibo.txt		seeds-weibo.txt
seeds.txt		seeds.txt

Folders and files

Latest commit

History

Repository files navigation

Universal Web Scraper

🎯 Portfolio Highlights

🚀 Technical Features

�️ Architecture & Tech Stack

🚀 Quick Start

Example Usage

⚙️ Configuration Options

🐳 Docker Deployment

Container Architecture

☸️ Kubernetes Deployment

Kubernetes Features

🔧 Extending the Framework

Creating Custom Parsers

Parser Auto-Discovery

⚡ Environment Configuration

Deployment Configuration

Environment Overrides

Key Configuration Benefits

Serverless Deployment

🔄 Data Processing

Parser Selection Logic

Output Format

🔒 Production Considerations

Rate Limiting & Reliability

Security & Monitoring

🧪 Testing & Quality Assurance

📈 Future Enhancements

📋 Technical Requirements

📄 License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages