Skip to content

tsrdatatech/web-scrapers-js

Repository files navigation

Universal Web Scraper

CI/CD Pipeline Test Coverage Node.js License: MIT

Enterprise-grade web scraping framework showcasing modern JavaScript architecture, containerization, and DevOps best practices.

🎯 Portfolio Highlights

This project demonstrates:

  • Modern JavaScript (ES2024, Node.js 22+) with enterprise patterns
  • Scalable Architecture - Plugin-based parser system with dependency injection
  • Production DevOps - Docker containerization, CI/CD pipelines, comprehensive testing
  • Type Safety - Runtime validation with Zod schemas
  • Performance - Concurrent processing with intelligent queue management
  • Observability - Structured logging and monitoring capabilities

πŸš€ Technical Features

  • πŸ—οΈ Pluggable Architecture - Extensible parser system with auto-discovery
  • 🎭 Browser Automation - Headless Chrome via Playwright with smart pooling
  • βœ… Type-Safe Validation - Runtime schema validation with Zod
  • πŸ“Š Observability - Structured JSON logging with Pino
  • πŸ”„ Resilient Processing - Built-in queue management, retries, and error handling
  • �️ Database Integration - Cassandra NoSQL for distributed storage, deduplication, and analytics
  • �🐳 Containerized - Multi-stage Docker builds optimized for production
  • πŸ§ͺ Test-Driven - Comprehensive Jest test suite with mocking
  • πŸ”§ DevOps Ready - GitHub Actions CI/CD with automated testing and security scanning

�️ Architecture & Tech Stack

Component Technology Purpose
Runtime Node.js 22+ Modern JavaScript runtime
Scraping Engine Crawlee + Playwright Enterprise web scraping framework
Type Safety Zod Runtime schema validation
Logging Pino High-performance structured logging
Testing Jest Unit and integration testing
Code Quality ESLint + Prettier Automated code formatting & linting
Containerization Docker + Kubernetes Multi-stage builds & orchestration
CI/CD GitHub Actions Automated testing & deployment

πŸš€ Quick Start

# Install dependencies
npm install

# Run a single URL
npm start -- --url https://example.com --parser generic-news

# Process multiple URLs from file
npm start -- --file seeds.txt --parser auto

# With Cassandra database integration (optional)
make cassandra-dev                    # Start Cassandra locally
node scripts/cassandra-utils.js seed # Populate initial data
npm start -- --parser generic-news   # Uses database + file fallback

Example Usage

The seeds.txt file supports multiple formats:

https://www.example.com/article1
https://www.example.com/article2
{"url": "https://news.site.com", "parser": "generic-news"}
{"url": "https://weibo.com/user", "parser": "weibo"}

βš™οΈ Configuration Options

Parameter Description Default
--url Single URL to process -
--file Path to seeds file -
--parser Parser to use (auto for detection) auto
--concurrency Concurrent requests 2
--maxRequests Maximum requests to process unlimited
--delayMin Minimum delay between requests (ms) 500
--delayMax Maximum delay between requests (ms) 1500

🐳 Docker Deployment

Multi-stage containerization for different environments:

# Build and run production container
make docker-build && make docker-run

# Development with live reload
make docker-dev

# Run test suite in container
make docker-test

Container Architecture

  • Development Stage: Full toolchain with hot reload
  • Testing Stage: Isolated environment for CI/CD
  • Production Stage: Minimal runtime optimized for performance
  • Security Features: Non-root user, health checks, minimal attack surface

☸️ Kubernetes Deployment

Enterprise-ready Kubernetes manifests for cloud deployment:

# Deploy to Kubernetes cluster
kubectl apply -f k8s/

# Or use the convenience script
./k8s/deploy.sh

# Check deployment status
kubectl get pods -n web-scraper
kubectl logs -f deployment/web-scraper -n web-scraper

Kubernetes Features

  • Namespace Isolation - Dedicated namespace for resource organization
  • ConfigMap Management - Environment configuration without secrets
  • Resource Limits - CPU/memory constraints for predictable performance
  • Health Probes - Liveness and readiness checks for reliability
  • Service Discovery - Internal networking and load balancing

πŸ”§ Extending the Framework

Creating Custom Parsers

The plugin architecture allows easy extension for new sites:

import { BaseParser } from '../core/base-parser.js'
import { z } from 'zod'

const MyParserSchema = z.object({
  id: z.string(),
  title: z.string(),
  content: z.string().optional(),
  publishedAt: z.date().optional(),
})

export default class MyCustomParser extends BaseParser {
  id = 'my-site'
  domains = ['example.com']
  schema = MyParserSchema

  async canParse(url) {
    return /example\.com/.test(url)
  }

  async parse(page, ctx) {
    const title = await page.textContent('h1')
    const content = await page.textContent('article')

    return this.schema.parse({
      id: ctx.request.id,
      title,
      content,
      publishedAt: new Date(),
    })
  }
}

Parser Auto-Discovery

The framework automatically registers parsers found in src/parsers/ directory, enabling plug-and-play functionality for new sites.

⚑ Environment Configuration

Deployment Configuration

The framework uses a single crawlee.json configuration file optimized for all environments:

{
  "persistStorage": false,
  "logLevel": "INFO"
}

This single configuration works perfectly for all deployment scenarios:

🎯 Universal Configuration Benefits:

  • Development: Clean runs with no storage persistence
  • Production: Add CRAWLEE_PERSIST_STORAGE=true if storage needed
  • Serverless: Perfect as-is (no storage, minimal footprint)
  • Docker: Works in containers without modification

Environment Overrides

Override configuration through environment variables when needed:

# Enable storage for production if needed
export CRAWLEE_PERSIST_STORAGE=true
export CRAWLEE_LOG_LEVEL=ERROR

# Run with overrides
npm start

Key Configuration Benefits

  • πŸš€ No storage by default - Prevents disk bloat, perfect for serverless
  • πŸ“Š Appropriate logging - INFO level provides good visibility
  • πŸ”§ Environment flexibility - Override via env vars when needed
  • 🐳 Docker-ready - Works in containers without modification

Serverless Deployment

Optimized for AWS Lambda, Google Cloud Functions, and similar platforms:

  • No filesystem persistence required
  • Minimal logging overhead
  • Memory-efficient browser management
  • Environment variable configuration support

πŸ”„ Data Processing

Parser Selection Logic

  1. Forced via CLI (--parser <id>). Use --parser auto to disable forcing.
  2. First parser whose canParse(url) returns truthy.
  3. Fallback: attempt article extraction; if a title is found and generic-news exists, use it.

Output Format

Validated JSON objects printed to stdout (one per line) + structured logs to stderr:

{
  "id": "article_123",
  "title": "Sample Article",
  "content": "...",
  "publishedAt": "2024-01-01"
}

Results can be easily adapted for various outputs (database, message queue, file storage).

πŸ”’ Production Considerations

Rate Limiting & Reliability

  • Configurable concurrency limits (default: 2 concurrent requests)
  • Random delays between requests (500-1500ms) for respectful scraping
  • Built-in retry mechanisms with exponential backoff
  • Request queue management for large-scale operations

Security & Monitoring

  • Structured logging for audit trails and debugging
  • Health checks for containerized deployments
  • Non-root user execution in Docker containers
  • Environment-based configuration management

πŸ§ͺ Testing & Quality Assurance

# Run test suite
npm test

# Generate coverage report
npm run test:coverage

# Code quality checks
npm run lint && npm run format:check

# Full validation pipeline
npm run validate

πŸ“ˆ Future Enhancements

  • Storage Adapters: S3, PostgreSQL, MongoDB integrations
  • Monitoring: Prometheus metrics and alerting
  • Scaling: Distributed processing with message queues
  • Advanced Parsing: ML-based content extraction
  • API Gateway: RESTful interface for remote operations

πŸ“‹ Technical Requirements

  • Node.js: 22.0.0 or higher
  • Memory: 2GB+ recommended for browser operations
  • Storage: Configurable (local filesystem or external)
  • Network: HTTP/HTTPS access for target sites

πŸ“„ License

MIT License - See LICENSE file for details.


Built with ❀️ to demonstrate enterprise-grade JavaScript architecture and modern DevOps practices.

About

Universal web scraper with pluggable parser architecture, built with Crawlee and Playwright

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors