A highly scalable, serverless distributed web crawler that processes URLs through AWS SQS queues and stores structured results in S3. Designed for enterprise-grade web scraping with built-in monitoring and auto-scaling capabilities.
- Fully Serverless Architecture: Leverages AWS managed services for minimal operational overhead
- Elastic Scaling: Automatically scales workers based on queue depth
- Dockerized Workers: Containerized scraping tasks for consistent execution
- Persistent Storage: All crawl results stored in S3 with structured JSON format
- Comprehensive Monitoring: Built-in CloudWatch logging and metrics
- Fault-Tolerant Design: Message queue ensures no URL is lost during processing
- Cost-Effective: Pay only for resources used during crawling operations
flowchart LR
A[User] -->|Submit URL| B[Frontend]
B -->|API Call| C[Backend API]
C -->|Queue Message| D[SQS]
D -->|Poll| E[ECS Fargate Workers]
E -->|Store Results| F[S3 Bucket]
E -->|Logs| G[CloudWatch]
H[Auto Scaling] -->|Scale| E
I[Monitoring] -->|Metrics| G
| Component | Technology |
|---|---|
| Frontend | React.js |
| Backend API | FastAPI (Python) |
| Queue Service | Amazon SQS (Standard Queue) |
| Compute | AWS ECS Fargate |
| Storage | Amazon S3 (Standard Storage) |
| Monitoring | AWS CloudWatch |
| Scraping | BeautifulSoup4, Requests |
- AWS Account with admin permissions
- AWS CLI configured (
aws configure) - Docker installed
- Python 3.10+
- Node.js 16+ (if using frontend)
-
Set Environment Variables:
cp .env.example .env # Update values in .env -
Deploy API:
cd backend/ python -m pip install -r requirements.txt uvicorn app.main:app --host 0.0.0.0 --port 8000
-
Build and Push Container:
docker build -t web-crawler . aws ecr get-login-password | docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com docker tag web-crawler:latest <account-id>.dkr.ecr.<region>.amazonaws.com/web-crawler:latest docker push <account-id>.dkr.ecr.<region>.amazonaws.com/web-crawler:latest
-
Create ECS Service:
aws ecs create-service \ --cluster web-crawler-cluster \ --service-name web-crawler \ --task-definition web-crawler-task \ --desired-count 2 \ --launch-type FARGATE \ --network-configuration "awsvpcConfiguration={subnets=[subnet-12345],securityGroups=[sg-12345]}"
| Variable | Description | Example Value |
|---|---|---|
AWS_REGION |
AWS region for resources | us-east-1 |
WEB_CRAWLER_QUEUE_URL |
SQS Queue URL | https://sqs.us-east-1.amazonaws.com/... |
WEB_CRAWLER_OUTPUT_BUCKET |
S3 Bucket for results | web-crawler-results-12345 |
MAX_RETRIES |
Maximum retry attempts per URL | 3 |
REQUEST_TIMEOUT |
HTTP request timeout in seconds | 10 |
USER_AGENT |
Custom user agent for requests | Mozilla/5.0 (compatible; MyCrawler) |
{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SQSQueueApproximateNumberOfMessagesVisible",
"ResourceLabel": "queue-name/web-crawler-queue"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300
}-
Start Local Services:
docker-compose up -d
-
Run Tests:
pytest backend/tests/ pytest crawler/tests/
-
Submit Test URL:
curl -X POST "http://localhost:8000/submit-url" \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com"}'
-
View Worker Logs:
aws logs tail /ecs/web-crawler --follow
-
Check Queue Status:
aws sqs get-queue-attributes \ --queue-url <QUEUE_URL> \ --attribute-names ApproximateNumberOfMessages
-
Inspect S3 Results:
aws s3 ls s3://web-crawler-results-12345/results/
Key metrics to monitor:
-
SQS Metrics:
ApproximateNumberOfMessagesVisibleApproximateAgeOfOldestMessage
-
ECS Metrics:
CPUUtilizationMemoryUtilization
-
Custom Metrics:
CrawlSuccessRateAverageProcessingTime
Example CloudWatch Dashboard:
aws cloudwatch put-dashboard \
--dashboard-name WebCrawler \
--dashboard-body file://dashboard.json| Issue | Solution |
|---|---|
| Messages stuck in queue | Check worker logs, increase task count |
| High error rate | Verify timeout settings, check target sites' robots.txt |
| Slow processing | Review scraper logic, optimize BeautifulSoup selectors |
| Permission denied errors | Verify IAM roles for ECS tasks |
| Container failing to start | Check ECS task definition, verify container health checks |
- Dynamic Content: Does not execute JavaScript (consider adding headless browser support)
- Rate Limiting: No built-in rate limiting (implement in future version)
- Duplicate URLs: Basic URL deduplication only (hash-based)
- Large Files: Not optimized for large file downloads
- Add headless browser support (Playwright)
- Implement sophisticated URL deduplication
- Add rate limiting middleware
- Develop result analysis module
- Create Terraform deployment option
- Add Prometheus/Grafana monitoring
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request