SuData is a comprehensive data pipeline system that automates the collection, processing, and monitoring of streaming data from multiple sources. Built with modern DevOps practices, it features automated CI/CD pipelines, multi-environment deployments, and robust monitoring capabilities.
+-------------------+ +-------------------+ +-------------------+
| Telegram Scraper | | TikTok Scraper | | YouTube Scraper |
| | | | | |
+---------+---------+ +---------+---------+ +---------+---------+
| | |
| (Raw Data) | (Raw Data) | (Raw Data)
v v v
+---------------------------------------------------------------------+
| Refinery Service (Python) |
| (Processes & Refines Data) |
+---------------------------------------------------------------------+
|
| (Refined Data)
v
+---------------------------------------------------------------------+
| Output / Storage |
+---------------------------------------------------------------------+
| ^
| | (Health Status)
v |
+---------------------------------------------------------------------+
| SuData Symphony (Orchestration & Monitoring) |
| (Starts, Stops, Manages Health Checks) |
+---------------------------------------------------------------------+
^
| (Health Status)
|
+-------------------+
| Dashboard |
| |
+-------------------+
- Multi-Source Data Collection: Automated scraping from TikTok and YouTube
- Real-time Processing: Intelligent data refinement and transformation
- Web Dashboard: Interactive monitoring and visualization interface
- Health Monitoring: Comprehensive system health checks and alerts
- Automated Testing: Unit, integration, and E2E test suites
- Multi-Environment Support: Development, staging, and production environments
- Docker Containerization: Consistent deployment across environments
- Automated Deployments: GitHub Actions-powered CI/CD pipeline
- Rollback Capabilities: Automated rollback on deployment failures
- Quality Gates: Code quality, security scanning, and coverage requirements
- Next.js 14: React-based web framework
- TypeScript: Type-safe JavaScript development
- Tailwind CSS: Utility-first CSS framework
- Chart.js: Data visualization library
- Python 3.11+: Core processing language
- FastAPI: High-performance API framework
- AsyncIO: Asynchronous programming support
- Pandas: Data manipulation and analysis
Each Python service (Telegram, TikTok, YouTube Scrapers, and Refinery Service) utilizes its own dedicated virtual environment (e.g., tele-venv, tt-venv, yt-venv, .venv respectively). These virtual environments, along with other large binary files (like dnnl.lib and torch_cpu.dll), are intentionally excluded from Git via the .gitignore file to prevent issues with GitHub's file size limits. When setting up the project, pnpm install will handle the creation and population of these virtual environments.
- Docker: Containerization platform
- GitHub Actions: CI/CD automation
- GitHub Container Registry: Docker image storage
- pnpm: Fast, disk space efficient package manager
- Vitest: Unit testing for frontend
- Jest: JavaScript testing framework
- Pytest: Python testing framework
- Playwright: End-to-end testing
- ESLint: JavaScript/TypeScript linting
- Flake8: Python code linting
- Node.js 18+
- Python 3.11+
- Docker & Docker Compose
- pnpm
-
Clone the repository
git clone https://github.com/O96a/SuData.git cd sudata -
Install dependencies
pnpm install
-
Start services using SuData Symphony (Recommended)
Navigate to the project root and run:
python scripts/sudata-symphony.py
Note: SuData Symphony now includes automated port cleanup at startup to prevent conflicts. The
--no-confirmflag is enabled by default for non-interactive startup.
The SuData Symphony is our production-ready orchestration system that manages all services with proper startup order, health monitoring, and graceful shutdown.
Features:
- π Smart Startup: Automatic service discovery and ordered startup
- π Health Monitoring: Real-time health checks for all services
- π― User-Friendly: Interactive confirmation and colored logging
- π‘οΈ Graceful Shutdown: Proper cleanup and resource management
- π§ Easy Configuration: YAML-based service configuration
- π Auto-Recovery: Automatic restart of failed services
- π Process Monitoring: Continuous health checks and status tracking
Quick Commands:
# Start all services
python scripts/sudata-symphony.py
# Start with specific log level
python scripts/sudata-symphony.py --log-level INFO
# To gracefully stop all services, press Ctrl+C in the terminal running SuData Symphony.
# Service status can be monitored via the Dashboard.The Service Manager handles the lifecycle of all services with the following capabilities:
- Service Lifecycle Management: Start, stop, and restart services
- Health Monitoring: Continuous health checks via
/healthendpoints - Automatic Recovery: Configurable auto-restart for failed services
- Logging: Centralized logging for all services
- Cross-Platform: Works on both Windows and Unix-based systems
Services are configured in scripts/config/services.yaml with options for:
- Virtual environment management
- Custom startup commands
- Environment variables
- Health check endpoints
- Auto-recovery settings
For detailed configuration options, see Service Manager Documentation.
Access the monitoring dashboard at http://localhost:3007 to view:
- Service status and health
- Resource usage (CPU, memory)
- Logs and error messages
- Performance metrics
| Service | Port | Description |
|---|---|---|
| TikTok Scraper | 3000 | TikTok data collection |
| YouTube Scraper | 3002 | YouTube data collection |
| Dashboard | 3007 | Monitoring interface |
| Refinery Service | 3004 | Data processing |
| Telegram Scraper | 3005 | Telegram data collection |
Each service exposes a health check endpoint at /health that returns:
- Service status (healthy/unhealthy)
- Uptime
- Version information
- Dependencies status
Example health check response:
{
"status": "healthy",
"timestamp": "2023-01-01T00:00:00Z",
"version": "1.0.0",
"details": {
"database": "connected",
"disk_space": "sufficient"
}
}Service logs are available in the following locations:
- Console Output: Direct output from each service
- Log Files: Stored in
logs/directory at the project root (e.g.,/SuData/logs/telegram_scraper.log) - Dashboard: Real-time log viewing in the monitoring interface
Configure alerts for:
- Service failures
- Resource constraints
- Performance degradation
- Failed health checks
Alerts can be sent to:
- Slack/Teams
- Discord
- Custom webhooks
- Access the dashboard
- Development: http://localhost:3000
- Staging: http://localhost:3001
-
Install Python and JavaScript dependencies
# From the project root, pnpm will install all dependencies for all services pnpm install -
Run tests
pnpm run test:all
-
Start development servers
pnpm run start:all
- Unit Tests:
pnpm run test:unit - Integration Tests:
pnpm run test:integration - End-to-End Tests:
pnpm run test:e2e - Performance Tests:
pnpm run test:performance - Coverage Reports:
pnpm run test:coverage
Deployment is currently a manual process. Automated CI/CD pipelines are configured via GitHub Actions for continuous integration and testing.
-
Staging Deployment (Automatic)
- Triggered on merge to
mainbranch - Automated build, test, and deploy
- Health checks and notifications
- Triggered on merge to
-
Production Deployment (Manual)
- Manual workflow dispatch
- Requires confirmation input
- Comprehensive health checks
- Automated rollback on failure
Each service provides comprehensive health monitoring endpoints:
| Service | Port | Endpoints | Status |
|---|---|---|---|
| Telegram Scraper | 3005 | /health, /health/detailed, / |
β Implemented |
| TikTok Scraper | 3000 | /health, /health/detailed, / |
β Implemented |
| YouTube Scraper | 3002 | /health, /health/detailed, / |
β Implemented |
| Refinery Service | 3004 | /health |
β Available |
| Dashboard | 3007 | Dashboard UI + service monitoring | β Running |
Each health endpoint provides:
- Basic Health: Service operational status (healthy/error/stopped)
- Detailed Statistics:
- Number of monitored channels/streamers
- Recent output files (last 24 hours)
- Error count and activity metrics
- File details (size, modification time, age)
- Service Information: Version, uptime, and configuration details
The web dashboard automatically monitors all service health endpoints:
- π’ Green: Service running and healthy
- π‘ Amber: Service status unknown (health endpoint unavailable)
- π΄ Red: Service error or health check failed
- βͺ Gray: Service stopped
# Check Telegram scraper health
curl http://localhost:3005/health
# Check TikTok scraper detailed stats
curl http://localhost:3000/health/detailed
# Check YouTube scraper status
curl http://localhost:3002/health
# View all services in dashboard
open http://localhost:3007- Service Health: Real-time health status monitoring with automatic refresh
- Performance Metrics: File processing statistics and error tracking
- Resource Utilization: Service uptime and activity monitoring
- Error Tracking: Recent error count and log analysis
- Deployment Status: Service availability and health endpoint monitoring
NODE_ENV=development
LOG_LEVEL=DEBUG
DASHBOARD_PORT=3007NODE_ENV=staging
LOG_LEVEL=DEBUG
DASHBOARD_PORT=3007
REFINERY_PORT=3004NODE_ENV=production
LOG_LEVEL=INFO
DASHBOARD_PORT=3007
REFINERY_PORT=3004- Dependency Scanning: Automated vulnerability detection
- Container Security: Image scanning and security policies
- Secrets Management: Secure handling of sensitive configuration
- Access Control: Role-based access and authentication
- No hardcoded secrets in code
- Regular dependency updates
- Container image scanning
- Secure environment variable management
- Fork the repository
- Create a feature branch
- Make changes with tests
- Submit a pull request
- CI pipeline validates changes
- Code review and merge
- β All tests pass
- β Code coverage β₯ 70%
- β Linting and formatting checks pass
- β Security scans pass
- β Documentation updated
- Dashboard Load Time: < 2 seconds
- API Response Time: < 1 second
- Data Processing: 500+ records/minute
- System Uptime: 99.9% target
- Docker multi-stage builds
- Efficient dependency caching
- Parallel test execution
- Resource optimization
-
Services Not Starting If services are not starting, check the individual service logs in the
logs/directory at the project root for specific error messages. -
Health Check Failures If services are failing health checks, ensure they are running and accessible on their configured ports. You can manually check a service's health endpoint using
curl:curl http://localhost:[PORT]/health
Replace
[PORT]with the service's actual port (e.g.,3005for Telegram Scraper). -
Telegram Scraper Authentication Prompt If the Telegram Scraper prompts for phone number/bot token, it means the authentication session file is missing or invalid. You need to perform a one-time interactive authentication:
- Ensure
sudata-symphony.pyis not running. - Navigate to
apps/telegram-scraperin your terminal. - Run
tele-venv/Scripts/python.exe main.py(Windows) or./tele-venv/bin/python main.py(Linux/Mac). - Follow the prompts to enter your phone number, verification code, and 2FA password (if applicable).
- Once authenticated, a session file will be created. You can then restart
sudata-symphony.py.
- Ensure
-
Dashboard Showing Red/Errors Despite Services Running If
sudata-symphony.pyreports services as healthy but the dashboard shows errors, it might be a caching issue or a mismatch in the dashboard's internal health check URLs. Try:- Hard refreshing your browser (Ctrl+F5 or Cmd+Shift+R) for
http://localhost:3007/. - Clearing your browser's cache for
http://localhost:3007/.
- Hard refreshing your browser (Ctrl+F5 or Cmd+Shift+R) for
-
Port Conflicts
sudata-symphony.pynow includes automated port cleanup at startup. If you still encounterAddress already in useerrors, manually identify and terminate the conflicting process:# Find process using a specific port (e.g., 3005) netstat -ano | findstr :3005 # Terminate the process using its PID (replace <PID>) taskkill /PID <PID> /F
- β CI/CD Pipeline Implementation
- β Multi-Environment Deployment
- β Automated Health Checks
- β Rollback Capabilities
- β Comprehensive Documentation
- π Advanced Monitoring Dashboard
- π Enhanced Security Features
- π Performance Optimization
- π Additional Data Sources
This project is licensed under the MIT License - see the LICENSE file for details.
SuData 2025