Skip to content

sudaverse/SuData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SuData - Comprehensive Data Pipeline System

πŸš€ Overview

SuData is a comprehensive data pipeline system that automates the collection, processing, and monitoring of streaming data from multiple sources. Built with modern DevOps practices, it features automated CI/CD pipelines, multi-environment deployments, and robust monitoring capabilities.

Architecture

+-------------------+    +-------------------+    +-------------------+
|  Telegram Scraper |    |   TikTok Scraper  |    |  YouTube Scraper  |
|                   |    |                   |    |                   |
+---------+---------+    +---------+---------+    +---------+---------+
          |                        |                        |
          | (Raw Data)             | (Raw Data)             | (Raw Data)
          v                        v                        v
+---------------------------------------------------------------------+
|                     Refinery Service (Python)                       |
|                     (Processes & Refines Data)                      |
+---------------------------------------------------------------------+
                                 |
                                 | (Refined Data)
                                 v
+---------------------------------------------------------------------+
|                         Output / Storage                            |
+---------------------------------------------------------------------+
                              |  ^                                
                              |  | (Health Status)                 
                              v  |     
+---------------------------------------------------------------------+
|             SuData Symphony (Orchestration & Monitoring)            |
|                 (Starts, Stops, Manages Health Checks)              |
+---------------------------------------------------------------------+
                                 ^                                
                                 | (Health Status)                 
                                 |                                
                        +-------------------+    
                        |     Dashboard     |    
                        |                   |    
                        +-------------------+    

🎯 Features

Core Functionality

  • Multi-Source Data Collection: Automated scraping from TikTok and YouTube
  • Real-time Processing: Intelligent data refinement and transformation
  • Web Dashboard: Interactive monitoring and visualization interface
  • Health Monitoring: Comprehensive system health checks and alerts

DevOps & CI/CD

  • Automated Testing: Unit, integration, and E2E test suites
  • Multi-Environment Support: Development, staging, and production environments
  • Docker Containerization: Consistent deployment across environments
  • Automated Deployments: GitHub Actions-powered CI/CD pipeline
  • Rollback Capabilities: Automated rollback on deployment failures
  • Quality Gates: Code quality, security scanning, and coverage requirements

πŸ› οΈ Technology Stack

Frontend

  • Next.js 14: React-based web framework
  • TypeScript: Type-safe JavaScript development
  • Tailwind CSS: Utility-first CSS framework
  • Chart.js: Data visualization library

Backend Services

  • Python 3.11+: Core processing language
  • FastAPI: High-performance API framework
  • AsyncIO: Asynchronous programming support
  • Pandas: Data manipulation and analysis

Virtual Environments & Large Files

Each Python service (Telegram, TikTok, YouTube Scrapers, and Refinery Service) utilizes its own dedicated virtual environment (e.g., tele-venv, tt-venv, yt-venv, .venv respectively). These virtual environments, along with other large binary files (like dnnl.lib and torch_cpu.dll), are intentionally excluded from Git via the .gitignore file to prevent issues with GitHub's file size limits. When setting up the project, pnpm install will handle the creation and population of these virtual environments.

DevOps & Infrastructure

  • Docker: Containerization platform
  • GitHub Actions: CI/CD automation
  • GitHub Container Registry: Docker image storage
  • pnpm: Fast, disk space efficient package manager

Testing & Quality

  • Vitest: Unit testing for frontend
  • Jest: JavaScript testing framework
  • Pytest: Python testing framework
  • Playwright: End-to-end testing
  • ESLint: JavaScript/TypeScript linting
  • Flake8: Python code linting

πŸš€ Quick Start

Prerequisites

  • Node.js 18+
  • Python 3.11+
  • Docker & Docker Compose
  • pnpm

Installation

  1. Clone the repository

    git clone https://github.com/O96a/SuData.git
    cd sudata
  2. Install dependencies

    pnpm install
  3. Start services using SuData Symphony (Recommended)

    Navigate to the project root and run:

    python scripts/sudata-symphony.py

    Note: SuData Symphony now includes automated port cleanup at startup to prevent conflicts. The --no-confirm flag is enabled by default for non-interactive startup.

🎼 Service Management & Monitoring

SuData Symphony - Service Orchestration

The SuData Symphony is our production-ready orchestration system that manages all services with proper startup order, health monitoring, and graceful shutdown.

Features:

  • πŸš€ Smart Startup: Automatic service discovery and ordered startup
  • πŸ’š Health Monitoring: Real-time health checks for all services
  • 🎯 User-Friendly: Interactive confirmation and colored logging
  • πŸ›‘οΈ Graceful Shutdown: Proper cleanup and resource management
  • πŸ”§ Easy Configuration: YAML-based service configuration
  • πŸ”„ Auto-Recovery: Automatic restart of failed services
  • πŸ“Š Process Monitoring: Continuous health checks and status tracking

Quick Commands:

# Start all services
python scripts/sudata-symphony.py

# Start with specific log level
python scripts/sudata-symphony.py --log-level INFO

# To gracefully stop all services, press Ctrl+C in the terminal running SuData Symphony.
# Service status can be monitored via the Dashboard.

Service Manager

The Service Manager handles the lifecycle of all services with the following capabilities:

  • Service Lifecycle Management: Start, stop, and restart services
  • Health Monitoring: Continuous health checks via /health endpoints
  • Automatic Recovery: Configurable auto-restart for failed services
  • Logging: Centralized logging for all services
  • Cross-Platform: Works on both Windows and Unix-based systems

Service Configuration

Services are configured in scripts/config/services.yaml with options for:

  • Virtual environment management
  • Custom startup commands
  • Environment variables
  • Health check endpoints
  • Auto-recovery settings

For detailed configuration options, see Service Manager Documentation.

Monitoring Dashboard

Access the monitoring dashboard at http://localhost:3007 to view:

  • Service status and health
  • Resource usage (CPU, memory)
  • Logs and error messages
  • Performance metrics

Service Ports

Service Port Description
TikTok Scraper 3000 TikTok data collection
YouTube Scraper 3002 YouTube data collection
Dashboard 3007 Monitoring interface
Refinery Service 3004 Data processing
Telegram Scraper 3005 Telegram data collection

Health Checks

Each service exposes a health check endpoint at /health that returns:

  • Service status (healthy/unhealthy)
  • Uptime
  • Version information
  • Dependencies status

Example health check response:

{
  "status": "healthy",
  "timestamp": "2023-01-01T00:00:00Z",
  "version": "1.0.0",
  "details": {
    "database": "connected",
    "disk_space": "sufficient"
  }
}

Logs

Service logs are available in the following locations:

  • Console Output: Direct output from each service
  • Log Files: Stored in logs/ directory at the project root (e.g., /SuData/logs/telegram_scraper.log)
  • Dashboard: Real-time log viewing in the monitoring interface

Alerts

Configure alerts for:

  • Service failures
  • Resource constraints
  • Performance degradation
  • Failed health checks

Alerts can be sent to:

  • Email
  • Slack/Teams
  • Discord
  • Custom webhooks
  1. Access the dashboard

Development Setup

  1. Install Python and JavaScript dependencies

    # From the project root, pnpm will install all dependencies for all services
    pnpm install
  2. Run tests

    pnpm run test:all
  3. Start development servers

    pnpm run start:all

πŸ§ͺ Testing

Test Suites

  • Unit Tests: pnpm run test:unit
  • Integration Tests: pnpm run test:integration
  • End-to-End Tests: pnpm run test:e2e
  • Performance Tests: pnpm run test:performance
  • Coverage Reports: pnpm run test:coverage

Test Coverage

  • Target: 70% minimum coverage
  • Current Status: codecov

πŸš€ Deployment

Deployment Process

Deployment is currently a manual process. Automated CI/CD pipelines are configured via GitHub Actions for continuous integration and testing.

  1. Staging Deployment (Automatic)

    • Triggered on merge to main branch
    • Automated build, test, and deploy
    • Health checks and notifications
  2. Production Deployment (Manual)

    • Manual workflow dispatch
    • Requires confirmation input
    • Comprehensive health checks
    • Automated rollback on failure

πŸ“Š Monitoring & Health Checks

Health API Endpoints

Each service provides comprehensive health monitoring endpoints:

Service Port Endpoints Status
Telegram Scraper 3005 /health, /health/detailed, / βœ… Implemented
TikTok Scraper 3000 /health, /health/detailed, / βœ… Implemented
YouTube Scraper 3002 /health, /health/detailed, / βœ… Implemented
Refinery Service 3004 /health βœ… Available
Dashboard 3007 Dashboard UI + service monitoring βœ… Running

Health API Features

Each health endpoint provides:

  • Basic Health: Service operational status (healthy/error/stopped)
  • Detailed Statistics:
    • Number of monitored channels/streamers
    • Recent output files (last 24 hours)
    • Error count and activity metrics
    • File details (size, modification time, age)
  • Service Information: Version, uptime, and configuration details

Dashboard Integration

The web dashboard automatically monitors all service health endpoints:

  • 🟒 Green: Service running and healthy
  • 🟑 Amber: Service status unknown (health endpoint unavailable)
  • πŸ”΄ Red: Service error or health check failed
  • βšͺ Gray: Service stopped

Health Check Examples

# Check Telegram scraper health
curl http://localhost:3005/health

# Check TikTok scraper detailed stats
curl http://localhost:3000/health/detailed

# Check YouTube scraper status
curl http://localhost:3002/health

# View all services in dashboard
open http://localhost:3007

Monitoring Features

  • Service Health: Real-time health status monitoring with automatic refresh
  • Performance Metrics: File processing statistics and error tracking
  • Resource Utilization: Service uptime and activity monitoring
  • Error Tracking: Recent error count and log analysis
  • Deployment Status: Service availability and health endpoint monitoring

πŸ”§ Configuration

Environment Variables

Development

NODE_ENV=development
LOG_LEVEL=DEBUG
DASHBOARD_PORT=3007

Staging

NODE_ENV=staging
LOG_LEVEL=DEBUG
DASHBOARD_PORT=3007
REFINERY_PORT=3004

Production

NODE_ENV=production
LOG_LEVEL=INFO
DASHBOARD_PORT=3007
REFINERY_PORT=3004

πŸ“– Documentation

Architecture & Design

Operations & Deployment

Development

πŸ” Security

Security Features

  • Dependency Scanning: Automated vulnerability detection
  • Container Security: Image scanning and security policies
  • Secrets Management: Secure handling of sensitive configuration
  • Access Control: Role-based access and authentication

Security Practices

  • No hardcoded secrets in code
  • Regular dependency updates
  • Container image scanning
  • Secure environment variable management

🀝 Contributing

Development Workflow

  1. Fork the repository
  2. Create a feature branch
  3. Make changes with tests
  4. Submit a pull request
  5. CI pipeline validates changes
  6. Code review and merge

Pull Request Requirements

  • βœ… All tests pass
  • βœ… Code coverage β‰₯ 70%
  • βœ… Linting and formatting checks pass
  • βœ… Security scans pass
  • βœ… Documentation updated

πŸ“ˆ Performance

Benchmarks

  • Dashboard Load Time: < 2 seconds
  • API Response Time: < 1 second
  • Data Processing: 500+ records/minute
  • System Uptime: 99.9% target

Optimization Features

  • Docker multi-stage builds
  • Efficient dependency caching
  • Parallel test execution
  • Resource optimization

πŸ› Troubleshooting

Common Issues

  1. Services Not Starting If services are not starting, check the individual service logs in the logs/ directory at the project root for specific error messages.

  2. Health Check Failures If services are failing health checks, ensure they are running and accessible on their configured ports. You can manually check a service's health endpoint using curl:

    curl http://localhost:[PORT]/health

    Replace [PORT] with the service's actual port (e.g., 3005 for Telegram Scraper).

  3. Telegram Scraper Authentication Prompt If the Telegram Scraper prompts for phone number/bot token, it means the authentication session file is missing or invalid. You need to perform a one-time interactive authentication:

    1. Ensure sudata-symphony.py is not running.
    2. Navigate to apps/telegram-scraper in your terminal.
    3. Run tele-venv/Scripts/python.exe main.py (Windows) or ./tele-venv/bin/python main.py (Linux/Mac).
    4. Follow the prompts to enter your phone number, verification code, and 2FA password (if applicable).
    5. Once authenticated, a session file will be created. You can then restart sudata-symphony.py.
  4. Dashboard Showing Red/Errors Despite Services Running If sudata-symphony.py reports services as healthy but the dashboard shows errors, it might be a caching issue or a mismatch in the dashboard's internal health check URLs. Try:

    • Hard refreshing your browser (Ctrl+F5 or Cmd+Shift+R) for http://localhost:3007/.
    • Clearing your browser's cache for http://localhost:3007/.
  5. Port Conflicts sudata-symphony.py now includes automated port cleanup at startup. If you still encounter Address already in use errors, manually identify and terminate the conflicting process:

    # Find process using a specific port (e.g., 3005)
    netstat -ano | findstr :3005
    # Terminate the process using its PID (replace <PID>)
    taskkill /PID <PID> /F

Support Resources

πŸ“Š Project Status

Current Version: v1.0.0

Recent Updates

  • βœ… CI/CD Pipeline Implementation
  • βœ… Multi-Environment Deployment
  • βœ… Automated Health Checks
  • βœ… Rollback Capabilities
  • βœ… Comprehensive Documentation

Upcoming Features

  • πŸ”„ Advanced Monitoring Dashboard
  • πŸ”„ Enhanced Security Features
  • πŸ”„ Performance Optimization
  • πŸ”„ Additional Data Sources

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


SuData 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •