Skip to content

Latest commit

 

History

History
441 lines (348 loc) · 16.7 KB

File metadata and controls

441 lines (348 loc) · 16.7 KB

🃏 DECK: Development Execution Control Kernel

"The Smart Queue That Keeps CLIDE Organized Across All Your Repositories"

What is DECK?

DECK (Development Execution Control Kernel) is a lightweight, file-based queue management system designed specifically for managing concurrent CLIDE development sessions across multiple repositories. Using UNIX file locking primitives and simple directory structures, DECK ensures that your autonomous development system scales gracefully without overwhelming your resources, while maintaining clear separation between different repositories and their issues.

DECK is CLIDE's traffic controller - making sure the right number of issues get worked on simultaneously across all your repositories while queuing overflow work intelligently with full repository context.

Core Philosophy

File-Based Simplicity

DECK uses the filesystem as its database. Lock files represent active work tickets, directories organize queue states, and standard UNIX tools provide all the reliability needed. No databases, no complex dependencies - just rock-solid filesystem operations.

Atomic Operations

Every DECK operation uses atomic filesystem primitives. Creating locks, moving files between directories, and releasing resources all happen atomically, preventing race conditions even in high-concurrency scenarios.

Self-Healing Design

DECK can reconstruct its entire state by examining GitHub issue labels and active tmux sessions. Crashed processes, system reboots, or corrupted state files don't break the system - DECK rebuilds and continues.

Zero-Dependency Architecture

DECK relies only on standard UNIX utilities available on every system: flock, mkdir, mv, ls, and rm. No additional packages, no runtime dependencies, no version conflicts.

System Architecture

Directory Structure

DECK organizes work using a simple but powerful directory layout:

/var/clide/deck/
├── active/                        # Currently running CLIDE sessions
│   ├── myorg_myrepo_123.lock     # Active development ticket
│   ├── myorg_api_124.lock        # Active development ticket
│   └── otherorg_frontend_125.lock # Active development ticket
├── waiting/                       # Queued work waiting for available slots
│   ├── myorg_myrepo_126.lock     # Queued ticket
│   ├── myorg_api_127.lock        # Queued ticket
│   └── otherorg_frontend_128.lock # Queued ticket
├── config/
│   └── deck.conf                 # DECK configuration
└── logs/
    └── deck.log                  # Operation history

Lock File Format

Each lock file contains JSON metadata about the work item:

{
  "repository": "myorg/myrepo",
  "issue_number": 123,
  "created_at": "2025-08-22T10:30:00Z",
  "started_at": "2025-08-22T10:32:15Z",
  "session_name": "clide-myorg_myrepo_123",
  "worktree_path": "./worktrees/myorg_myrepo_123-add-auth",
  "branch_name": "issue-123-add-authentication",
  "status": "implementing",
  "pid": 12345,
  "priority": "normal"
}

Configuration System

DECK behavior controlled through simple configuration:

# /var/clide/deck/config/deck.conf
MAX_CONCURRENT=3
QUEUE_DIR="/var/clide/deck"
TIMEOUT_HOURS=24
AUTO_CLEANUP=true
PRIORITY_LABELS="urgent,security,hotfix"
LOG_LEVEL="info"

Core Operations

Ticket Queueing (Fire-and-Forget)

When a GitHub webhook triggers, DECK immediately queues work without blocking:

Process Flow:

  1. Webhook handler calls deck queue [repo] [issue_number] and returns immediately
  2. DECK creates lock file in waiting/ directory with format owner_repo_issue.lock
  3. DECK signals its background daemon that queue state changed
  4. Webhook returns HTTP 200 in milliseconds - no waiting
  5. Background daemon processes queue changes asynchronously

Repository-Scoped Tickets:

  • Ticket naming: owner_repo_issuenumber.lock (e.g., myorg_api_123.lock)
  • Supports multiple repositories in same DECK instance
  • Unambiguous identification across all managed repositories
  • tmux session naming: clide-owner_repo_issue for consistency

Ticket Release

When CLIDE completes work (PR created or issue closed), it releases its ticket:

Process Flow:

  1. Remove lock file from active/ directory
  2. Check waiting/ directory for queued work
  3. Atomically move first queued ticket to active/
  4. Notify newly activated CLIDE session to begin work
  5. Log the transition for audit trail

Queue Promotion:

  • First-in-first-out promotion from waiting to active
  • Priority queue support for urgent issues
  • Automatic notification of promoted sessions

Status Monitoring

DECK provides real-time visibility into system state:

Active Work Display:

  • List all currently running CLIDE sessions
  • Show progress indicators and time elapsed
  • Display worktree paths and branch information
  • Monitor resource usage per session

Queue Visibility:

  • Show waiting tickets in order
  • Estimate wait times based on historical data
  • Display queue depth and processing rate
  • Identify bottlenecks and capacity issues

State Reconstruction

DECK's self-healing capability rebuilds queue state from external sources:

Multi-Repository GitHub Scanning:

  • Find all open issues with 🚀 clide-working label across managed repositories
  • Verify corresponding tmux sessions exist with naming pattern clide-owner_repo_issue
  • Create missing lock files for active work using owner_repo_issue.lock format
  • Remove orphaned locks for non-existent sessions or closed issues

Repository-Aware Consistency:

  • Validate lock file metadata against GitHub issue status per repository
  • Detect and resolve duplicate or conflicting tickets across repositories
  • Clean up stale locks from crashed processes with full repo/issue context
  • Reconcile queue state with GitHub issue status maintaining repository scope

DECK Daemon Architecture

Background Process Management

DECK operates through a persistent background daemon that handles all queue processing:

Daemon Responsibilities:

  • Continuous monitoring of queue state changes
  • Automatic promotion of waiting tickets to active status
  • CLIDE session spawning and lifecycle management
  • Resource monitoring and capacity management
  • Health checks and self-recovery operations

Service Integration:

  • Runs as systemd service for automatic startup and recovery
  • Configurable restart policies for high availability
  • Log rotation and monitoring integration
  • Graceful shutdown with work preservation

Signal-Based Communication:

  • Filesystem signals (touch files) trigger daemon processing
  • No polling overhead - daemon sleeps until work available
  • Atomic operations prevent race conditions
  • Efficient resource utilization

Queue Processing Cycle

The daemon operates in a continuous processing cycle:

Monitoring Phase:

  • Watch for .queue-changed signal files
  • Detect system resource changes
  • Monitor active session health
  • Check configuration updates

Processing Phase:

  • Count available active slots vs configuration limits
  • Identify next tickets for promotion based on priority
  • Atomically move tickets from waiting to active
  • Spawn CLIDE sessions for newly activated work

Maintenance Phase:

  • Clean up completed or failed sessions
  • Update ticket metadata and logs
  • Perform health checks on system components
  • Report metrics and status information

Advanced Features

Priority Queue Management

DECK supports priority-based queue management:

Priority Levels:

  • Critical: Security issues, production outages
  • High: Urgent features, important bug fixes
  • Normal: Standard development work
  • Low: Nice-to-have features, cleanup tasks

Priority Detection:

  • Automatic priority assignment based on GitHub labels
  • Manual priority override through DECK commands
  • Queue jumping for escalated issues
  • Fair scheduling to prevent starvation

Resource-Aware Scheduling

DECK monitors system resources to optimize concurrent work:

Capacity Management:

  • Dynamic adjustment of max concurrent sessions
  • Resource usage monitoring per CLIDE session
  • Automatic throttling during high system load
  • Smart scheduling based on issue complexity estimates

Load Balancing:

  • Distribution of work across multiple CLIDE instances
  • Affinity scheduling for related issues
  • Resource allocation based on historical patterns
  • Predictive scaling for queue management

Timeout and Cleanup Management

DECK prevents runaway sessions and resource leaks:

Session Timeouts:

  • Configurable maximum development time per issue
  • Warning notifications before timeout
  • Graceful session termination with state preservation
  • Automatic issue commenting about timeout scenarios

Orphan Detection:

  • Regular scanning for lock files without corresponding processes
  • Cleanup of stale worktrees and tmux sessions
  • Recovery of work state from git branches
  • Notification of cleanup actions for audit trail

Integration Hooks

DECK provides integration points for external systems:

Webhook Integration:

  • Status change notifications to monitoring systems
  • Queue depth alerts for capacity planning
  • Performance metrics for analytics dashboards
  • Custom hooks for organization-specific workflows

Logging and Metrics:

  • Structured logging for all DECK operations
  • Performance metrics collection and reporting
  • Queue analytics for optimization insights
  • Integration with existing monitoring infrastructure

Operational Commands

Basic Queue Management

DECK provides a simple command-line interface optimized for webhook responsiveness:

deck queue [repo] [issue_number] [priority]

  • Queue work ticket for the specified repository and issue (returns immediately)
  • Repository format: "owner/repo" (e.g., "myorg/myrepo")
  • Optional priority override for urgent work
  • Creates waiting ticket and signals daemon
  • Fire-and-forget operation for webhook compatibility

deck release [repo] [issue_number]

  • Release work ticket for completed repository issue
  • Signals daemon to promote next queued item
  • Clean up associated resources and metadata
  • Log completion for audit and metrics

deck status [--detailed] [--json]

  • Show current queue state and active work
  • Optional detailed view with resource usage
  • JSON output for programmatic integration
  • Real-time updates for monitoring dashboards

Administrative Operations

DECK includes powerful administrative capabilities:

deck daemon [start|stop|restart|status]

  • Control the DECK background daemon
  • Start/stop queue processing services
  • Check daemon health and process status
  • Restart daemon for configuration changes

deck rebuild

  • Reconstruct queue state from GitHub and system state
  • Resolve inconsistencies and orphaned resources
  • Validate all lock files and metadata
  • Report actions taken for audit purposes

deck cleanup [--dry-run]

  • Remove stale locks and orphaned resources
  • Clean up abandoned worktrees and sessions
  • Garbage collect old log files and temporary data
  • Optional dry-run mode for safety

deck configure [setting] [value]

  • Update DECK configuration without restart
  • Validate configuration changes before applying
  • Support for hot-reloading of most settings
  • Backup and restore configuration states

Monitoring and Debugging

DECK provides comprehensive observability:

deck monitor [--continuous]

  • Real-time queue monitoring with live updates
  • Resource usage tracking per active session
  • Queue depth trends and processing rates
  • Alert notifications for unusual conditions

deck logs [--tail] [--grep pattern]

  • Access structured logs for all DECK operations
  • Filtering and searching capabilities
  • Integration with standard UNIX log tools
  • Export capabilities for external analysis

deck health

  • Comprehensive system health check
  • Validation of all DECK components and dependencies
  • Performance benchmarking and capacity assessment
  • Recommendations for optimization and scaling

Error Handling and Recovery

Fault Tolerance

DECK handles failures gracefully without losing work:

Process Failures:

  • Automatic detection of crashed CLIDE sessions
  • Recovery of work state from git branches
  • Notification of failures through configured channels
  • Requeue options for recoverable failures

System Failures:

  • Queue state persistence across system restarts
  • Automatic reconstruction after unclean shutdowns
  • Validation and repair of corrupted lock files
  • Graceful degradation during resource constraints

Network Failures:

  • Retry logic for GitHub API operations
  • Offline operation capabilities where possible
  • Queue state synchronization after connectivity restoration
  • Conflict resolution for concurrent modifications

Data Integrity

DECK ensures queue consistency under all conditions:

Atomic Operations:

  • All file operations use atomic move semantics
  • Consistent state even during concurrent access
  • Lock file integrity through checksums and validation
  • Transaction-like behavior for complex operations

Consistency Checks:

  • Regular validation of queue state against external sources
  • Detection and resolution of split-brain scenarios
  • Automatic repair of minor inconsistencies
  • Escalation procedures for unresolvable conflicts

Performance and Scalability

Efficient Queue Operations

DECK optimizes for high-throughput, low-latency operations:

Fast Path Operations:

  • O(1) ticket acquisition for available slots
  • Minimal filesystem operations per transaction
  • Cached state for frequently accessed data
  • Batched operations for bulk queue management

Scalability Features:

  • Horizontal scaling across multiple DECK instances
  • Shared queue state through distributed locking
  • Load balancing for optimal resource utilization
  • Automatic capacity adjustment based on demand

Resource Optimization

DECK minimizes system resource usage:

Memory Efficiency:

  • Minimal memory footprint through file-based storage
  • Lazy loading of metadata and configuration
  • Efficient data structures for queue operations
  • Garbage collection of temporary resources

Disk Efficiency:

  • Compact lock file format with minimal metadata
  • Log rotation and archival for long-term storage
  • Temporary file cleanup and resource reclamation
  • Optimal directory structure for filesystem performance

Security and Access Control

File System Security

DECK implements appropriate security measures:

Access Control:

  • Proper file permissions for all DECK resources
  • User/group isolation for multi-tenant environments
  • Secure temporary file creation and cleanup
  • Protection against symlink attacks and path traversal

Audit Trail:

  • Comprehensive logging of all DECK operations
  • Tamper-evident log files with integrity checking
  • User attribution for all administrative actions
  • Integration with system audit frameworks

Why DECK Works Perfectly with CLIDE

Natural Integration

DECK and CLIDE work together seamlessly:

  • Fire-and-Forget Webhooks: Webhook handlers return in milliseconds
  • Automatic Processing: Background daemon spawns CLIDE sessions asynchronously
  • Simple Interface: Just call deck queue and deck release
  • Self-Healing: Both systems can reconstruct state independently
  • GitHub Integration: DECK uses GitHub labels as source of truth

Operational Benefits

The DECK + CLIDE combination provides:

  • Maximum Webhook Responsiveness: No timeouts or blocking operations
  • Predictable Resource Usage: Never overwhelm your development environment
  • Fair Work Distribution: Issues get processed in appropriate order
  • Automatic Recovery: System restarts don't lose work or break queues
  • Easy Monitoring: Simple file-based state is easy to inspect and debug

Scaling Advantages

DECK enables CLIDE to scale gracefully:

  • Decoupled Architecture: Webhook intake separated from work processing
  • Capacity Management: Handle any number of issues without resource exhaustion
  • Load Distribution: Spread work across multiple CLIDE instances
  • Performance Optimization: Queue analytics identify bottlenecks and improvements
  • Future-Proof Architecture: Add more CLIDE workers as needed without system changes

DECK transforms CLIDE from a single-issue processor into a scalable, enterprise-ready autonomous development platform that can manage work across your entire organization's repository portfolio. Simple in concept, robust in implementation, and powerful in practice.

Example Multi-Repository Operation:

  • myorg/api issue #45 → clide-myorg_api_45 session → myorg_api_45.lock
  • myorg/frontend issue #12 → clide-myorg_frontend_12 session → myorg_frontend_12.lock
  • otherorg/mobile issue #99 → clide-otherorg_mobile_99 session → otherorg_mobile_99.lock

Your development queue is now as organized and reliable as your best project manager - but it never sleeps, and it works on all your repositories simultaneously.