Skip to content

feat: add persistent orchestrator service, CLI client, and remote dev improvements#42

Merged
l50 merged 1 commit into
mainfrom
jayson/cap-858-overhaul-ares-k8s-multi-agent-architecture-for-production
Jan 14, 2026
Merged

feat: add persistent orchestrator service, CLI client, and remote dev improvements#42
l50 merged 1 commit into
mainfrom
jayson/cap-858-overhaul-ares-k8s-multi-agent-architecture-for-production

Conversation

@l50
Copy link
Copy Markdown
Contributor

@l50 l50 commented Jan 14, 2026

Key Changes:

  • Introduced a persistent orchestrator service for multi-agent operations via Redis
  • Added CLI and client for submitting and monitoring orchestrator operations remotely
  • Enhanced worker validation for Kubernetes pod prerequisites and nsenter usage
  • Improved Redis task queue with operation locking and longer result retention

Added:

  • Orchestrator service for Kubernetes - Implements a persistent service
    (OrchestratorService) that listens for operation requests on a Redis queue,
    manages multi-agent operations, and reports status updates to Redis.
    (src/ares/core/orchestrator_service.py)
  • CLI for submitting and managing operations - New async CLI tool
    (src/ares/cli_ops.py) using cyclopts, enabling users to submit operations,
    check status, and wait for completion via the orchestrator service.
  • Orchestrator client API - Provides functions to submit operations, poll for
    status, and wait for completion through Redis (src/ares/core/orchestrator_client.py)
  • Remote development Taskfile - Added .taskfiles/remote/Taskfile.yaml for
    efficient syncing, hot-reloading, status checks, and log tailing in K8s pods

Changed:

  • Orchestrator startup and agent management - Updated orchestrator logic to
    acquire and extend exclusive operation locks, wait for required workers before
    starting, and properly clean up locks on completion (src/ares/core/orchestrator.py)
  • RedisTaskQueue enhancements - Implemented operation locking methods
    (acquire_operation_lock, release_operation_lock, extend_operation_lock)
    and extended result TTL to 24 hours for improved reliability (src/ares/core/task_queue.py)
  • Worker startup validation - Added comprehensive pod prerequisite checks,
    including process namespace sharing, tools container detection, and nsenter
    capability validation for safe remote command execution (src/ares/core/worker.py)
  • Execution mode detection - Refactored remote execution logic for
    environment-aware auto-detection (ssm, k8s, or local) (src/ares/core/remote.py)
  • Documentation - Updated remote development workflow docs for new direct sync
    and hot-reload methods, and removed legacy ConfigMap instructions

Removed:

  • Outdated ConfigMap patch/cleanup steps and volume mount references from remote
    development documentation

…t tooling

**Added:**

- Introduced `src/ares/cli_ops.py`, an async CLI for submitting and tracking operations to the orchestrator service
- Added `src/ares/core/orchestrator_client.py` for programmatic operation submission, status, and wait-for-completion via Redis
- Implemented `src/ares/core/orchestrator_service.py` as a persistent orchestrator pod service for Kubernetes, processing operation requests from Redis
- Created `.taskfiles/remote/Taskfile.yaml` with tasks for syncing code to running K8s pods, hot reload, selective sync, pod status, logs, and exec
- Extended `src/ares/core/task_queue.py` with Redis-based operation locking (acquire, release, extend) to ensure single active orchestrator per operation

**Changed:**

- Refactored `src/ares/core/orchestrator.py` to support orchestrator service workflow:
  - Added operation locking, background lock extension, and required worker wait logic
  - Improved health monitoring and agent startup coordination
  - Ensured lock release and connection cleanup on completion
- Enhanced `src/ares/core/worker.py` with pod validation:
  - Validates prerequisites for nsenter execution (process namespace, tools PID, CAP_SYS_ADMIN)
  - Improved tools container PID detection (via /proc and ps)
- Improved `src/ares/core/remote.py` with robust execution mode auto-detection for SSM, K8s orchestrator, and K8s worker pods
- Updated integration and unit tests to expect task result TTL of 24 hours (86400s) for better debugging and recovery
- Simplified and updated `docs/remote-development.md` to reflect direct `kubectl cp` syncing, new tasks, and troubleshooting steps

**Removed:**

- Deprecated ConfigMap-based sync and patch workflow in documentation in favor of direct pod sync
@linear
Copy link
Copy Markdown

linear Bot commented Jan 14, 2026

CAP-858 Overhaul Ares K8s Multi-Agent Architecture for Production

Description:
Revamp the existing Ares multi-agent Kubernetes architecture to address critical gaps preventing production deployment. This includes health monitoring, worker readiness, prerequisite validation, operation locking, and deployment documentation. The goal is to ensure the system is robust, reliable, and easy to deploy at scale.


Objective:

Make the Ares Kubernetes multi-agent architecture production-ready by resolving critical stability, reliability, and usability issues, ensuring robust orchestration and deployment.


Scope of Work:

  • Implement orchestrator health monitoring to detect dead workers
  • Add worker readiness verification before task dispatch
  • Validate pod prerequisites at startup with clear error messaging
  • Extend result TTL from 1 hour to 24 hours
  • Implement operation locking to prevent concurrent orchestrators on the same operation
  • Auto-detect execution mode without manual environment variables
  • Integrate vulnerability priority queue into the exploitation workflow

Dependencies:

  • Access to current Kubernetes manifests and deployment environment
  • Reference to OVERHAUL.md for implementation details
  • Permissions to update orchestrator and worker codebases
  • Coordination with DevOps for deployment validation

Acceptance Criteria:

  1. Orchestrator waits until all worker pods report ready before dispatching tasks.
  2. Worker pods perform prerequisite checks at startup; failures produce clear, actionable error messages.
  3. Health monitoring reliably detects and reports offline or crashed workers to the orchestrator.
  4. Result data persists for at least 24 hours, preventing loss of long-operation history.
  5. Operation locking mechanism enforces only one orchestrator per operation_id.
  6. Execution mode is set automatically based on environment/context.
  7. Vulnerability exploitation is prioritized according to the configured queue.
  8. Deployment documentation enables a new engineer to set up the system without prior tribal knowledge.

Additional Notes:

  • Refer to OVERHAUL.md for detailed technical guidance and requirements.
  • Ensure all code changes are covered by unit and integration tests where applicable.
  • Coordinate with security team for review of the vulnerability queue integration.
  • Use clear logging for all new error and status messages.
  • Documentation should include troubleshooting for common deployment issues.

@dreadnode-renovate-bot dreadnode-renovate-bot Bot added area/docs Changes made to project documentation area/pre-commit Changes made to pre-commit hooks area/tests labels Jan 14, 2026
@l50 l50 changed the title ``` feat: add persistent orchestrator service, CLI client, and remote dev improvements Jan 14, 2026
@l50 l50 merged commit 5866ab3 into main Jan 14, 2026
9 of 11 checks passed
@l50 l50 deleted the jayson/cap-858-overhaul-ares-k8s-multi-agent-architecture-for-production branch January 14, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs Changes made to project documentation area/pre-commit Changes made to pre-commit hooks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant