feat: add persistent orchestrator service, CLI client, and remote dev improvements#42
Merged
l50 merged 1 commit intoJan 14, 2026
Conversation
…t tooling **Added:** - Introduced `src/ares/cli_ops.py`, an async CLI for submitting and tracking operations to the orchestrator service - Added `src/ares/core/orchestrator_client.py` for programmatic operation submission, status, and wait-for-completion via Redis - Implemented `src/ares/core/orchestrator_service.py` as a persistent orchestrator pod service for Kubernetes, processing operation requests from Redis - Created `.taskfiles/remote/Taskfile.yaml` with tasks for syncing code to running K8s pods, hot reload, selective sync, pod status, logs, and exec - Extended `src/ares/core/task_queue.py` with Redis-based operation locking (acquire, release, extend) to ensure single active orchestrator per operation **Changed:** - Refactored `src/ares/core/orchestrator.py` to support orchestrator service workflow: - Added operation locking, background lock extension, and required worker wait logic - Improved health monitoring and agent startup coordination - Ensured lock release and connection cleanup on completion - Enhanced `src/ares/core/worker.py` with pod validation: - Validates prerequisites for nsenter execution (process namespace, tools PID, CAP_SYS_ADMIN) - Improved tools container PID detection (via /proc and ps) - Improved `src/ares/core/remote.py` with robust execution mode auto-detection for SSM, K8s orchestrator, and K8s worker pods - Updated integration and unit tests to expect task result TTL of 24 hours (86400s) for better debugging and recovery - Simplified and updated `docs/remote-development.md` to reflect direct `kubectl cp` syncing, new tasks, and troubleshooting steps **Removed:** - Deprecated ConfigMap-based sync and patch workflow in documentation in favor of direct pod sync
CAP-858 Overhaul Ares K8s Multi-Agent Architecture for Production
Description: Objective: Make the Ares Kubernetes multi-agent architecture production-ready by resolving critical stability, reliability, and usability issues, ensuring robust orchestration and deployment. Scope of Work:
Dependencies:
Acceptance Criteria:
Additional Notes:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Key Changes:
Added:
(
OrchestratorService) that listens for operation requests on a Redis queue,manages multi-agent operations, and reports status updates to Redis.
(src/ares/core/orchestrator_service.py)
(
src/ares/cli_ops.py) usingcyclopts, enabling users to submit operations,check status, and wait for completion via the orchestrator service.
status, and wait for completion through Redis (src/ares/core/orchestrator_client.py)
.taskfiles/remote/Taskfile.yamlforefficient syncing, hot-reloading, status checks, and log tailing in K8s pods
Changed:
acquire and extend exclusive operation locks, wait for required workers before
starting, and properly clean up locks on completion (src/ares/core/orchestrator.py)
(
acquire_operation_lock,release_operation_lock,extend_operation_lock)and extended result TTL to 24 hours for improved reliability (src/ares/core/task_queue.py)
including process namespace sharing, tools container detection, and nsenter
capability validation for safe remote command execution (src/ares/core/worker.py)
environment-aware auto-detection (ssm, k8s, or local) (src/ares/core/remote.py)
and hot-reload methods, and removed legacy ConfigMap instructions
Removed:
development documentation