From 5f7c0f7f9fdc64a28bbca974b506848030153cd9 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 03:46:28 +0530
Subject: [PATCH 01/20] doc: add multi-agent runtime design proposal for LFX
 mentorship

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 884 ++++++++++++++++++++
 1 file changed, 884 insertions(+)
 create mode 100644 docs/design/multi-agent-runtime-proposal.md

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
new file mode 100644
index 00000000..4f186e0e
--- /dev/null
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -0,0 +1,884 @@
+---
+title: Multi-Agent Runtime Design Proposal
+authors:
+  - "@Abhinav-kodes"
+creation-date: 2025-05-12
+---
+
+# Multi-Agent Runtime: Design Proposal
+
+Author: Abhinav Singh
+
+> **Status:** Draft  
+> **Target Version:** AgentCube v0.x  
+> **Relates to:** Issue [#301](https://github.com/volcano-sh/agentcube/issues/301)
+
+---
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Motivation](#motivation)
+3. [Goals and Non-Goals](#goals-and-non-goals)
+4. [Use Cases](#use-cases)
+5. [Design](#design)
+   - [Architecture](#architecture)
+   - [CRD Specification](#crd-specification)
+   - [Key Design Decisions](#key-design-decisions)
+   - [Store Layout](#store-layout)
+   - [Core Implementation](#core-implementation-createsandboxgroup)
+   - [Topological Sort and Cycle Detection](#topological-sort-and-cycle-detection)
+6. [API](#api)
+   - [New Endpoints](#new-endpoints)
+   - [Request and Response Types](#request-and-response-types)
+   - [Store Interface Additions](#store-interface-additions)
+   - [CRD Types](#crd-types)
+   - [SandboxInfo Extensions](#sandboxinfo-extensions)
+7. [Controller: MultiAgentRuntimeReconciler](#controller-multiagentrimereconciler)
+8. [Garbage Collection](#garbage-collection)
+9. [Router Integration](#router-integration)
+10. [SDK Integration](#sdk-integration)
+11. [Backward Compatibility](#backward-compatibility)
+12. [File Change Map](#file-change-map)
+13. [Implementation Plan](#implementation-plan)
+14. [What Stays Unchanged](#what-stays-unchanged)
+15. [Alternatives Considered](#alternatives-considered)
+16. [Future Enhancements](#future-enhancements)
+
+---
+
+## Overview
+
+This document proposes `MultiAgentRuntime`, a new custom resource for the AgentCube project. It introduces a declarative orchestration layer that allows users to define a group of collaborating `AgentRuntime` roles as a single unit with unified lifecycle management.
+
+Each role references an existing `AgentRuntime` CRD by name. The system manages startup ordering, dependency endpoint injection, per-role warm pools, failure handling, and garbage collection for the entire group atomically.
+
+The design is intentionally additive: it reuses the existing transactional sandbox creation pipeline (`createSandbox`, `rollbackSandboxCreation`, `WatchSandboxOnce`, and the store interface) without modification. The multi-agent layer sits above these primitives as a pure composition layer.
+
+---
+
+## Motivation
+
+Complex AI workloads increasingly require multiple specialized agents working together. The existing `example/pcap-analyzer/` already demonstrates this pattern: a planner agent coordinates with a code-interpreter agent to analyze network packet captures. Today, achieving this requires users to:
+
+1. Create each agent sandbox independently via separate API calls.
+2. Discover endpoints and wire inter-agent communication by hand.
+3. Manage lifecycle (idle timeout, TTL, cleanup) for each sandbox individually.
+4. Implement custom rollback logic if any sandbox fails to start.
+
+This manual approach is fragile and does not scale. A single client disconnect mid-creation can leave a partially created group with no cleanup path. There is no first-class notion of a "group" in the store, so GC cannot reason about group-level lifecycle. Warm pools are unavailable at the group level. Three-agent or DAG-structured topologies require custom application-level orchestration code.
+
+`MultiAgentRuntime` addresses all of these gaps by promoting the group from an informal convention to a first-class API object with full lifecycle management.
+
+> **Note:** The existing single-agent `AgentRuntime` and `CodeInterpreter` creation flows are not modified by this proposal. `MultiAgentRuntime` is a pure composition layer that calls the same `createSandbox()` pipeline for each role.
+
+---
+
+## Goals and Non-Goals
+
+### Goals
+
+- Provide a single CRD to declare a group of collaborating `AgentRuntime` roles and their relationships.
+- Support atomic creation and rollback: failure of any role undoes all previously created sandboxes.
+- Support topological startup ordering via a `dependencies[]` field with cycle detection.
+- Support per-role warm pools using the existing `SandboxTemplate` + `SandboxWarmPool` + `SandboxClaim` machinery.
+- Provide a `/topology` endpoint so the coordinator can discover worker endpoints at runtime.
+- Support a `BestEffort` startup policy for cases where some workers are optional.
+- Extend GC to clean up group manifests alongside member sandboxes.
+- Add a `MultiAgentRuntimeReconciler` for post-creation self-healing.
+
+### Non-Goals
+
+- Cross-namespace groups. All roles must be in the same namespace as the `MultiAgentRuntime` resource.
+- Cross-cluster groups. All sandboxes are created in the same Kubernetes cluster.
+- Runtime re-configuration of group topology (e.g., adding or removing roles after creation).
+- Built-in inter-agent message passing or shared memory. Agents communicate over cluster-internal networking using injected endpoints; the transport layer is the application's responsibility.
+- Multi-tenancy isolation between groups. Namespace-level RBAC from the existing `AgentRuntime` applies.
+
+---
+
+## Use Cases
+
+1. **Research team with planner + code-interpreter**
+   A research team deploys a planner agent that breaks complex queries into steps and a code-interpreter agent that executes generated code. Today, the `example/pcap-analyzer/` demonstrates this pattern with manual orchestration. With `MultiAgentRuntime`, the team declares both agents as roles with `dependencies: [planner]` on the code-interpreter, and the system handles startup ordering, endpoint injection, and unified cleanup.
+
+2. **Fan-out analysis pipeline**
+   A security team runs a coordinator agent that fans out to three parallel analysis agents (network, filesystem, process). Each analyzer runs independently with no inter-dependencies. The coordinator discovers all worker endpoints via the `/topology` endpoint and dispatches tasks. `MultiAgentRuntime` with `startupPolicy: BestEffort` allows the pipeline to operate in degraded mode if one analyzer fails to start.
+
+3. **DAG-structured data processing**
+   A data engineering team chains agents in a DAG: an ingestion agent feeds a transformation agent, which feeds both a validation agent and a storage agent. Dependencies ensure each stage starts only after its predecessors are ready. Endpoint injection eliminates manual service discovery.
+
+4. **Latency-sensitive warm pool deployment**
+   A production API team needs sub-second group creation for a coordinator + two workers. By setting `warmPoolSize: 3` on each role, pre-warmed sandboxes are claimed via `SandboxClaim` at group creation time, reducing cold-start latency from minutes to near-zero.
+
+---
+
+## Design
+
+### Architecture
+
+```mermaid
+graph TD
+    Client["External Client"]
+    Router["Router (pkg/router)<br/>proxies to coordinator sessionID only"]
+    WM["WorkloadManager (pkg/workloadmanager)<br/>createSandboxGroup()"]
+    S1["Sandbox/Pod<br/>[planner] (coordinator)"]
+    S2["Sandbox/Pod<br/>[researcher]"]
+    S3["Sandbox/Pod<br/>[coder]"]
+    Store["Store (Redis/Valkey)<br/>sandbox:{sessionID} entries<br/>agentgroup:{groupSessionID} manifest"]
+
+    Client -->|external request| Router
+    Router -->|create group| WM
+    WM -->|createSandbox| S1
+    WM -->|createSandbox| S2
+    WM -->|createSandbox| S3
+    WM -->|SaveAgentGroup| Store
+    S2 <-.->|cluster-internal| S1
+    S3 <-.->|cluster-internal| S1
+    S3 <-.->|cluster-internal| S2
+```
+
+The coordinator is the only role exposed externally through the Router. All inter-agent traffic flows over cluster-internal pod IPs, never touching the Router proxy. This keeps inter-agent latency low and does not require any changes to the Router's proxy logic.
+
+### CRD Specification
+
+```yaml
+apiVersion: runtime.agentcube.volcano.sh/v1alpha1
+kind: MultiAgentRuntime
+metadata:
+  name: research-team
+  namespace: default
+spec:
+  # startupPolicy controls failure semantics during group creation.
+  # Atomic (default): any role failure rolls back all previously created sandboxes.
+  # BestEffort: coordinator must succeed; worker failures are recorded in the group manifest.
+  startupPolicy: Atomic
+  roles:
+    - name: planner
+      runtimeRef: planner-agent   # name of an existing AgentRuntime CRD in this namespace
+      isCoordinator: true         # exactly one role must be marked as coordinator
+      warmPoolSize: 2
+    - name: researcher
+      runtimeRef: researcher-agent
+      warmPoolSize: 3
+      dependencies: [planner]     # planner must be ready before researcher is created
+    - name: coder
+      runtimeRef: coder-agent
+      dependencies: [planner, researcher]
+  sessionTimeout: 15m
+  maxSessionDuration: 8h
+```
+
+> **Example mapping:** The `example/pcap-analyzer/` pattern maps directly onto this spec: `planner` references the planner `AgentRuntime`, and the code-interpreter maps to a worker role with `dependencies: [planner]`. The manual orchestration code in `pcap_analyzer.py` is replaced entirely by this declarative CRD.
+
+### Key Design Decisions
+
+#### Reference-based role definitions
+
+Each role's `runtimeRef` points to an existing `AgentRuntime` CRD in the same namespace. `MultiAgentRuntime` does not inline a pod spec, security context, resource requirements, or image. All of these are already defined and validated in the referenced `AgentRuntime`. This means updating an agent's image or resource limits in the `AgentRuntime` automatically applies to every group that references it, with no duplication.
+
+This decision mirrors the pattern used by `CodeInterpreter`, which also separates the "what" (container spec in the CRD) from the lifecycle management concerns.
+
+#### Flat role list with dependency DAG
+
+Roles are declared in a flat `roles[]` list. Each role optionally declares `dependencies[]` referencing other roles by name. This represents an arbitrary directed acyclic graph (DAG): linear pipelines, fan-out/fan-in, swarms (no dependencies), and peer topologies are all expressible without any structural change to the API.
+
+At creation time, `topoSort()` runs Kahn's algorithm over the dependency graph. Roles with no remaining dependencies are created first. Dependency endpoints are injected into each subsequent role before its sandbox is created. If a cycle is detected, the request is rejected immediately with an error message identifying the involved roles.
+
+#### Coordinator designation
+
+Exactly one role must have `isCoordinator: true`. This is enforced at request time; zero or multiple coordinators result in a validation error. The coordinator's session ID is the only one registered with the Router for external traffic. All other roles are cluster-internal.
+
+The coordinator concept is intentionally separate from the orchestrator concept. A coordinator is simply the external entrypoint; it does not need to control other agents. This supports gateway-style topologies where the coordinator routes requests but does not plan or execute.
+
+#### Per-role warm pools
+
+Each role may set `warmPoolSize`. When greater than zero, the `MultiAgentRuntimeReconciler` creates a `SandboxTemplate` + `SandboxWarmPool` pair for that role, using `controllerutil.SetControllerReference` to bind them to the `MultiAgentRuntime`. At group creation time, warm roles are provisioned via `SandboxClaim` rather than direct `Sandbox` creation, reducing cold-start latency from approximately 2 minutes per role to near-zero.
+
+This reuses the exact same `SandboxTemplate`/`SandboxWarmPool`/`SandboxClaim` machinery already implemented in `CodeInterpreterReconciler`, with no changes to that machinery.
+
+#### Startup policies
+
+**`Atomic` (default):** All roles must succeed. If any role fails, the deferred rollback function in `createSandboxGroup()` calls `rollbackSandboxCreation()` for every previously created sandbox. The call returns an error. This is the correct default for production workloads where a partial group is worse than no group.
+
+**`BestEffort`:** The coordinator must succeed. Worker failures are recorded in the group manifest with `status: "failed"`, and the call returns successfully with partial topology information. The response signals which roles are unavailable. This is appropriate for workloads where some workers are optional or can be retried asynchronously.
+
+In both policies, coordinator failure always causes full rollback and an error return, regardless of how many workers succeeded.
+
+#### Dependency endpoint injection
+
+Before a dependent role's sandbox is created, the verified pod IPs of its dependencies are injected as environment variables into the pod template. The naming convention is:
+
+```
+AGENTCUBE_DEP_{ROLE_NAME_UPPER}_ENDPOINT = {podIP}:{port}
+```
+
+For a role with `dependencies: [planner]`, the pod receives:
+
+```
+AGENTCUBE_DEP_PLANNER_ENDPOINT = 10.0.0.4:8080
+```
+
+Injection happens in-memory inside `createSandboxGroup()` by mutating the pod template before it is passed to `buildSandboxByAgentRuntime()`. The referenced `AgentRuntime` CRD object in the informer cache is never written.
+
+### Store Layout
+
+Individual sandbox store entries gain two new fields that associate them with their parent group:
+
+```
+sandbox:{sessionID-planner}    -> SandboxInfo{ ..., GroupSessionID: "grp-xxx", Role: "planner" }
+sandbox:{sessionID-researcher} -> SandboxInfo{ ..., GroupSessionID: "grp-xxx", Role: "researcher" }
+sandbox:{sessionID-coder}      -> SandboxInfo{ ..., GroupSessionID: "grp-xxx", Role: "coder" }
+```
+
+A separate group manifest key stores aggregated role metadata:
+
+```
+agentgroup:{grp-xxx} -> AgentGroupManifest{
+    GroupSessionID: "grp-xxx",
+    CreatedAt: ...,
+    Roles: [
+        { Name: "planner",    SessionID: "...", Endpoint: "10.0.0.4:8080", Status: "ready" },
+        { Name: "researcher", SessionID: "...", Endpoint: "10.0.0.5:8080", Status: "ready" },
+        { Name: "coder",      SessionID: "...", Endpoint: "10.0.0.6:8080", Status: "failed" }
+    ]
+}
+```
+
+> **Backward compatibility:** Standalone sandboxes (those not belonging to any group) have empty `GroupSessionID` and `Role` fields. All existing store queries over `sandbox:` keys are unaffected. The `omitempty` JSON tag ensures these fields are absent from serialized standalone entries, so existing store consumers that unmarshal `SandboxInfo` do not break.
+
+### Core Implementation: `createSandboxGroup()`
+
+```go
+func (s *Server) createSandboxGroup(
+    ctx context.Context,
+    mar *runtimev1alpha1.MultiAgentRuntime,
+    dynamicClient dynamic.Interface,
+) (*types.CreateAgentGroupResponse, error) {
+
+    groupSessionID := "grp-" + uuid.New().String()
+    var created []createdRole
+
+    needGroupRollback := true
+    defer func() {
+        if !needGroupRollback {
+            return
+        }
+        for _, c := range created {
+            // rollbackSandboxCreation is called as-is (no changes to the function).
+            s.rollbackSandboxCreation(dynamicClient, c.sandbox, nil, c.sessionID)
+        }
+    }()
+
+    orderedRoles, err := topoSort(mar.Spec.Roles)
+    if err != nil {
+        return nil, err // descriptive cycle error
+    }
+
+    for _, role := range orderedRoles {
+        // buildSandboxByAgentRuntime is called as-is (no changes to the function).
+        sandbox, sandboxEntry, err := buildSandboxByAgentRuntime(
+            mar.Namespace, role.RuntimeRef, s.informers,
+        )
+        if err != nil {
+            return nil, fmt.Errorf("role %s: build sandbox: %w", role.Name, err)
+        }
+        sandboxEntry.GroupSessionID = groupSessionID
+        sandboxEntry.Role = role.Name
+
+        if len(role.Dependencies) > 0 {
+            injectDependencyEndpoints(&sandbox.Spec.PodTemplate, role.Dependencies, created)
+        }
+
+        resultChan := s.sandboxController.WatchSandboxOnce(ctx, sandbox.Namespace, sandbox.Name)
+        defer s.sandboxController.UnWatchSandbox(sandbox.Namespace, sandbox.Name)
+
+        // createSandbox is called as-is (no changes to the function).
+        resp, err := s.createSandbox(ctx, dynamicClient, sandbox, nil, sandboxEntry, resultChan)
+        if err != nil {
+            if mar.Spec.StartupPolicy == StartupPolicyBestEffort && !role.IsCoordinator {
+                klog.Warningf("group %s: role %s failed (BestEffort policy): %v", groupSessionID, role.Name, err)
+                recordRoleFailure(groupSessionID, role.Name)
+                continue
+            }
+            return nil, fmt.Errorf("role %s: %w", role.Name, err)
+        }
+
+        created = append(created, createdRole{
+            name:      role.Name,
+            resp:      resp,
+            sandbox:   sandbox,
+            sessionID: sandboxEntry.SessionID,
+        })
+    }
+
+    manifest := buildGroupManifest(groupSessionID, created)
+    if err := s.storeClient.SaveAgentGroup(ctx, groupSessionID, manifest); err != nil {
+        return nil, fmt.Errorf("save group manifest: %w", err)
+    }
+
+    needGroupRollback = false
+    return buildGroupResponse(groupSessionID, created), nil
+}
+```
+
+**Key properties:**
+
+- The deferred rollback calls the existing `rollbackSandboxCreation()` function, without modification, for every sandbox in `created`.
+- Roles are created in topological order. A dependency's endpoint is guaranteed to be in `created` before the dependent role's sandbox is built.
+- `buildSandboxByAgentRuntime()`, `createSandbox()`, `WatchSandboxOnce()`, and `rollbackSandboxCreation()` are all called as-is.
+- The `needGroupRollback` flag is only cleared after `SaveAgentGroup` succeeds. A store failure after all sandboxes are created will roll back the Kubernetes resources, maintaining consistency between the cluster state and the store.
+
+The following sequence diagram illustrates the creation flow for a 3-role group under `Atomic` policy:
+
+```mermaid
+sequenceDiagram
+    participant Client as External Client
+    participant Router as Router
+    participant WM as WorkloadManager
+    participant Store as Store (Redis/Valkey)
+    participant K8s as Kubernetes API
+
+    Client->>Router: POST /v1/multi-agent-runtime
+    Router->>WM: Forward (create group)
+    WM->>WM: topoSort(roles) -> [planner, researcher, coder]
+
+    Note over WM, K8s: Role 1: planner (coordinator)
+    WM->>Store: StoreSandbox(placeholder)
+    WM->>K8s: Create Sandbox [planner]
+    K8s-->>WM: Sandbox Ready
+    WM->>Store: UpdateSandbox(ready)
+
+    Note over WM, K8s: Role 2: researcher (depends on planner)
+    WM->>WM: injectDependencyEndpoints(planner IP)
+    WM->>Store: StoreSandbox(placeholder)
+    WM->>K8s: Create Sandbox [researcher]
+    K8s-->>WM: Sandbox Ready
+    WM->>Store: UpdateSandbox(ready)
+
+    Note over WM, K8s: Role 3: coder (depends on planner, researcher)
+    WM->>WM: injectDependencyEndpoints(planner IP, researcher IP)
+    WM->>Store: StoreSandbox(placeholder)
+    WM->>K8s: Create Sandbox [coder]
+    K8s-->>WM: Sandbox Ready
+    WM->>Store: UpdateSandbox(ready)
+
+    WM->>Store: SaveAgentGroup(manifest)
+    WM-->>Router: CreateAgentGroupResponse
+    Router-->>Client: 200 OK + groupSessionId
+```
+
+### Topological Sort and Cycle Detection
+
+```go
+func topoSort(roles []RoleSpec) ([]RoleSpec, error) {
+    inDegree := make(map[string]int)
+    adj      := make(map[string][]string)
+    roleMap  := make(map[string]RoleSpec)
+
+    for _, r := range roles {
+        roleMap[r.Name] = r
+        if _, exists := inDegree[r.Name]; !exists {
+            inDegree[r.Name] = 0
+        }
+        for _, dep := range r.Dependencies {
+            adj[dep] = append(adj[dep], r.Name)
+            inDegree[r.Name]++
+        }
+    }
+
+    var queue []string
+    for name, deg := range inDegree {
+        if deg == 0 {
+            queue = append(queue, name)
+        }
+    }
+
+    var sorted []RoleSpec
+    for len(queue) > 0 {
+        name := queue[0]
+        queue = queue[1:]
+        sorted = append(sorted, roleMap[name])
+        for _, neighbor := range adj[name] {
+            inDegree[neighbor]--
+            if inDegree[neighbor] == 0 {
+                queue = append(queue, neighbor)
+            }
+        }
+    }
+
+    if len(sorted) != len(roles) {
+        // Identify and name the roles involved in the cycle for the error message.
+        var cycled []string
+        for name, deg := range inDegree {
+            if deg > 0 {
+                cycled = append(cycled, name)
+            }
+        }
+        sort.Strings(cycled)
+        return nil, fmt.Errorf("dependency cycle detected among roles: %v", cycled)
+    }
+    return sorted, nil
+}
+```
+
+The algorithm is Kahn's BFS-based topological sort, O(V+E). Cycle detection is derived from the invariant that Kahn's algorithm only produces a complete ordering when no cycle exists. If `len(sorted) < len(roles)`, the roles with remaining in-degree are in a cycle. Their names are included in the error message to aid debugging.
+
+---
+
+## API
+
+### New Endpoints
+
+| Method | Path | Description |
+|--------|------|-------------|
+| `POST` | `/v1/multi-agent-runtime` | Create a new agent group. Returns group session ID and coordinator entrypoints. |
+| `DELETE` | `/v1/multi-agent-runtime/sessions/:groupSessionId` | Delete all sandboxes in the group and remove the group manifest from the store. |
+| `GET` | `/v1/multi-agent-runtime/groups/:groupSessionId/topology` | Return the group manifest including all role endpoints and statuses. Intended for use by the coordinator at startup to discover worker endpoints. |
+
+### Request and Response Types
+
+#### Create Group Request
+
+```go
+type CreateAgentGroupRequest struct {
+    Kind      string `json:"kind"`      // "MultiAgentRuntime"
+    Name      string `json:"name"`      // MultiAgentRuntime CRD name
+    Namespace string `json:"namespace"`
+}
+```
+
+#### Create Group Response
+
+```go
+type CreateAgentGroupResponse struct {
+    GroupSessionID string                   `json:"groupSessionId"`
+    Roles          []AgentGroupRoleResponse `json:"roles"`
+}
+
+type AgentGroupRoleResponse struct {
+    Name      string `json:"name"`
+    SessionID string `json:"sessionId"`
+    Endpoint  string `json:"endpoint"`
+    Status    string `json:"status"` // "ready" | "failed"
+}
+```
+
+#### Group Manifest (stored in Redis/Valkey)
+
+```go
+type AgentGroupManifest struct {
+    GroupSessionID string           `json:"groupSessionId"`
+    Roles          []AgentGroupRole `json:"roles"`
+    CreatedAt      time.Time        `json:"createdAt"`
+}
+
+type AgentGroupRole struct {
+    Name      string `json:"name"`
+    SessionID string `json:"sessionId"`
+    Endpoint  string `json:"endpoint"`
+    Status    string `json:"status"` // "ready" | "failed"
+}
+```
+
+### Store Interface Additions
+
+Four new methods are added to the `Store` interface in `pkg/store/interface.go`. All existing methods are unchanged.
+
+```go
+// SaveAgentGroup persists a group manifest keyed by groupSessionID.
+// Key format: agentgroup:{groupSessionID}
+SaveAgentGroup(ctx context.Context, groupSessionID string, manifest *types.AgentGroupManifest) error
+
+// GetAgentGroup retrieves a group manifest by groupSessionID.
+// Returns ErrNotFound if the key does not exist.
+GetAgentGroup(ctx context.Context, groupSessionID string) (*types.AgentGroupManifest, error)
+
+// DeleteAgentGroup removes a group manifest by groupSessionID.
+DeleteAgentGroup(ctx context.Context, groupSessionID string) error
+
+// UpdateAgentGroupRoleStatus atomically updates the status of a specific role
+// within a group manifest. Used by the reconciler during self-healing.
+UpdateAgentGroupRoleStatus(ctx context.Context, groupSessionID, roleName, status string) error
+```
+
+Both `store_redis.go` and `store_valkey.go` implement these methods using the key prefix `agentgroup:`. The serialization format is JSON, consistent with existing `sandbox:` entries.
+
+> **Note:** `UpdateAgentGroupRoleStatus` performs a read-modify-write on the manifest JSON. Under high concurrency this could race, but group manifests are updated infrequently (only during self-healing) and only by the reconciler, so optimistic concurrency control is not required in the initial implementation.
+
+### CRD Types
+
+```go
+// MultiAgentRuntime defines a group of collaborating AgentRuntime roles with
+// unified lifecycle management.
+//
+// +genclient
+// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
+// +kubebuilder:object:root=true
+// +kubebuilder:subresource:status
+// +kubebuilder:resource:scope=Namespaced
+// +kubebuilder:printcolumn:name="Ready",type="boolean",JSONPath=".status.ready"
+// +kubebuilder:printcolumn:name="Policy",type="string",JSONPath=".spec.startupPolicy"
+// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
+type MultiAgentRuntime struct {
+    metav1.TypeMeta   `json:",inline"`
+    metav1.ObjectMeta `json:"metadata,omitempty"`
+    Spec              MultiAgentRuntimeSpec   `json:"spec"`
+    Status            MultiAgentRuntimeStatus `json:"status,omitempty"`
+}
+
+type MultiAgentRuntimeSpec struct {
+    // StartupPolicy controls failure behavior during group creation.
+    // +kubebuilder:default="Atomic"
+    // +kubebuilder:validation:Enum=Atomic;BestEffort
+    StartupPolicy StartupPolicyType `json:"startupPolicy,omitempty"`
+
+    // Roles defines the set of agent roles in this group.
+    // At least one role must be present, and exactly one must have IsCoordinator=true.
+    // +kubebuilder:validation:MinItems=1
+    Roles []RoleSpec `json:"roles"`
+
+    // SessionTimeout is the idle timeout applied to all sandboxes in the group.
+    // Defaults to 15m.
+    // +kubebuilder:default="15m"
+    SessionTimeout *metav1.Duration `json:"sessionTimeout,omitempty"`
+
+    // MaxSessionDuration is the absolute TTL for all sandboxes in the group.
+    // Defaults to 8h.
+    // +kubebuilder:default="8h"
+    MaxSessionDuration *metav1.Duration `json:"maxSessionDuration,omitempty"`
+}
+
+type RoleSpec struct {
+    // Name is the unique identifier for this role within the group.
+    // +kubebuilder:validation:MinLength=1
+    Name string `json:"name"`
+
+    // RuntimeRef is the name of an existing AgentRuntime CRD in the same namespace.
+    // +kubebuilder:validation:MinLength=1
+    RuntimeRef string `json:"runtimeRef"`
+
+    // IsCoordinator marks this role as the external entrypoint for the group.
+    // Exactly one role must be marked as coordinator.
+    // +optional
+    IsCoordinator bool `json:"isCoordinator,omitempty"`
+
+    // WarmPoolSize specifies the number of pre-warmed sandboxes for this role.
+    // When set, the reconciler creates a SandboxTemplate + SandboxWarmPool for this role.
+    // +optional
+    // +kubebuilder:validation:Minimum=0
+    WarmPoolSize *int32 `json:"warmPoolSize,omitempty"`
+
+    // Dependencies lists the names of roles that must be ready before this role is created.
+    // Circular dependencies are rejected at request time.
+    // +optional
+    Dependencies []string `json:"dependencies,omitempty"`
+}
+
+type StartupPolicyType string
+
+const (
+    // StartupPolicyAtomic rolls back all created sandboxes if any role fails.
+    StartupPolicyAtomic     StartupPolicyType = "Atomic"
+    // StartupPolicyBestEffort allows worker failures; coordinator failure still rolls back everything.
+    StartupPolicyBestEffort StartupPolicyType = "BestEffort"
+)
+
+type MultiAgentRuntimeStatus struct {
+    // Conditions reflect the current state of the MultiAgentRuntime.
+    // Standard conditions: Ready, Degraded, Failed.
+    Conditions []metav1.Condition `json:"conditions,omitempty"`
+
+    // Ready is true when all required roles are running and healthy.
+    Ready bool `json:"ready,omitempty"`
+}
+```
+
+### SandboxInfo Extensions
+
+Two fields are added to `SandboxInfo` in `pkg/common/types/sandbox.go`. Both fields are empty for standalone sandboxes, preserving full backward compatibility.
+
+```go
+type SandboxInfo struct {
+    // ... existing fields unchanged ...
+
+    // GroupSessionID associates this sandbox with a MultiAgentRuntime group.
+    // Empty string for standalone (non-group) sandboxes.
+    GroupSessionID string `json:"groupSessionId,omitempty"`
+
+    // Role identifies this sandbox's role within its group.
+    // Empty string for standalone sandboxes.
+    Role string `json:"role,omitempty"`
+}
+```
+
+---
+
+## Controller: `MultiAgentRuntimeReconciler`
+
+A new `MultiAgentRuntimeReconciler` in `pkg/workloadmanager/multiagent_controller.go` manages the lifecycle of `MultiAgentRuntime` resources. It is registered with the existing `controller-runtime` manager already wired in `cmd/workload-manager/main.go` alongside `CodeInterpreterReconciler`.
+
+The reconciler uses `GenerationChangedPredicate` to avoid reconcile loops triggered by status-only updates, consistent with `CodeInterpreterReconciler`.
+
+### Warm Pool Management (Phase 2)
+
+For each role with `warmPoolSize > 0`, the reconciler ensures a `SandboxTemplate` and `SandboxWarmPool` exist with the correct spec. Both resources are created with `controllerutil.SetControllerReference` pointing to the `MultiAgentRuntime`, so they are garbage collected when the `MultiAgentRuntime` is deleted. If the `warmPoolSize` changes, the reconciler updates the `SandboxWarmPool` spec in place.
+
+### Self-Healing (Phase 4)
+
+The reconciler watches for `Sandbox` objects whose `GroupSessionID` matches a known group. On pod failure:
+
+- **`Atomic` policy**: the reconciler calls `handleDeleteAgentGroup()` to tear down all remaining sandboxes and delete the group manifest. It sets a `Failed` condition on the `MultiAgentRuntimeStatus`.
+- **`BestEffort` policy**: the reconciler attempts to create a replacement sandbox for the failed role. On success, it calls `UpdateAgentGroupRoleStatus()` with the new endpoint. On repeated failure, it sets a `Degraded` condition.
+
+### Status Conditions
+
+| Condition | Meaning |
+|-----------|---------|
+| `Ready=True` | All roles are running and healthy |
+| `Ready=False, reason=Creating` | Group creation is in progress |
+| `Degraded=True` | One or more workers failed (BestEffort policy only) |
+| `Failed=True` | Group has been torn down due to a critical failure |
+
+---
+
+## Garbage Collection
+
+The existing GC in `pkg/workloadmanager/garbage_collection.go` is extended with group awareness. When the GC deletes a sandbox that has a non-empty `GroupSessionID`:
+
+1. It calls `GetAgentGroup()` to retrieve the group manifest.
+2. It removes the deleted role from the manifest.
+3. If no roles remain in the manifest, it calls `DeleteAgentGroup()` to remove the `agentgroup:` key from the store.
+4. If other roles remain, it calls `SaveAgentGroup()` with the updated manifest.
+
+This ensures that group manifests do not accumulate indefinitely in the store after their member sandboxes expire. The existing idle-timeout and TTL logic for individual sandboxes is not modified. Group membership is an additional cleanup concern layered on top of existing GC, not a replacement.
+
+---
+
+## Router Integration
+
+`pkg/router/session_manager.go` is extended with a `MultiAgentRuntimeKind` case in the endpoint resolution switch:
+
+```go
+case types.MultiAgentRuntimeKind:
+    endpoint = m.workloadMgrAddr + "/v1/multi-agent-runtime"
+```
+
+The Router tracks only the coordinator's session ID for external request routing. Worker endpoints are stored in the group manifest and are internal-only. No changes are required to the Router's proxy logic.
+
+> **Note:** The Router does not need to know that a session belongs to a group. It proxies requests to the coordinator's sandbox exactly as it would proxy requests to any standalone `AgentRuntime` sandbox. The group abstraction is fully transparent to the Router.
+
+---
+
+## SDK Integration
+
+The Python SDK exposes a `MultiAgentRuntimeClient` that wraps the three new HTTP endpoints:
+
+```python
+from agentcube import MultiAgentRuntimeClient
+
+client = MultiAgentRuntimeClient(
+    base_url="https://router.example.com",
+    # auth=...,  # same auth options as existing clients
+)
+
+# Create a group
+group = client.create_group(
+    name="research-team",
+    namespace="default",
+)
+print(f"Group created: {group.group_session_id}")
+print(f"Coordinator endpoint: {group.roles[0].endpoint}")
+
+# Discover worker topology (coordinator calls this at startup)
+topology = client.get_topology(group.group_session_id)
+for role in topology.roles:
+    print(f"  {role.name}: {role.endpoint} ({role.status})")
+
+# Delete the group
+client.delete_group(group.group_session_id)
+```
+
+Token lifecycle, retry logic, and error handling follow the same patterns as the existing `CodeInterpreterClient`.
+
+---
+
+## Backward Compatibility
+
+This feature is fully backward compatible. No existing behavior changes unless the user creates a `MultiAgentRuntime` resource:
+
+| Concern | Impact |
+|---------|--------|
+| Existing `AgentRuntime` creation flow | Unchanged. `createSandbox()` is called as-is. |
+| Existing `CodeInterpreter` creation flow | Unchanged. |
+| Existing store schema | Two new `omitempty` fields (`GroupSessionID`, `Role`) added to `SandboxInfo`. Existing entries deserialize with zero values. No migration required. |
+| Existing GC logic | Unchanged for standalone sandboxes. Group cleanup is additive. |
+| Existing Router proxy | Unchanged. Group awareness is limited to the endpoint switch. |
+| Store key namespace | New `agentgroup:` prefix does not collide with existing `sandbox:` prefix. |
+| API surface | Three new endpoints under `/v1/multi-agent-runtime`. No changes to existing endpoints. |
+
+---
+
+## File Change Map
+
+### New Files
+
+| File | Description |
+|------|-------------|
+| `pkg/apis/runtime/v1alpha1/multiagentruntime_types.go` | CRD types with kubebuilder markers |
+| `pkg/workloadmanager/multiagent_controller.go` | `MultiAgentRuntimeReconciler` |
+| `pkg/workloadmanager/multiagent_controller_test.go` | Reconciler unit tests |
+| `sdk-python/agentcube/multi_agent.py` | `MultiAgentRuntimeClient` for the Python SDK |
+| `sdk-python/examples/multi_agent_usage.py` | End-to-end usage example |
+| `test/e2e/multi_agent_runtime.yaml` | E2E test fixtures |
+| `docs/design/multi-agent-runtime-proposal.md` | This document |
+| `manifests/charts/base/crds/runtime.agentcube.volcano.sh_multiagentruntimes.yaml` | Auto-generated by `make gen-crd` |
+
+### Modified Files
+
+| File | Change |
+|------|--------|
+| `pkg/apis/runtime/v1alpha1/register.go` | Add `MultiAgentRuntimeKind`, `MultiAgentRuntimeListKind`, `MultiAgentRuntimeGroupVersionKind` |
+| `pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go` | Regenerated by `make generate` |
+| `pkg/common/types/types.go` | Add `MultiAgentRuntimeKind` constant |
+| `pkg/common/types/sandbox.go` | Add `GroupSessionID`, `Role` to `SandboxInfo`; add `AgentGroupManifest`, `AgentGroupRole`, group request/response types |
+| `pkg/api/errors.go` | Add `ErrMultiAgentRuntimeNotFound`; add `multiAgentRuntimeResource` in `workloadResource()` switch |
+| `pkg/workloadmanager/informers.go` | Add `MultiAgentRuntimeGVR`; add informer wiring and cache sync |
+| `pkg/workloadmanager/workload_builder.go` | Add `GroupSessionID`, `Role` fields to `sandboxEntry` struct |
+| `pkg/workloadmanager/sandbox_helper.go` | Propagate `GroupSessionID` and `Role` in `buildSandboxPlaceHolder()` and `buildSandboxInfo()` |
+| `pkg/workloadmanager/handlers.go` | Add `handleMultiAgentRuntimeCreate`, `createSandboxGroup`, `handleDeleteAgentGroup`, `handleGetGroupTopology` |
+| `pkg/workloadmanager/handlers_test.go` | Add group creation and rollback test cases |
+| `pkg/workloadmanager/server.go` | Add 3 new routes under `/v1/multi-agent-runtime` |
+| `pkg/workloadmanager/garbage_collection.go` | Group manifest cleanup when last member sandbox is GC'd |
+| `pkg/store/interface.go` | Add `SaveAgentGroup`, `GetAgentGroup`, `DeleteAgentGroup`, `UpdateAgentGroupRoleStatus` |
+| `pkg/store/store_redis.go` | Implement all 4 group methods |
+| `pkg/store/store_redis_test.go` | Group CRUD tests |
+| `pkg/store/store_valkey.go` | Implement all 4 group methods |
+| `pkg/store/store_valkey_test.go` | Group CRUD tests |
+| `pkg/router/session_manager.go` | Add `MultiAgentRuntimeKind` case in endpoint switch |
+| `cmd/workload-manager/main.go` | Phase 1: HTTP routes; Phase 4: reconciler wiring |
+| `sdk-python/agentcube/__init__.py` | Export `MultiAgentRuntimeClient` |
+| `test/e2e/e2e_test.go` | Add `TestMultiAgentRuntimeCreate`, `TestMultiAgentRuntimeRollback` |
+
+---
+
+## Implementation Plan
+
+### Phase 1 - Core Foundation (Weeks 1-4)
+
+Deliverables that satisfy the mentorship expected outcomes on their own.
+
+- Define `MultiAgentRuntime` CRD types with kubebuilder markers; run `make generate` + `make gen-crd`.
+- Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet).
+- Add `GroupSessionID` + `Role` to `SandboxInfo`; propagate through `buildSandboxPlaceHolder()` + `buildSandboxInfo()`.
+- Implement all 4 store methods in `store_redis.go` + `store_valkey.go` with full unit test coverage.
+- Add `MultiAgentRuntimeKind` to Router endpoint switch.
+- Extend GC to clean up `agentgroup:` manifest keys when last member sandbox is deleted.
+- Unit tests: `createSandboxGroup()` with atomic rollback on partial failure, store CRUD, coordinator validation, cycle detection.
+- E2E test: kind cluster (same setup as existing E2E), create a 3-role group, verify all sandboxes running, delete group, verify cleanup.
+- User guide: YAML example + `kubectl` workflow.
+
+### Phase 2 - Warm Pools Per Role (Weeks 5-6)
+
+- Implement `warmPoolSize` field handling in `MultiAgentRuntimeReconciler`.
+- Reconciler creates `SandboxTemplate` + `SandboxWarmPool` per warm role with owner references.
+- Group creation uses `SandboxClaim` for warm roles, cold `Sandbox` creation for others.
+- Add E2E test comparing cold-start vs warm-start group creation latency.
+
+### Phase 3 - DAG Startup and Topology (Weeks 7-8)
+
+- Implement `dependencies[]` field: `topoSort()` + `injectDependencyEndpoints()`.
+- Add `GET /v1/multi-agent-runtime/groups/:groupSessionId/topology` endpoint.
+- Add `get_topology()` to Python SDK `MultiAgentRuntimeClient`.
+- E2E test: verify dependency endpoint env vars are present in dependent pod environment.
+
+### Phase 4 - StartupPolicy and Self-Healing (Weeks 9-11)
+
+- Implement `BestEffort` startup policy in `createSandboxGroup()`.
+- Implement `MultiAgentRuntimeReconciler` self-healing:
+  - `Atomic`: tear down entire group on worker pod crash.
+  - `BestEffort`: attempt role restart, update group manifest with new endpoint.
+- Add per-role status conditions to `MultiAgentRuntimeStatus`.
+- Wire reconciler into `cmd/workload-manager/main.go`.
+
+### Phase 5 - Observability and Documentation (Week 12)
+
+- Add Prometheus metrics:
+  - `agentcube_group_creation_duration_seconds` (histogram)
+  - `agentcube_group_role_failures_total` (counter, labels: `role`, `policy`)
+  - `agentcube_active_groups` (gauge)
+- Finalize design document, API reference, and troubleshooting guide.
+
+---
+
+## What Stays Unchanged
+
+The following functions and components are called as-is with zero modifications:
+
+| Component | Location | Used By |
+|-----------|----------|---------|
+| `createSandbox()` | `pkg/workloadmanager/handlers.go` | Called per-role inside `createSandboxGroup()` |
+| `rollbackSandboxCreation()` | `pkg/workloadmanager/handlers.go` | Called in deferred rollback |
+| `buildSandboxByAgentRuntime()` | `pkg/workloadmanager/workload_builder.go` | Called per-role to build sandbox spec |
+| `WatchSandboxOnce()` / `UnWatchSandbox()` | `pkg/workloadmanager/sandbox_controller.go` | Called per-role for readiness watching |
+| All `AgentRuntime` + `CodeInterpreter` creation flows | Various | Not touched |
+| All existing store methods | `pkg/store/` | Not touched |
+| GC idle-timeout + TTL logic for standalone sandboxes | `pkg/workloadmanager/garbage_collection.go` | Not touched |
+| Router proxy logic for `AgentRuntime` + `CodeInterpreter` | `pkg/router/` | Not touched |
+
+---
+
+## Alternatives Considered
+
+### Inline pod spec in `MultiAgentRuntime`
+
+An early design embedded a full pod spec within each role definition, similar to how `CodeInterpreter` defines its own container template. This was rejected for three reasons:
+
+1. It duplicates the security context, resource requirements, image configuration, and environment variables already defined and validated in the referenced `AgentRuntime`.
+2. Changes to an agent's configuration require updates in two places (the `AgentRuntime` CRD and every `MultiAgentRuntime` that embeds it), creating a maintenance burden that grows with the number of groups.
+3. Admission validation for pod specs would need to be duplicated in the `MultiAgentRuntime` webhook, diverging from the single source of truth already established by `AgentRuntime`.
+
+The `runtimeRef` approach provides clean separation of concerns: `AgentRuntime` owns the workload definition; `MultiAgentRuntime` owns the topology and lifecycle policy.
+
+### Hardcoded orchestrator/workers split
+
+An alternative used a two-level structure with a single `orchestrator` field and a `workers[]` list, similar to Ray's head-node/worker-node model or Kubeflow's launcher/worker distinction. This was rejected because:
+
+1. It restricts valid topologies to star patterns. DAG pipelines, mesh topologies, and peer-to-peer swarms cannot be expressed.
+2. It conflates "coordinator" (external entrypoint) with "orchestrator" (controls other agents). These are separate concerns: a gateway-style coordinator does not plan or dispatch tasks to workers.
+3. The flat `roles[]` list with optional `dependencies[]` is strictly more expressive. The overhead is a single topological sort (O(V+E)), which is negligible for the group sizes this feature targets (2-10 roles).
+
+### Separate control plane service
+
+A design was considered where a dedicated `multi-agent-controller` deployment managed group lifecycle independently from the workload manager. This was rejected because:
+
+1. It introduces an additional deployment, service, RBAC configuration, and operational surface for cluster administrators.
+2. The workload manager already has the store client, informer cache, dynamic Kubernetes client, and sandbox controller necessary for group management. Replicating these in a separate service introduces duplication and divergence risk.
+3. Inter-service communication between the new controller and the workload manager would add latency, a new failure mode, and the complexity of defining an internal API between the two.
+4. The `MultiAgentRuntimeReconciler` integrates naturally into the existing `controller-runtime` manager already wired in `cmd/workload-manager/main.go`, following the same pattern as `CodeInterpreterReconciler`. No new binary is required.
+
+---
+
+## Future Enhancements
+
+The following items are explicitly out of scope for this proposal but are noted as natural extensions:
+
+### Dynamic role scaling
+
+Allow the `MultiAgentRuntime` spec to be updated after creation to add or remove worker roles. This would require the reconciler to diff the desired state against the group manifest and create/delete sandboxes accordingly. The current design's flat `roles[]` list and group manifest structure are compatible with this extension.
+
+### Cross-namespace groups
+
+Allow roles to reference `AgentRuntime` CRDs in different namespaces. This requires namespace-scoped RBAC checks during group creation and cross-namespace informer watches. The `runtimeRef` field would be extended to `namespace/name` format.
+
+### Group-level metrics and logging
+
+Aggregate per-role Prometheus metrics into group-level dashboards. Correlate logs across all roles in a group using the `GroupSessionID` as a trace identifier. The `GroupSessionID` is already propagated to all sandbox entries in the store, so log correlation is achievable without schema changes.
+
+### Inter-agent communication primitives
+
+Provide optional shared volumes or message queues between roles. This is explicitly a non-goal of the current design (agents communicate via injected endpoints over cluster networking), but it could be layered on top of the group abstraction if demand emerges.
+
+### Integration with AgentCube auth layer
+
+When the authentication proposal (see `docs/design/auth-proposal.md`) is implemented, `MultiAgentRuntime` group creation requests will be subject to the same Keycloak JWT validation and RBAC checks. The `sandbox:invoke` role would be extended to cover group creation, or a new `group:create` role introduced. No changes to the `MultiAgentRuntime` design are needed because the auth middleware sits in front of all workload manager endpoints.
\ No newline at end of file

From 9202c76ea20f582bbb67d1353dfb14db9d597332 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 03:59:19 +0530
Subject: [PATCH 02/20] docs: address feedback on multi-agent design proposal

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 45 ++++++++++++++-------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index 4f186e0e..e23acc0f 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -207,16 +207,22 @@ In both policies, coordinator failure always causes full rollback and an error r
 
 #### Dependency endpoint injection
 
-Before a dependent role's sandbox is created, the verified pod IPs of its dependencies are injected as environment variables into the pod template. The naming convention is:
+Before a dependent role's sandbox is created, the verified pod IPs of its dependencies are injected as environment variables into the pod template. To ensure compatibility with standard shell naming conventions, any hyphens or non-alphanumeric characters in the role name are replaced by underscores. 
+
+The naming convention is:
 
 ```
-AGENTCUBE_DEP_{ROLE_NAME_UPPER}_ENDPOINT = {podIP}:{port}
+AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_ENDPOINT = {podIP}:{port}
 ```
 
-For a role with `dependencies: [planner]`, the pod receives:
+**Port Resolution Rule:**
+* If the dependency's `AgentRuntime` CRD defines a single port, that port is used.
+* If it defines multiple ports, the system first looks for a port named `http` or `default`. If no such port is found, it falls back to the first port in the ports list.
+
+For a role with `dependencies: [my-planner]` (where the planner exposes `8080` as the first port), the dependent pod receives:
 
 ```
-AGENTCUBE_DEP_PLANNER_ENDPOINT = 10.0.0.4:8080
+AGENTCUBE_DEP_MY_PLANNER_ENDPOINT = 10.0.0.4:8080
 ```
 
 Injection happens in-memory inside `createSandboxGroup()` by mutating the pod template before it is passed to `buildSandboxByAgentRuntime()`. The referenced `AgentRuntime` CRD object in the informer cache is never written.
@@ -290,11 +296,12 @@ func (s *Server) createSandboxGroup(
             injectDependencyEndpoints(&sandbox.Spec.PodTemplate, role.Dependencies, created)
         }
 
-        resultChan := s.sandboxController.WatchSandboxOnce(ctx, sandbox.Namespace, sandbox.Name)
-        defer s.sandboxController.UnWatchSandbox(sandbox.Namespace, sandbox.Name)
-
-        // createSandbox is called as-is (no changes to the function).
-        resp, err := s.createSandbox(ctx, dynamicClient, sandbox, nil, sandboxEntry, resultChan)
+        // Watch and create sandbox in a closure to prevent watcher resource accumulation from defer in a loop
+        resp, err := func() (*types.CreateAgentResponse, error) {
+            resultChan := s.sandboxController.WatchSandboxOnce(ctx, sandbox.Namespace, sandbox.Name)
+            defer s.sandboxController.UnWatchSandbox(sandbox.Namespace, sandbox.Name)
+            return s.createSandbox(ctx, dynamicClient, sandbox, nil, sandboxEntry, resultChan)
+        }()
         if err != nil {
             if mar.Spec.StartupPolicy == StartupPolicyBestEffort && !role.IsCoordinator {
                 klog.Warningf("group %s: role %s failed (BestEffort policy): %v", groupSessionID, role.Name, err)
@@ -312,7 +319,7 @@ func (s *Server) createSandboxGroup(
         })
     }
 
-    manifest := buildGroupManifest(groupSessionID, created)
+    manifest := buildGroupManifest(groupSessionID, mar.Spec.Roles, created)
     if err := s.storeClient.SaveAgentGroup(ctx, groupSessionID, manifest); err != nil {
         return nil, fmt.Errorf("save group manifest: %w", err)
     }
@@ -408,6 +415,14 @@ func topoSort(roles []RoleSpec) ([]RoleSpec, error) {
     }
 
     if len(sorted) != len(roles) {
+        // Check for missing dependencies first to provide a better error message.
+        for _, r := range roles {
+            for _, dep := range r.Dependencies {
+                if _, exists := roleMap[dep]; !exists {
+                    return nil, fmt.Errorf("role %s depends on non-existent role %s", r.Name, dep)
+                }
+            }
+        }
         // Identify and name the roles involved in the cycle for the error message.
         var cycled []string
         for name, deg := range inDegree {
@@ -497,14 +512,16 @@ GetAgentGroup(ctx context.Context, groupSessionID string) (*types.AgentGroupMani
 // DeleteAgentGroup removes a group manifest by groupSessionID.
 DeleteAgentGroup(ctx context.Context, groupSessionID string) error
 
-// UpdateAgentGroupRoleStatus atomically updates the status of a specific role
+// UpdateAgentGroupRoleStatus atomically updates the status and endpoint of a specific role
 // within a group manifest. Used by the reconciler during self-healing.
-UpdateAgentGroupRoleStatus(ctx context.Context, groupSessionID, roleName, status string) error
+UpdateAgentGroupRoleStatus(ctx context.Context, groupSessionID, roleName, status, endpoint string) error
 ```
 
-Both `store_redis.go` and `store_valkey.go` implement these methods using the key prefix `agentgroup:`. The serialization format is JSON, consistent with existing `sandbox:` entries.
+Both `store_redis.go` and `store_valkey.go` implement these methods using the key prefix `agentgroup:`. 
 
-> **Note:** `UpdateAgentGroupRoleStatus` performs a read-modify-write on the manifest JSON. Under high concurrency this could race, but group manifests are updated infrequently (only during self-healing) and only by the reconciler, so optimistic concurrency control is not required in the initial implementation.
+> **Store Implementation Note:** To prevent race conditions in a distributed environment during concurrent updates, the store backends (Redis/Valkey) implement group manifests using a **Redis Hash** (`HSET agentgroup:{groupSessionID}`) instead of a raw JSON string.
+> 
+> The hash fields map directly to roles and their metadata (e.g., `HSET agentgroup:{groupSessionID} role:{roleName} <json>`), allowing atomic field-level updates without rewriting the full manifest JSON, avoiding read-modify-write races.
 
 ### CRD Types
 

From 8b3d04f828821308375158d1d0467e6dcfa387bd Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 04:06:35 +0530
Subject: [PATCH 03/20] docs: address additional feedback on multi-agent design
 proposal

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index e23acc0f..c813d4a7 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -219,7 +219,13 @@ AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_ENDPOINT = {podIP}:{port}
 * If the dependency's `AgentRuntime` CRD defines a single port, that port is used.
 * If it defines multiple ports, the system first looks for a port named `http` or `default`. If no such port is found, it falls back to the first port in the ports list.
 
-For a role with `dependencies: [my-planner]` (where the planner exposes `8080` as the first port), the dependent pod receives:
+**Validation against Naming Collisions:**
+* Because multiple role names could map to the same sanitized environment variable (e.g., `my-agent` and `my.agent` both sanitize to `AGENTCUBE_DEP_MY_AGENT_ENDPOINT`), the API server validates the group configuration at request admission time. If any two roles within the group result in the same sanitized environment variable key, the request is rejected with a `400 Bad Request` validation error.
+
+**Injection Scope:**
+* The dependency endpoints are injected into the `Env` list of **all containers** (including primary, sidecar, and init-containers) defined in the pod spec. This ensures that any multi-container runtime configuration can reliably resolve the endpoints.
+
+For a role with `dependencies: [my-planner]` (where the planner exposes `8080` as the first port), the dependent pod's containers receive:
 
 ```
 AGENTCUBE_DEP_MY_PLANNER_ENDPOINT = 10.0.0.4:8080
@@ -305,7 +311,6 @@ func (s *Server) createSandboxGroup(
         if err != nil {
             if mar.Spec.StartupPolicy == StartupPolicyBestEffort && !role.IsCoordinator {
                 klog.Warningf("group %s: role %s failed (BestEffort policy): %v", groupSessionID, role.Name, err)
-                recordRoleFailure(groupSessionID, role.Name)
                 continue
             }
             return nil, fmt.Errorf("role %s: %w", role.Name, err)
@@ -648,6 +653,12 @@ The reconciler watches for `Sandbox` objects whose `GroupSessionID` matches a kn
 - **`Atomic` policy**: the reconciler calls `handleDeleteAgentGroup()` to tear down all remaining sandboxes and delete the group manifest. It sets a `Failed` condition on the `MultiAgentRuntimeStatus`.
 - **`BestEffort` policy**: the reconciler attempts to create a replacement sandbox for the failed role. On success, it calls `UpdateAgentGroupRoleStatus()` with the new endpoint. On repeated failure, it sets a `Degraded` condition.
 
+> [!WARNING]
+> **Stale Environment Variables in BestEffort Groups:** 
+> When a failed worker pod is replaced under the `BestEffort` policy, the new pod receives a new IP address. Because environment variables are immutable once a pod is running, already active dependent pods (such as the coordinator) will retain the stale endpoint in their environment variables.
+> 
+> To prevent communication failures, agents deployed in `BestEffort` groups must not rely solely on injected environment variables. Instead, they should utilize the `/topology` endpoint (`GET /v1/multi-agent-runtime/groups/:groupSessionId/topology`) for dynamic service discovery to retrieve current worker endpoints.
+
 ### Status Conditions
 
 | Condition | Meaning |

From 864d9e135f995004311ccf7c654588af59d16c20 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 04:33:49 +0530
Subject: [PATCH 04/20] docs: address second batch of review comments on
 multi-agent proposal

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 38 ++++++++++++++++++---
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index c813d4a7..b69e2466 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -216,8 +216,9 @@ AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_ENDPOINT = {podIP}:{port}
 ```
 
 **Port Resolution Rule:**
-* If the dependency's `AgentRuntime` CRD defines a single port, that port is used.
-* If it defines multiple ports, the system first looks for a port named `http` or `default`. If no such port is found, it falls back to the first port in the ports list.
+* If `targetPort` is explicitly defined in the dependency's `RoleSpec` (either as a port name or number), that port is used.
+* If `targetPort` is not specified, and the dependency's `AgentRuntime` CRD defines a single port, that port is used.
+* If `targetPort` is not specified, and it defines multiple ports, the system first looks for a port named `http` or `default`. If no such port is found, it falls back to the first port in the ports list.
 
 **Validation against Naming Collisions:**
 * Because multiple role names could map to the same sanitized environment variable (e.g., `my-agent` and `my.agent` both sanitize to `AGENTCUBE_DEP_MY_AGENT_ENDPOINT`), the API server validates the group configuration at request admission time. If any two roles within the group result in the same sanitized environment variable key, the request is rejected with a `400 Bad Request` validation error.
@@ -298,12 +299,20 @@ func (s *Server) createSandboxGroup(
         sandboxEntry.GroupSessionID = groupSessionID
         sandboxEntry.Role = role.Name
 
+        // Apply group-level SessionTimeout and MaxSessionDuration overrides
+        if mar.Spec.SessionTimeout != nil {
+            sandboxEntry.SessionTimeout = mar.Spec.SessionTimeout
+        }
+        if mar.Spec.MaxSessionDuration != nil {
+            sandbox.Spec.MaxSessionDuration = mar.Spec.MaxSessionDuration
+        }
+
         if len(role.Dependencies) > 0 {
             injectDependencyEndpoints(&sandbox.Spec.PodTemplate, role.Dependencies, created)
         }
 
         // Watch and create sandbox in a closure to prevent watcher resource accumulation from defer in a loop
-        resp, err := func() (*types.CreateAgentResponse, error) {
+        resp, err := func() (*types.CreateSandboxResponse, error) {
             resultChan := s.sandboxController.WatchSandboxOnce(ctx, sandbox.Namespace, sandbox.Name)
             defer s.sandboxController.UnWatchSandbox(sandbox.Namespace, sandbox.Name)
             return s.createSandbox(ctx, dynamicClient, sandbox, nil, sandboxEntry, resultChan)
@@ -336,6 +345,7 @@ func (s *Server) createSandboxGroup(
 
 **Key properties:**
 
+- **Parallel Sandbox Creation:** To prevent HTTP gateway or client timeouts when launching large groups, roles that do not share mutual dependencies (i.e., reside at the same level of the dependency DAG) are created in parallel. Sandbox creation proceeds in "dependency waves": all sandboxes within a wave are launched concurrently, and the server waits for all to be ready before proceeding to the next dependent wave.
 - The deferred rollback calls the existing `rollbackSandboxCreation()` function, without modification, for every sandbox in `created`.
 - Roles are created in topological order. A dependency's endpoint is guaranteed to be in `created` before the dependent role's sandbox is built.
 - `buildSandboxByAgentRuntime()`, `createSandbox()`, `WatchSandboxOnce()`, and `rollbackSandboxCreation()` are all called as-is.
@@ -595,6 +605,11 @@ type RoleSpec struct {
     // Circular dependencies are rejected at request time.
     // +optional
     Dependencies []string `json:"dependencies,omitempty"`
+
+    // TargetPort specifies the name or number of the port in the referenced AgentRuntime 
+    // to be used by dependent roles. If empty, the default Port Resolution Rule applies.
+    // +optional
+    TargetPort string `json:"targetPort,omitempty"`
 }
 
 type StartupPolicyType string
@@ -672,14 +687,27 @@ The reconciler watches for `Sandbox` objects whose `GroupSessionID` matches a kn
 
 ## Garbage Collection
 
-The existing GC in `pkg/workloadmanager/garbage_collection.go` is extended with group awareness. When the GC deletes a sandbox that has a non-empty `GroupSessionID`:
+The existing GC in `pkg/workloadmanager/garbage_collection.go` is extended with group awareness.
+
+### Group-Aware Idle Timeout
+
+Because the Router only proxies external traffic directly to the coordinator, only the coordinator's `LastActivityAt` timestamp in the store is updated during active sessions. Internal worker sandboxes that receive no direct external traffic would otherwise retain static `LastActivityAt` values, causing the GC to prematurely delete them while the coordinator is still active.
+
+To prevent this, the GC evaluates idle timeouts group-wide:
+1. When checking if a sandbox is idle, if its `GroupSessionID` is non-empty, the GC retrieves the group's coordinator sandbox from the store.
+2. The idle duration for **all members of the group** is calculated based on the coordinator's `LastActivityAt` timestamp (or the maximum `LastActivityAt` among all group member sandboxes if the coordinator's timestamp is unavailable).
+3. Individual sandboxes in a group are only deleted for inactivity if the group as a whole is determined to be idle.
+
+### Group Metadata Cleanup
+
+When the GC deletes a sandbox that has a non-empty `GroupSessionID`:
 
 1. It calls `GetAgentGroup()` to retrieve the group manifest.
 2. It removes the deleted role from the manifest.
 3. If no roles remain in the manifest, it calls `DeleteAgentGroup()` to remove the `agentgroup:` key from the store.
 4. If other roles remain, it calls `SaveAgentGroup()` with the updated manifest.
 
-This ensures that group manifests do not accumulate indefinitely in the store after their member sandboxes expire. The existing idle-timeout and TTL logic for individual sandboxes is not modified. Group membership is an additional cleanup concern layered on top of existing GC, not a replacement.
+This ensures that group manifests do not accumulate indefinitely in the store after their member sandboxes expire. Group membership is an additional cleanup concern layered on top of existing GC, not a replacement.
 
 ---
 

From 8e6e65b5060ca957db29dedf9dc9cb5289d3cc98 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 04:37:59 +0530
Subject: [PATCH 05/20] docs: support wave-based parallel startup and
 multi-port injection

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 166 ++++++++++++--------
 1 file changed, 100 insertions(+), 66 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index b69e2466..bfbc341b 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -207,29 +207,37 @@ In both policies, coordinator failure always causes full rollback and an error r
 
 #### Dependency endpoint injection
 
-Before a dependent role's sandbox is created, the verified pod IPs of its dependencies are injected as environment variables into the pod template. To ensure compatibility with standard shell naming conventions, any hyphens or non-alphanumeric characters in the role name are replaced by underscores. 
+Before a dependent role's sandbox is created, the verified pod IPs and ports of its dependencies are injected as environment variables into the pod template. To ensure compatibility with standard shell naming conventions, any hyphens or non-alphanumeric characters in the role name and port names are replaced by underscores. 
 
-The naming convention is:
+The naming conventions are:
 
-```
-AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_ENDPOINT = {podIP}:{port}
-```
+1. **Default Endpoint (Primary service port):**
+   ```
+   AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_ENDPOINT = {podIP}:{port}
+   ```
+   * **Port Resolution Rule:**
+     * If `targetPort` is explicitly defined in the dependency's `RoleSpec` (either as a port name or number), that port is used as the default.
+     * If `targetPort` is not specified, and the dependency's `AgentRuntime` CRD defines a single port, that port is used.
+     * If `targetPort` is not specified, and it defines multiple ports, the system first looks for a port named `http` or `default`. If no such port is found, it falls back to the first port in the ports list.
 
-**Port Resolution Rule:**
-* If `targetPort` is explicitly defined in the dependency's `RoleSpec` (either as a port name or number), that port is used.
-* If `targetPort` is not specified, and the dependency's `AgentRuntime` CRD defines a single port, that port is used.
-* If `targetPort` is not specified, and it defines multiple ports, the system first looks for a port named `http` or `default`. If no such port is found, it falls back to the first port in the ports list.
+2. **Named Port Endpoints (For multi-service agents exposing multiple ports):**
+   If the dependency's `AgentRuntime` CRD defines named ports, an endpoint environment variable is additionally injected for each named port:
+   ```
+   AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_PORT_{PORT_NAME_SANITISED_UPPER}_ENDPOINT = {podIP}:{portNumber}
+   ```
 
 **Validation against Naming Collisions:**
-* Because multiple role names could map to the same sanitized environment variable (e.g., `my-agent` and `my.agent` both sanitize to `AGENTCUBE_DEP_MY_AGENT_ENDPOINT`), the API server validates the group configuration at request admission time. If any two roles within the group result in the same sanitized environment variable key, the request is rejected with a `400 Bad Request` validation error.
+* Because multiple role names or port names could map to the same sanitized environment variable (e.g., `my-agent` and `my.agent` both sanitize to `AGENTCUBE_DEP_MY_AGENT_ENDPOINT`), the API server validates the group configuration at request admission time. If any two roles or named ports within the group result in the same sanitized environment variable key, the request is rejected with a `400 Bad Request` validation error.
 
 **Injection Scope:**
 * The dependency endpoints are injected into the `Env` list of **all containers** (including primary, sidecar, and init-containers) defined in the pod spec. This ensures that any multi-container runtime configuration can reliably resolve the endpoints.
 
-For a role with `dependencies: [my-planner]` (where the planner exposes `8080` as the first port), the dependent pod's containers receive:
+For a role with `dependencies: [my-planner]` (where the planner exposes a port named `api` at `8080` and `metrics` at `9090`), the dependent pod's containers receive:
 
 ```
 AGENTCUBE_DEP_MY_PLANNER_ENDPOINT = 10.0.0.4:8080
+AGENTCUBE_DEP_MY_PLANNER_PORT_API_ENDPOINT = 10.0.0.4:8080
+AGENTCUBE_DEP_MY_PLANNER_PORT_METRICS_ENDPOINT = 10.0.0.4:9090
 ```
 
 Injection happens in-memory inside `createSandboxGroup()` by mutating the pod template before it is passed to `buildSandboxByAgentRuntime()`. The referenced `AgentRuntime` CRD object in the informer cache is never written.
@@ -283,54 +291,72 @@ func (s *Server) createSandboxGroup(
         }
     }()
 
-    orderedRoles, err := topoSort(mar.Spec.Roles)
+    waves, err := topoSort(mar.Spec.Roles)
     if err != nil {
-        return nil, err // descriptive cycle error
+        return nil, err // descriptive cycle or missing dependency error
     }
 
-    for _, role := range orderedRoles {
-        // buildSandboxByAgentRuntime is called as-is (no changes to the function).
-        sandbox, sandboxEntry, err := buildSandboxByAgentRuntime(
-            mar.Namespace, role.RuntimeRef, s.informers,
-        )
-        if err != nil {
-            return nil, fmt.Errorf("role %s: build sandbox: %w", role.Name, err)
-        }
-        sandboxEntry.GroupSessionID = groupSessionID
-        sandboxEntry.Role = role.Name
+    var createdMutex sync.Mutex
 
-        // Apply group-level SessionTimeout and MaxSessionDuration overrides
-        if mar.Spec.SessionTimeout != nil {
-            sandboxEntry.SessionTimeout = mar.Spec.SessionTimeout
-        }
-        if mar.Spec.MaxSessionDuration != nil {
-            sandbox.Spec.MaxSessionDuration = mar.Spec.MaxSessionDuration
-        }
+    for _, wave := range waves {
+        g, waveCtx := errgroup.WithContext(ctx)
 
-        if len(role.Dependencies) > 0 {
-            injectDependencyEndpoints(&sandbox.Spec.PodTemplate, role.Dependencies, created)
-        }
+        for _, r := range wave {
+            role := r // capture loop variable
+            g.Go(func() error {
+                // buildSandboxByAgentRuntime is called as-is (no changes to the function).
+                sandbox, sandboxEntry, err := buildSandboxByAgentRuntime(
+                    mar.Namespace, role.RuntimeRef, s.informers,
+                )
+                if err != nil {
+                    return fmt.Errorf("role %s: build sandbox: %w", role.Name, err)
+                }
+                sandboxEntry.GroupSessionID = groupSessionID
+                sandboxEntry.Role = role.Name
 
-        // Watch and create sandbox in a closure to prevent watcher resource accumulation from defer in a loop
-        resp, err := func() (*types.CreateSandboxResponse, error) {
-            resultChan := s.sandboxController.WatchSandboxOnce(ctx, sandbox.Namespace, sandbox.Name)
-            defer s.sandboxController.UnWatchSandbox(sandbox.Namespace, sandbox.Name)
-            return s.createSandbox(ctx, dynamicClient, sandbox, nil, sandboxEntry, resultChan)
-        }()
-        if err != nil {
-            if mar.Spec.StartupPolicy == StartupPolicyBestEffort && !role.IsCoordinator {
-                klog.Warningf("group %s: role %s failed (BestEffort policy): %v", groupSessionID, role.Name, err)
-                continue
-            }
-            return nil, fmt.Errorf("role %s: %w", role.Name, err)
+                // Apply group-level SessionTimeout and MaxSessionDuration overrides
+                if mar.Spec.SessionTimeout != nil {
+                    sandboxEntry.SessionTimeout = mar.Spec.SessionTimeout
+                }
+                if mar.Spec.MaxSessionDuration != nil {
+                    sandbox.Spec.MaxSessionDuration = mar.Spec.MaxSessionDuration
+                }
+
+                if len(role.Dependencies) > 0 {
+                    createdMutex.Lock()
+                    injectDependencyEndpoints(&sandbox.Spec.PodTemplate, role.Dependencies, created)
+                    createdMutex.Unlock()
+                }
+
+                // Watch and create sandbox in a closure to prevent watcher resource accumulation from defer in a loop
+                resp, err := func() (*types.CreateSandboxResponse, error) {
+                    resultChan := s.sandboxController.WatchSandboxOnce(waveCtx, sandbox.Namespace, sandbox.Name)
+                    defer s.sandboxController.UnWatchSandbox(sandbox.Namespace, sandbox.Name)
+                    return s.createSandbox(waveCtx, dynamicClient, sandbox, nil, sandboxEntry, resultChan)
+                }()
+                if err != nil {
+                    if mar.Spec.StartupPolicy == StartupPolicyBestEffort && !role.IsCoordinator {
+                        klog.Warningf("group %s: role %s failed (BestEffort policy): %v", groupSessionID, role.Name, err)
+                        return nil // skip worker failure under BestEffort
+                    }
+                    return fmt.Errorf("role %s: %w", role.Name, err)
+                }
+
+                createdMutex.Lock()
+                created = append(created, createdRole{
+                    name:      role.Name,
+                    resp:      resp,
+                    sandbox:   sandbox,
+                    sessionID: sandboxEntry.SessionID,
+                })
+                createdMutex.Unlock()
+                return nil
+            })
         }
 
-        created = append(created, createdRole{
-            name:      role.Name,
-            resp:      resp,
-            sandbox:   sandbox,
-            sessionID: sandboxEntry.SessionID,
-        })
+        if err := g.Wait(); err != nil {
+            return nil, err
+        }
     }
 
     manifest := buildGroupManifest(groupSessionID, mar.Spec.Roles, created)
@@ -393,7 +419,7 @@ sequenceDiagram
 ### Topological Sort and Cycle Detection
 
 ```go
-func topoSort(roles []RoleSpec) ([]RoleSpec, error) {
+func topoSort(roles []RoleSpec) ([][]RoleSpec, error) {
     inDegree := make(map[string]int)
     adj      := make(map[string][]string)
     roleMap  := make(map[string]RoleSpec)
@@ -409,27 +435,35 @@ func topoSort(roles []RoleSpec) ([]RoleSpec, error) {
         }
     }
 
-    var queue []string
+    var currentQueue []string
     for name, deg := range inDegree {
         if deg == 0 {
-            queue = append(queue, name)
+            currentQueue = append(currentQueue, name)
         }
     }
 
-    var sorted []RoleSpec
-    for len(queue) > 0 {
-        name := queue[0]
-        queue = queue[1:]
-        sorted = append(sorted, roleMap[name])
-        for _, neighbor := range adj[name] {
-            inDegree[neighbor]--
-            if inDegree[neighbor] == 0 {
-                queue = append(queue, neighbor)
+    var waves [][]RoleSpec
+    var totalSorted int
+
+    for len(currentQueue) > 0 {
+        var nextQueue []string
+        var wave []RoleSpec
+
+        for _, name := range currentQueue {
+            wave = append(wave, roleMap[name])
+            totalSorted++
+            for _, neighbor := range adj[name] {
+                inDegree[neighbor]--
+                if inDegree[neighbor] == 0 {
+                    nextQueue = append(nextQueue, neighbor)
+                }
             }
         }
+        waves = append(waves, wave)
+        currentQueue = nextQueue
     }
 
-    if len(sorted) != len(roles) {
+    if totalSorted != len(roles) {
         // Check for missing dependencies first to provide a better error message.
         for _, r := range roles {
             for _, dep := range r.Dependencies {
@@ -448,11 +482,11 @@ func topoSort(roles []RoleSpec) ([]RoleSpec, error) {
         sort.Strings(cycled)
         return nil, fmt.Errorf("dependency cycle detected among roles: %v", cycled)
     }
-    return sorted, nil
+    return waves, nil
 }
 ```
 
-The algorithm is Kahn's BFS-based topological sort, O(V+E). Cycle detection is derived from the invariant that Kahn's algorithm only produces a complete ordering when no cycle exists. If `len(sorted) < len(roles)`, the roles with remaining in-degree are in a cycle. Their names are included in the error message to aid debugging.
+The algorithm is Kahn's BFS-based topological sort grouped into level-order waves, O(V+E). Cycle detection is derived from the invariant that Kahn's algorithm only produces a complete ordering when no cycle exists. If `totalSorted < len(roles)`, the roles with remaining in-degree are in a cycle or have missing dependencies. Their names are included in the error message to aid debugging.
 
 ---
 

From adae045e1aabeba2811acfcf34bb70fa515d6004 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 04:47:14 +0530
Subject: [PATCH 06/20] docs: resolve timeout fields mapping, BestEffort status
 reporting, and use Headless Services for DNS stability

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 59 +++++++++++++++------
 1 file changed, 44 insertions(+), 15 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index bfbc341b..ec813545 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -207,13 +207,17 @@ In both policies, coordinator failure always causes full rollback and an error r
 
 #### Dependency endpoint injection
 
-Before a dependent role's sandbox is created, the verified pod IPs and ports of its dependencies are injected as environment variables into the pod template. To ensure compatibility with standard shell naming conventions, any hyphens or non-alphanumeric characters in the role name and port names are replaced by underscores. 
+Before a dependent role's sandbox is created, stable DNS-based endpoints of its dependencies are injected as environment variables into the pod template. To ensure stable communication under the `BestEffort` policy where failed worker pods are replaced and get new IP addresses, the Workload Manager automatically creates a Headless Kubernetes Service for each role in the group. 
 
-The naming conventions are:
+* **Service Name:** `{groupSessionID}-{roleNameSanitized}` (where `{roleNameSanitized}` replaces non-DNS-compliant characters with hyphens).
+* **Service Selector:** Matches `SessionIdLabelKey` and `Role` metadata on the sandbox.
+* **Service Lifecycle:** Created during `createSandboxGroup()` and cleaned up automatically via OwnerReferences when the MultiAgentRuntime is deleted.
+
+The environment variables injected into the dependent pod's containers point to the stable Service DNS name:
 
 1. **Default Endpoint (Primary service port):**
    ```
-   AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_ENDPOINT = {podIP}:{port}
+   AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_ENDPOINT = {serviceName}.{namespace}.svc.cluster.local:{port}
    ```
    * **Port Resolution Rule:**
      * If `targetPort` is explicitly defined in the dependency's `RoleSpec` (either as a port name or number), that port is used as the default.
@@ -223,7 +227,7 @@ The naming conventions are:
 2. **Named Port Endpoints (For multi-service agents exposing multiple ports):**
    If the dependency's `AgentRuntime` CRD defines named ports, an endpoint environment variable is additionally injected for each named port:
    ```
-   AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_PORT_{PORT_NAME_SANITISED_UPPER}_ENDPOINT = {podIP}:{portNumber}
+   AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_PORT_{PORT_NAME_SANITISED_UPPER}_ENDPOINT = {serviceName}.{namespace}.svc.cluster.local:{portNumber}
    ```
 
 **Validation against Naming Collisions:**
@@ -232,12 +236,12 @@ The naming conventions are:
 **Injection Scope:**
 * The dependency endpoints are injected into the `Env` list of **all containers** (including primary, sidecar, and init-containers) defined in the pod spec. This ensures that any multi-container runtime configuration can reliably resolve the endpoints.
 
-For a role with `dependencies: [my-planner]` (where the planner exposes a port named `api` at `8080` and `metrics` at `9090`), the dependent pod's containers receive:
+For a role with `dependencies: [my-planner]` in namespace `default` (where `my-planner` maps to service name `grp-xyz-my-planner` and exposes a port named `api` at `8080` and `metrics` at `9090`), the dependent pod's containers receive:
 
 ```
-AGENTCUBE_DEP_MY_PLANNER_ENDPOINT = 10.0.0.4:8080
-AGENTCUBE_DEP_MY_PLANNER_PORT_API_ENDPOINT = 10.0.0.4:8080
-AGENTCUBE_DEP_MY_PLANNER_PORT_METRICS_ENDPOINT = 10.0.0.4:9090
+AGENTCUBE_DEP_MY_PLANNER_ENDPOINT = grp-xyz-my-planner.default.svc.cluster.local:8080
+AGENTCUBE_DEP_MY_PLANNER_PORT_API_ENDPOINT = grp-xyz-my-planner.default.svc.cluster.local:8080
+AGENTCUBE_DEP_MY_PLANNER_PORT_METRICS_ENDPOINT = grp-xyz-my-planner.default.svc.cluster.local:9090
 ```
 
 Injection happens in-memory inside `createSandboxGroup()` by mutating the pod template before it is passed to `buildSandboxByAgentRuntime()`. The referenced `AgentRuntime` CRD object in the informer cache is never written.
@@ -271,6 +275,14 @@ agentgroup:{grp-xxx} -> AgentGroupManifest{
 ### Core Implementation: `createSandboxGroup()`
 
 ```go
+type createdRole struct {
+    name      string
+    resp      *types.CreateSandboxResponse
+    sandbox   *sandboxv1alpha1.Sandbox
+    sessionID string
+    failed    bool
+}
+
 func (s *Server) createSandboxGroup(
     ctx context.Context,
     mar *runtimev1alpha1.MultiAgentRuntime,
@@ -286,6 +298,9 @@ func (s *Server) createSandboxGroup(
             return
         }
         for _, c := range created {
+            if c.failed {
+                continue
+            }
             // rollbackSandboxCreation is called as-is (no changes to the function).
             s.rollbackSandboxCreation(dynamicClient, c.sandbox, nil, c.sessionID)
         }
@@ -314,12 +329,19 @@ func (s *Server) createSandboxGroup(
                 sandboxEntry.GroupSessionID = groupSessionID
                 sandboxEntry.Role = role.Name
 
-                // Apply group-level SessionTimeout and MaxSessionDuration overrides
+                // Apply group-level SessionTimeout override
                 if mar.Spec.SessionTimeout != nil {
-                    sandboxEntry.SessionTimeout = mar.Spec.SessionTimeout
+                    sandboxEntry.IdleTimeout = mar.Spec.SessionTimeout.Duration
+                    if sandbox.Annotations == nil {
+                        sandbox.Annotations = make(map[string]string)
+                    }
+                    sandbox.Annotations[IdleTimeoutAnnotationKey] = mar.Spec.SessionTimeout.Duration.String()
                 }
+
+                // Apply group-level MaxSessionDuration override
                 if mar.Spec.MaxSessionDuration != nil {
-                    sandbox.Spec.MaxSessionDuration = mar.Spec.MaxSessionDuration
+                    shutdownTime := metav1.NewTime(time.Now().Add(mar.Spec.MaxSessionDuration.Duration))
+                    sandbox.Spec.Lifecycle.ShutdownTime = &shutdownTime
                 }
 
                 if len(role.Dependencies) > 0 {
@@ -337,6 +359,12 @@ func (s *Server) createSandboxGroup(
                 if err != nil {
                     if mar.Spec.StartupPolicy == StartupPolicyBestEffort && !role.IsCoordinator {
                         klog.Warningf("group %s: role %s failed (BestEffort policy): %v", groupSessionID, role.Name, err)
+                        createdMutex.Lock()
+                        created = append(created, createdRole{
+                            name:   role.Name,
+                            failed: true,
+                        })
+                        createdMutex.Unlock()
                         return nil // skip worker failure under BestEffort
                     }
                     return fmt.Errorf("role %s: %w", role.Name, err)
@@ -348,6 +376,7 @@ func (s *Server) createSandboxGroup(
                     resp:      resp,
                     sandbox:   sandbox,
                     sessionID: sandboxEntry.SessionID,
+                    failed:    false,
                 })
                 createdMutex.Unlock()
                 return nil
@@ -702,11 +731,11 @@ The reconciler watches for `Sandbox` objects whose `GroupSessionID` matches a kn
 - **`Atomic` policy**: the reconciler calls `handleDeleteAgentGroup()` to tear down all remaining sandboxes and delete the group manifest. It sets a `Failed` condition on the `MultiAgentRuntimeStatus`.
 - **`BestEffort` policy**: the reconciler attempts to create a replacement sandbox for the failed role. On success, it calls `UpdateAgentGroupRoleStatus()` with the new endpoint. On repeated failure, it sets a `Degraded` condition.
 
-> [!WARNING]
-> **Stale Environment Variables in BestEffort Groups:** 
-> When a failed worker pod is replaced under the `BestEffort` policy, the new pod receives a new IP address. Because environment variables are immutable once a pod is running, already active dependent pods (such as the coordinator) will retain the stale endpoint in their environment variables.
+> [!NOTE]
+> **DNS-Based Self-Healing in BestEffort Groups:** 
+> Because environment variables are immutable once a pod is running, replacing a failed worker pod with a new pod under the `BestEffort` policy would render direct Pod IP environment variables stale. 
 > 
-> To prevent communication failures, agents deployed in `BestEffort` groups must not rely solely on injected environment variables. Instead, they should utilize the `/topology` endpoint (`GET /v1/multi-agent-runtime/groups/:groupSessionId/topology`) for dynamic service discovery to retrieve current worker endpoints.
+> By utilizing Headless Kubernetes Services, the injected environment variable points to a stable DNS endpoint (e.g., `grp-abc-my-planner.default.svc.cluster.local`). When a worker pod is replaced, the service's selector automatically targets the new pod's IP. Active dependent pods (like the coordinator) resolve the same DNS name to the new worker IP without needing environment variable updates or dynamic service discovery polling.
 
 ### Status Conditions
 

From c34fc9e9b2eac5bb573088bbff69eb0f94c63b42 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 04:51:18 +0530
Subject: [PATCH 07/20] docs: address service naming, admission webhook, TTL
 consistency, and Redis Hash layout feedback

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 30 +++++++++++++++++----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index ec813545..ba14d185 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -209,7 +209,7 @@ In both policies, coordinator failure always causes full rollback and an error r
 
 Before a dependent role's sandbox is created, stable DNS-based endpoints of its dependencies are injected as environment variables into the pod template. To ensure stable communication under the `BestEffort` policy where failed worker pods are replaced and get new IP addresses, the Workload Manager automatically creates a Headless Kubernetes Service for each role in the group. 
 
-* **Service Name:** `{groupSessionID}-{roleNameSanitized}` (where `{roleNameSanitized}` replaces non-DNS-compliant characters with hyphens).
+* **Service Name:** `mar-{shortHash}-{roleNameSanitized}` (where `{shortHash}` is the first 8 characters of the SHA-256 hash of the `groupSessionID`, and `{roleNameSanitized}` replaces non-DNS-compliant characters with hyphens). This ensures the service name stays well under the Kubernetes 63-character DNS label limit even with long role names.
 * **Service Selector:** Matches `SessionIdLabelKey` and `Role` metadata on the sandbox.
 * **Service Lifecycle:** Created during `createSandboxGroup()` and cleaned up automatically via OwnerReferences when the MultiAgentRuntime is deleted.
 
@@ -292,6 +292,9 @@ func (s *Server) createSandboxGroup(
     groupSessionID := "grp-" + uuid.New().String()
     var created []createdRole
 
+    // Compute a single baseTime for consistent TTL calculation across all group members
+    baseTime := time.Now()
+
     needGroupRollback := true
     defer func() {
         if !needGroupRollback {
@@ -338,9 +341,9 @@ func (s *Server) createSandboxGroup(
                     sandbox.Annotations[IdleTimeoutAnnotationKey] = mar.Spec.SessionTimeout.Duration.String()
                 }
 
-                // Apply group-level MaxSessionDuration override
+                // Apply group-level MaxSessionDuration override using the shared baseTime
                 if mar.Spec.MaxSessionDuration != nil {
-                    shutdownTime := metav1.NewTime(time.Now().Add(mar.Spec.MaxSessionDuration.Duration))
+                    shutdownTime := metav1.NewTime(baseTime.Add(mar.Spec.MaxSessionDuration.Duration))
                     sandbox.Spec.Lifecycle.ShutdownTime = &shutdownTime
                 }
 
@@ -401,11 +404,15 @@ func (s *Server) createSandboxGroup(
 **Key properties:**
 
 - **Parallel Sandbox Creation:** To prevent HTTP gateway or client timeouts when launching large groups, roles that do not share mutual dependencies (i.e., reside at the same level of the dependency DAG) are created in parallel. Sandbox creation proceeds in "dependency waves": all sandboxes within a wave are launched concurrently, and the server waits for all to be ready before proceeding to the next dependent wave.
+- **Consistent TTL:** A single `baseTime` is captured before the creation loop begins and used for all `ShutdownTime` calculations. This ensures every sandbox in the group shares a synchronized absolute TTL, regardless of how long it takes to create each wave.
 - The deferred rollback calls the existing `rollbackSandboxCreation()` function, without modification, for every sandbox in `created`.
 - Roles are created in topological order. A dependency's endpoint is guaranteed to be in `created` before the dependent role's sandbox is built.
 - `buildSandboxByAgentRuntime()`, `createSandbox()`, `WatchSandboxOnce()`, and `rollbackSandboxCreation()` are all called as-is.
 - The `needGroupRollback` flag is only cleared after `SaveAgentGroup` succeeds. A store failure after all sandboxes are created will roll back the Kubernetes resources, maintaining consistency between the cluster state and the store.
 
+> [!NOTE]
+> **Future Improvement: Reconciler-Based Orchestration.** The current design executes `createSandboxGroup()` synchronously within the API handler. For very large groups where the cumulative creation time may approach HTTP proxy timeouts, a more resilient approach would be to have the API handler persist the `MultiAgentRuntime` CRD with a `Creating` status and delegate the actual sandbox orchestration to the `MultiAgentRuntimeReconciler`. This ensures the system can recover and resume group creation even if the Workload Manager restarts mid-process. This is left as a future optimization since the wave-based parallelism already significantly reduces total startup latency for practical group sizes.
+
 The following sequence diagram illustrates the creation flow for a 3-role group under `Atomic` policy:
 
 ```mermaid
@@ -599,7 +606,15 @@ Both `store_redis.go` and `store_valkey.go` implement these methods using the ke
 
 > **Store Implementation Note:** To prevent race conditions in a distributed environment during concurrent updates, the store backends (Redis/Valkey) implement group manifests using a **Redis Hash** (`HSET agentgroup:{groupSessionID}`) instead of a raw JSON string.
 > 
-> The hash fields map directly to roles and their metadata (e.g., `HSET agentgroup:{groupSessionID} role:{roleName} <json>`), allowing atomic field-level updates without rewriting the full manifest JSON, avoiding read-modify-write races.
+> The hash layout uses a reserved `_metadata` field for top-level group attributes (`GroupSessionID`, `CreatedAt`, `StartupPolicy`, `Namespace`) and individual `role:{roleName}` fields for per-role state:
+> ```
+> HSET agentgroup:{grp-xxx}
+>   _metadata        '{"groupSessionID":"grp-xxx","createdAt":"...","startupPolicy":"Atomic","namespace":"default"}'
+>   role:planner     '{"sessionID":"...","endpoint":"...","status":"ready"}'
+>   role:researcher  '{"sessionID":"...","endpoint":"...","status":"ready"}'
+>   role:coder       '{"sessionID":"...","endpoint":"...","status":"failed"}'
+> ```
+> This allows atomic field-level updates via `HSET` without rewriting the full manifest JSON, avoiding read-modify-write races. `GetAgentGroup()` reads all fields with `HGETALL` and reconstructs the full `AgentGroupManifest` by combining the `_metadata` and `role:*` entries.
 
 ### CRD Types
 
@@ -888,12 +903,17 @@ This feature is fully backward compatible. No existing behavior changes unless t
 Deliverables that satisfy the mentorship expected outcomes on their own.
 
 - Define `MultiAgentRuntime` CRD types with kubebuilder markers; run `make generate` + `make gen-crd`.
+- Implement a **ValidatingAdmissionWebhook** for `MultiAgentRuntime` to enforce configuration invariants at admission time:
+  - Exactly one role must be marked as `isCoordinator`.
+  - No two roles may produce the same sanitized environment variable key (naming collision detection).
+  - `dependencies[]` references must point to roles defined within the same spec.
+  - Role names must be valid DNS label fragments (lowercase alphanumeric and hyphens, max 63 characters).
 - Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet).
 - Add `GroupSessionID` + `Role` to `SandboxInfo`; propagate through `buildSandboxPlaceHolder()` + `buildSandboxInfo()`.
 - Implement all 4 store methods in `store_redis.go` + `store_valkey.go` with full unit test coverage.
 - Add `MultiAgentRuntimeKind` to Router endpoint switch.
 - Extend GC to clean up `agentgroup:` manifest keys when last member sandbox is deleted.
-- Unit tests: `createSandboxGroup()` with atomic rollback on partial failure, store CRUD, coordinator validation, cycle detection.
+- Unit tests: `createSandboxGroup()` with atomic rollback on partial failure, store CRUD, coordinator validation, cycle detection, admission webhook validation.
 - E2E test: kind cluster (same setup as existing E2E), create a 3-role group, verify all sandboxes running, delete group, verify cleanup.
 - User guide: YAML example + `kubectl` workflow.
 

From 58d4e0bb8cadaf43aa4a3c4f3ef5383e5db38fff Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 04:55:58 +0530
Subject: [PATCH 08/20] docs: fix ToC link typo, SANITIZED spelling, DELETE
 path, and creation date

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index ba14d185..d78ecea2 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -2,7 +2,7 @@
 title: Multi-Agent Runtime Design Proposal
 authors:
   - "@Abhinav-kodes"
-creation-date: 2025-05-12
+creation-date: 2026-05-19
 ---
 
 # Multi-Agent Runtime: Design Proposal
@@ -34,7 +34,7 @@ Author: Abhinav Singh
    - [Store Interface Additions](#store-interface-additions)
    - [CRD Types](#crd-types)
    - [SandboxInfo Extensions](#sandboxinfo-extensions)
-7. [Controller: MultiAgentRuntimeReconciler](#controller-multiagentrimereconciler)
+7. [Controller: MultiAgentRuntimeReconciler](#controller-multiagentruntimereconciler)
 8. [Garbage Collection](#garbage-collection)
 9. [Router Integration](#router-integration)
 10. [SDK Integration](#sdk-integration)
@@ -217,7 +217,7 @@ The environment variables injected into the dependent pod's containers point to
 
 1. **Default Endpoint (Primary service port):**
    ```
-   AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_ENDPOINT = {serviceName}.{namespace}.svc.cluster.local:{port}
+   AGENTCUBE_DEP_{ROLE_NAME_SANITIZED_UPPER}_ENDPOINT = {serviceName}.{namespace}.svc.cluster.local:{port}
    ```
    * **Port Resolution Rule:**
      * If `targetPort` is explicitly defined in the dependency's `RoleSpec` (either as a port name or number), that port is used as the default.
@@ -227,7 +227,7 @@ The environment variables injected into the dependent pod's containers point to
 2. **Named Port Endpoints (For multi-service agents exposing multiple ports):**
    If the dependency's `AgentRuntime` CRD defines named ports, an endpoint environment variable is additionally injected for each named port:
    ```
-   AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_PORT_{PORT_NAME_SANITISED_UPPER}_ENDPOINT = {serviceName}.{namespace}.svc.cluster.local:{portNumber}
+   AGENTCUBE_DEP_{ROLE_NAME_SANITIZED_UPPER}_PORT_{PORT_NAME_SANITIZED_UPPER}_ENDPOINT = {serviceName}.{namespace}.svc.cluster.local:{portNumber}
    ```
 
 **Validation against Naming Collisions:**
@@ -533,7 +533,7 @@ The algorithm is Kahn's BFS-based topological sort grouped into level-order wave
 | Method | Path | Description |
 |--------|------|-------------|
 | `POST` | `/v1/multi-agent-runtime` | Create a new agent group. Returns group session ID and coordinator entrypoints. |
-| `DELETE` | `/v1/multi-agent-runtime/sessions/:groupSessionId` | Delete all sandboxes in the group and remove the group manifest from the store. |
+| `DELETE` | `/v1/multi-agent-runtime/groups/:groupSessionId` | Delete all sandboxes in the group and remove the group manifest from the store. |
 | `GET` | `/v1/multi-agent-runtime/groups/:groupSessionId/topology` | Return the group manifest including all role endpoints and statuses. Intended for use by the coordinator at startup to discover worker endpoints. |
 
 ### Request and Response Types

From e55e21c1c5a58c1e3294c3493ec696407104ec7a Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 21:34:41 +0530
Subject: [PATCH 09/20] docs: add Kind field to RoleSpec, fix service name
 truncation, HDEL for GC cleanup, and cache GC coordinator lookup

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 57 +++++++++++++++------
 1 file changed, 40 insertions(+), 17 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index d78ecea2..40fe5ee4 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -209,7 +209,7 @@ In both policies, coordinator failure always causes full rollback and an error r
 
 Before a dependent role's sandbox is created, stable DNS-based endpoints of its dependencies are injected as environment variables into the pod template. To ensure stable communication under the `BestEffort` policy where failed worker pods are replaced and get new IP addresses, the Workload Manager automatically creates a Headless Kubernetes Service for each role in the group. 
 
-* **Service Name:** `mar-{shortHash}-{roleNameSanitized}` (where `{shortHash}` is the first 8 characters of the SHA-256 hash of the `groupSessionID`, and `{roleNameSanitized}` replaces non-DNS-compliant characters with hyphens). This ensures the service name stays well under the Kubernetes 63-character DNS label limit even with long role names.
+* **Service Name:** `mar-{shortHash}-{roleNameSanitized}` (where `{shortHash}` is the first 8 characters of the SHA-256 hash of the `groupSessionID`, and `{roleNameSanitized}` replaces non-DNS-compliant characters with hyphens, **truncated to 50 characters**). With a 13-character prefix (`mar-` + 8-char hash + `-`), truncating the role name to 50 characters guarantees the full service name always fits within the Kubernetes 63-character DNS label limit.
 * **Service Selector:** Matches `SessionIdLabelKey` and `Role` metadata on the sandbox.
 * **Service Lifecycle:** Created during `createSandboxGroup()` and cleaned up automatically via OwnerReferences when the MultiAgentRuntime is deleted.
 
@@ -322,10 +322,21 @@ func (s *Server) createSandboxGroup(
         for _, r := range wave {
             role := r // capture loop variable
             g.Go(func() error {
-                // buildSandboxByAgentRuntime is called as-is (no changes to the function).
-                sandbox, sandboxEntry, err := buildSandboxByAgentRuntime(
-                    mar.Namespace, role.RuntimeRef, s.informers,
-                )
+                // Dispatch to the correct builder based on role Kind.
+                var sandbox *sandboxv1alpha1.Sandbox
+                var sandboxClaim *extensionsv1alpha1.SandboxClaim
+                var sandboxEntry *sandboxEntry
+                var err error
+                if role.Kind == types.CodeInterpreterKind {
+                    sandbox, sandboxClaim, sandboxEntry, err = buildSandboxByCodeInterpreter(
+                        mar.Namespace, role.RuntimeRef, s.informers,
+                    )
+                } else {
+                    // Default: AgentRuntime
+                    sandbox, sandboxEntry, err = buildSandboxByAgentRuntime(
+                        mar.Namespace, role.RuntimeRef, s.informers,
+                    )
+                }
                 if err != nil {
                     return fmt.Errorf("role %s: build sandbox: %w", role.Name, err)
                 }
@@ -357,7 +368,7 @@ func (s *Server) createSandboxGroup(
                 resp, err := func() (*types.CreateSandboxResponse, error) {
                     resultChan := s.sandboxController.WatchSandboxOnce(waveCtx, sandbox.Namespace, sandbox.Name)
                     defer s.sandboxController.UnWatchSandbox(sandbox.Namespace, sandbox.Name)
-                    return s.createSandbox(waveCtx, dynamicClient, sandbox, nil, sandboxEntry, resultChan)
+                    return s.createSandbox(waveCtx, dynamicClient, sandbox, sandboxClaim, sandboxEntry, resultChan)
                 }()
                 if err != nil {
                     if mar.Spec.StartupPolicy == StartupPolicyBestEffort && !role.IsCoordinator {
@@ -407,7 +418,7 @@ func (s *Server) createSandboxGroup(
 - **Consistent TTL:** A single `baseTime` is captured before the creation loop begins and used for all `ShutdownTime` calculations. This ensures every sandbox in the group shares a synchronized absolute TTL, regardless of how long it takes to create each wave.
 - The deferred rollback calls the existing `rollbackSandboxCreation()` function, without modification, for every sandbox in `created`.
 - Roles are created in topological order. A dependency's endpoint is guaranteed to be in `created` before the dependent role's sandbox is built.
-- `buildSandboxByAgentRuntime()`, `createSandbox()`, `WatchSandboxOnce()`, and `rollbackSandboxCreation()` are all called as-is.
+- `buildSandboxByAgentRuntime()`, `buildSandboxByCodeInterpreter()`, `createSandbox()`, `WatchSandboxOnce()`, and `rollbackSandboxCreation()` are all called as-is. The correct builder is selected by the `role.Kind` field.
 - The `needGroupRollback` flag is only cleared after `SaveAgentGroup` succeeds. A store failure after all sandboxes are created will roll back the Kubernetes resources, maintaining consistency between the cluster state and the store.
 
 > [!NOTE]
@@ -597,6 +608,11 @@ GetAgentGroup(ctx context.Context, groupSessionID string) (*types.AgentGroupMani
 // DeleteAgentGroup removes a group manifest by groupSessionID.
 DeleteAgentGroup(ctx context.Context, groupSessionID string) error
 
+// DeleteAgentGroupRole atomically removes a single role entry from the group manifest
+// using HDEL. Preferred over a read-modify-write cycle during GC to avoid race conditions.
+// If the role was the last entry, it also deletes the _metadata field and the key.
+DeleteAgentGroupRole(ctx context.Context, groupSessionID, roleName string) error
+
 // UpdateAgentGroupRoleStatus atomically updates the status and endpoint of a specific role
 // within a group manifest. Used by the reconciler during self-healing.
 UpdateAgentGroupRoleStatus(ctx context.Context, groupSessionID, roleName, status, endpoint string) error
@@ -664,7 +680,14 @@ type RoleSpec struct {
     // +kubebuilder:validation:MinLength=1
     Name string `json:"name"`
 
-    // RuntimeRef is the name of an existing AgentRuntime CRD in the same namespace.
+    // Kind specifies the type of the referenced runtime.
+    // Defaults to "AgentRuntime". Set to "CodeInterpreter" to reference a CodeInterpreter CRD.
+    // +optional
+    // +kubebuilder:default="AgentRuntime"
+    // +kubebuilder:validation:Enum=AgentRuntime;CodeInterpreter
+    Kind string `json:"kind,omitempty"`
+
+    // RuntimeRef is the name of an existing AgentRuntime or CodeInterpreter CRD in the same namespace.
     // +kubebuilder:validation:MinLength=1
     RuntimeRef string `json:"runtimeRef"`
 
@@ -684,8 +707,8 @@ type RoleSpec struct {
     // +optional
     Dependencies []string `json:"dependencies,omitempty"`
 
-    // TargetPort specifies the name or number of the port in the referenced AgentRuntime 
-    // to be used by dependent roles. If empty, the default Port Resolution Rule applies.
+    // TargetPort specifies the name or number of the port in the referenced AgentRuntime
+    // or CodeInterpreter to be used by dependent roles. If empty, the default Port Resolution Rule applies.
     // +optional
     TargetPort string `json:"targetPort,omitempty"`
 }
@@ -772,20 +795,20 @@ The existing GC in `pkg/workloadmanager/garbage_collection.go` is extended with
 Because the Router only proxies external traffic directly to the coordinator, only the coordinator's `LastActivityAt` timestamp in the store is updated during active sessions. Internal worker sandboxes that receive no direct external traffic would otherwise retain static `LastActivityAt` values, causing the GC to prematurely delete them while the coordinator is still active.
 
 To prevent this, the GC evaluates idle timeouts group-wide:
-1. When checking if a sandbox is idle, if its `GroupSessionID` is non-empty, the GC retrieves the group's coordinator sandbox from the store.
+1. When checking if a sandbox is idle, if its `GroupSessionID` is non-empty, the GC retrieves the group manifest once per GC cycle (cached in a `map[string]*AgentGroupManifest` local to the cycle) and looks up the coordinator sandbox from the manifest.
 2. The idle duration for **all members of the group** is calculated based on the coordinator's `LastActivityAt` timestamp (or the maximum `LastActivityAt` among all group member sandboxes if the coordinator's timestamp is unavailable).
 3. Individual sandboxes in a group are only deleted for inactivity if the group as a whole is determined to be idle.
 
+Caching the manifest per group per GC cycle avoids O(N) redundant store lookups where N is the number of worker sandboxes in the group.
+
 ### Group Metadata Cleanup
 
 When the GC deletes a sandbox that has a non-empty `GroupSessionID`:
 
-1. It calls `GetAgentGroup()` to retrieve the group manifest.
-2. It removes the deleted role from the manifest.
-3. If no roles remain in the manifest, it calls `DeleteAgentGroup()` to remove the `agentgroup:` key from the store.
-4. If other roles remain, it calls `SaveAgentGroup()` with the updated manifest.
+1. It calls `DeleteAgentGroupRole(ctx, groupSessionID, roleName)` to atomically remove the role's entry from the Redis Hash using `HDEL`. This avoids a read-modify-write cycle and prevents race conditions when multiple GC goroutines may be cleaning up members of the same group concurrently.
+2. `DeleteAgentGroupRole()` additionally checks if any `role:*` fields remain in the hash after deletion. If none remain, it deletes the `_metadata` field and removes the entire `agentgroup:` key.
 
-This ensures that group manifests do not accumulate indefinitely in the store after their member sandboxes expire. Group membership is an additional cleanup concern layered on top of existing GC, not a replacement.
+This ensures group manifests do not accumulate indefinitely in the store after their member sandboxes expire. The atomic `HDEL` approach also prevents data loss from concurrent GC deletions of sibling roles within the same group.
 
 ---
 
@@ -884,7 +907,7 @@ This feature is fully backward compatible. No existing behavior changes unless t
 | `pkg/workloadmanager/handlers_test.go` | Add group creation and rollback test cases |
 | `pkg/workloadmanager/server.go` | Add 3 new routes under `/v1/multi-agent-runtime` |
 | `pkg/workloadmanager/garbage_collection.go` | Group manifest cleanup when last member sandbox is GC'd |
-| `pkg/store/interface.go` | Add `SaveAgentGroup`, `GetAgentGroup`, `DeleteAgentGroup`, `UpdateAgentGroupRoleStatus` |
+| `pkg/store/interface.go` | Add `SaveAgentGroup`, `GetAgentGroup`, `DeleteAgentGroup`, `DeleteAgentGroupRole`, `UpdateAgentGroupRoleStatus` |
 | `pkg/store/store_redis.go` | Implement all 4 group methods |
 | `pkg/store/store_redis_test.go` | Group CRUD tests |
 | `pkg/store/store_valkey.go` | Implement all 4 group methods |

From e8e5e8aa7a61290620c2af6c822877a1c89f27e7 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 21:59:40 +0530
Subject: [PATCH 10/20] docs: fix service name example, response payload,
 method count, and GC coordinator lookup

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 27 +++++++++++----------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index 40fe5ee4..0863a2b8 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -236,12 +236,12 @@ The environment variables injected into the dependent pod's containers point to
 **Injection Scope:**
 * The dependency endpoints are injected into the `Env` list of **all containers** (including primary, sidecar, and init-containers) defined in the pod spec. This ensures that any multi-container runtime configuration can reliably resolve the endpoints.
 
-For a role with `dependencies: [my-planner]` in namespace `default` (where `my-planner` maps to service name `grp-xyz-my-planner` and exposes a port named `api` at `8080` and `metrics` at `9090`), the dependent pod's containers receive:
+For a role with `dependencies: [my-planner]` in namespace `default` (where `my-planner` maps to service name `mar-abcdef12-my-planner` and exposes a port named `api` at `8080` and `metrics` at `9090`), the dependent pod's containers receive:
 
 ```
-AGENTCUBE_DEP_MY_PLANNER_ENDPOINT = grp-xyz-my-planner.default.svc.cluster.local:8080
-AGENTCUBE_DEP_MY_PLANNER_PORT_API_ENDPOINT = grp-xyz-my-planner.default.svc.cluster.local:8080
-AGENTCUBE_DEP_MY_PLANNER_PORT_METRICS_ENDPOINT = grp-xyz-my-planner.default.svc.cluster.local:9090
+AGENTCUBE_DEP_MY_PLANNER_ENDPOINT = mar-abcdef12-my-planner.default.svc.cluster.local:8080
+AGENTCUBE_DEP_MY_PLANNER_PORT_API_ENDPOINT = mar-abcdef12-my-planner.default.svc.cluster.local:8080
+AGENTCUBE_DEP_MY_PLANNER_PORT_METRICS_ENDPOINT = mar-abcdef12-my-planner.default.svc.cluster.local:9090
 ```
 
 Injection happens in-memory inside `createSandboxGroup()` by mutating the pod template before it is passed to `buildSandboxByAgentRuntime()`. The referenced `AgentRuntime` CRD object in the informer cache is never written.
@@ -460,7 +460,7 @@ sequenceDiagram
 
     WM->>Store: SaveAgentGroup(manifest)
     WM-->>Router: CreateAgentGroupResponse
-    Router-->>Client: 200 OK + groupSessionId
+    Router-->>Client: 200 OK + CreateAgentGroupResponse
 ```
 
 ### Topological Sort and Cycle Detection
@@ -594,7 +594,7 @@ type AgentGroupRole struct {
 
 ### Store Interface Additions
 
-Four new methods are added to the `Store` interface in `pkg/store/interface.go`. All existing methods are unchanged.
+Five new methods are added to the `Store` interface in `pkg/store/interface.go`. All existing methods are unchanged.
 
 ```go
 // SaveAgentGroup persists a group manifest keyed by groupSessionID.
@@ -795,11 +795,12 @@ The existing GC in `pkg/workloadmanager/garbage_collection.go` is extended with
 Because the Router only proxies external traffic directly to the coordinator, only the coordinator's `LastActivityAt` timestamp in the store is updated during active sessions. Internal worker sandboxes that receive no direct external traffic would otherwise retain static `LastActivityAt` values, causing the GC to prematurely delete them while the coordinator is still active.
 
 To prevent this, the GC evaluates idle timeouts group-wide:
-1. When checking if a sandbox is idle, if its `GroupSessionID` is non-empty, the GC retrieves the group manifest once per GC cycle (cached in a `map[string]*AgentGroupManifest` local to the cycle) and looks up the coordinator sandbox from the manifest.
-2. The idle duration for **all members of the group** is calculated based on the coordinator's `LastActivityAt` timestamp (or the maximum `LastActivityAt` among all group member sandboxes if the coordinator's timestamp is unavailable).
-3. Individual sandboxes in a group are only deleted for inactivity if the group as a whole is determined to be idle.
+1. When checking if a sandbox is idle, if its `GroupSessionID` is non-empty, the GC retrieves the group manifest once per GC cycle via `GetAgentGroup()` (result cached in a `map[string]*AgentGroupManifest` local to that cycle). The manifest's `role:*` fields contain the `SessionID` of each member, including the coordinator.
+2. The coordinator's `SandboxInfo` (including `LastActivityAt`) is fetched with a single `GetSandbox(coordinatorSessionID)` call. This result is also cached per group per cycle, so it is only fetched once regardless of how many worker sandboxes belong to that group. If the coordinator's `SandboxInfo` is unavailable (e.g., already evicted), the GC falls back to the maximum `LastActivityAt` among all group member sandboxes whose `SandboxInfo` can be retrieved.
+3. The idle duration for **all members of the group** is calculated based on the resolved coordinator (or fallback) `LastActivityAt` timestamp.
+4. Individual sandboxes in a group are only deleted for inactivity if the group as a whole is determined to be idle.
 
-Caching the manifest per group per GC cycle avoids O(N) redundant store lookups where N is the number of worker sandboxes in the group.
+Caching both the group manifest and the coordinator `SandboxInfo` per group per GC cycle reduces the total number of store roundtrips to O(1) per group rather than O(N) per group member.
 
 ### Group Metadata Cleanup
 
@@ -908,9 +909,9 @@ This feature is fully backward compatible. No existing behavior changes unless t
 | `pkg/workloadmanager/server.go` | Add 3 new routes under `/v1/multi-agent-runtime` |
 | `pkg/workloadmanager/garbage_collection.go` | Group manifest cleanup when last member sandbox is GC'd |
 | `pkg/store/interface.go` | Add `SaveAgentGroup`, `GetAgentGroup`, `DeleteAgentGroup`, `DeleteAgentGroupRole`, `UpdateAgentGroupRoleStatus` |
-| `pkg/store/store_redis.go` | Implement all 4 group methods |
+| `pkg/store/store_redis.go` | Implement all 5 group methods |
 | `pkg/store/store_redis_test.go` | Group CRUD tests |
-| `pkg/store/store_valkey.go` | Implement all 4 group methods |
+| `pkg/store/store_valkey.go` | Implement all 5 group methods |
 | `pkg/store/store_valkey_test.go` | Group CRUD tests |
 | `pkg/router/session_manager.go` | Add `MultiAgentRuntimeKind` case in endpoint switch |
 | `cmd/workload-manager/main.go` | Phase 1: HTTP routes; Phase 4: reconciler wiring |
@@ -933,7 +934,7 @@ Deliverables that satisfy the mentorship expected outcomes on their own.
   - Role names must be valid DNS label fragments (lowercase alphanumeric and hyphens, max 63 characters).
 - Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet).
 - Add `GroupSessionID` + `Role` to `SandboxInfo`; propagate through `buildSandboxPlaceHolder()` + `buildSandboxInfo()`.
-- Implement all 4 store methods in `store_redis.go` + `store_valkey.go` with full unit test coverage.
+- Implement all 5 store methods in `store_redis.go` + `store_valkey.go` with full unit test coverage.
 - Add `MultiAgentRuntimeKind` to Router endpoint switch.
 - Extend GC to clean up `agentgroup:` manifest keys when last member sandbox is deleted.
 - Unit tests: `createSandboxGroup()` with atomic rollback on partial failure, store CRUD, coordinator validation, cycle detection, admission webhook validation.

From 08307ce83078bcfc9a0c3191f39ea9f8afa47de1 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 22:05:34 +0530
Subject: [PATCH 11/20] docs: fix trailing hyphen stripping, service name
 collision check, no-port error handling, and injectDependencyEndpoints
 signature

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index 0863a2b8..b7238ad3 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -209,7 +209,7 @@ In both policies, coordinator failure always causes full rollback and an error r
 
 Before a dependent role's sandbox is created, stable DNS-based endpoints of its dependencies are injected as environment variables into the pod template. To ensure stable communication under the `BestEffort` policy where failed worker pods are replaced and get new IP addresses, the Workload Manager automatically creates a Headless Kubernetes Service for each role in the group. 
 
-* **Service Name:** `mar-{shortHash}-{roleNameSanitized}` (where `{shortHash}` is the first 8 characters of the SHA-256 hash of the `groupSessionID`, and `{roleNameSanitized}` replaces non-DNS-compliant characters with hyphens, **truncated to 50 characters**). With a 13-character prefix (`mar-` + 8-char hash + `-`), truncating the role name to 50 characters guarantees the full service name always fits within the Kubernetes 63-character DNS label limit.
+* **Service Name:** `mar-{shortHash}-{roleNameSanitized}` (where `{shortHash}` is the first 8 characters of the SHA-256 hash of the `groupSessionID`, and `{roleNameSanitized}` replaces non-DNS-compliant characters with hyphens, **truncated to 50 characters, then stripped of any trailing hyphens**). With a 13-character prefix (`mar-` + 8-char hash + `-`), this guarantees the full service name always fits within the Kubernetes 63-character DNS label limit and is a valid DNS label (no trailing hyphens).
 * **Service Selector:** Matches `SessionIdLabelKey` and `Role` metadata on the sandbox.
 * **Service Lifecycle:** Created during `createSandboxGroup()` and cleaned up automatically via OwnerReferences when the MultiAgentRuntime is deleted.
 
@@ -223,6 +223,7 @@ The environment variables injected into the dependent pod's containers point to
      * If `targetPort` is explicitly defined in the dependency's `RoleSpec` (either as a port name or number), that port is used as the default.
      * If `targetPort` is not specified, and the dependency's `AgentRuntime` CRD defines a single port, that port is used.
      * If `targetPort` is not specified, and it defines multiple ports, the system first looks for a port named `http` or `default`. If no such port is found, it falls back to the first port in the ports list.
+     * If `targetPort` is not specified and the dependency's `AgentRuntime` CRD defines **no ports**, endpoint injection is skipped for that dependency and `injectDependencyEndpoints()` returns an error. The `ValidatingAdmissionWebhook` should reject at admission time any role that lists a dependency whose referenced `AgentRuntime` exposes no ports, preventing this error from reaching the creation path.
 
 2. **Named Port Endpoints (For multi-service agents exposing multiple ports):**
    If the dependency's `AgentRuntime` CRD defines named ports, an endpoint environment variable is additionally injected for each named port:
@@ -232,6 +233,7 @@ The environment variables injected into the dependent pod's containers point to
 
 **Validation against Naming Collisions:**
 * Because multiple role names or port names could map to the same sanitized environment variable (e.g., `my-agent` and `my.agent` both sanitize to `AGENTCUBE_DEP_MY_AGENT_ENDPOINT`), the API server validates the group configuration at request admission time. If any two roles or named ports within the group result in the same sanitized environment variable key, the request is rejected with a `400 Bad Request` validation error.
+* The `ValidatingAdmissionWebhook` also explicitly checks for **Service name collisions after truncation**: after computing `mar-{shortHash}-{roleNameSanitized-truncated-stripped}` for each role, if any two roles in the same group produce an identical service name, the request is rejected. This prevents the edge case where two roles whose names are identical in their first 50 characters would silently share a single Headless Service.
 
 **Injection Scope:**
 * The dependency endpoints are injected into the `Env` list of **all containers** (including primary, sidecar, and init-containers) defined in the pod spec. This ensures that any multi-container runtime configuration can reliably resolve the endpoints.
@@ -360,8 +362,11 @@ func (s *Server) createSandboxGroup(
 
                 if len(role.Dependencies) > 0 {
                     createdMutex.Lock()
-                    injectDependencyEndpoints(&sandbox.Spec.PodTemplate, role.Dependencies, created)
+                    injectErr := injectDependencyEndpoints(&sandbox.Spec.PodTemplate, groupSessionID, role.Dependencies, created)
                     createdMutex.Unlock()
+                    if injectErr != nil {
+                        return fmt.Errorf("role %s: inject dependency endpoints: %w", role.Name, injectErr)
+                    }
                 }
 
                 // Watch and create sandbox in a closure to prevent watcher resource accumulation from defer in a loop

From b02693a20991f5fe9856d76bb90a21cb545564c8 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 22:20:26 +0530
Subject: [PATCH 12/20] docs: service creation, DNS consistency, type
 unification, phase ordering, and CRD status

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 135 +++++++++++++-------
 1 file changed, 88 insertions(+), 47 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index b7238ad3..2842b8e5 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -278,11 +278,12 @@ agentgroup:{grp-xxx} -> AgentGroupManifest{
 
 ```go
 type createdRole struct {
-    name      string
-    resp      *types.CreateSandboxResponse
-    sandbox   *sandboxv1alpha1.Sandbox
-    sessionID string
-    failed    bool
+    name       string
+    resp       *types.CreateSandboxResponse
+    sandbox    *sandboxv1alpha1.Sandbox
+    sessionID  string
+    serviceDNS string // stable DNS from the Headless Service (e.g., mar-abc-planner.ns.svc.cluster.local)
+    failed     bool
 }
 
 func (s *Server) createSandboxGroup(
@@ -298,6 +299,7 @@ func (s *Server) createSandboxGroup(
     baseTime := time.Now()
 
     needGroupRollback := true
+    var createdServices []string // tracks Headless Service names for rollback
     defer func() {
         if !needGroupRollback {
             return
@@ -306,9 +308,16 @@ func (s *Server) createSandboxGroup(
             if c.failed {
                 continue
             }
-            // rollbackSandboxCreation is called as-is (no changes to the function).
             s.rollbackSandboxCreation(dynamicClient, c.sandbox, nil, c.sessionID)
         }
+        // Clean up any Headless Services created during group setup.
+        for _, svcName := range createdServices {
+            if err := s.kubeClient.CoreV1().Services(mar.Namespace).Delete(
+                ctx, svcName, metav1.DeleteOptions{},
+            ); err != nil && !apierrors.IsNotFound(err) {
+                klog.Warningf("group %s: rollback service %s: %v", groupSessionID, svcName, err)
+            }
+        }
     }()
 
     waves, err := topoSort(mar.Spec.Roles)
@@ -360,6 +369,17 @@ func (s *Server) createSandboxGroup(
                     sandbox.Spec.Lifecycle.ShutdownTime = &shutdownTime
                 }
 
+                // Create a Headless Service for this role to provide a stable DNS endpoint.
+                svcName, svcDNS, err := s.createHeadlessServiceForRole(
+                    ctx, mar.Namespace, groupSessionID, role.Name, sandbox,
+                )
+                if err != nil {
+                    return fmt.Errorf("role %s: create headless service: %w", role.Name, err)
+                }
+                createdMutex.Lock()
+                createdServices = append(createdServices, svcName)
+                createdMutex.Unlock()
+
                 if len(role.Dependencies) > 0 {
                     createdMutex.Lock()
                     injectErr := injectDependencyEndpoints(&sandbox.Spec.PodTemplate, groupSessionID, role.Dependencies, created)
@@ -391,11 +411,12 @@ func (s *Server) createSandboxGroup(
 
                 createdMutex.Lock()
                 created = append(created, createdRole{
-                    name:      role.Name,
-                    resp:      resp,
-                    sandbox:   sandbox,
-                    sessionID: sandboxEntry.SessionID,
-                    failed:    false,
+                    name:       role.Name,
+                    resp:       resp,
+                    sandbox:    sandbox,
+                    sessionID:  sandboxEntry.SessionID,
+                    serviceDNS: svcDNS,
+                    failed:     false,
                 })
                 createdMutex.Unlock()
                 return nil
@@ -421,10 +442,11 @@ func (s *Server) createSandboxGroup(
 
 - **Parallel Sandbox Creation:** To prevent HTTP gateway or client timeouts when launching large groups, roles that do not share mutual dependencies (i.e., reside at the same level of the dependency DAG) are created in parallel. Sandbox creation proceeds in "dependency waves": all sandboxes within a wave are launched concurrently, and the server waits for all to be ready before proceeding to the next dependent wave.
 - **Consistent TTL:** A single `baseTime` is captured before the creation loop begins and used for all `ShutdownTime` calculations. This ensures every sandbox in the group shares a synchronized absolute TTL, regardless of how long it takes to create each wave.
-- The deferred rollback calls the existing `rollbackSandboxCreation()` function, without modification, for every sandbox in `created`.
-- Roles are created in topological order. A dependency's endpoint is guaranteed to be in `created` before the dependent role's sandbox is built.
+- **Headless Service per role:** Before each sandbox is created, `createHeadlessServiceForRole()` provisions a Headless Service whose DNS name (`serviceDNS`) is stored in `createdRole`. The `injectDependencyEndpoints()` function reads `serviceDNS` from `created` to construct stable DNS-based environment variables. On rollback, all created Services are explicitly deleted alongside their sandboxes.
+- The deferred rollback calls `rollbackSandboxCreation()` for every sandbox in `created` **and** deletes all Headless Services tracked in `createdServices`, preventing resource leaks on partial failure.
+- Roles are created in topological order. A dependency's `serviceDNS` is guaranteed to be in `created` before the dependent role's sandbox is built.
 - `buildSandboxByAgentRuntime()`, `buildSandboxByCodeInterpreter()`, `createSandbox()`, `WatchSandboxOnce()`, and `rollbackSandboxCreation()` are all called as-is. The correct builder is selected by the `role.Kind` field.
-- The `needGroupRollback` flag is only cleared after `SaveAgentGroup` succeeds. A store failure after all sandboxes are created will roll back the Kubernetes resources, maintaining consistency between the cluster state and the store.
+- The `needGroupRollback` flag is only cleared after `SaveAgentGroup` succeeds. A store failure after all sandboxes are created will roll back both Kubernetes resources and Headless Services, maintaining consistency.
 
 > [!NOTE]
 > **Future Improvement: Reconciler-Based Orchestration.** The current design executes `createSandboxGroup()` synchronously within the API handler. For very large groups where the cumulative creation time may approach HTTP proxy timeouts, a more resilient approach would be to have the API handler persist the `MultiAgentRuntime` CRD with a `Creating` status and delegate the actual sandbox orchestration to the `MultiAgentRuntimeReconciler`. This ensures the system can recover and resume group creation even if the Workload Manager restarts mid-process. This is left as a future optimization since the wave-based parallelism already significantly reduces total startup latency for practical group sizes.
@@ -444,20 +466,23 @@ sequenceDiagram
     WM->>WM: topoSort(roles) -> [planner, researcher, coder]
 
     Note over WM, K8s: Role 1: planner (coordinator)
+    WM->>K8s: Create Headless Service [planner]
     WM->>Store: StoreSandbox(placeholder)
     WM->>K8s: Create Sandbox [planner]
     K8s-->>WM: Sandbox Ready
     WM->>Store: UpdateSandbox(ready)
 
     Note over WM, K8s: Role 2: researcher (depends on planner)
-    WM->>WM: injectDependencyEndpoints(planner IP)
+    WM->>WM: injectDependencyEndpoints(planner Service DNS)
+    WM->>K8s: Create Headless Service [researcher]
     WM->>Store: StoreSandbox(placeholder)
     WM->>K8s: Create Sandbox [researcher]
     K8s-->>WM: Sandbox Ready
     WM->>Store: UpdateSandbox(ready)
 
     Note over WM, K8s: Role 3: coder (depends on planner, researcher)
-    WM->>WM: injectDependencyEndpoints(planner IP, researcher IP)
+    WM->>WM: injectDependencyEndpoints(planner Service DNS, researcher Service DNS)
+    WM->>K8s: Create Headless Service [coder]
     WM->>Store: StoreSandbox(placeholder)
     WM->>K8s: Create Sandbox [coder]
     K8s-->>WM: Sandbox Ready
@@ -481,7 +506,15 @@ func topoSort(roles []RoleSpec) ([][]RoleSpec, error) {
         if _, exists := inDegree[r.Name]; !exists {
             inDegree[r.Name] = 0
         }
+    }
+
+    // Validate all dependency references before running Kahn's algorithm.
+    // This provides clear "missing dependency" errors instead of confusing cycle errors.
+    for _, r := range roles {
         for _, dep := range r.Dependencies {
+            if _, exists := roleMap[dep]; !exists {
+                return nil, fmt.Errorf("role %s depends on non-existent role %s", r.Name, dep)
+            }
             adj[dep] = append(adj[dep], r.Name)
             inDegree[r.Name]++
         }
@@ -516,15 +549,7 @@ func topoSort(roles []RoleSpec) ([][]RoleSpec, error) {
     }
 
     if totalSorted != len(roles) {
-        // Check for missing dependencies first to provide a better error message.
-        for _, r := range roles {
-            for _, dep := range r.Dependencies {
-                if _, exists := roleMap[dep]; !exists {
-                    return nil, fmt.Errorf("role %s depends on non-existent role %s", r.Name, dep)
-                }
-            }
-        }
-        // Identify and name the roles involved in the cycle for the error message.
+        // All dependencies are valid (checked above), so this must be a cycle.
         var cycled []string
         for name, deg := range inDegree {
             if deg > 0 {
@@ -538,7 +563,7 @@ func topoSort(roles []RoleSpec) ([][]RoleSpec, error) {
 }
 ```
 
-The algorithm is Kahn's BFS-based topological sort grouped into level-order waves, O(V+E). Cycle detection is derived from the invariant that Kahn's algorithm only produces a complete ordering when no cycle exists. If `totalSorted < len(roles)`, the roles with remaining in-degree are in a cycle or have missing dependencies. Their names are included in the error message to aid debugging.
+The algorithm is Kahn's BFS-based topological sort grouped into level-order waves, O(V+E). Missing dependencies are validated **upfront** before the sort begins, ensuring clear error messages. Cycle detection is then derived from the invariant that Kahn's algorithm only produces a complete ordering when no cycle exists. If `totalSorted < len(roles)` after the upfront check passes, the remaining roles must form a cycle. Their names are included in the error message to aid debugging.
 
 ---
 
@@ -549,7 +574,7 @@ The algorithm is Kahn's BFS-based topological sort grouped into level-order wave
 | Method | Path | Description |
 |--------|------|-------------|
 | `POST` | `/v1/multi-agent-runtime` | Create a new agent group. Returns group session ID and coordinator entrypoints. |
-| `DELETE` | `/v1/multi-agent-runtime/groups/:groupSessionId` | Delete all sandboxes in the group and remove the group manifest from the store. |
+| `DELETE` | `/v1/multi-agent-runtime/groups/:groupSessionId` | Delete all sandboxes in the group, their associated Headless Services (via OwnerReference cascading), and remove the group manifest from the store. Returns `204 No Content` on success. |
 | `GET` | `/v1/multi-agent-runtime/groups/:groupSessionId/topology` | Return the group manifest including all role endpoints and statuses. Intended for use by the coordinator at startup to discover worker endpoints. |
 
 ### Request and Response Types
@@ -567,33 +592,28 @@ type CreateAgentGroupRequest struct {
 #### Create Group Response
 
 ```go
-type CreateAgentGroupResponse struct {
-    GroupSessionID string                   `json:"groupSessionId"`
-    Roles          []AgentGroupRoleResponse `json:"roles"`
-}
-
-type AgentGroupRoleResponse struct {
+// AgentGroupRoleState is the shared type used across the API response, group manifest,
+// and topology endpoint. A single type prevents structural drift between these surfaces.
+type AgentGroupRoleState struct {
     Name      string `json:"name"`
     SessionID string `json:"sessionId"`
     Endpoint  string `json:"endpoint"`
     Status    string `json:"status"` // "ready" | "failed"
 }
+
+type CreateAgentGroupResponse struct {
+    GroupSessionID string                `json:"groupSessionId"`
+    Roles          []AgentGroupRoleState `json:"roles"`
+}
 ```
 
 #### Group Manifest (stored in Redis/Valkey)
 
 ```go
 type AgentGroupManifest struct {
-    GroupSessionID string           `json:"groupSessionId"`
-    Roles          []AgentGroupRole `json:"roles"`
-    CreatedAt      time.Time        `json:"createdAt"`
-}
-
-type AgentGroupRole struct {
-    Name      string `json:"name"`
-    SessionID string `json:"sessionId"`
-    Endpoint  string `json:"endpoint"`
-    Status    string `json:"status"` // "ready" | "failed"
+    GroupSessionID string                `json:"groupSessionId"`
+    Roles          []AgentGroupRoleState `json:"roles"`
+    CreatedAt      time.Time             `json:"createdAt"`
 }
 ```
 
@@ -671,6 +691,9 @@ type MultiAgentRuntimeSpec struct {
 
     // SessionTimeout is the idle timeout applied to all sandboxes in the group.
     // Defaults to 15m.
+    // NOTE: Although this is a pointer type (*metav1.Duration), kubebuilder applies the
+    // default value at admission time, so the pointer is always non-nil after defaulting.
+    // The nil check in createSandboxGroup() is a defensive guard for programmatic callers.
     // +kubebuilder:default="15m"
     SessionTimeout *metav1.Duration `json:"sessionTimeout,omitempty"`
 
@@ -734,6 +757,21 @@ type MultiAgentRuntimeStatus struct {
 
     // Ready is true when all required roles are running and healthy.
     Ready bool `json:"ready,omitempty"`
+
+    // RoleStatuses tracks per-role operational state. Updated by the reconciler
+    // during self-healing (Phase 4). Complements the store-side group manifest
+    // by providing Kubernetes-native observability via kubectl.
+    // +optional
+    RoleStatuses []RoleStatusEntry `json:"roleStatuses,omitempty"`
+}
+
+type RoleStatusEntry struct {
+    // Name is the role name matching RoleSpec.Name.
+    Name string `json:"name"`
+    // Status is the current operational state: "Ready", "Failed", "Replacing".
+    Status string `json:"status"`
+    // SessionID is the sandbox session ID for this role, if available.
+    SessionID string `json:"sessionId,omitempty"`
 }
 ```
 
@@ -851,7 +889,8 @@ group = client.create_group(
     namespace="default",
 )
 print(f"Group created: {group.group_session_id}")
-print(f"Coordinator endpoint: {group.roles[0].endpoint}")
+coordinator = next(r for r in group.roles if r.name == "planner")
+print(f"Coordinator endpoint: {coordinator.endpoint}")
 
 # Discover worker topology (coordinator calls this at startup)
 topology = client.get_topology(group.group_session_id)
@@ -891,6 +930,9 @@ This feature is fully backward compatible. No existing behavior changes unless t
 | `pkg/apis/runtime/v1alpha1/multiagentruntime_types.go` | CRD types with kubebuilder markers |
 | `pkg/workloadmanager/multiagent_controller.go` | `MultiAgentRuntimeReconciler` |
 | `pkg/workloadmanager/multiagent_controller_test.go` | Reconciler unit tests |
+| `pkg/workloadmanager/multiagent_webhook.go` | `ValidatingAdmissionWebhook` for `MultiAgentRuntime` configuration validation |
+| `pkg/workloadmanager/multiagent_webhook_test.go` | Webhook unit tests (coordinator count, naming collisions, missing deps, DNS label) |
+| `manifests/charts/base/templates/multiagentruntime-webhook.yaml` | Webhook `ValidatingWebhookConfiguration` manifest |
 | `sdk-python/agentcube/multi_agent.py` | `MultiAgentRuntimeClient` for the Python SDK |
 | `sdk-python/examples/multi_agent_usage.py` | End-to-end usage example |
 | `test/e2e/multi_agent_runtime.yaml` | E2E test fixtures |
@@ -937,7 +979,7 @@ Deliverables that satisfy the mentorship expected outcomes on their own.
   - No two roles may produce the same sanitized environment variable key (naming collision detection).
   - `dependencies[]` references must point to roles defined within the same spec.
   - Role names must be valid DNS label fragments (lowercase alphanumeric and hyphens, max 63 characters).
-- Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet).
+- Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet), including `topoSort()`, `injectDependencyEndpoints()`, and Headless Service creation per role.
 - Add `GroupSessionID` + `Role` to `SandboxInfo`; propagate through `buildSandboxPlaceHolder()` + `buildSandboxInfo()`.
 - Implement all 5 store methods in `store_redis.go` + `store_valkey.go` with full unit test coverage.
 - Add `MultiAgentRuntimeKind` to Router endpoint switch.
@@ -953,9 +995,8 @@ Deliverables that satisfy the mentorship expected outcomes on their own.
 - Group creation uses `SandboxClaim` for warm roles, cold `Sandbox` creation for others.
 - Add E2E test comparing cold-start vs warm-start group creation latency.
 
-### Phase 3 - DAG Startup and Topology (Weeks 7-8)
+### Phase 3 - Topology Endpoint and SDK (Weeks 7-8)
 
-- Implement `dependencies[]` field: `topoSort()` + `injectDependencyEndpoints()`.
 - Add `GET /v1/multi-agent-runtime/groups/:groupSessionId/topology` endpoint.
 - Add `get_topology()` to Python SDK `MultiAgentRuntimeClient`.
 - E2E test: verify dependency endpoint env vars are present in dependent pod environment.

From feaa2ad6d2ce2d2d4139509d7651be9eb92cf436 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 22:22:59 +0530
Subject: [PATCH 13/20] docs: clarify service pre-registration ordering and add
 missing manifest fields

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index 2842b8e5..163e65bb 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -370,6 +370,9 @@ func (s *Server) createSandboxGroup(
                 }
 
                 // Create a Headless Service for this role to provide a stable DNS endpoint.
+                // NOTE: The service is intentionally created before injectDependencyEndpoints()
+                // and registered in createdServices immediately, so that if the subsequent
+                // injection step fails, the deferred rollback will still clean up this service.
                 svcName, svcDNS, err := s.createHeadlessServiceForRole(
                     ctx, mar.Namespace, groupSessionID, role.Name, sandbox,
                 )
@@ -612,6 +615,8 @@ type CreateAgentGroupResponse struct {
 ```go
 type AgentGroupManifest struct {
     GroupSessionID string                `json:"groupSessionId"`
+    StartupPolicy  string                `json:"startupPolicy"`
+    Namespace      string                `json:"namespace"`
     Roles          []AgentGroupRoleState `json:"roles"`
     CreatedAt      time.Time             `json:"createdAt"`
 }

From 42f7cf031dd4148f64318ff5528c0b695c68c6ab Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 22:26:12 +0530
Subject: [PATCH 14/20] docs: correct sequence diagram order and sandbox type
 reference

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index 163e65bb..c1ade4ad 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -476,16 +476,16 @@ sequenceDiagram
     WM->>Store: UpdateSandbox(ready)
 
     Note over WM, K8s: Role 2: researcher (depends on planner)
-    WM->>WM: injectDependencyEndpoints(planner Service DNS)
     WM->>K8s: Create Headless Service [researcher]
+    WM->>WM: injectDependencyEndpoints(planner Service DNS)
     WM->>Store: StoreSandbox(placeholder)
     WM->>K8s: Create Sandbox [researcher]
     K8s-->>WM: Sandbox Ready
     WM->>Store: UpdateSandbox(ready)
 
     Note over WM, K8s: Role 3: coder (depends on planner, researcher)
-    WM->>WM: injectDependencyEndpoints(planner Service DNS, researcher Service DNS)
     WM->>K8s: Create Headless Service [coder]
+    WM->>WM: injectDependencyEndpoints(planner Service DNS, researcher Service DNS)
     WM->>Store: StoreSandbox(placeholder)
     WM->>K8s: Create Sandbox [coder]
     K8s-->>WM: Sandbox Ready
@@ -951,7 +951,7 @@ This feature is fully backward compatible. No existing behavior changes unless t
 | `pkg/apis/runtime/v1alpha1/register.go` | Add `MultiAgentRuntimeKind`, `MultiAgentRuntimeListKind`, `MultiAgentRuntimeGroupVersionKind` |
 | `pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go` | Regenerated by `make generate` |
 | `pkg/common/types/types.go` | Add `MultiAgentRuntimeKind` constant |
-| `pkg/common/types/sandbox.go` | Add `GroupSessionID`, `Role` to `SandboxInfo`; add `AgentGroupManifest`, `AgentGroupRole`, group request/response types |
+| `pkg/common/types/sandbox.go` | Add `GroupSessionID`, `Role` to `SandboxInfo`; add `AgentGroupManifest`, `AgentGroupRoleState`, group request/response types |
 | `pkg/api/errors.go` | Add `ErrMultiAgentRuntimeNotFound`; add `multiAgentRuntimeResource` in `workloadResource()` switch |
 | `pkg/workloadmanager/informers.go` | Add `MultiAgentRuntimeGVR`; add informer wiring and cache sync |
 | `pkg/workloadmanager/workload_builder.go` | Add `GroupSessionID`, `Role` fields to `sandboxEntry` struct |

From cf2dc75bf73a13463283797b3b7b779bc4bfd047 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 22:40:44 +0530
Subject: [PATCH 15/20] docs: track SandboxClaim in rollback and clarify
 missing ports error

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 30 +++++++++++----------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index c1ade4ad..3df12cb5 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -223,7 +223,7 @@ The environment variables injected into the dependent pod's containers point to
      * If `targetPort` is explicitly defined in the dependency's `RoleSpec` (either as a port name or number), that port is used as the default.
      * If `targetPort` is not specified, and the dependency's `AgentRuntime` CRD defines a single port, that port is used.
      * If `targetPort` is not specified, and it defines multiple ports, the system first looks for a port named `http` or `default`. If no such port is found, it falls back to the first port in the ports list.
-     * If `targetPort` is not specified and the dependency's `AgentRuntime` CRD defines **no ports**, endpoint injection is skipped for that dependency and `injectDependencyEndpoints()` returns an error. The `ValidatingAdmissionWebhook` should reject at admission time any role that lists a dependency whose referenced `AgentRuntime` exposes no ports, preventing this error from reaching the creation path.
+     * If `targetPort` is not specified and the dependency's `AgentRuntime` CRD defines **no ports**, endpoint injection fails and `injectDependencyEndpoints()` returns an error, which in turn causes the entire group creation to fail (triggering a full rollback under `Atomic` policy). The `ValidatingAdmissionWebhook` should reject at admission time any role that lists a dependency whose referenced `AgentRuntime` exposes no ports, preventing this error from reaching the creation path.
 
 2. **Named Port Endpoints (For multi-service agents exposing multiple ports):**
    If the dependency's `AgentRuntime` CRD defines named ports, an endpoint environment variable is additionally injected for each named port:
@@ -278,12 +278,13 @@ agentgroup:{grp-xxx} -> AgentGroupManifest{
 
 ```go
 type createdRole struct {
-    name       string
-    resp       *types.CreateSandboxResponse
-    sandbox    *sandboxv1alpha1.Sandbox
-    sessionID  string
-    serviceDNS string // stable DNS from the Headless Service (e.g., mar-abc-planner.ns.svc.cluster.local)
-    failed     bool
+    name         string
+    resp         *types.CreateSandboxResponse
+    sandbox      *sandboxv1alpha1.Sandbox
+    sandboxClaim *extensionsv1alpha1.SandboxClaim
+    sessionID    string
+    serviceDNS   string // stable DNS from the Headless Service (e.g., mar-abc-planner.ns.svc.cluster.local)
+    failed       bool
 }
 
 func (s *Server) createSandboxGroup(
@@ -308,7 +309,7 @@ func (s *Server) createSandboxGroup(
             if c.failed {
                 continue
             }
-            s.rollbackSandboxCreation(dynamicClient, c.sandbox, nil, c.sessionID)
+            s.rollbackSandboxCreation(dynamicClient, c.sandbox, c.sandboxClaim, c.sessionID)
         }
         // Clean up any Headless Services created during group setup.
         for _, svcName := range createdServices {
@@ -414,12 +415,13 @@ func (s *Server) createSandboxGroup(
 
                 createdMutex.Lock()
                 created = append(created, createdRole{
-                    name:       role.Name,
-                    resp:       resp,
-                    sandbox:    sandbox,
-                    sessionID:  sandboxEntry.SessionID,
-                    serviceDNS: svcDNS,
-                    failed:     false,
+                    name:         role.Name,
+                    resp:         resp,
+                    sandbox:      sandbox,
+                    sandboxClaim: sandboxClaim,
+                    sessionID:    sandboxEntry.SessionID,
+                    serviceDNS:   svcDNS,
+                    failed:       false,
                 })
                 createdMutex.Unlock()
                 return nil

From eb29b2604c43a53754d5f245198dd5b450c06876 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Wed, 20 May 2026 22:49:25 +0530
Subject: [PATCH 16/20] docs: use stable selector, lua script for gc, and
 snapshot created roles

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index 3df12cb5..0f777b66 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -210,7 +210,7 @@ In both policies, coordinator failure always causes full rollback and an error r
 Before a dependent role's sandbox is created, stable DNS-based endpoints of its dependencies are injected as environment variables into the pod template. To ensure stable communication under the `BestEffort` policy where failed worker pods are replaced and get new IP addresses, the Workload Manager automatically creates a Headless Kubernetes Service for each role in the group. 
 
 * **Service Name:** `mar-{shortHash}-{roleNameSanitized}` (where `{shortHash}` is the first 8 characters of the SHA-256 hash of the `groupSessionID`, and `{roleNameSanitized}` replaces non-DNS-compliant characters with hyphens, **truncated to 50 characters, then stripped of any trailing hyphens**). With a 13-character prefix (`mar-` + 8-char hash + `-`), this guarantees the full service name always fits within the Kubernetes 63-character DNS label limit and is a valid DNS label (no trailing hyphens).
-* **Service Selector:** Matches `SessionIdLabelKey` and `Role` metadata on the sandbox.
+* **Service Selector:** Matches `GroupSessionID` and `Role` metadata on the sandbox. This ensures the Service remains stable and targets replacement pods correctly under `BestEffort` policies.
 * **Service Lifecycle:** Created during `createSandboxGroup()` and cleaned up automatically via OwnerReferences when the MultiAgentRuntime is deleted.
 
 The environment variables injected into the dependent pod's containers point to the stable Service DNS name:
@@ -331,6 +331,13 @@ func (s *Server) createSandboxGroup(
     for _, wave := range waves {
         g, waveCtx := errgroup.WithContext(ctx)
 
+        // Take a snapshot of dependencies created in previous waves to prevent
+        // mutex contention when injecting endpoints concurrently across the wave.
+        createdMutex.Lock()
+        createdSnapshot := make([]createdRole, len(created))
+        copy(createdSnapshot, created)
+        createdMutex.Unlock()
+
         for _, r := range wave {
             role := r // capture loop variable
             g.Go(func() error {
@@ -385,9 +392,7 @@ func (s *Server) createSandboxGroup(
                 createdMutex.Unlock()
 
                 if len(role.Dependencies) > 0 {
-                    createdMutex.Lock()
-                    injectErr := injectDependencyEndpoints(&sandbox.Spec.PodTemplate, groupSessionID, role.Dependencies, created)
-                    createdMutex.Unlock()
+                    injectErr := injectDependencyEndpoints(&sandbox.Spec.PodTemplate, groupSessionID, role.Dependencies, createdSnapshot)
                     if injectErr != nil {
                         return fmt.Errorf("role %s: inject dependency endpoints: %w", role.Name, injectErr)
                     }
@@ -640,9 +645,10 @@ GetAgentGroup(ctx context.Context, groupSessionID string) (*types.AgentGroupMani
 // DeleteAgentGroup removes a group manifest by groupSessionID.
 DeleteAgentGroup(ctx context.Context, groupSessionID string) error
 
-// DeleteAgentGroupRole atomically removes a single role entry from the group manifest
-// using HDEL. Preferred over a read-modify-write cycle during GC to avoid race conditions.
-// If the role was the last entry, it also deletes the _metadata field and the key.
+// DeleteAgentGroupRole atomically removes a single role entry from the group manifest.
+// To prevent race conditions during concurrent GC, the check-then-delete sequence
+// (removing the role, and deleting the manifest if it was the last role) MUST be
+// implemented using a Redis Lua script or transaction.
 DeleteAgentGroupRole(ctx context.Context, groupSessionID, roleName string) error
 
 // UpdateAgentGroupRoleStatus atomically updates the status and endpoint of a specific role

From 6ab571e9687fca17d1a5c46e3ef6dadca3b3e741 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Thu, 28 May 2026 00:14:03 +0530
Subject: [PATCH 17/20] feat(runtime): Add MultiAgentRuntime API validation
 scaffolding

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 ...entcube.volcano.sh_multiagentruntimes.yaml | 222 ++++++++++++++++++
 pkg/apis/runtime/v1alpha1/multiagent_types.go | 142 +++++++++++
 .../runtime/v1alpha1/multiagent_validation.go |  88 +++++++
 .../v1alpha1/multiagent_validation_test.go    | 175 ++++++++++++++
 .../runtime/v1alpha1/zz_generated.deepcopy.go | 158 +++++++++++++
 5 files changed, 785 insertions(+)
 create mode 100644 manifests/charts/base/crds/runtime.agentcube.volcano.sh_multiagentruntimes.yaml
 create mode 100644 pkg/apis/runtime/v1alpha1/multiagent_types.go
 create mode 100644 pkg/apis/runtime/v1alpha1/multiagent_validation.go
 create mode 100644 pkg/apis/runtime/v1alpha1/multiagent_validation_test.go

diff --git a/manifests/charts/base/crds/runtime.agentcube.volcano.sh_multiagentruntimes.yaml b/manifests/charts/base/crds/runtime.agentcube.volcano.sh_multiagentruntimes.yaml
new file mode 100644
index 00000000..7417a828
--- /dev/null
+++ b/manifests/charts/base/crds/runtime.agentcube.volcano.sh_multiagentruntimes.yaml
@@ -0,0 +1,222 @@
+---
+apiVersion: apiextensions.k8s.io/v1
+kind: CustomResourceDefinition
+metadata:
+  annotations:
+    controller-gen.kubebuilder.io/version: v0.17.2
+  name: multiagentruntimes.runtime.agentcube.volcano.sh
+spec:
+  group: runtime.agentcube.volcano.sh
+  names:
+    kind: MultiAgentRuntime
+    listKind: MultiAgentRuntimeList
+    plural: multiagentruntimes
+    singular: multiagentruntime
+  scope: Namespaced
+  versions:
+  - additionalPrinterColumns:
+    - jsonPath: .status.ready
+      name: Ready
+      type: boolean
+    - jsonPath: .spec.startupPolicy
+      name: Policy
+      type: string
+    - jsonPath: .metadata.creationTimestamp
+      name: Age
+      type: date
+    name: v1alpha1
+    schema:
+      openAPIV3Schema:
+        description: |-
+          MultiAgentRuntime defines a group of collaborating AgentRuntime roles with
+          unified lifecycle management.
+        properties:
+          apiVersion:
+            description: |-
+              APIVersion defines the versioned schema of this representation of an object.
+              Servers should convert recognized schemas to the latest internal value, and
+              may reject unrecognized values.
+              More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
+            type: string
+          kind:
+            description: |-
+              Kind is a string value representing the REST resource this object represents.
+              Servers may infer this from the endpoint the client submits requests to.
+              Cannot be updated.
+              In CamelCase.
+              More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
+            type: string
+          metadata:
+            type: object
+          spec:
+            properties:
+              maxSessionDuration:
+                default: 8h
+                description: |-
+                  MaxSessionDuration is the absolute TTL for all sandboxes in the group.
+                  Defaults to 8h.
+                type: string
+              roles:
+                description: |-
+                  Roles defines the set of agent roles in this group.
+                  At least one role must be present, and exactly one must have IsCoordinator=true.
+                items:
+                  properties:
+                    dependencies:
+                      description: |-
+                        Dependencies lists the names of roles that must be ready before this role is created.
+                        Circular dependencies are rejected at request time.
+                      items:
+                        type: string
+                      type: array
+                    isCoordinator:
+                      description: |-
+                        IsCoordinator marks this role as the external entrypoint for the group.
+                        Exactly one role must be marked as coordinator.
+                      type: boolean
+                    kind:
+                      default: AgentRuntime
+                      description: |-
+                        Kind specifies the type of the referenced runtime.
+                        Defaults to "AgentRuntime". Set to "CodeInterpreter" to reference a CodeInterpreter CRD.
+                      enum:
+                      - AgentRuntime
+                      - CodeInterpreter
+                      type: string
+                    name:
+                      description: Name is the unique identifier for this role within
+                        the group.
+                      minLength: 1
+                      type: string
+                    runtimeRef:
+                      description: RuntimeRef is the name of an existing AgentRuntime
+                        or CodeInterpreter CRD in the same namespace.
+                      minLength: 1
+                      type: string
+                    targetPort:
+                      description: |-
+                        TargetPort specifies the name or number of the port in the referenced AgentRuntime
+                        or CodeInterpreter to be used by dependent roles. If empty, the default Port Resolution Rule applies.
+                      type: string
+                    warmPoolSize:
+                      description: WarmPoolSize specifies the number of pre-warmed
+                        sandboxes for this role.
+                      format: int32
+                      minimum: 0
+                      type: integer
+                  required:
+                  - name
+                  - runtimeRef
+                  type: object
+                minItems: 1
+                type: array
+              sessionTimeout:
+                default: 15m
+                description: |-
+                  SessionTimeout is the idle timeout applied to all sandboxes in the group.
+                  Defaults to 15m.
+                type: string
+              startupPolicy:
+                default: Atomic
+                description: StartupPolicy controls failure behavior during group
+                  creation.
+                enum:
+                - Atomic
+                - BestEffort
+                type: string
+            required:
+            - roles
+            type: object
+          status:
+            properties:
+              conditions:
+                description: |-
+                  Conditions reflect the current state of the MultiAgentRuntime.
+                  Standard conditions: Ready, Degraded, Failed.
+                items:
+                  description: Condition contains details for one aspect of the current
+                    state of this API Resource.
+                  properties:
+                    lastTransitionTime:
+                      description: |-
+                        lastTransitionTime is the last time the condition transitioned from one status to another.
+                        This should be when the underlying condition changed.  If that is not known, then using the time when the API field changed is acceptable.
+                      format: date-time
+                      type: string
+                    message:
+                      description: |-
+                        message is a human readable message indicating details about the transition.
+                        This may be an empty string.
+                      maxLength: 32768
+                      type: string
+                    observedGeneration:
+                      description: |-
+                        observedGeneration represents the .metadata.generation that the condition was set based upon.
+                        For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
+                        with respect to the current state of the instance.
+                      format: int64
+                      minimum: 0
+                      type: integer
+                    reason:
+                      description: |-
+                        reason contains a programmatic identifier indicating the reason for the condition's last transition.
+                        Producers of specific condition types may define expected values and meanings for this field,
+                        and whether the values are considered a guaranteed API.
+                        The value should be a CamelCase string.
+                        This field may not be empty.
+                      maxLength: 1024
+                      minLength: 1
+                      pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
+                      type: string
+                    status:
+                      description: status of the condition, one of True, False, Unknown.
+                      enum:
+                      - "True"
+                      - "False"
+                      - Unknown
+                      type: string
+                    type:
+                      description: type of condition in CamelCase or in foo.example.com/CamelCase.
+                      maxLength: 316
+                      pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$
+                      type: string
+                  required:
+                  - lastTransitionTime
+                  - message
+                  - reason
+                  - status
+                  - type
+                  type: object
+                type: array
+              ready:
+                description: Ready is true when all required roles are running and
+                  healthy.
+                type: boolean
+              roleStatuses:
+                description: RoleStatuses tracks per-role operational state.
+                items:
+                  properties:
+                    name:
+                      description: Name is the role name matching RoleSpec.Name.
+                      type: string
+                    sessionId:
+                      description: SessionID is the sandbox session ID for this role,
+                        if available.
+                      type: string
+                    status:
+                      description: 'Status is the current operational state: "Ready",
+                        "Failed", "Replacing".'
+                      type: string
+                  required:
+                  - name
+                  - status
+                  type: object
+                type: array
+            type: object
+        required:
+        - spec
+        type: object
+    served: true
+    storage: true
+    subresources:
+      status: {}
diff --git a/pkg/apis/runtime/v1alpha1/multiagent_types.go b/pkg/apis/runtime/v1alpha1/multiagent_types.go
new file mode 100644
index 00000000..ca17312e
--- /dev/null
+++ b/pkg/apis/runtime/v1alpha1/multiagent_types.go
@@ -0,0 +1,142 @@
+/*
+Copyright The Volcano Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+package v1alpha1
+
+import (
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+)
+
+// MultiAgentRuntime defines a group of collaborating AgentRuntime roles with
+// unified lifecycle management.
+//
+// +genclient
+// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
+// +kubebuilder:object:root=true
+// +kubebuilder:subresource:status
+// +kubebuilder:resource:scope=Namespaced
+// +kubebuilder:printcolumn:name="Ready",type="boolean",JSONPath=".status.ready"
+// +kubebuilder:printcolumn:name="Policy",type="string",JSONPath=".spec.startupPolicy"
+// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
+type MultiAgentRuntime struct {
+	metav1.TypeMeta   `json:",inline"`
+	metav1.ObjectMeta `json:"metadata,omitempty"`
+	Spec              MultiAgentRuntimeSpec   `json:"spec"`
+	Status            MultiAgentRuntimeStatus `json:"status,omitempty"`
+}
+
+type MultiAgentRuntimeSpec struct {
+	// StartupPolicy controls failure behavior during group creation.
+	// +kubebuilder:default="Atomic"
+	// +kubebuilder:validation:Enum=Atomic;BestEffort
+	StartupPolicy StartupPolicyType `json:"startupPolicy,omitempty"`
+
+	// Roles defines the set of agent roles in this group.
+	// At least one role must be present, and exactly one must have IsCoordinator=true.
+	// +kubebuilder:validation:MinItems=1
+	Roles []RoleSpec `json:"roles"`
+
+	// SessionTimeout is the idle timeout applied to all sandboxes in the group.
+	// Defaults to 15m.
+	// +kubebuilder:default="15m"
+	SessionTimeout *metav1.Duration `json:"sessionTimeout,omitempty"`
+
+	// MaxSessionDuration is the absolute TTL for all sandboxes in the group.
+	// Defaults to 8h.
+	// +kubebuilder:default="8h"
+	MaxSessionDuration *metav1.Duration `json:"maxSessionDuration,omitempty"`
+}
+
+type RoleSpec struct {
+	// Name is the unique identifier for this role within the group.
+	// +kubebuilder:validation:MinLength=1
+	Name string `json:"name"`
+
+	// Kind specifies the type of the referenced runtime.
+	// Defaults to "AgentRuntime". Set to "CodeInterpreter" to reference a CodeInterpreter CRD.
+	// +optional
+	// +kubebuilder:default="AgentRuntime"
+	// +kubebuilder:validation:Enum=AgentRuntime;CodeInterpreter
+	Kind string `json:"kind,omitempty"`
+
+	// RuntimeRef is the name of an existing AgentRuntime or CodeInterpreter CRD in the same namespace.
+	// +kubebuilder:validation:MinLength=1
+	RuntimeRef string `json:"runtimeRef"`
+
+	// IsCoordinator marks this role as the external entrypoint for the group.
+	// Exactly one role must be marked as coordinator.
+	// +optional
+	IsCoordinator bool `json:"isCoordinator,omitempty"`
+
+	// WarmPoolSize specifies the number of pre-warmed sandboxes for this role.
+	// +optional
+	// +kubebuilder:validation:Minimum=0
+	WarmPoolSize *int32 `json:"warmPoolSize,omitempty"`
+
+	// Dependencies lists the names of roles that must be ready before this role is created.
+	// Circular dependencies are rejected at request time.
+	// +optional
+	Dependencies []string `json:"dependencies,omitempty"`
+
+	// TargetPort specifies the name or number of the port in the referenced AgentRuntime
+	// or CodeInterpreter to be used by dependent roles. If empty, the default Port Resolution Rule applies.
+	// +optional
+	TargetPort string `json:"targetPort,omitempty"`
+}
+
+type StartupPolicyType string
+
+const (
+	// StartupPolicyAtomic rolls back all created sandboxes if any role fails.
+	StartupPolicyAtomic StartupPolicyType = "Atomic"
+	// StartupPolicyBestEffort allows worker failures; coordinator failure still rolls back everything.
+	StartupPolicyBestEffort StartupPolicyType = "BestEffort"
+)
+
+type MultiAgentRuntimeStatus struct {
+	// Conditions reflect the current state of the MultiAgentRuntime.
+	// Standard conditions: Ready, Degraded, Failed.
+	Conditions []metav1.Condition `json:"conditions,omitempty"`
+
+	// Ready is true when all required roles are running and healthy.
+	Ready bool `json:"ready,omitempty"`
+
+	// RoleStatuses tracks per-role operational state.
+	// +optional
+	RoleStatuses []RoleStatusEntry `json:"roleStatuses,omitempty"`
+}
+
+type RoleStatusEntry struct {
+	// Name is the role name matching RoleSpec.Name.
+	Name string `json:"name"`
+	// Status is the current operational state: "Ready", "Failed", "Replacing".
+	Status string `json:"status"`
+	// SessionID is the sandbox session ID for this role, if available.
+	SessionID string `json:"sessionId,omitempty"`
+}
+
+// MultiAgentRuntimeList contains a list of MultiAgentRuntime
+// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
+// +kubebuilder:object:root=true
+type MultiAgentRuntimeList struct {
+	metav1.TypeMeta `json:",inline"`
+	metav1.ListMeta `json:"metadata,omitempty"`
+	Items           []MultiAgentRuntime `json:"items"`
+}
+
+func init() {
+	SchemeBuilder.Register(&MultiAgentRuntime{}, &MultiAgentRuntimeList{})
+}
diff --git a/pkg/apis/runtime/v1alpha1/multiagent_validation.go b/pkg/apis/runtime/v1alpha1/multiagent_validation.go
new file mode 100644
index 00000000..1f50f019
--- /dev/null
+++ b/pkg/apis/runtime/v1alpha1/multiagent_validation.go
@@ -0,0 +1,88 @@
+/*
+Copyright The Volcano Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+package v1alpha1
+
+import (
+	"fmt"
+
+	"k8s.io/apimachinery/pkg/util/validation"
+	"k8s.io/apimachinery/pkg/util/validation/field"
+)
+
+// ValidateMultiAgentRuntime validates the MultiAgentRuntime and returns an ErrorList
+func ValidateMultiAgentRuntime(mar *MultiAgentRuntime) field.ErrorList {
+	allErrs := field.ErrorList{}
+	allErrs = append(allErrs, ValidateMultiAgentRuntimeSpec(&mar.Spec, field.NewPath("spec"))...)
+	return allErrs
+}
+
+// ValidateMultiAgentRuntimeSpec validates the MultiAgentRuntimeSpec
+func ValidateMultiAgentRuntimeSpec(spec *MultiAgentRuntimeSpec, fldPath *field.Path) field.ErrorList {
+	allErrs := field.ErrorList{}
+
+	if len(spec.Roles) == 0 {
+		allErrs = append(allErrs, field.Required(fldPath.Child("roles"), "must provide at least one role"))
+	}
+
+	roleMap := make(map[string]bool)
+	coordinatorCount := 0
+
+	rolesPath := fldPath.Child("roles")
+	for i, role := range spec.Roles {
+		idxPath := rolesPath.Index(i)
+
+		// Validate role name
+		if len(role.Name) == 0 {
+			allErrs = append(allErrs, field.Required(idxPath.Child("name"), "role name is required"))
+		} else {
+			for _, msg := range validation.IsDNS1123Label(role.Name) {
+				allErrs = append(allErrs, field.Invalid(idxPath.Child("name"), role.Name, msg))
+			}
+		}
+
+		// Check for duplicate role names
+		if roleMap[role.Name] {
+			allErrs = append(allErrs, field.Duplicate(idxPath.Child("name"), role.Name))
+		}
+		roleMap[role.Name] = true
+
+		if role.IsCoordinator {
+			coordinatorCount++
+		}
+
+		// Validate runtime ref
+		if len(role.RuntimeRef) == 0 {
+			allErrs = append(allErrs, field.Required(idxPath.Child("runtimeRef"), "runtime reference is required"))
+		}
+	}
+
+	if coordinatorCount != 1 {
+		allErrs = append(allErrs, field.Invalid(rolesPath, spec.Roles, fmt.Sprintf("exactly one coordinator must be set, but found %d", coordinatorCount)))
+	}
+
+	// Validate dependencies exist
+	for i, role := range spec.Roles {
+		idxPath := rolesPath.Index(i)
+		for j, dep := range role.Dependencies {
+			if !roleMap[dep] {
+				allErrs = append(allErrs, field.Invalid(idxPath.Child("dependencies").Index(j), dep, "dependency refers to a non-existent role"))
+			}
+		}
+	}
+
+	return allErrs
+}
diff --git a/pkg/apis/runtime/v1alpha1/multiagent_validation_test.go b/pkg/apis/runtime/v1alpha1/multiagent_validation_test.go
new file mode 100644
index 00000000..4c91fb06
--- /dev/null
+++ b/pkg/apis/runtime/v1alpha1/multiagent_validation_test.go
@@ -0,0 +1,175 @@
+/*
+Copyright The Volcano Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+package v1alpha1
+
+import (
+	"strings"
+	"testing"
+)
+
+func TestValidateMultiAgentRuntimeSpec(t *testing.T) {
+	testCases := []struct {
+		name          string
+		spec          MultiAgentRuntimeSpec
+		expectedError string
+	}{
+		{
+			name: "valid configuration",
+			spec: MultiAgentRuntimeSpec{
+				Roles: []RoleSpec{
+					{
+						Name:          "planner",
+						RuntimeRef:    "planner-runtime",
+						IsCoordinator: true,
+					},
+					{
+						Name:         "worker",
+						RuntimeRef:   "worker-runtime",
+						Dependencies: []string{"planner"},
+					},
+				},
+			},
+			expectedError: "",
+		},
+		{
+			name: "missing roles",
+			spec: MultiAgentRuntimeSpec{
+				Roles: []RoleSpec{},
+			},
+			expectedError: "must provide at least one role",
+		},
+		{
+			name: "missing coordinator",
+			spec: MultiAgentRuntimeSpec{
+				Roles: []RoleSpec{
+					{
+						Name:       "worker1",
+						RuntimeRef: "runtime1",
+					},
+				},
+			},
+			expectedError: "exactly one coordinator must be set, but found 0",
+		},
+		{
+			name: "multiple coordinators",
+			spec: MultiAgentRuntimeSpec{
+				Roles: []RoleSpec{
+					{
+						Name:          "planner1",
+						RuntimeRef:    "runtime1",
+						IsCoordinator: true,
+					},
+					{
+						Name:          "planner2",
+						RuntimeRef:    "runtime2",
+						IsCoordinator: true,
+					},
+				},
+			},
+			expectedError: "exactly one coordinator must be set, but found 2",
+		},
+		{
+			name: "duplicate role names",
+			spec: MultiAgentRuntimeSpec{
+				Roles: []RoleSpec{
+					{
+						Name:          "planner",
+						RuntimeRef:    "runtime1",
+						IsCoordinator: true,
+					},
+					{
+						Name:       "planner",
+						RuntimeRef: "runtime2",
+					},
+				},
+			},
+			expectedError: "Duplicate value",
+		},
+		{
+			name: "invalid role name",
+			spec: MultiAgentRuntimeSpec{
+				Roles: []RoleSpec{
+					{
+						Name:          "invalid_name!",
+						RuntimeRef:    "runtime1",
+						IsCoordinator: true,
+					},
+				},
+			},
+			expectedError: "a lowercase RFC 1123 label must consist of lower case alphanumeric characters",
+		},
+		{
+			name: "non-existent dependency",
+			spec: MultiAgentRuntimeSpec{
+				Roles: []RoleSpec{
+					{
+						Name:          "planner",
+						RuntimeRef:    "runtime1",
+						IsCoordinator: true,
+					},
+					{
+						Name:         "worker",
+						RuntimeRef:   "runtime2",
+						Dependencies: []string{"missing-role"},
+					},
+				},
+			},
+			expectedError: "dependency refers to a non-existent role",
+		},
+		{
+			name: "missing runtime ref",
+			spec: MultiAgentRuntimeSpec{
+				Roles: []RoleSpec{
+					{
+						Name:          "planner",
+						IsCoordinator: true,
+					},
+				},
+			},
+			expectedError: "runtime reference is required",
+		},
+	}
+
+	for _, tc := range testCases {
+		t.Run(tc.name, func(t *testing.T) {
+			errs := ValidateMultiAgentRuntime(&MultiAgentRuntime{Spec: tc.spec})
+			if tc.expectedError == "" {
+				if len(errs) != 0 {
+					t.Errorf("Expected valid spec, got errors: %v", errs)
+				}
+			} else {
+				if len(errs) == 0 {
+					t.Errorf("Expected error containing %q, but got valid", tc.expectedError)
+					return
+				}
+
+				found := false
+				for _, err := range errs {
+					if err.Error() != "" {
+						if strings.Contains(err.Error(), tc.expectedError) {
+							found = true
+							break
+						}
+					}
+				}
+				if !found {
+					t.Errorf("Expected error containing %q, got: %v", tc.expectedError, errs)
+				}
+			}
+		})
+	}
+}
diff --git a/pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go b/pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go
index a7dbb4d4..f6a7eab3 100644
--- a/pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go
+++ b/pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go
@@ -320,6 +320,164 @@ func (in *CodeInterpreterStatus) DeepCopy() *CodeInterpreterStatus {
 	return out
 }
 
+// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
+func (in *MultiAgentRuntime) DeepCopyInto(out *MultiAgentRuntime) {
+	*out = *in
+	out.TypeMeta = in.TypeMeta
+	in.ObjectMeta.DeepCopyInto(&out.ObjectMeta)
+	in.Spec.DeepCopyInto(&out.Spec)
+	in.Status.DeepCopyInto(&out.Status)
+}
+
+// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new MultiAgentRuntime.
+func (in *MultiAgentRuntime) DeepCopy() *MultiAgentRuntime {
+	if in == nil {
+		return nil
+	}
+	out := new(MultiAgentRuntime)
+	in.DeepCopyInto(out)
+	return out
+}
+
+// DeepCopyObject is an autogenerated deepcopy function, copying the receiver, creating a new runtime.Object.
+func (in *MultiAgentRuntime) DeepCopyObject() runtime.Object {
+	if c := in.DeepCopy(); c != nil {
+		return c
+	}
+	return nil
+}
+
+// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
+func (in *MultiAgentRuntimeList) DeepCopyInto(out *MultiAgentRuntimeList) {
+	*out = *in
+	out.TypeMeta = in.TypeMeta
+	in.ListMeta.DeepCopyInto(&out.ListMeta)
+	if in.Items != nil {
+		in, out := &in.Items, &out.Items
+		*out = make([]MultiAgentRuntime, len(*in))
+		for i := range *in {
+			(*in)[i].DeepCopyInto(&(*out)[i])
+		}
+	}
+}
+
+// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new MultiAgentRuntimeList.
+func (in *MultiAgentRuntimeList) DeepCopy() *MultiAgentRuntimeList {
+	if in == nil {
+		return nil
+	}
+	out := new(MultiAgentRuntimeList)
+	in.DeepCopyInto(out)
+	return out
+}
+
+// DeepCopyObject is an autogenerated deepcopy function, copying the receiver, creating a new runtime.Object.
+func (in *MultiAgentRuntimeList) DeepCopyObject() runtime.Object {
+	if c := in.DeepCopy(); c != nil {
+		return c
+	}
+	return nil
+}
+
+// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
+func (in *MultiAgentRuntimeSpec) DeepCopyInto(out *MultiAgentRuntimeSpec) {
+	*out = *in
+	if in.Roles != nil {
+		in, out := &in.Roles, &out.Roles
+		*out = make([]RoleSpec, len(*in))
+		for i := range *in {
+			(*in)[i].DeepCopyInto(&(*out)[i])
+		}
+	}
+	if in.SessionTimeout != nil {
+		in, out := &in.SessionTimeout, &out.SessionTimeout
+		*out = new(v1.Duration)
+		**out = **in
+	}
+	if in.MaxSessionDuration != nil {
+		in, out := &in.MaxSessionDuration, &out.MaxSessionDuration
+		*out = new(v1.Duration)
+		**out = **in
+	}
+}
+
+// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new MultiAgentRuntimeSpec.
+func (in *MultiAgentRuntimeSpec) DeepCopy() *MultiAgentRuntimeSpec {
+	if in == nil {
+		return nil
+	}
+	out := new(MultiAgentRuntimeSpec)
+	in.DeepCopyInto(out)
+	return out
+}
+
+// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
+func (in *MultiAgentRuntimeStatus) DeepCopyInto(out *MultiAgentRuntimeStatus) {
+	*out = *in
+	if in.Conditions != nil {
+		in, out := &in.Conditions, &out.Conditions
+		*out = make([]v1.Condition, len(*in))
+		for i := range *in {
+			(*in)[i].DeepCopyInto(&(*out)[i])
+		}
+	}
+	if in.RoleStatuses != nil {
+		in, out := &in.RoleStatuses, &out.RoleStatuses
+		*out = make([]RoleStatusEntry, len(*in))
+		copy(*out, *in)
+	}
+}
+
+// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new MultiAgentRuntimeStatus.
+func (in *MultiAgentRuntimeStatus) DeepCopy() *MultiAgentRuntimeStatus {
+	if in == nil {
+		return nil
+	}
+	out := new(MultiAgentRuntimeStatus)
+	in.DeepCopyInto(out)
+	return out
+}
+
+// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
+func (in *RoleSpec) DeepCopyInto(out *RoleSpec) {
+	*out = *in
+	if in.WarmPoolSize != nil {
+		in, out := &in.WarmPoolSize, &out.WarmPoolSize
+		*out = new(int32)
+		**out = **in
+	}
+	if in.Dependencies != nil {
+		in, out := &in.Dependencies, &out.Dependencies
+		*out = make([]string, len(*in))
+		copy(*out, *in)
+	}
+}
+
+// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new RoleSpec.
+func (in *RoleSpec) DeepCopy() *RoleSpec {
+	if in == nil {
+		return nil
+	}
+	out := new(RoleSpec)
+	in.DeepCopyInto(out)
+	return out
+}
+
+// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
+func (in *RoleStatusEntry) DeepCopyInto(out *RoleStatusEntry) {
+	*out = *in
+}
+
+// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new RoleStatusEntry.
+func (in *RoleStatusEntry) DeepCopy() *RoleStatusEntry {
+	if in == nil {
+		return nil
+	}
+	out := new(RoleStatusEntry)
+	in.DeepCopyInto(out)
+	return out
+}
+
 // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
 func (in *SandboxTemplate) DeepCopyInto(out *SandboxTemplate) {
 	*out = *in

From 77bdeae7c0f80e791d30bff012c67867aba32ccf Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Thu, 28 May 2026 00:21:13 +0530
Subject: [PATCH 18/20] chore(codegen): Fix lister-gen and update generated
 client code for MultiAgentRuntime

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 .../v1alpha1/fake/fake_multiagentruntime.go   |  52 +++++++++
 .../v1alpha1/fake/fake_runtime_client.go      |   4 +
 .../runtime/v1alpha1/generated_expansion.go   |   2 +
 .../runtime/v1alpha1/multiagentruntime.go     |  70 ++++++++++++
 .../typed/runtime/v1alpha1/runtime_client.go  |   5 +
 .../informers/externalversions/generic.go     |   2 +
 .../runtime/v1alpha1/interface.go             |   7 ++
 .../runtime/v1alpha1/multiagentruntime.go     | 102 ++++++++++++++++++
 .../runtime/v1alpha1/expansion_generated.go   |   8 ++
 .../runtime/v1alpha1/multiagentruntime.go     |  70 ++++++++++++
 hack/update-codegen.sh                        |   2 +
 11 files changed, 324 insertions(+)
 create mode 100644 client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_multiagentruntime.go
 create mode 100644 client-go/clientset/versioned/typed/runtime/v1alpha1/multiagentruntime.go
 create mode 100644 client-go/informers/externalversions/runtime/v1alpha1/multiagentruntime.go
 create mode 100644 client-go/listers/runtime/v1alpha1/multiagentruntime.go

diff --git a/client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_multiagentruntime.go b/client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_multiagentruntime.go
new file mode 100644
index 00000000..de1ec9c1
--- /dev/null
+++ b/client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_multiagentruntime.go
@@ -0,0 +1,52 @@
+/*
+Copyright The Volcano Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+// Code generated by client-gen. DO NOT EDIT.
+
+package fake
+
+import (
+	runtimev1alpha1 "github.com/volcano-sh/agentcube/client-go/clientset/versioned/typed/runtime/v1alpha1"
+	v1alpha1 "github.com/volcano-sh/agentcube/pkg/apis/runtime/v1alpha1"
+	gentype "k8s.io/client-go/gentype"
+)
+
+// fakeMultiAgentRuntimes implements MultiAgentRuntimeInterface
+type fakeMultiAgentRuntimes struct {
+	*gentype.FakeClientWithList[*v1alpha1.MultiAgentRuntime, *v1alpha1.MultiAgentRuntimeList]
+	Fake *FakeRuntimeV1alpha1
+}
+
+func newFakeMultiAgentRuntimes(fake *FakeRuntimeV1alpha1, namespace string) runtimev1alpha1.MultiAgentRuntimeInterface {
+	return &fakeMultiAgentRuntimes{
+		gentype.NewFakeClientWithList[*v1alpha1.MultiAgentRuntime, *v1alpha1.MultiAgentRuntimeList](
+			fake.Fake,
+			namespace,
+			v1alpha1.SchemeGroupVersion.WithResource("multiagentruntimes"),
+			v1alpha1.SchemeGroupVersion.WithKind("MultiAgentRuntime"),
+			func() *v1alpha1.MultiAgentRuntime { return &v1alpha1.MultiAgentRuntime{} },
+			func() *v1alpha1.MultiAgentRuntimeList { return &v1alpha1.MultiAgentRuntimeList{} },
+			func(dst, src *v1alpha1.MultiAgentRuntimeList) { dst.ListMeta = src.ListMeta },
+			func(list *v1alpha1.MultiAgentRuntimeList) []*v1alpha1.MultiAgentRuntime {
+				return gentype.ToPointerSlice(list.Items)
+			},
+			func(list *v1alpha1.MultiAgentRuntimeList, items []*v1alpha1.MultiAgentRuntime) {
+				list.Items = gentype.FromPointerSlice(items)
+			},
+		),
+		fake,
+	}
+}
diff --git a/client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go b/client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go
index 91c3617f..e8da1f38 100644
--- a/client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go
+++ b/client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go
@@ -36,6 +36,10 @@ func (c *FakeRuntimeV1alpha1) CodeInterpreters(namespace string) v1alpha1.CodeIn
 	return newFakeCodeInterpreters(c, namespace)
 }
 
+func (c *FakeRuntimeV1alpha1) MultiAgentRuntimes(namespace string) v1alpha1.MultiAgentRuntimeInterface {
+	return newFakeMultiAgentRuntimes(c, namespace)
+}
+
 // RESTClient returns a RESTClient that is used to communicate
 // with API server by this client implementation.
 func (c *FakeRuntimeV1alpha1) RESTClient() rest.Interface {
diff --git a/client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go b/client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go
index 1ac68773..a6fcc213 100644
--- a/client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go
+++ b/client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go
@@ -21,3 +21,5 @@ package v1alpha1
 type AgentRuntimeExpansion interface{}
 
 type CodeInterpreterExpansion interface{}
+
+type MultiAgentRuntimeExpansion interface{}
diff --git a/client-go/clientset/versioned/typed/runtime/v1alpha1/multiagentruntime.go b/client-go/clientset/versioned/typed/runtime/v1alpha1/multiagentruntime.go
new file mode 100644
index 00000000..eae1117a
--- /dev/null
+++ b/client-go/clientset/versioned/typed/runtime/v1alpha1/multiagentruntime.go
@@ -0,0 +1,70 @@
+/*
+Copyright The Volcano Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+// Code generated by client-gen. DO NOT EDIT.
+
+package v1alpha1
+
+import (
+	context "context"
+
+	scheme "github.com/volcano-sh/agentcube/client-go/clientset/versioned/scheme"
+	runtimev1alpha1 "github.com/volcano-sh/agentcube/pkg/apis/runtime/v1alpha1"
+	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	types "k8s.io/apimachinery/pkg/types"
+	watch "k8s.io/apimachinery/pkg/watch"
+	gentype "k8s.io/client-go/gentype"
+)
+
+// MultiAgentRuntimesGetter has a method to return a MultiAgentRuntimeInterface.
+// A group's client should implement this interface.
+type MultiAgentRuntimesGetter interface {
+	MultiAgentRuntimes(namespace string) MultiAgentRuntimeInterface
+}
+
+// MultiAgentRuntimeInterface has methods to work with MultiAgentRuntime resources.
+type MultiAgentRuntimeInterface interface {
+	Create(ctx context.Context, multiAgentRuntime *runtimev1alpha1.MultiAgentRuntime, opts v1.CreateOptions) (*runtimev1alpha1.MultiAgentRuntime, error)
+	Update(ctx context.Context, multiAgentRuntime *runtimev1alpha1.MultiAgentRuntime, opts v1.UpdateOptions) (*runtimev1alpha1.MultiAgentRuntime, error)
+	// Add a +genclient:noStatus comment above the type to avoid generating UpdateStatus().
+	UpdateStatus(ctx context.Context, multiAgentRuntime *runtimev1alpha1.MultiAgentRuntime, opts v1.UpdateOptions) (*runtimev1alpha1.MultiAgentRuntime, error)
+	Delete(ctx context.Context, name string, opts v1.DeleteOptions) error
+	DeleteCollection(ctx context.Context, opts v1.DeleteOptions, listOpts v1.ListOptions) error
+	Get(ctx context.Context, name string, opts v1.GetOptions) (*runtimev1alpha1.MultiAgentRuntime, error)
+	List(ctx context.Context, opts v1.ListOptions) (*runtimev1alpha1.MultiAgentRuntimeList, error)
+	Watch(ctx context.Context, opts v1.ListOptions) (watch.Interface, error)
+	Patch(ctx context.Context, name string, pt types.PatchType, data []byte, opts v1.PatchOptions, subresources ...string) (result *runtimev1alpha1.MultiAgentRuntime, err error)
+	MultiAgentRuntimeExpansion
+}
+
+// multiAgentRuntimes implements MultiAgentRuntimeInterface
+type multiAgentRuntimes struct {
+	*gentype.ClientWithList[*runtimev1alpha1.MultiAgentRuntime, *runtimev1alpha1.MultiAgentRuntimeList]
+}
+
+// newMultiAgentRuntimes returns a MultiAgentRuntimes
+func newMultiAgentRuntimes(c *RuntimeV1alpha1Client, namespace string) *multiAgentRuntimes {
+	return &multiAgentRuntimes{
+		gentype.NewClientWithList[*runtimev1alpha1.MultiAgentRuntime, *runtimev1alpha1.MultiAgentRuntimeList](
+			"multiagentruntimes",
+			c.RESTClient(),
+			scheme.ParameterCodec,
+			namespace,
+			func() *runtimev1alpha1.MultiAgentRuntime { return &runtimev1alpha1.MultiAgentRuntime{} },
+			func() *runtimev1alpha1.MultiAgentRuntimeList { return &runtimev1alpha1.MultiAgentRuntimeList{} },
+		),
+	}
+}
diff --git a/client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go b/client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go
index 0da62b40..96c45ba0 100644
--- a/client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go
+++ b/client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go
@@ -30,6 +30,7 @@ type RuntimeV1alpha1Interface interface {
 	RESTClient() rest.Interface
 	AgentRuntimesGetter
 	CodeInterpretersGetter
+	MultiAgentRuntimesGetter
 }
 
 // RuntimeV1alpha1Client is used to interact with features provided by the runtime group.
@@ -45,6 +46,10 @@ func (c *RuntimeV1alpha1Client) CodeInterpreters(namespace string) CodeInterpret
 	return newCodeInterpreters(c, namespace)
 }
 
+func (c *RuntimeV1alpha1Client) MultiAgentRuntimes(namespace string) MultiAgentRuntimeInterface {
+	return newMultiAgentRuntimes(c, namespace)
+}
+
 // NewForConfig creates a new RuntimeV1alpha1Client for the given config.
 // NewForConfig is equivalent to NewForConfigAndClient(c, httpClient),
 // where httpClient was generated with rest.HTTPClientFor(c).
diff --git a/client-go/informers/externalversions/generic.go b/client-go/informers/externalversions/generic.go
index f304a1b7..f557f68d 100644
--- a/client-go/informers/externalversions/generic.go
+++ b/client-go/informers/externalversions/generic.go
@@ -57,6 +57,8 @@ func (f *sharedInformerFactory) ForResource(resource schema.GroupVersionResource
 		return &genericInformer{resource: resource.GroupResource(), informer: f.Runtime().V1alpha1().AgentRuntimes().Informer()}, nil
 	case v1alpha1.SchemeGroupVersion.WithResource("codeinterpreters"):
 		return &genericInformer{resource: resource.GroupResource(), informer: f.Runtime().V1alpha1().CodeInterpreters().Informer()}, nil
+	case v1alpha1.SchemeGroupVersion.WithResource("multiagentruntimes"):
+		return &genericInformer{resource: resource.GroupResource(), informer: f.Runtime().V1alpha1().MultiAgentRuntimes().Informer()}, nil
 
 	}
 
diff --git a/client-go/informers/externalversions/runtime/v1alpha1/interface.go b/client-go/informers/externalversions/runtime/v1alpha1/interface.go
index 959f4abe..89dccb0f 100644
--- a/client-go/informers/externalversions/runtime/v1alpha1/interface.go
+++ b/client-go/informers/externalversions/runtime/v1alpha1/interface.go
@@ -28,6 +28,8 @@ type Interface interface {
 	AgentRuntimes() AgentRuntimeInformer
 	// CodeInterpreters returns a CodeInterpreterInformer.
 	CodeInterpreters() CodeInterpreterInformer
+	// MultiAgentRuntimes returns a MultiAgentRuntimeInformer.
+	MultiAgentRuntimes() MultiAgentRuntimeInformer
 }
 
 type version struct {
@@ -50,3 +52,8 @@ func (v *version) AgentRuntimes() AgentRuntimeInformer {
 func (v *version) CodeInterpreters() CodeInterpreterInformer {
 	return &codeInterpreterInformer{factory: v.factory, namespace: v.namespace, tweakListOptions: v.tweakListOptions}
 }
+
+// MultiAgentRuntimes returns a MultiAgentRuntimeInformer.
+func (v *version) MultiAgentRuntimes() MultiAgentRuntimeInformer {
+	return &multiAgentRuntimeInformer{factory: v.factory, namespace: v.namespace, tweakListOptions: v.tweakListOptions}
+}
diff --git a/client-go/informers/externalversions/runtime/v1alpha1/multiagentruntime.go b/client-go/informers/externalversions/runtime/v1alpha1/multiagentruntime.go
new file mode 100644
index 00000000..b498fb46
--- /dev/null
+++ b/client-go/informers/externalversions/runtime/v1alpha1/multiagentruntime.go
@@ -0,0 +1,102 @@
+/*
+Copyright The Volcano Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+// Code generated by informer-gen. DO NOT EDIT.
+
+package v1alpha1
+
+import (
+	context "context"
+	time "time"
+
+	versioned "github.com/volcano-sh/agentcube/client-go/clientset/versioned"
+	internalinterfaces "github.com/volcano-sh/agentcube/client-go/informers/externalversions/internalinterfaces"
+	runtimev1alpha1 "github.com/volcano-sh/agentcube/client-go/listers/runtime/v1alpha1"
+	apisruntimev1alpha1 "github.com/volcano-sh/agentcube/pkg/apis/runtime/v1alpha1"
+	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	runtime "k8s.io/apimachinery/pkg/runtime"
+	watch "k8s.io/apimachinery/pkg/watch"
+	cache "k8s.io/client-go/tools/cache"
+)
+
+// MultiAgentRuntimeInformer provides access to a shared informer and lister for
+// MultiAgentRuntimes.
+type MultiAgentRuntimeInformer interface {
+	Informer() cache.SharedIndexInformer
+	Lister() runtimev1alpha1.MultiAgentRuntimeLister
+}
+
+type multiAgentRuntimeInformer struct {
+	factory          internalinterfaces.SharedInformerFactory
+	tweakListOptions internalinterfaces.TweakListOptionsFunc
+	namespace        string
+}
+
+// NewMultiAgentRuntimeInformer constructs a new informer for MultiAgentRuntime type.
+// Always prefer using an informer factory to get a shared informer instead of getting an independent
+// one. This reduces memory footprint and number of connections to the server.
+func NewMultiAgentRuntimeInformer(client versioned.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers) cache.SharedIndexInformer {
+	return NewFilteredMultiAgentRuntimeInformer(client, namespace, resyncPeriod, indexers, nil)
+}
+
+// NewFilteredMultiAgentRuntimeInformer constructs a new informer for MultiAgentRuntime type.
+// Always prefer using an informer factory to get a shared informer instead of getting an independent
+// one. This reduces memory footprint and number of connections to the server.
+func NewFilteredMultiAgentRuntimeInformer(client versioned.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers, tweakListOptions internalinterfaces.TweakListOptionsFunc) cache.SharedIndexInformer {
+	return cache.NewSharedIndexInformer(
+		&cache.ListWatch{
+			ListFunc: func(options v1.ListOptions) (runtime.Object, error) {
+				if tweakListOptions != nil {
+					tweakListOptions(&options)
+				}
+				return client.RuntimeV1alpha1().MultiAgentRuntimes(namespace).List(context.Background(), options)
+			},
+			WatchFunc: func(options v1.ListOptions) (watch.Interface, error) {
+				if tweakListOptions != nil {
+					tweakListOptions(&options)
+				}
+				return client.RuntimeV1alpha1().MultiAgentRuntimes(namespace).Watch(context.Background(), options)
+			},
+			ListWithContextFunc: func(ctx context.Context, options v1.ListOptions) (runtime.Object, error) {
+				if tweakListOptions != nil {
+					tweakListOptions(&options)
+				}
+				return client.RuntimeV1alpha1().MultiAgentRuntimes(namespace).List(ctx, options)
+			},
+			WatchFuncWithContext: func(ctx context.Context, options v1.ListOptions) (watch.Interface, error) {
+				if tweakListOptions != nil {
+					tweakListOptions(&options)
+				}
+				return client.RuntimeV1alpha1().MultiAgentRuntimes(namespace).Watch(ctx, options)
+			},
+		},
+		&apisruntimev1alpha1.MultiAgentRuntime{},
+		resyncPeriod,
+		indexers,
+	)
+}
+
+func (f *multiAgentRuntimeInformer) defaultInformer(client versioned.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {
+	return NewFilteredMultiAgentRuntimeInformer(client, f.namespace, resyncPeriod, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, f.tweakListOptions)
+}
+
+func (f *multiAgentRuntimeInformer) Informer() cache.SharedIndexInformer {
+	return f.factory.InformerFor(&apisruntimev1alpha1.MultiAgentRuntime{}, f.defaultInformer)
+}
+
+func (f *multiAgentRuntimeInformer) Lister() runtimev1alpha1.MultiAgentRuntimeLister {
+	return runtimev1alpha1.NewMultiAgentRuntimeLister(f.Informer().GetIndexer())
+}
diff --git a/client-go/listers/runtime/v1alpha1/expansion_generated.go b/client-go/listers/runtime/v1alpha1/expansion_generated.go
index 6e06c618..64cfb8ed 100644
--- a/client-go/listers/runtime/v1alpha1/expansion_generated.go
+++ b/client-go/listers/runtime/v1alpha1/expansion_generated.go
@@ -33,3 +33,11 @@ type CodeInterpreterListerExpansion interface{}
 // CodeInterpreterNamespaceListerExpansion allows custom methods to be added to
 // CodeInterpreterNamespaceLister.
 type CodeInterpreterNamespaceListerExpansion interface{}
+
+// MultiAgentRuntimeListerExpansion allows custom methods to be added to
+// MultiAgentRuntimeLister.
+type MultiAgentRuntimeListerExpansion interface{}
+
+// MultiAgentRuntimeNamespaceListerExpansion allows custom methods to be added to
+// MultiAgentRuntimeNamespaceLister.
+type MultiAgentRuntimeNamespaceListerExpansion interface{}
diff --git a/client-go/listers/runtime/v1alpha1/multiagentruntime.go b/client-go/listers/runtime/v1alpha1/multiagentruntime.go
new file mode 100644
index 00000000..ab8dad70
--- /dev/null
+++ b/client-go/listers/runtime/v1alpha1/multiagentruntime.go
@@ -0,0 +1,70 @@
+/*
+Copyright The Volcano Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+// Code generated by lister-gen. DO NOT EDIT.
+
+package v1alpha1
+
+import (
+	runtimev1alpha1 "github.com/volcano-sh/agentcube/pkg/apis/runtime/v1alpha1"
+	labels "k8s.io/apimachinery/pkg/labels"
+	listers "k8s.io/client-go/listers"
+	cache "k8s.io/client-go/tools/cache"
+)
+
+// MultiAgentRuntimeLister helps list MultiAgentRuntimes.
+// All objects returned here must be treated as read-only.
+type MultiAgentRuntimeLister interface {
+	// List lists all MultiAgentRuntimes in the indexer.
+	// Objects returned here must be treated as read-only.
+	List(selector labels.Selector) (ret []*runtimev1alpha1.MultiAgentRuntime, err error)
+	// MultiAgentRuntimes returns an object that can list and get MultiAgentRuntimes.
+	MultiAgentRuntimes(namespace string) MultiAgentRuntimeNamespaceLister
+	MultiAgentRuntimeListerExpansion
+}
+
+// multiAgentRuntimeLister implements the MultiAgentRuntimeLister interface.
+type multiAgentRuntimeLister struct {
+	listers.ResourceIndexer[*runtimev1alpha1.MultiAgentRuntime]
+}
+
+// NewMultiAgentRuntimeLister returns a new MultiAgentRuntimeLister.
+func NewMultiAgentRuntimeLister(indexer cache.Indexer) MultiAgentRuntimeLister {
+	return &multiAgentRuntimeLister{listers.New[*runtimev1alpha1.MultiAgentRuntime](indexer, runtimev1alpha1.Resource("multiagentruntime").GroupResource())}
+}
+
+// MultiAgentRuntimes returns an object that can list and get MultiAgentRuntimes.
+func (s *multiAgentRuntimeLister) MultiAgentRuntimes(namespace string) MultiAgentRuntimeNamespaceLister {
+	return multiAgentRuntimeNamespaceLister{listers.NewNamespaced[*runtimev1alpha1.MultiAgentRuntime](s.ResourceIndexer, namespace)}
+}
+
+// MultiAgentRuntimeNamespaceLister helps list and get MultiAgentRuntimes.
+// All objects returned here must be treated as read-only.
+type MultiAgentRuntimeNamespaceLister interface {
+	// List lists all MultiAgentRuntimes in the indexer for a given namespace.
+	// Objects returned here must be treated as read-only.
+	List(selector labels.Selector) (ret []*runtimev1alpha1.MultiAgentRuntime, err error)
+	// Get retrieves the MultiAgentRuntime from the indexer for a given namespace and name.
+	// Objects returned here must be treated as read-only.
+	Get(name string) (*runtimev1alpha1.MultiAgentRuntime, error)
+	MultiAgentRuntimeNamespaceListerExpansion
+}
+
+// multiAgentRuntimeNamespaceLister implements the MultiAgentRuntimeNamespaceLister
+// interface.
+type multiAgentRuntimeNamespaceLister struct {
+	listers.ResourceIndexer[*runtimev1alpha1.MultiAgentRuntime]
+}
diff --git a/hack/update-codegen.sh b/hack/update-codegen.sh
index f39bb861..324af68e 100755
--- a/hack/update-codegen.sh
+++ b/hack/update-codegen.sh
@@ -71,9 +71,11 @@ find "${SCRIPT_ROOT}/client-go/listers" -name "*.go" -type f | while read -r fil
   if [[ "$OSTYPE" == "darwin"* ]]; then
     sed -i '' 's/runtimev1alpha1\.Resource("codeinterpreter")/runtimev1alpha1.Resource("codeinterpreter").GroupResource()/g' "$file"
     sed -i '' 's/runtimev1alpha1\.Resource("agentruntime")/runtimev1alpha1.Resource("agentruntime").GroupResource()/g' "$file"
+    sed -i '' 's/runtimev1alpha1\.Resource("multiagentruntime")/runtimev1alpha1.Resource("multiagentruntime").GroupResource()/g' "$file"
   else
     sed -i 's/runtimev1alpha1\.Resource("codeinterpreter")/runtimev1alpha1.Resource("codeinterpreter").GroupResource()/g' "$file"
     sed -i 's/runtimev1alpha1\.Resource("agentruntime")/runtimev1alpha1.Resource("agentruntime").GroupResource()/g' "$file"
+    sed -i 's/runtimev1alpha1\.Resource("multiagentruntime")/runtimev1alpha1.Resource("multiagentruntime").GroupResource()/g' "$file"
   fi
 done
 

From 14f9f727c54a033af3ce60f47def942fd975b5c9 Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Thu, 28 May 2026 00:24:57 +0530
Subject: [PATCH 19/20] docs: Address review feedback on MultiAgentRuntime
 proposal

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 docs/design/multi-agent-runtime-proposal.md | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index 0f777b66..313abf19 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -233,7 +233,7 @@ The environment variables injected into the dependent pod's containers point to
 
 **Validation against Naming Collisions:**
 * Because multiple role names or port names could map to the same sanitized environment variable (e.g., `my-agent` and `my.agent` both sanitize to `AGENTCUBE_DEP_MY_AGENT_ENDPOINT`), the API server validates the group configuration at request admission time. If any two roles or named ports within the group result in the same sanitized environment variable key, the request is rejected with a `400 Bad Request` validation error.
-* The `ValidatingAdmissionWebhook` also explicitly checks for **Service name collisions after truncation**: after computing `mar-{shortHash}-{roleNameSanitized-truncated-stripped}` for each role, if any two roles in the same group produce an identical service name, the request is rejected. This prevents the edge case where two roles whose names are identical in their first 50 characters would silently share a single Headless Service.
+* The `ValidatingAdmissionWebhook` also explicitly checks for **Role name collisions after truncation**: after computing the sanitized and truncated role name (`{roleNameSanitized-truncated-stripped}`) for each role, if any two roles in the same group produce an identical sanitized name, the request is rejected. This prevents the edge case where two roles whose names are identical in their first 50 characters would silently share a single Headless Service. The webhook cannot compute the final Service name because the `groupSessionID` (which provides the random hash) is only generated at runtime.
 
 **Injection Scope:**
 * The dependency endpoints are injected into the `Env` list of **all containers** (including primary, sidecar, and init-containers) defined in the pod spec. This ensures that any multi-container runtime configuration can reliably resolve the endpoints.
@@ -646,9 +646,10 @@ GetAgentGroup(ctx context.Context, groupSessionID string) (*types.AgentGroupMani
 DeleteAgentGroup(ctx context.Context, groupSessionID string) error
 
 // DeleteAgentGroupRole atomically removes a single role entry from the group manifest.
-// To prevent race conditions during concurrent GC, the check-then-delete sequence
-// (removing the role, and deleting the manifest if it was the last role) MUST be
-// implemented using a Redis Lua script or transaction.
+// To ensure atomicity and prevent race conditions where the manifest might be deleted
+// while a concurrent process is adding a role, the implementation should use a Redis Lua script.
+// This script should perform the HDEL and then check HLEN to decide whether to delete
+// the entire key in a single atomic operation.
 DeleteAgentGroupRole(ctx context.Context, groupSessionID, roleName string) error
 
 // UpdateAgentGroupRoleStatus atomically updates the status and endpoint of a specific role
@@ -862,8 +863,8 @@ Caching both the group manifest and the coordinator `SandboxInfo` per group per
 
 When the GC deletes a sandbox that has a non-empty `GroupSessionID`:
 
-1. It calls `DeleteAgentGroupRole(ctx, groupSessionID, roleName)` to atomically remove the role's entry from the Redis Hash using `HDEL`. This avoids a read-modify-write cycle and prevents race conditions when multiple GC goroutines may be cleaning up members of the same group concurrently.
-2. `DeleteAgentGroupRole()` additionally checks if any `role:*` fields remain in the hash after deletion. If none remain, it deletes the `_metadata` field and removes the entire `agentgroup:` key.
+1. It calls `DeleteAgentGroupRole(ctx, groupSessionID, roleName)` to atomically remove the role's entry from the Redis Hash using a Lua script. This script executes `HDEL` and avoids a read-modify-write cycle, preventing race conditions when multiple GC goroutines may be cleaning up members of the same group concurrently.
+2. `DeleteAgentGroupRole()`'s Lua script additionally checks `HLEN` after deletion. If no `role:*` fields remain in the hash, it deletes the `_metadata` field and removes the entire `agentgroup:` key in the same atomic transaction.
 
 This ensures group manifests do not accumulate indefinitely in the store after their member sandboxes expire. The atomic `HDEL` approach also prevents data loss from concurrent GC deletions of sibling roles within the same group.
 

From 1603cae4fc72e886c252248425d004897467da4a Mon Sep 17 00:00:00 2001
From: Abhinav Singh <abhinavsingh717073@gmail.com>
Date: Thu, 28 May 2026 00:32:17 +0530
Subject: [PATCH 20/20] fix: Address bugs and inconsistencies reported in PR
 #354

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
---
 .../listers/runtime/v1alpha1/multiagentruntime.go |  2 +-
 docs/design/multi-agent-runtime-proposal.md       | 15 +++++++++++----
 hack/update-codegen.sh                            |  4 ++--
 .../runtime/v1alpha1/multiagent_validation.go     |  3 ++-
 4 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/client-go/listers/runtime/v1alpha1/multiagentruntime.go b/client-go/listers/runtime/v1alpha1/multiagentruntime.go
index ab8dad70..c03bb0e4 100644
--- a/client-go/listers/runtime/v1alpha1/multiagentruntime.go
+++ b/client-go/listers/runtime/v1alpha1/multiagentruntime.go
@@ -43,7 +43,7 @@ type multiAgentRuntimeLister struct {
 
 // NewMultiAgentRuntimeLister returns a new MultiAgentRuntimeLister.
 func NewMultiAgentRuntimeLister(indexer cache.Indexer) MultiAgentRuntimeLister {
-	return &multiAgentRuntimeLister{listers.New[*runtimev1alpha1.MultiAgentRuntime](indexer, runtimev1alpha1.Resource("multiagentruntime").GroupResource())}
+	return &multiAgentRuntimeLister{listers.New[*runtimev1alpha1.MultiAgentRuntime](indexer, runtimev1alpha1.Resource("multiagentruntimes").GroupResource())}
 }
 
 // MultiAgentRuntimes returns an object that can list and get MultiAgentRuntimes.
diff --git a/docs/design/multi-agent-runtime-proposal.md b/docs/design/multi-agent-runtime-proposal.md
index 313abf19..75557918 100644
--- a/docs/design/multi-agent-runtime-proposal.md
+++ b/docs/design/multi-agent-runtime-proposal.md
@@ -264,6 +264,7 @@ A separate group manifest key stores aggregated role metadata:
 agentgroup:{grp-xxx} -> AgentGroupManifest{
     GroupSessionID: "grp-xxx",
     CreatedAt: ...,
+    ExpiresAt: ...,
     Roles: [
         { Name: "planner",    SessionID: "...", Endpoint: "10.0.0.4:8080", Status: "ready" },
         { Name: "researcher", SessionID: "...", Endpoint: "10.0.0.5:8080", Status: "ready" },
@@ -407,6 +408,10 @@ func (s *Server) createSandboxGroup(
                 if err != nil {
                     if mar.Spec.StartupPolicy == StartupPolicyBestEffort && !role.IsCoordinator {
                         klog.Warningf("group %s: role %s failed (BestEffort policy): %v", groupSessionID, role.Name, err)
+                        // Clean up the orphaned Headless Service for the failed worker role
+                        if delErr := s.kubeClient.CoreV1().Services(mar.Namespace).Delete(ctx, svcName, metav1.DeleteOptions{}); delErr != nil && !apierrors.IsNotFound(delErr) {
+                            klog.Warningf("group %s: failed to clean up orphaned service %s: %v", groupSessionID, svcName, delErr)
+                        }
                         createdMutex.Lock()
                         created = append(created, createdRole{
                             name:   role.Name,
@@ -452,7 +457,7 @@ func (s *Server) createSandboxGroup(
 
 - **Parallel Sandbox Creation:** To prevent HTTP gateway or client timeouts when launching large groups, roles that do not share mutual dependencies (i.e., reside at the same level of the dependency DAG) are created in parallel. Sandbox creation proceeds in "dependency waves": all sandboxes within a wave are launched concurrently, and the server waits for all to be ready before proceeding to the next dependent wave.
 - **Consistent TTL:** A single `baseTime` is captured before the creation loop begins and used for all `ShutdownTime` calculations. This ensures every sandbox in the group shares a synchronized absolute TTL, regardless of how long it takes to create each wave.
-- **Headless Service per role:** Before each sandbox is created, `createHeadlessServiceForRole()` provisions a Headless Service whose DNS name (`serviceDNS`) is stored in `createdRole`. The `injectDependencyEndpoints()` function reads `serviceDNS` from `created` to construct stable DNS-based environment variables. On rollback, all created Services are explicitly deleted alongside their sandboxes.
+- **Headless Service per role:** Before each sandbox is created, `createHeadlessServiceForRole()` provisions a Headless Service whose DNS name (`serviceDNS`) is stored in `createdRole`. The `injectDependencyEndpoints()` function reads `serviceDNS` from `created` to construct stable DNS-based environment variables, explicitly skipping any dependency in `createdSnapshot` where `failed == true` to avoid injecting unreachable endpoints under the `BestEffort` policy. On rollback, all created Services are explicitly deleted alongside their sandboxes.
 - The deferred rollback calls `rollbackSandboxCreation()` for every sandbox in `created` **and** deletes all Headless Services tracked in `createdServices`, preventing resource leaks on partial failure.
 - Roles are created in topological order. A dependency's `serviceDNS` is guaranteed to be in `created` before the dependent role's sandbox is built.
 - `buildSandboxByAgentRuntime()`, `buildSandboxByCodeInterpreter()`, `createSandbox()`, `WatchSandboxOnce()`, and `rollbackSandboxCreation()` are all called as-is. The correct builder is selected by the `role.Kind` field.
@@ -531,9 +536,9 @@ func topoSort(roles []RoleSpec) ([][]RoleSpec, error) {
     }
 
     var currentQueue []string
-    for name, deg := range inDegree {
-        if deg == 0 {
-            currentQueue = append(currentQueue, name)
+    for _, r := range roles {
+        if inDegree[r.Name] == 0 {
+            currentQueue = append(currentQueue, r.Name)
         }
     }
 
@@ -554,6 +559,7 @@ func topoSort(roles []RoleSpec) ([][]RoleSpec, error) {
                 }
             }
         }
+        sort.Strings(nextQueue)
         waves = append(waves, wave)
         currentQueue = nextQueue
     }
@@ -626,6 +632,7 @@ type AgentGroupManifest struct {
     Namespace      string                `json:"namespace"`
     Roles          []AgentGroupRoleState `json:"roles"`
     CreatedAt      time.Time             `json:"createdAt"`
+    ExpiresAt      time.Time             `json:"expiresAt"`
 }
 ```
 
diff --git a/hack/update-codegen.sh b/hack/update-codegen.sh
index 324af68e..cf1c22f3 100755
--- a/hack/update-codegen.sh
+++ b/hack/update-codegen.sh
@@ -71,11 +71,11 @@ find "${SCRIPT_ROOT}/client-go/listers" -name "*.go" -type f | while read -r fil
   if [[ "$OSTYPE" == "darwin"* ]]; then
     sed -i '' 's/runtimev1alpha1\.Resource("codeinterpreter")/runtimev1alpha1.Resource("codeinterpreter").GroupResource()/g' "$file"
     sed -i '' 's/runtimev1alpha1\.Resource("agentruntime")/runtimev1alpha1.Resource("agentruntime").GroupResource()/g' "$file"
-    sed -i '' 's/runtimev1alpha1\.Resource("multiagentruntime")/runtimev1alpha1.Resource("multiagentruntime").GroupResource()/g' "$file"
+    sed -i '' 's/runtimev1alpha1\.Resource("multiagentruntime")/runtimev1alpha1.Resource("multiagentruntimes").GroupResource()/g' "$file"
   else
     sed -i 's/runtimev1alpha1\.Resource("codeinterpreter")/runtimev1alpha1.Resource("codeinterpreter").GroupResource()/g' "$file"
     sed -i 's/runtimev1alpha1\.Resource("agentruntime")/runtimev1alpha1.Resource("agentruntime").GroupResource()/g' "$file"
-    sed -i 's/runtimev1alpha1\.Resource("multiagentruntime")/runtimev1alpha1.Resource("multiagentruntime").GroupResource()/g' "$file"
+    sed -i 's/runtimev1alpha1\.Resource("multiagentruntime")/runtimev1alpha1.Resource("multiagentruntimes").GroupResource()/g' "$file"
   fi
 done
 
diff --git a/pkg/apis/runtime/v1alpha1/multiagent_validation.go b/pkg/apis/runtime/v1alpha1/multiagent_validation.go
index 1f50f019..5864250d 100644
--- a/pkg/apis/runtime/v1alpha1/multiagent_validation.go
+++ b/pkg/apis/runtime/v1alpha1/multiagent_validation.go
@@ -57,8 +57,9 @@ func ValidateMultiAgentRuntimeSpec(spec *MultiAgentRuntimeSpec, fldPath *field.P
 		// Check for duplicate role names
 		if roleMap[role.Name] {
 			allErrs = append(allErrs, field.Duplicate(idxPath.Child("name"), role.Name))
+		} else {
+			roleMap[role.Name] = true
 		}
-		roleMap[role.Name] = true
 
 		if role.IsCoordinator {
 			coordinatorCount++