Skip to content

feat(registry): AgentBehavioralProfile — cross-session reliability tracking for A2AAgentRegistry #330

@nanookclaw

Description

@nanookclaw

Summary

A2AAgentRegistry currently tracks agent availability (health checks, failure counts, RemoteAgentStatus). AgentConnection.is_healthy() answers 'is the agent alive?'

It does not answer: 'is this agent consistently delivering on its task commitments across sessions?'

These are different failure modes. An agent can pass health checks while exhibiting increasing refusal rates, calibration drift, or task completion degradation over time — invisible to the current registry.

Proposed: AgentBehavioralProfile

An optional, strictly additive model attached to AgentConnection that aggregates behavioral signals across sessions:

from dataclasses import dataclass, field
from typing import Optional
import time

@dataclass
class SessionBehavioralSnapshot:
    """Per-session behavioral measurements."""
    session_id: str
    timestamp: float
    delivery_score: float          # tasks_completed / tasks_requested  (0.0–1.0)
    calibration_delta: float       # |predicted_confidence − actual_success_rate|
    avg_latency_ms: Optional[float] = None
    tool_call_count: Optional[int] = None
    error_count: int = 0

@dataclass
class AgentBehavioralProfile:
    """Cross-session behavioral reliability profile for a remote agent."""
    agent_id: str
    snapshots: list[SessionBehavioralSnapshot] = field(default_factory=list)
    window_sessions: int = 10      # rolling window size

    def record_session(self, snapshot: SessionBehavioralSnapshot) -> None:
        """Append session snapshot; prune to window_sessions."""
        self.snapshots.append(snapshot)
        if len(self.snapshots) > self.window_sessions:
            self.snapshots = self.snapshots[-self.window_sessions:]

    @property
    def delivery_trend(self) -> Optional[float]:
        """Slope of delivery_score over window (positive = improving)."""
        return _slope([s.delivery_score for s in self.snapshots])

    @property
    def calibration_trend(self) -> Optional[float]:
        """Slope of calibration_delta (positive = diverging = bad)."""
        return _slope([s.calibration_delta for s in self.snapshots])

    @property
    def is_degrading(self) -> bool:
        """True if delivery_score slope < −0.05/session over window."""
        t = self.delivery_trend
        return t is not None and t < -0.05

    @property
    def avg_delivery_score(self) -> Optional[float]:
        if not self.snapshots: return None
        return sum(s.delivery_score for s in self.snapshots) / len(self.snapshots)

def _slope(values: list[float]) -> Optional[float]:
    """Linear regression slope (stdlib only, no numpy)."""
    n = len(values)
    if n < 2: return None
    x_mean = (n - 1) / 2
    y_mean = sum(values) / n
    num = sum((i - x_mean) * (v - y_mean) for i, v in enumerate(values))
    den = sum((i - x_mean) ** 2 for i in range(n))
    return num / den if den else 0.0

Integration Points

AgentConnection (transport.py):

behavioral_profile: Optional[AgentBehavioralProfile] = Field(
    None,
    description="Cross-session behavioral reliability profile (populated by A2AAgentRegistry)"
)

Default None = backward compatible for all existing agents.

A2AAgentRegistry (a2a_registry.py):

  • After each session/task completes, call agent_connection.behavioral_profile.record_session(snapshot)
  • Expose get_reliable_agents(min_delivery=0.8) filter alongside existing get_healthy_agents()
  • Flag is_degrading=True agents in health telemetry

Why This Matters

Health checks answer availability. Behavioral profiles answer reliability — whether the agent consistently does what it says it will do. For multi-agent collaboration at scale, these are separate concerns:

  • An agent can be alive but unreliable (passes health checks, low delivery_score)
  • An agent can be reliable but temporarily down (is_degrading=False, currently STALE)

The distinction matters for task routing decisions.

Reference Implementation

PDR (Probabilistic Delivery Reliability) paper — DOI: 10.5281/zenodo.19348539 — formalizes this three-axis behavioral assessment surface (delivery_score × calibration_delta × adaptation_score) with production measurement methodology.

The AgentBehavioralProfile above implements the delivery_score and calibration_delta axes in ~50 lines of stdlib Python.

Strictly Additive

  • AgentConnection.behavioral_profile = None default — no existing code paths affected
  • AgentBehavioralProfile is a new file (models/behavioral_profile.py)
  • No new dependencies (stdlib dataclasses + math only)
  • Existing health check / failure count / STALE logic unchanged

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions