Summary
A2AAgentRegistry currently tracks agent availability (health checks, failure counts, RemoteAgentStatus). AgentConnection.is_healthy() answers 'is the agent alive?'
It does not answer: 'is this agent consistently delivering on its task commitments across sessions?'
These are different failure modes. An agent can pass health checks while exhibiting increasing refusal rates, calibration drift, or task completion degradation over time — invisible to the current registry.
Proposed: AgentBehavioralProfile
An optional, strictly additive model attached to AgentConnection that aggregates behavioral signals across sessions:
from dataclasses import dataclass, field
from typing import Optional
import time
@dataclass
class SessionBehavioralSnapshot:
"""Per-session behavioral measurements."""
session_id: str
timestamp: float
delivery_score: float # tasks_completed / tasks_requested (0.0–1.0)
calibration_delta: float # |predicted_confidence − actual_success_rate|
avg_latency_ms: Optional[float] = None
tool_call_count: Optional[int] = None
error_count: int = 0
@dataclass
class AgentBehavioralProfile:
"""Cross-session behavioral reliability profile for a remote agent."""
agent_id: str
snapshots: list[SessionBehavioralSnapshot] = field(default_factory=list)
window_sessions: int = 10 # rolling window size
def record_session(self, snapshot: SessionBehavioralSnapshot) -> None:
"""Append session snapshot; prune to window_sessions."""
self.snapshots.append(snapshot)
if len(self.snapshots) > self.window_sessions:
self.snapshots = self.snapshots[-self.window_sessions:]
@property
def delivery_trend(self) -> Optional[float]:
"""Slope of delivery_score over window (positive = improving)."""
return _slope([s.delivery_score for s in self.snapshots])
@property
def calibration_trend(self) -> Optional[float]:
"""Slope of calibration_delta (positive = diverging = bad)."""
return _slope([s.calibration_delta for s in self.snapshots])
@property
def is_degrading(self) -> bool:
"""True if delivery_score slope < −0.05/session over window."""
t = self.delivery_trend
return t is not None and t < -0.05
@property
def avg_delivery_score(self) -> Optional[float]:
if not self.snapshots: return None
return sum(s.delivery_score for s in self.snapshots) / len(self.snapshots)
def _slope(values: list[float]) -> Optional[float]:
"""Linear regression slope (stdlib only, no numpy)."""
n = len(values)
if n < 2: return None
x_mean = (n - 1) / 2
y_mean = sum(values) / n
num = sum((i - x_mean) * (v - y_mean) for i, v in enumerate(values))
den = sum((i - x_mean) ** 2 for i in range(n))
return num / den if den else 0.0
Integration Points
AgentConnection (transport.py):
behavioral_profile: Optional[AgentBehavioralProfile] = Field(
None,
description="Cross-session behavioral reliability profile (populated by A2AAgentRegistry)"
)
Default None = backward compatible for all existing agents.
A2AAgentRegistry (a2a_registry.py):
- After each session/task completes, call
agent_connection.behavioral_profile.record_session(snapshot)
- Expose
get_reliable_agents(min_delivery=0.8) filter alongside existing get_healthy_agents()
- Flag
is_degrading=True agents in health telemetry
Why This Matters
Health checks answer availability. Behavioral profiles answer reliability — whether the agent consistently does what it says it will do. For multi-agent collaboration at scale, these are separate concerns:
- An agent can be alive but unreliable (passes health checks, low delivery_score)
- An agent can be reliable but temporarily down (is_degrading=False, currently STALE)
The distinction matters for task routing decisions.
Reference Implementation
PDR (Probabilistic Delivery Reliability) paper — DOI: 10.5281/zenodo.19348539 — formalizes this three-axis behavioral assessment surface (delivery_score × calibration_delta × adaptation_score) with production measurement methodology.
The AgentBehavioralProfile above implements the delivery_score and calibration_delta axes in ~50 lines of stdlib Python.
Strictly Additive
AgentConnection.behavioral_profile = None default — no existing code paths affected
AgentBehavioralProfile is a new file (models/behavioral_profile.py)
- No new dependencies (stdlib dataclasses + math only)
- Existing health check / failure count / STALE logic unchanged
Summary
A2AAgentRegistrycurrently tracks agent availability (health checks, failure counts,RemoteAgentStatus).AgentConnection.is_healthy()answers 'is the agent alive?'It does not answer: 'is this agent consistently delivering on its task commitments across sessions?'
These are different failure modes. An agent can pass health checks while exhibiting increasing refusal rates, calibration drift, or task completion degradation over time — invisible to the current registry.
Proposed:
AgentBehavioralProfileAn optional, strictly additive model attached to
AgentConnectionthat aggregates behavioral signals across sessions:Integration Points
AgentConnection(transport.py):Default
None= backward compatible for all existing agents.A2AAgentRegistry(a2a_registry.py):agent_connection.behavioral_profile.record_session(snapshot)get_reliable_agents(min_delivery=0.8)filter alongside existingget_healthy_agents()is_degrading=Trueagents in health telemetryWhy This Matters
Health checks answer availability. Behavioral profiles answer reliability — whether the agent consistently does what it says it will do. For multi-agent collaboration at scale, these are separate concerns:
The distinction matters for task routing decisions.
Reference Implementation
PDR (Probabilistic Delivery Reliability) paper — DOI: 10.5281/zenodo.19348539 — formalizes this three-axis behavioral assessment surface (delivery_score × calibration_delta × adaptation_score) with production measurement methodology.
The
AgentBehavioralProfileabove implements the delivery_score and calibration_delta axes in ~50 lines of stdlib Python.Strictly Additive
AgentConnection.behavioral_profile = Nonedefault — no existing code paths affectedAgentBehavioralProfileis a new file (models/behavioral_profile.py)