feat(registry): AgentBehavioralProfile — cross-session reliability tracking for A2AAgentRegistry

## Summary

`A2AAgentRegistry` currently tracks agent **availability** (health checks, failure counts, `RemoteAgentStatus`). `AgentConnection.is_healthy()` answers *'is the agent alive?'*

It does not answer: *'is this agent consistently delivering on its task commitments across sessions?'*

These are different failure modes. An agent can pass health checks while exhibiting increasing refusal rates, calibration drift, or task completion degradation over time — invisible to the current registry.

## Proposed: `AgentBehavioralProfile`

An optional, strictly additive model attached to `AgentConnection` that aggregates behavioral signals across sessions:

```python
from dataclasses import dataclass, field
from typing import Optional
import time

@dataclass
class SessionBehavioralSnapshot:
    """Per-session behavioral measurements."""
    session_id: str
    timestamp: float
    delivery_score: float          # tasks_completed / tasks_requested  (0.0–1.0)
    calibration_delta: float       # |predicted_confidence − actual_success_rate|
    avg_latency_ms: Optional[float] = None
    tool_call_count: Optional[int] = None
    error_count: int = 0

@dataclass
class AgentBehavioralProfile:
    """Cross-session behavioral reliability profile for a remote agent."""
    agent_id: str
    snapshots: list[SessionBehavioralSnapshot] = field(default_factory=list)
    window_sessions: int = 10      # rolling window size

    def record_session(self, snapshot: SessionBehavioralSnapshot) -> None:
        """Append session snapshot; prune to window_sessions."""
        self.snapshots.append(snapshot)
        if len(self.snapshots) > self.window_sessions:
            self.snapshots = self.snapshots[-self.window_sessions:]

    @property
    def delivery_trend(self) -> Optional[float]:
        """Slope of delivery_score over window (positive = improving)."""
        return _slope([s.delivery_score for s in self.snapshots])

    @property
    def calibration_trend(self) -> Optional[float]:
        """Slope of calibration_delta (positive = diverging = bad)."""
        return _slope([s.calibration_delta for s in self.snapshots])

    @property
    def is_degrading(self) -> bool:
        """True if delivery_score slope < −0.05/session over window."""
        t = self.delivery_trend
        return t is not None and t < -0.05

    @property
    def avg_delivery_score(self) -> Optional[float]:
        if not self.snapshots: return None
        return sum(s.delivery_score for s in self.snapshots) / len(self.snapshots)

def _slope(values: list[float]) -> Optional[float]:
    """Linear regression slope (stdlib only, no numpy)."""
    n = len(values)
    if n < 2: return None
    x_mean = (n - 1) / 2
    y_mean = sum(values) / n
    num = sum((i - x_mean) * (v - y_mean) for i, v in enumerate(values))
    den = sum((i - x_mean) ** 2 for i in range(n))
    return num / den if den else 0.0
```

## Integration Points

**`AgentConnection` (transport.py):**
```python
behavioral_profile: Optional[AgentBehavioralProfile] = Field(
    None,
    description="Cross-session behavioral reliability profile (populated by A2AAgentRegistry)"
)
```
Default `None` = backward compatible for all existing agents.

**`A2AAgentRegistry` (a2a_registry.py):**
- After each session/task completes, call `agent_connection.behavioral_profile.record_session(snapshot)`
- Expose `get_reliable_agents(min_delivery=0.8)` filter alongside existing `get_healthy_agents()`
- Flag `is_degrading=True` agents in health telemetry

## Why This Matters

Health checks answer availability. Behavioral profiles answer *reliability* — whether the agent consistently does what it says it will do. For multi-agent collaboration at scale, these are separate concerns:

- An agent can be **alive but unreliable** (passes health checks, low delivery_score)
- An agent can be **reliable but temporarily down** (is_degrading=False, currently STALE)

The distinction matters for task routing decisions.

## Reference Implementation

PDR (Probabilistic Delivery Reliability) paper — DOI: [10.5281/zenodo.19348539](https://zenodo.org/records/19348539) — formalizes this three-axis behavioral assessment surface (delivery_score × calibration_delta × adaptation_score) with production measurement methodology.

The `AgentBehavioralProfile` above implements the delivery_score and calibration_delta axes in ~50 lines of stdlib Python.

## Strictly Additive

- `AgentConnection.behavioral_profile = None` default — no existing code paths affected
- `AgentBehavioralProfile` is a new file (`models/behavioral_profile.py`)
- No new dependencies (stdlib dataclasses + math only)
- Existing health check / failure count / STALE logic unchanged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(registry): AgentBehavioralProfile — cross-session reliability tracking for A2AAgentRegistry #330

Summary

Proposed: `AgentBehavioralProfile`

Integration Points

Why This Matters

Reference Implementation

Strictly Additive

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(registry): AgentBehavioralProfile — cross-session reliability tracking for A2AAgentRegistry #330

Description

Summary

Proposed: AgentBehavioralProfile

Integration Points

Why This Matters

Reference Implementation

Strictly Additive

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposed: `AgentBehavioralProfile`