Skip to content

feat(agent): detect availability via session/new probe and assistant-first identity#500

Open
kaizhou-lab wants to merge 144 commits into
mainfrom
feat/agent-connection-testing-phase2
Open

feat(agent): detect availability via session/new probe and assistant-first identity#500
kaizhou-lab wants to merge 144 commits into
mainfrom
feat/agent-connection-testing-phase2

Conversation

@kaizhou-lab

@kaizhou-lab kaizhou-lab commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR started as agent connection testing phase 2, but it now also completes the backend side of the assistant-first identity migration. The combined scope is: accurate agent availability detection, assistant/agent identity unification, cron assistant-first persistence, and a clearer /api/assistants contract.

Agent availability detection

  • Probe real session/new startup instead of inferring availability from static metadata.
  • Report online/offline status with explicit auth-required handling and clear stale needs_auth state after successful startup.
  • Gate aionrs availability on resolved provider configuration instead of backend labels alone.
  • Reap probe process trees / process groups to avoid orphaned wrapper children.
  • Expose management and logo metadata needed by AionUi for repair/status surfaces.

Assistant-first identity and schema cleanup

  • Make assistant identity the public write boundary for conversation, team, channel, cron, guide, and ACP-session flows.
  • Use agent_metadata.id as the canonical concrete agent binding behind assistants.
  • Normalize assistant definitions, snapshots, and runtime resolution around assistant id plus agent_metadata.id instead of ambiguous backend/preset fields.
  • Rename response/runtime backend metadata for ACP agents from generic backend to explicit acp_backend.
  • Remove duplicated nested agent.id from /api/assistants; the top-level agent_id is the only concrete agent binding in the response.
  • Keep nested /api/assistants.agent as runtime metadata only: type, source, and optional acp_backend.

Cron jobs

  • Move cron persistence to assistant-first agent_config.assistant_id.
  • Derive runtime backend, ACP backend, provider/model behavior, name, and avatar from the assistant at execution time.
  • Remove legacy cron_jobs.agent_type / agent_config.backend dependence from current write/read paths.
  • Migrate existing branch data in the existing phase-2 migration path instead of adding a new migration.
  • Preserve model/provider data where it is actual execution configuration, while deriving agent backend identity from assistant_id.

Team, conversation, channel, and ACP session flows

  • Reject legacy backend/custom-agent/preset fields in public assistant-first write DTOs while canonicalizing legacy read paths where needed.
  • Carry assistant identity through team creation, add-agent, snapshots, guide handoff, and MCP tool contracts.
  • Persist assistant snapshots for conversations and use them to resolve runtime backend safely.
  • Use assistant-owned defaults and runtime seeds when creating ACP/aionrs sessions.
  • Keep channel defaults and settings bound to assistants rather than backend labels.

Tests

  • Added/updated API type, assistant service, cron, team, conversation, channel, migration, and e2e coverage for the new identity contract.
  • Added assertions that /api/assistants exposes agent_id plus agent.acp_backend, and no longer exposes nested agent.id or agent.backend.

Testing

  • just push passed after merging latest origin/main.
  • AionCore: migration immutability check, cargo check, clippy/fmt, and workspace nextest passed.
  • Nextest result: 6506 passed, 18 skipped.

Closes #499

zk added 30 commits June 15, 2026 17:57
The availability scheduler runs `try_connect_custom_agent` every 5
minutes for every agent, spawning a CLI subprocess and tearing it
down once the ACP handshake completes (or fails). For wrapper CLIs
that fork a long-lived grandchild — `npm exec openclaw --acp` is the
production case — cleanup was leaking the grandchild because:

* `kill_on_drop(true)` on the tokio Command only signals the direct
  child (the npm exec wrapper), not its grandchild.
* The probe relied on `drop(protocol)` for the success path and on
  no explicit cleanup for the handshake-fail path, so `proc.kill`
  was never called.
* `CliAgentProcess::kill` itself short-circuited and returned Ok the
  moment the leader exited within the grace period — so even when
  callers did invoke it, no group-wide SIGKILL was sent.

Result: dozens of zombie `openclaw-acp` processes accumulated per
day under the 5-minute scheduler.

Fix:

1. `CliAgentProcess::kill` now always issues a group-wide SIGKILL
   after the grace period, even when the leader has already exited.
   `force_kill` already maps ESRCH to success, so the sweep is
   idempotent for already-reaped trees.
2. `try_connect_custom_agent` calls `proc.kill` on every outcome
   (success, ACP failure, handshake timeout) by hoisting the spawn
   out of the inner future and running cleanup unconditionally
   after the timeout race resolves.
3. New regression test `probe_kills_grandchild_left_behind_by_wrapper`
   exercises the exact wrapper-grandchild shape from production and
   asserts the grandchild is reaped before the probe returns.
zk added 30 commits June 18, 2026 17:38
list_management_rows now sets env to Vec::new() instead of copying
meta.env, which contained merged override secrets. The management
row already exposes has_command_override + env_override_key_count;
the UI does not need plaintext values.

The e2e test now asserts the management row's env is empty AND
that the secret value "sk-x" does not appear anywhere in the
management response body.
Added record_session_success to AgentAvailabilityFeedbackPort trait
and AgentAvailabilityService. It persists a session-kind available
snapshot, which clears needs_auth and updates last_success_at.

Turn orchestrator now calls record_agent_session_success when
send_message succeeds, mirroring the existing session failure path.

Test: record_session_success_clears_needs_auth verifies that a
needs_auth state set via session failure is cleared by session success.
Bare assistants for single-engine agents (e.g. Aion CLI) were generated
with an empty agent_backend because their engine identity lives in
agent_type, not the ACP-vendor backend column. An empty preset_agent_type
made the frontend route aionrs assistants as ACP, dropping the top-level
model and persisting a NULL model that later failed warmup with
"Provider '' not found". Reconcile now falls back to agent_type when
backend is empty.

Also fix pre-existing test breakage left from the assistant-first
migration:
- add missing agent_status/team_selectable/deletable to the
  AssistantResponse camelCase rejection fixture
- update channel default-settings e2e to expect the generated aionrs
  bare assistant binding instead of a null assistant
- drop the obsolete agent.select persist integration test (direct agent
  selection is no longer supported; covered by the unknown-action unit test)
- add missing last_check_error_details to the bare assistant test row
…/offline

Align the backend agent status model with the simplified frontend semantics:
a probe only verifies an ACP handshake (reachability), not authorization, so
the misleading "available"/"needs_auth" verdicts are dropped in favor of plain
online/offline.

- agent_discovery: rename AgentManagementStatus and AgentSnapshotCheckStatus
  variants Available/Unavailable/NeedsAuth -> Online/Offline (serde snake_case).
- availability: simplify snapshot persistence — record_session_failure always
  yields offline; drop the success-recording path and auth detection.
- registry: parse "online"/"offline"; derive management status as Offline ->
  Offline else Online.
- turn_orchestrator: always report session_send_failed on send error; no
  success recording.
- assistant/service + tests: migrate status values to online/offline.
…iders

Deepen the agent health probe from `initialize` to `session/new` so it
reflects real usability, not just protocol reachability — `initialize`
returns authMethods even for authorized agents and cannot tell apart
"reachable but not signed in".

- custom_agent_probe: after `initialize`, open a throwaway `session/new`
  (no prompt); classify the outcome as Ok / Auth (ACP auth_required,
  JSON-RPC -32000) / Fail. Applies to both the custom and builtin-managed
  probe paths.
- api-types: add `TryConnectCustomAgentResponse::FailAuth` (tag `fail_auth`).
- availability: map FailAuth → offline + `auth_required` code; gate aionrs
  (built-in agent, no external CLI) availability on having at least one
  enabled model provider, mirroring AssistantService::resolve_default_agent_type
  — offline + `no_provider` otherwise.
- custom: accept test-on-save when the agent is reachable but auth_required
  (a valid agent the user just hasn't logged into yet).
- registry: add guidance for auth_required and no_provider error codes.
- The background scheduler shares run_probe, so periodic checks reflect the
  same session/new-based status.
…ficeAI/AionCore into feat/agent-connection-testing-phase2

* 'feat/agent-connection-testing-phase2' of github.com:iOfficeAI/AionCore:
…ackend

An assistant's agent_status was matched to its agent row by `backend`
only. aionrs (the built-in Rust agent) has a NULL `backend` and is keyed
by `agent_type` ("aionrs"), so every aionrs-backed assistant failed to
resolve a row and was mislabelled Missing/unavailable.

Match the agent row on `backend == effective_backend` OR
`agent_type.serde_name() == effective_backend`, so aionrs assistants
resolve to the real aionrs row and reflect its actual status.

Add a regression test covering an aionrs assistant (row with NULL backend,
agent_type Aionrs, Online) resolving to Online instead of Missing.
…-testing-phase2

# Conflicts:
#	crates/aionui-conversation/src/service.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect agent availability: probe session/new, recognize auth_required, gate aionrs by provider

1 participant