feat(agent): detect availability via session/new probe and assistant-first identity by kaizhou-lab · Pull Request #500 · iOfficeAI/AionCore

kaizhou-lab · 2026-06-22T09:27:42Z

Summary

This PR started as agent connection testing phase 2, but it now also completes the backend side of the assistant-first identity migration. The combined scope is: accurate agent availability detection, assistant/agent identity unification, cron assistant-first persistence, and a clearer /api/assistants contract.

Agent availability detection

Probe real session/new startup instead of inferring availability from static metadata.
Report online/offline status with explicit auth-required handling and clear stale needs_auth state after successful startup.
Gate aionrs availability on resolved provider configuration instead of backend labels alone.
Reap probe process trees / process groups to avoid orphaned wrapper children.
Expose management and logo metadata needed by AionUi for repair/status surfaces.

Assistant-first identity and schema cleanup

Make assistant identity the public write boundary for conversation, team, channel, cron, guide, and ACP-session flows.
Use agent_metadata.id as the canonical concrete agent binding behind assistants.
Normalize assistant definitions, snapshots, and runtime resolution around assistant id plus agent_metadata.id instead of ambiguous backend/preset fields.
Rename response/runtime backend metadata for ACP agents from generic backend to explicit acp_backend.
Remove duplicated nested agent.id from /api/assistants; the top-level agent_id is the only concrete agent binding in the response.
Keep nested /api/assistants.agent as runtime metadata only: type, source, and optional acp_backend.

Cron jobs

Move cron persistence to assistant-first agent_config.assistant_id.
Derive runtime backend, ACP backend, provider/model behavior, name, and avatar from the assistant at execution time.
Remove legacy cron_jobs.agent_type / agent_config.backend dependence from current write/read paths.
Migrate existing branch data in the existing phase-2 migration path instead of adding a new migration.
Preserve model/provider data where it is actual execution configuration, while deriving agent backend identity from assistant_id.

Team, conversation, channel, and ACP session flows

Reject legacy backend/custom-agent/preset fields in public assistant-first write DTOs while canonicalizing legacy read paths where needed.
Carry assistant identity through team creation, add-agent, snapshots, guide handoff, and MCP tool contracts.
Persist assistant snapshots for conversations and use them to resolve runtime backend safely.
Use assistant-owned defaults and runtime seeds when creating ACP/aionrs sessions.
Keep channel defaults and settings bound to assistants rather than backend labels.

Tests

Added/updated API type, assistant service, cron, team, conversation, channel, migration, and e2e coverage for the new identity contract.
Added assertions that /api/assistants exposes agent_id plus agent.acp_backend, and no longer exposes nested agent.id or agent.backend.

Testing

just push passed after merging latest origin/main.
AionCore: migration immutability check, cargo check, clippy/fmt, and workspace nextest passed.
Nextest result: 6506 passed, 18 skipped.

Closes #499

…-testing-phase2

The availability scheduler runs `try_connect_custom_agent` every 5 minutes for every agent, spawning a CLI subprocess and tearing it down once the ACP handshake completes (or fails). For wrapper CLIs that fork a long-lived grandchild — `npm exec openclaw --acp` is the production case — cleanup was leaking the grandchild because: * `kill_on_drop(true)` on the tokio Command only signals the direct child (the npm exec wrapper), not its grandchild. * The probe relied on `drop(protocol)` for the success path and on no explicit cleanup for the handshake-fail path, so `proc.kill` was never called. * `CliAgentProcess::kill` itself short-circuited and returned Ok the moment the leader exited within the grace period — so even when callers did invoke it, no group-wide SIGKILL was sent. Result: dozens of zombie `openclaw-acp` processes accumulated per day under the 5-minute scheduler. Fix: 1. `CliAgentProcess::kill` now always issues a group-wide SIGKILL after the grace period, even when the leader has already exited. `force_kill` already maps ESRCH to success, so the sweep is idempotent for already-reaped trees. 2. `try_connect_custom_agent` calls `proc.kill` on every outcome (success, ACP failure, handshake timeout) by hoisting the spawn out of the inner future and running cleanup unconditionally after the timeout race resolves. 3. New regression test `probe_kills_grandchild_left_behind_by_wrapper` exercises the exact wrapper-grandchild shape from production and asserts the grandchild is reaped before the probe returns.

list_management_rows now sets env to Vec::new() instead of copying meta.env, which contained merged override secrets. The management row already exposes has_command_override + env_override_key_count; the UI does not need plaintext values. The e2e test now asserts the management row's env is empty AND that the secret value "sk-x" does not appear anywhere in the management response body.

Added record_session_success to AgentAvailabilityFeedbackPort trait and AgentAvailabilityService. It persists a session-kind available snapshot, which clears needs_auth and updates last_success_at. Turn orchestrator now calls record_agent_session_success when send_message succeeds, mirroring the existing session failure path. Test: record_session_success_clears_needs_auth verifies that a needs_auth state set via session failure is cleared by session success.

…ing-phase2

Bare assistants for single-engine agents (e.g. Aion CLI) were generated with an empty agent_backend because their engine identity lives in agent_type, not the ACP-vendor backend column. An empty preset_agent_type made the frontend route aionrs assistants as ACP, dropping the top-level model and persisting a NULL model that later failed warmup with "Provider '' not found". Reconcile now falls back to agent_type when backend is empty. Also fix pre-existing test breakage left from the assistant-first migration: - add missing agent_status/team_selectable/deletable to the AssistantResponse camelCase rejection fixture - update channel default-settings e2e to expect the generated aionrs bare assistant binding instead of a null assistant - drop the obsolete agent.select persist integration test (direct agent selection is no longer supported; covered by the unknown-action unit test) - add missing last_check_error_details to the bare assistant test row

…/offline Align the backend agent status model with the simplified frontend semantics: a probe only verifies an ACP handshake (reachability), not authorization, so the misleading "available"/"needs_auth" verdicts are dropped in favor of plain online/offline. - agent_discovery: rename AgentManagementStatus and AgentSnapshotCheckStatus variants Available/Unavailable/NeedsAuth -> Online/Offline (serde snake_case). - availability: simplify snapshot persistence — record_session_failure always yields offline; drop the success-recording path and auth detection. - registry: parse "online"/"offline"; derive management status as Offline -> Offline else Online. - turn_orchestrator: always report session_send_failed on send error; no success recording. - assistant/service + tests: migrate status values to online/offline.

…iders Deepen the agent health probe from `initialize` to `session/new` so it reflects real usability, not just protocol reachability — `initialize` returns authMethods even for authorized agents and cannot tell apart "reachable but not signed in". - custom_agent_probe: after `initialize`, open a throwaway `session/new` (no prompt); classify the outcome as Ok / Auth (ACP auth_required, JSON-RPC -32000) / Fail. Applies to both the custom and builtin-managed probe paths. - api-types: add `TryConnectCustomAgentResponse::FailAuth` (tag `fail_auth`). - availability: map FailAuth → offline + `auth_required` code; gate aionrs (built-in agent, no external CLI) availability on having at least one enabled model provider, mirroring AssistantService::resolve_default_agent_type — offline + `no_provider` otherwise. - custom: accept test-on-save when the agent is reachable but auth_required (a valid agent the user just hasn't logged into yet). - registry: add guidance for auth_required and no_provider error codes. - The background scheduler shares run_probe, so periodic checks reflect the same session/new-based status.

…-testing-phase2

…ficeAI/AionCore into feat/agent-connection-testing-phase2 * 'feat/agent-connection-testing-phase2' of github.com:iOfficeAI/AionCore:

…ackend An assistant's agent_status was matched to its agent row by `backend` only. aionrs (the built-in Rust agent) has a NULL `backend` and is keyed by `agent_type` ("aionrs"), so every aionrs-backed assistant failed to resolve a row and was mislabelled Missing/unavailable. Match the agent row on `backend == effective_backend` OR `agent_type.serde_name() == effective_backend`, so aionrs assistants resolve to the real aionrs row and reflect its actual status. Add a regression test covering an aionrs assistant (row with NULL backend, agent_type Aionrs, Online) resolving to Online instead of Missing.

…-testing-phase2

…-testing-phase2 # Conflicts: # crates/aionui-conversation/src/service.rs

…-testing-phase2

zk added 30 commits June 15, 2026 17:57

feat(agent): add connection testing and bare assistant projection

66b487a

Merge remote-tracking branch 'origin/main' into feat/agent-connection…

3c742a9

…-testing-phase2

chore(assistant): remove unused preset id whitelist asset

9ced726

Merge remote-tracking branch 'origin/main' into feat/agent-connection…

645eff8

…-testing-phase2

feat(assistant): prioritize bare assistants on first bootstrap

5f10d94

merge: bring origin/main into feat/agent-connection-testing-phase2

8431999

test(assistant): cover bare assistant projection

6523326

feat(channel): add backend-owned channel settings API

6a040d9

chore: apply auto-fixes (fmt + clippy)

ab35b57

test(api): refresh assistant response fixture

24fb0ce

feat(channel): resolve bindings from assistants

6226104

feat(team): persist assistant identity across team flows

d821d4d

refactor(cron): persist assistant identity in cron config

c176bb8

feat(agent): probe managed builtin acp health

940e287

refactor(agent): drop legacy backend health check route

f14083f

refactor(cron): create conversations with assistant identity

4bdab9b

refactor(channel): normalize assistant bindings on write

7263515

feat(agent): surface management diagnostics guidance

987f8f1

fix(assistant): forbid editing generated assistants

742ba2b

refactor(team): persist assistant identity for new agents

f2e7975

refactor(conversation): inject assistant runtime seeds

86033c2

refactor(cron): strip legacy agent ids on assistant writes

907ace6

refactor(team): prefer assistant ids in mcp tooling

d6c8161

fix(team): resolve spawn backend from assistant ids

afb071e

refactor(team): derive team backends from assistants

6709f85

refactor(conversation): drop redundant preset extra writes

712d5fc

refactor(team): prefer assistant ids in leader prompts

dfd6d18

refactor(channel): create conversations through assistant identities

f71aa96

refactor(team): seed lead prompts from assistants

f5ebc0a

zk added 30 commits June 18, 2026 17:38

Merge branch 'feat/agent-self-repair' into feat/agent-connection-test…

ea7893b

…ing-phase2

fix(team): structure assistant-first errors for i18n

c4072a9

fix(cron): resolve assistant backend from snapshots

86e8d70

chore(merge): merge origin main into agent connection branch

a4f437e

Merge remote-tracking branch 'origin/main' into feat/agent-connection…

9f4d84a

…-testing-phase2

Merge remote-tracking branch 'origin/main' into feat/agent-connection…

8f170cc

…-testing-phase2

Merge branch 'feat/agent-connection-testing-phase2' of github.com:iOf…

5456a25

…ficeAI/AionCore into feat/agent-connection-testing-phase2 * 'feat/agent-connection-testing-phase2' of github.com:iOfficeAI/AionCore:

fix(team): bootstrap TeamRun for assistant-first creation

82f848e

chore: apply auto-fixes (fmt + clippy)

65159ae

fix(ci): stabilize agent availability checks

0eefd92

Merge remote-tracking branch 'origin/main' into feat/agent-connection…

5380220

…-testing-phase2

fix(assistant): unify assistant agent id storage

cee9c09

perf(agent): remove background availability probes

c74e32b

refactor(assistant): normalize assistant and cron identity

237eb92

Merge remote-tracking branch 'origin/main' into feat/agent-connection…

3cb4334

…-testing-phase2 # Conflicts: # crates/aionui-conversation/src/service.rs

chore: apply auto-fixes (fmt + clippy)

1b737a2

test(cron): align workspace e2e fixtures with assistant config

607a409

fix(assistant): include agent id in response projections

643be6b

chore: apply auto-fixes (fmt + clippy)

f50eb38

chore: apply auto-fixes (fmt + clippy)

6af8c1d

refactor(assistant): expose acp backend explicitly

3115845

refactor(assistant): remove duplicate agent id from response

2a13562

chore: apply auto-fixes (fmt + clippy)

69184e1

Merge remote-tracking branch 'origin/main' into feat/agent-connection…

c637666

…-testing-phase2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): detect availability via session/new probe and assistant-first identity#500

feat(agent): detect availability via session/new probe and assistant-first identity#500
kaizhou-lab wants to merge 144 commits into
mainfrom
feat/agent-connection-testing-phase2

kaizhou-lab commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaizhou-lab commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Agent availability detection

Assistant-first identity and schema cleanup

Cron jobs

Team, conversation, channel, and ACP session flows

Tests

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kaizhou-lab commented Jun 22, 2026 •

edited

Loading