Skip to content

fix(bundles): transition-aware logging for broken connectors (#194)#321

Open
Ovaculos wants to merge 2 commits into
NimbleBrainInc:mainfrom
Ovaculos:fix/transition-aware-bundle-logging
Open

fix(bundles): transition-aware logging for broken connectors (#194)#321
Ovaculos wants to merge 2 commits into
NimbleBrainInc:mainfrom
Ovaculos:fix/transition-aware-bundle-logging

Conversation

@Ovaculos
Copy link
Copy Markdown
Contributor

Summary

  • Replace per-enumeration warn in ToolRegistry.availableTools with edge-triggered warns at BundleLifecycleManager.transition. One warn at the moment a connector enters dead/crashed/reauth_required; one info at recovery to running.
  • Sticky lastBrokenState breadcrumb so multi-step recoveries (notably URL bundle reauth_required → pending_auth → running) still fire the recovery info log.
  • Wire HealthMonitor → lifecycle so stdio subprocess deaths propagate to BundleInstance.state — closes the gap where URL bundles transitioned correctly but stdio bundles stayed stuck at running.
  • Disambiguate same-bundle-in-multiple-workspaces via (serverName, wsId) lookup. McpSource.getWorkspaceId() sources from existing bundleContext.workspaceId (no constructor surface change).

Closes #194.

Behavior change

Scenario Before After
URL bundle stuck dead warn per chat turn (∞) 1 warn at transition, 0 thereafter
Stdio subprocess dies mid-session warn per chat turn, BundleInstance.state stale at running 1 warn at crash detection, state correctly transitions to crashed/dead, UI state pill accurate
URL bundle reconnect (reauth_required → pending_auth → running) silent 1 info log at recovery
Same bundle in ws A + ws B, subprocess crash in A crash detection updates one arbitrary workspace's instance (whichever first in iteration) crash in A updates A's instance, B stays accurate
Healthy enumeration quiet quiet (unchanged)
NB_DEBUG=registry n/a (namespace didn't exist) per-skip debug line for trace-level diagnosis

Design notes

  • Broken-state set = { dead, crashed, reauth_required }. Excludes pending_auth (in-flight OAuth, expected during normal Connect) and not_authenticated (resting state on fresh install / post-Disconnect).
  • Funneltransition() is the single state-write site. recordConnectionStateChange now routes through it so URL and stdio bundles share one logging policy.
  • Sticky-bitBundleInstance.lastBrokenState carries the broken label across non-broken intermediates. Cleared on reaching running (after emitting the info log) and on stopped (explicit operator-end of episode).
  • HealthMonitor coupling — minimal: one optional reportSourceTransition(source, "crashed"|"running"|"dead") callback. The (source → BundleInstance) lookup lives in startServer, not HealthMonitor.

Test plan

  • Unit test/unit/bundle-transition-logging.test.ts — 25 tests: edge semantics, sticky-bit multi-step recovery, broken→broken no-rewarn, operator-stop episode termination, idempotency on no-op
  • Unit test/unit/registry-tool-enumeration-resilience.test.ts — extended: 0 warns across 100 enumerations of broken source (proves registry no longer spams)
  • Integration test/integration/health-monitor-lifecycle.test.ts — 2 tests: full chain HealthMonitor.check → reportSourceTransition → recordCrash → transition → warn; same-serverName-cross-workspace disambiguation
  • bun run verify green: 3162 unit + 644 integration + 24 smoke
  • Manual: install bogus-URL bundle → click Connect → expect single entered 'dead' from 'starting' warn; subsequent chats produce no further warns
  • Manual: corrupt Notion tokens.json → trigger Notion tool call → expect single entered 'reauth_required' warn; reconnect via UI → expect single recovered: 'reauth_required' → 'running' info

Files changed

```
AGENTS.md
src/api/server.ts
src/bundles/connection.ts
src/bundles/lifecycle.ts
src/bundles/types.ts
src/cli/log.ts
src/runtime/runtime.ts
src/tools/health-monitor.ts
src/tools/mcp-source.ts
src/tools/registry.ts
test/integration/health-monitor-lifecycle.test.ts (new)
test/unit/bundle-transition-logging.test.ts (new)
test/unit/registry-tool-enumeration-resilience.test.ts
```

Ovaculos and others added 2 commits May 28, 2026 12:33
Replace per-enumeration warn in ToolRegistry.availableTools with
edge-triggered warn in BundleLifecycleManager.transition. Operator
now gets one warn at the moment a connector enters dead, crashed,
or reauth_required, plus one info on recovery to running.

Sticky lastBrokenState breadcrumb on BundleInstance carries the
broken signal across multi-step recoveries — URL bundles go
reauth_required -> pending_auth -> running, and neither leg is a
direct broken->running edge.

recordConnectionStateChange now funnels through transition() so
URL bundles get the same edge logging path as stdio.

Registry skip path demoted to log.debug("registry", ...) — gated
behind NB_DEBUG=registry for trace-level diagnosis. Refs NimbleBrainInc#194.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire HealthMonitor's crash/recovery/dead detection into
BundleLifecycleManager so stdio subprocess deaths transition
BundleInstance.state through the same funnel as URL bundles.
Without this the user-facing state stayed stuck at "running"
when a subprocess died mid-session; Configure UI showed green
and operators only saw the plain stderr line from
workspace-runtime's boot-start catch.

HealthMonitor exposes a single reportSourceTransition hook
(crashed|running|dead). src/api/server.ts wires it to
recordCrash/Recovery/Dead with a (serverName, wsId) lookup
against runtime.getBundleInstances().

McpSource gains getWorkspaceId() returning bundleContext
.workspaceId (already populated at construction).
Runtime.mcpSources() drops the name-based dedupe — the same
bundle in two workspaces produces two distinct McpSource
processes, both of which need monitoring; collapsing them
would leave one workspace's state pill permanently stale on
a crash.

HealthMonitor's local BundleState renamed HealthRecordState
to stop shadowing src/bundles/types.ts BundleState.

Closes NimbleBrainInc#194.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Throttle availableTools() warning for permanently-stuck sources

1 participant