Skip to content

Fix agent memory leak: InvokeAsync to SendAsync#10

Merged
GordonBeeming merged 1 commit intomainfrom
gb/fix-agent-memory-leak
Mar 26, 2026
Merged

Fix agent memory leak: InvokeAsync to SendAsync#10
GordonBeeming merged 1 commit intomainfrom
gb/fix-agent-memory-leak

Conversation

@GordonBeeming
Copy link
Copy Markdown
Owner

Summary

  • Switch all one-way SignalR hub calls from InvokeAsync to SendAsync to eliminate pending invocation state accumulation that caused 85GB memory growth over 21 hours
  • Add CancellationToken propagation to all SignalR methods
  • Bound the Closed event reconnection loop with the stopping token
  • Dispose Process objects immediately on crash detection in HealthCheckAsync

Root Cause

The agent used InvokeAsync (two-way RPC) for all SignalR calls, including one-way methods like Heartbeat, SessionStatusChanged, etc. Each InvokeAsync allocates a TaskCompletionSource and registers it in an internal pending invocations dictionary. With NativeAOT, this tracking state accumulated as 256K VM_ALLOCATE blocks that were never released, reaching 296K blocks (74GB) over 21 hours of runtime.

SendAsync is fire-and-forget -- it serializes and sends without allocating any pending state. Only RegisterAgent (which actually returns AgentRegistrationResult) needs InvokeAsync.

Test plan

  • dotnet build -- full solution builds clean
  • dotnet test -- all integration tests pass
  • Monitor agent memory with vmmap --summary $(pgrep claudenest-agent) after deployment to confirm no rapid 256K segment growth

🤖 Generated with Claude Code

…sync

The agent was leaking ~74GB over 21 hours via 296K unreleased 256K
VM_ALLOCATE blocks (GC heap segments). Root cause: all SignalR hub calls
used InvokeAsync (two-way RPC) which allocates pending invocation tracking
state for each call, even for one-way methods like Heartbeat. Combined with
NativeAOT, this pending state accumulated native memory that the GC never
released back to the OS.

Changes:
- Switch all one-way SignalR calls (Heartbeat, SessionStatusChanged,
  DirectoryListing, ReportAllSessions, UpdateStatus) from InvokeAsync to
  SendAsync. Only RegisterAgent (which returns a result) keeps InvokeAsync.
- Add CancellationToken parameter to all SignalR methods and thread
  stoppingToken through from Worker.cs callers.
- Bound the Closed event reconnection loop with the stopping token instead
  of while(true), preventing unbounded retries during shutdown.
- Dispose Process objects immediately when HealthCheck detects a crashed
  session, rather than waiting for the 1-hour cleanup cycle.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: GitButler <gitbutler@gitbutler.com>
@GordonBeeming GordonBeeming marked this pull request as ready for review March 26, 2026 06:20
Copilot AI review requested due to automatic review settings March 26, 2026 06:20
@GordonBeeming GordonBeeming merged commit a3a3565 into main Mar 26, 2026
2 checks passed
@GordonBeeming GordonBeeming deleted the gb/fix-agent-memory-leak branch March 26, 2026 06:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a reported long-running memory growth issue in the agent by switching one-way SignalR hub calls from InvokeAsync (request/response) to SendAsync (fire-and-forget), and by propagating cancellation throughout the agent’s SignalR and reconnection logic.

Changes:

  • Replace one-way SignalR InvokeAsync calls with SendAsync, keeping InvokeAsync only for RegisterAgent.
  • Propagate CancellationToken through agent-to-hub calls and bound the manual reconnection loop by the stopping token.
  • Dispose Process handles immediately when a crashed session is detected during health checks.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/ClaudeNest.Agent/Worker.cs Passes the stopping token through to all SignalR send/report operations.
src/ClaudeNest.Agent/Services/SignalRConnectionManager.cs Converts hub calls to SendAsync, adds token propagation, and bounds manual reconnect by stopping token.
src/ClaudeNest.Agent/Services/SessionManager.cs Disposes session Process handle on crash detection during HealthCheckAsync.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ClaudeNest.Agent/Services/SignalRConnectionManager.cs
Comment thread src/ClaudeNest.Agent/Services/SessionManager.cs
GordonBeeming added a commit that referenced this pull request Mar 26, 2026
- Catch OperationCanceledException explicitly in the SignalR reconnection
  loop so shutdown cancellation logs cleanly instead of appearing as a
  reconnection failure with noisy backoff warnings.
- Remove Process.Dispose() from HealthCheckAsync crash detection to avoid
  racing with SpawnProcessAsync/MonitorAdoptedProcessAsync that may still
  hold the Process handle. The owning code path remains responsible for
  disposal; the 1-hour cleanup catches any stragglers.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: GitButler <gitbutler@gitbutler.com>
GordonBeeming added a commit that referenced this pull request Mar 26, 2026
- Catch OperationCanceledException explicitly in the SignalR reconnection
  loop so shutdown cancellation logs cleanly instead of appearing as a
  reconnection failure with noisy backoff warnings.
- Remove Process.Dispose() from HealthCheckAsync crash detection to avoid
  racing with SpawnProcessAsync/MonitorAdoptedProcessAsync that may still
  hold the Process handle. The owning code path remains responsible for
  disposal; the 1-hour cleanup catches any stragglers.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: GitButler <gitbutler@gitbutler.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants