Skip to content

feat: Server heartbeats#15

Merged
mnbuhl merged 6 commits intomainfrom
feature/heartbeat-recovery
May 4, 2026
Merged

feat: Server heartbeats#15
mnbuhl merged 6 commits intomainfrom
feature/heartbeat-recovery

Conversation

@mnbuhl
Copy link
Copy Markdown
Owner

@mnbuhl mnbuhl commented May 4, 2026

Summary

Adds process-level heartbeat recovery so Atomizer can detect stale process instances and release their leased Processing jobs across queues without waiting for visibility timeout expiry.

What changed

  • Added heartbeat operations directly to IAtomizerStorage; removed the unused split IAtomizerHeartbeatRecoveryStorage interface.
  • Removed ValidateHeartbeatRecoverySupport because all current drivers support the recovery path.
  • Added active-server heartbeat models and a single hosted heartbeat recovery service.
  • Added heartbeat option defaults/validation and AtomizerRuntimeIdentity delimiter validation.
  • Implemented in-memory heartbeat registry and stale recovery with revived-server no-op behavior.
  • Restored InMemoryStorage.ReleaseLeasedAsync to its original lease-token-index path; stale heartbeat recovery uses a separate predicate helper.
  • Implemented EF active-server entity/configuration plus stale-server recovery using EF-tracked entities inside a transaction.
  • Kept existing provider-specific SQL limited to the hot queue-polling/lease-release paths; heartbeat recovery intentionally avoids extra dialect methods.
  • Reset sample EF migrations and regenerated provider-specific Initial migrations via ./migrations.sh Initial.
  • Added tests for options/identity validation, hosted recovery orchestration, in-memory recovery semantics, EF storage/model support, and existing provider dialect SQL behavior.

Validation

  • ./migrations.sh Initial
  • dotnet build Atomizer.sln -v minimal
  • dotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter "FullyQualifiedName~Heartbeat|FullyQualifiedName~InMemoryStorageTests" -v minimal
  • dotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter "FullyQualifiedName~EntityFrameworkCoreStorageTests|FullyQualifiedName~DialectTests|FullyQualifiedName~Heartbeat" -v minimal
  • DOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln -v minimal

Note: warnings are existing project/runtime/package warnings (net6.0 EOL, xUnit restore compatibility, and known package vulnerability warnings), plus EF migration-generation warnings from the sample project.

mnbuhl and others added 3 commits May 4, 2026 18:34
Atomizer previously relied on visibility timeout and exact-token shutdown release, so crashed process instances could delay reprocessing across queues. This adds a process-level heartbeat registry, a recovery-specific storage contract, and backend-owned stale claim/release semantics for in-memory and supported EF providers while preserving the existing IAtomizerStorage surface.

Constraint: Preserve exact-token QueuePump shutdown release and keep IAtomizerStorage non-breaking
Constraint: Stale recovery must no-op when a server revived before the atomic claim
Rejected: Add heartbeat methods directly to IAtomizerStorage | unnecessary public API break
Rejected: Separate stale discovery removal from job release | race-prone ownership transfer
Confidence: high
Scope-risk: moderate
Directive: Do not loosen EF release-by-instance matching; it must remain anchored to LeaseToken.Delimiter and wildcard-escaped
Tested: dotnet build Atomizer.sln
Tested: dotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter FullyQualifiedName~Heartbeat
Tested: dotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter FullyQualifiedName~DialectTests|FullyQualifiedName~Heartbeat
Tested: DOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln
Not-tested: Plain dotnet test Atomizer.sln without roll-forward because local .NET 6 runtime is not installed
Co-authored-by: OmX <omx@oh-my-codex.dev>
Heartbeat recovery is a rare repair path, so expanding every SQL dialect for a delete plus release-by-prefix made the feature harder to maintain than the saved round trip justified. The EF storage now claims the stale server row and releases matching leases with tracked entities inside one transaction, while normal queue polling keeps its existing provider-specific SQL path.

Constraint: Preserve stale-server claim and matching job release in one EF transaction
Rejected: Provider-specific heartbeat SQL | too much dialect surface for an uncommon recovery path
Confidence: high
Scope-risk: narrow
Directive: Keep heartbeat recovery on the EF tracked-entity path unless profiling shows recovery-path pressure
Tested: dotnet build Atomizer.sln -v minimal
Tested: dotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter FullyQualifiedName~Heartbeat -v minimal
Tested: dotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter FullyQualifiedName~EntityFrameworkCoreStorageTests|FullyQualifiedName~DialectTests|FullyQualifiedName~Heartbeat -v minimal
Tested: DOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln -v minimal
Co-authored-by: OmX <omx@oh-my-codex.dev>
Heartbeat recovery is now part of every supported storage implementation, so the split recovery interface and support-validation hook only added indirection. Folding the heartbeat operations into IAtomizerStorage lets the hosted service use the same resolved storage instance directly. The in-memory release path is restored to its original lease-index-based implementation; heartbeat recovery keeps the separate predicate helper for cross-token stale-instance release. Sample EF migrations were reset and regenerated as provider-specific Initial migrations so new installs include the active-server table and current schedule indexes from the baseline.

Constraint: Current storage drivers all implement heartbeat recovery
Constraint: User requested sample migrations be regenerated via migrations.sh Initial
Rejected: Keep IAtomizerHeartbeatRecoveryStorage | no current consumer resolves it independently
Rejected: Keep ValidateHeartbeatRecoverySupport | all current drivers support the EF-tracked recovery path
Confidence: high
Scope-risk: moderate
Directive: Do not rework InMemoryStorage.ReleaseLeasedAsync without preserving the lease-token index semantics
Tested: ./migrations.sh Initial
Tested: dotnet build Atomizer.sln -v minimal
Tested: dotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter FullyQualifiedName~Heartbeat|FullyQualifiedName~InMemoryStorageTests -v minimal
Tested: dotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter FullyQualifiedName~EntityFrameworkCoreStorageTests|FullyQualifiedName~DialectTests|FullyQualifiedName~Heartbeat -v minimal
Tested: DOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln -v minimal
Co-authored-by: OmX <omx@oh-my-codex.dev>
@mnbuhl mnbuhl changed the title [codex] Recover stale Atomizer process leases feat: Server heartbeats May 4, 2026
@mnbuhl mnbuhl marked this pull request as ready for review May 4, 2026 17:36
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bdff47c09b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

_logger.LogError(ex, "Error during Atomizer heartbeat recovery tick");
}

await Task.Delay(_options.EffectiveStaleSweepInterval, stoppingToken);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Decouple heartbeat cadence from stale sweep interval

ExecuteAsync runs both heartbeat writes and stale-server recovery in one loop, then sleeps for EffectiveStaleSweepInterval; this means setting StaleSweepInterval higher than HeartbeatInterval also slows heartbeats, even though HeartbeatInterval is documented as the heartbeat cadence. In configurations like HeartbeatInterval=30s, StaleSweepInterval=5m, StaleServerTimeout=3m (which currently passes validation), healthy nodes can look stale for minutes and be “recovered,” releasing in-flight jobs prematurely and causing duplicate processing.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

@mnbuhl mnbuhl May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8835d32. The single BeatAndRecoverAsync loop has been split into two independent loops — RunHeartbeatLoopAsync (delays by HeartbeatInterval) and RunSweepLoopAsync (delays by EffectiveStaleSweepInterval) — running concurrently via Task.WhenAll. A high StaleSweepInterval no longer affects the heartbeat cadence.

{
using var scope = _scopeFactory.CreateScope();
var storage = ResolveStorage(scope.ServiceProvider);
await storage.RemoveHeartbeatAsync(_identity.InstanceId, cancellationToken);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cancel execute loop before removing local heartbeat

StopAsync removes the heartbeat record before calling base.StopAsync, but BackgroundService cancellation is triggered by base.StopAsync; if a tick is in progress, it can write a new heartbeat after this deletion and before cancellation propagates. That leaves a stale heartbeat row behind on clean shutdown, so other servers may later attempt unnecessary stale-recovery work against a process that already stopped.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8835d32. StopAsync now calls base.StopAsync first, which cancels the stoppingToken and drains both loops to completion before the heartbeat removal is attempted. This eliminates the window where a mid-tick write could re-create the heartbeat row after deletion.

mnbuhl and others added 3 commits May 4, 2026 20:03
The heartbeat PR already had behavior coverage, so this pass stays intentionally small: delete the empty cancellation catch in the hosted-service test, align the storage contract summary with the now-folded heartbeat methods, and remove leftover whitespace around the in-memory heartbeat helper.

Constraint: Preserve the simplified EF-tracked heartbeat recovery and regenerated sample migrations
Rejected: Broader test/helper refactors | unnecessary churn for a cleanup-only pass
Confidence: high
Scope-risk: narrow
Directive: Keep future deslop passes behavior-preserving and avoid touching generated migrations unless regenerating them intentionally
Tested: dotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter FullyQualifiedName~Heartbeat|FullyQualifiedName~InMemoryStorageTests -v minimal
Tested: dotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter (FullyQualifiedName~SqliteStorageTestsExecutor&FullyQualifiedName~Heartbeat)|FullyQualifiedName~DialectTests -v minimal
Tested: dotnet build Atomizer.sln -v minimal
Tested: DOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln -v minimal
Co-authored-by: OmX <omx@oh-my-codex.dev>
…after loop stops

- Split single BeatAndRecoverAsync loop into RunHeartbeatLoopAsync (delays by
  HeartbeatInterval) and RunSweepLoopAsync (delays by EffectiveStaleSweepInterval)
  so a long StaleSweepInterval no longer slows heartbeat writes.
- In StopAsync, call base.StopAsync first so the stoppingToken is cancelled and
  both loops exit before the heartbeat row is deleted, preventing a race where a
  mid-tick write could re-create the row after deletion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merged origin/main into feature/heartbeat-recovery and resolved the generated EF migration timestamp conflicts by replacing the sample migration artifacts with fresh provider-specific Initial migrations from migrations.sh. The storage cleanup conflict keeps the shared cleanup helper and extends it to clear active server heartbeat rows so merged EF tests remain isolated.

Constraint: User requested merging main into feature/heartbeat-recovery and regenerating sample migrations as Initial
Rejected: Keep either branch's timestamped sample migrations | stale generated artifacts would preserve the conflict source
Confidence: high
Scope-risk: moderate
Directive: Regenerate sample EF migrations after model-shape changes rather than hand-editing timestamped migration artifacts
Tested: ./migrations.sh Initial
Tested: csharpier check .
Tested: dotnet build
Tested: dotnet test --no-build passed for net8.0/net10.0 test assemblies
Not-tested: net6.0 test assemblies did not run because Microsoft.NETCore.App 6.0.0 is not installed locally
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Code Coverage

Package Line Rate Health
Atomizer 12%
Atomizer.EntityFrameworkCore 88%
Atomizer 72%
Atomizer 72%
Atomizer 12%
Atomizer.EntityFrameworkCore 88%
Atomizer 72%
Atomizer 12%
Atomizer.EntityFrameworkCore 88%
Summary 59% (7305 / 13164)

@mnbuhl mnbuhl merged commit e65485a into main May 4, 2026
1 check passed
@mnbuhl mnbuhl deleted the feature/heartbeat-recovery branch May 5, 2026 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant