Conversation
Atomizer previously relied on visibility timeout and exact-token shutdown release, so crashed process instances could delay reprocessing across queues. This adds a process-level heartbeat registry, a recovery-specific storage contract, and backend-owned stale claim/release semantics for in-memory and supported EF providers while preserving the existing IAtomizerStorage surface. Constraint: Preserve exact-token QueuePump shutdown release and keep IAtomizerStorage non-breaking Constraint: Stale recovery must no-op when a server revived before the atomic claim Rejected: Add heartbeat methods directly to IAtomizerStorage | unnecessary public API break Rejected: Separate stale discovery removal from job release | race-prone ownership transfer Confidence: high Scope-risk: moderate Directive: Do not loosen EF release-by-instance matching; it must remain anchored to LeaseToken.Delimiter and wildcard-escaped Tested: dotnet build Atomizer.sln Tested: dotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter FullyQualifiedName~Heartbeat Tested: dotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter FullyQualifiedName~DialectTests|FullyQualifiedName~Heartbeat Tested: DOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln Not-tested: Plain dotnet test Atomizer.sln without roll-forward because local .NET 6 runtime is not installed Co-authored-by: OmX <omx@oh-my-codex.dev>
Heartbeat recovery is a rare repair path, so expanding every SQL dialect for a delete plus release-by-prefix made the feature harder to maintain than the saved round trip justified. The EF storage now claims the stale server row and releases matching leases with tracked entities inside one transaction, while normal queue polling keeps its existing provider-specific SQL path. Constraint: Preserve stale-server claim and matching job release in one EF transaction Rejected: Provider-specific heartbeat SQL | too much dialect surface for an uncommon recovery path Confidence: high Scope-risk: narrow Directive: Keep heartbeat recovery on the EF tracked-entity path unless profiling shows recovery-path pressure Tested: dotnet build Atomizer.sln -v minimal Tested: dotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter FullyQualifiedName~Heartbeat -v minimal Tested: dotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter FullyQualifiedName~EntityFrameworkCoreStorageTests|FullyQualifiedName~DialectTests|FullyQualifiedName~Heartbeat -v minimal Tested: DOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln -v minimal Co-authored-by: OmX <omx@oh-my-codex.dev>
Heartbeat recovery is now part of every supported storage implementation, so the split recovery interface and support-validation hook only added indirection. Folding the heartbeat operations into IAtomizerStorage lets the hosted service use the same resolved storage instance directly. The in-memory release path is restored to its original lease-index-based implementation; heartbeat recovery keeps the separate predicate helper for cross-token stale-instance release. Sample EF migrations were reset and regenerated as provider-specific Initial migrations so new installs include the active-server table and current schedule indexes from the baseline. Constraint: Current storage drivers all implement heartbeat recovery Constraint: User requested sample migrations be regenerated via migrations.sh Initial Rejected: Keep IAtomizerHeartbeatRecoveryStorage | no current consumer resolves it independently Rejected: Keep ValidateHeartbeatRecoverySupport | all current drivers support the EF-tracked recovery path Confidence: high Scope-risk: moderate Directive: Do not rework InMemoryStorage.ReleaseLeasedAsync without preserving the lease-token index semantics Tested: ./migrations.sh Initial Tested: dotnet build Atomizer.sln -v minimal Tested: dotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter FullyQualifiedName~Heartbeat|FullyQualifiedName~InMemoryStorageTests -v minimal Tested: dotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter FullyQualifiedName~EntityFrameworkCoreStorageTests|FullyQualifiedName~DialectTests|FullyQualifiedName~Heartbeat -v minimal Tested: DOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln -v minimal Co-authored-by: OmX <omx@oh-my-codex.dev>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bdff47c09b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _logger.LogError(ex, "Error during Atomizer heartbeat recovery tick"); | ||
| } | ||
|
|
||
| await Task.Delay(_options.EffectiveStaleSweepInterval, stoppingToken); |
There was a problem hiding this comment.
Decouple heartbeat cadence from stale sweep interval
ExecuteAsync runs both heartbeat writes and stale-server recovery in one loop, then sleeps for EffectiveStaleSweepInterval; this means setting StaleSweepInterval higher than HeartbeatInterval also slows heartbeats, even though HeartbeatInterval is documented as the heartbeat cadence. In configurations like HeartbeatInterval=30s, StaleSweepInterval=5m, StaleServerTimeout=3m (which currently passes validation), healthy nodes can look stale for minutes and be “recovered,” releasing in-flight jobs prematurely and causing duplicate processing.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 8835d32. The single BeatAndRecoverAsync loop has been split into two independent loops — RunHeartbeatLoopAsync (delays by HeartbeatInterval) and RunSweepLoopAsync (delays by EffectiveStaleSweepInterval) — running concurrently via Task.WhenAll. A high StaleSweepInterval no longer affects the heartbeat cadence.
| { | ||
| using var scope = _scopeFactory.CreateScope(); | ||
| var storage = ResolveStorage(scope.ServiceProvider); | ||
| await storage.RemoveHeartbeatAsync(_identity.InstanceId, cancellationToken); |
There was a problem hiding this comment.
Cancel execute loop before removing local heartbeat
StopAsync removes the heartbeat record before calling base.StopAsync, but BackgroundService cancellation is triggered by base.StopAsync; if a tick is in progress, it can write a new heartbeat after this deletion and before cancellation propagates. That leaves a stale heartbeat row behind on clean shutdown, so other servers may later attempt unnecessary stale-recovery work against a process that already stopped.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 8835d32. StopAsync now calls base.StopAsync first, which cancels the stoppingToken and drains both loops to completion before the heartbeat removal is attempted. This eliminates the window where a mid-tick write could re-create the heartbeat row after deletion.
The heartbeat PR already had behavior coverage, so this pass stays intentionally small: delete the empty cancellation catch in the hosted-service test, align the storage contract summary with the now-folded heartbeat methods, and remove leftover whitespace around the in-memory heartbeat helper. Constraint: Preserve the simplified EF-tracked heartbeat recovery and regenerated sample migrations Rejected: Broader test/helper refactors | unnecessary churn for a cleanup-only pass Confidence: high Scope-risk: narrow Directive: Keep future deslop passes behavior-preserving and avoid touching generated migrations unless regenerating them intentionally Tested: dotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter FullyQualifiedName~Heartbeat|FullyQualifiedName~InMemoryStorageTests -v minimal Tested: dotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter (FullyQualifiedName~SqliteStorageTestsExecutor&FullyQualifiedName~Heartbeat)|FullyQualifiedName~DialectTests -v minimal Tested: dotnet build Atomizer.sln -v minimal Tested: DOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln -v minimal Co-authored-by: OmX <omx@oh-my-codex.dev>
…after loop stops - Split single BeatAndRecoverAsync loop into RunHeartbeatLoopAsync (delays by HeartbeatInterval) and RunSweepLoopAsync (delays by EffectiveStaleSweepInterval) so a long StaleSweepInterval no longer slows heartbeat writes. - In StopAsync, call base.StopAsync first so the stoppingToken is cancelled and both loops exit before the heartbeat row is deleted, preventing a race where a mid-tick write could re-create the row after deletion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merged origin/main into feature/heartbeat-recovery and resolved the generated EF migration timestamp conflicts by replacing the sample migration artifacts with fresh provider-specific Initial migrations from migrations.sh. The storage cleanup conflict keeps the shared cleanup helper and extends it to clear active server heartbeat rows so merged EF tests remain isolated. Constraint: User requested merging main into feature/heartbeat-recovery and regenerating sample migrations as Initial Rejected: Keep either branch's timestamped sample migrations | stale generated artifacts would preserve the conflict source Confidence: high Scope-risk: moderate Directive: Regenerate sample EF migrations after model-shape changes rather than hand-editing timestamped migration artifacts Tested: ./migrations.sh Initial Tested: csharpier check . Tested: dotnet build Tested: dotnet test --no-build passed for net8.0/net10.0 test assemblies Not-tested: net6.0 test assemblies did not run because Microsoft.NETCore.App 6.0.0 is not installed locally
Summary
Adds process-level heartbeat recovery so Atomizer can detect stale process instances and release their leased
Processingjobs across queues without waiting for visibility timeout expiry.What changed
IAtomizerStorage; removed the unused splitIAtomizerHeartbeatRecoveryStorageinterface.ValidateHeartbeatRecoverySupportbecause all current drivers support the recovery path.AtomizerRuntimeIdentitydelimiter validation.InMemoryStorage.ReleaseLeasedAsyncto its original lease-token-index path; stale heartbeat recovery uses a separate predicate helper.Initialmigrations via./migrations.sh Initial.Validation
./migrations.sh Initialdotnet build Atomizer.sln -v minimaldotnet test tests/Atomizer.Tests/Atomizer.Tests.csproj --framework net8.0 --filter "FullyQualifiedName~Heartbeat|FullyQualifiedName~InMemoryStorageTests" -v minimaldotnet test tests/Atomizer.EntityFrameworkCore.Tests/Atomizer.EntityFrameworkCore.Tests.csproj --framework net8.0 --filter "FullyQualifiedName~EntityFrameworkCoreStorageTests|FullyQualifiedName~DialectTests|FullyQualifiedName~Heartbeat" -v minimalDOTNET_ROLL_FORWARD=Major dotnet test Atomizer.sln -v minimalNote: warnings are existing project/runtime/package warnings (net6.0 EOL, xUnit restore compatibility, and known package vulnerability warnings), plus EF migration-generation warnings from the sample project.