Skip to content

fix: return all eligible FIFO partition jobs per sweep instead of one#18

Closed
mnbuhl wants to merge 4 commits intomainfrom
fix/fifo-partition-multi-job-sweep
Closed

fix: return all eligible FIFO partition jobs per sweep instead of one#18
mnbuhl wants to merge 4 commits intomainfrom
fix/fifo-partition-multi-job-sweep

Conversation

@mnbuhl
Copy link
Copy Markdown
Owner

@mnbuhl mnbuhl commented May 4, 2026

Summary

  • Previously GetDueJobsAsync returned at most one job per partition per sweep via a partition_heads CTE that joined on MIN(sequence_number), capping throughput to one job per partition per poll cycle regardless of batch size.
  • Removes the partition_heads CTE from all three SQL dialects (PostgreSQL, SQL Server, MySQL) and the EF Core LINQ fallback, replacing it with a simple blocked_partitions filter.
  • A partition is still excluded entirely if it has a Processing job or a Pending+retrying job — FIFO ordering and blocking semantics are fully preserved. All eligible jobs from unblocked partitions are now returned up to batchSize, ordered by (scheduled_at, sequence_number).

Changes

  • PostgreSqlDialect, SqlServerDialect, MySqlDialect — simplified GetDueJobs SQL
  • EntityFrameworkCoreStorage — simplified LINQ fallback (removed partitionHeads groupBy)
  • InMemoryStorage — removed partitionHeads groupBy, same eligibility logic
  • IAtomizerStorage — updated XML docs to reflect new contract
  • AtomizerStorageContractTests — updated FIFO-07 contract tests; added batch-size cap test
  • InMemoryStorageTests — updated one test that asserted old one-per-partition behaviour

Test plan

  • dotnet test tests/Atomizer.Tests — 125/125 passed
  • dotnet test tests/Atomizer.EntityFrameworkCore.Tests — 136/136 passed (includes SQLite contract tests)
  • csharpier check . — no formatting issues

🤖 Generated with Claude Code

Previously GetDueJobsAsync returned at most one job per partition per
sweep via a partition_heads CTE that joined on MIN(sequence_number).
This capped throughput to one job per partition per poll cycle.

The fix removes the partition_heads CTE and replaces it with a simple
blocked_partitions filter: any partition with a Processing job or a
Pending job with prior attempts is excluded entirely; all other
partitions contribute all their eligible jobs up to the batch size cap.
Ordering by (scheduled_at, sequence_number) preserves FIFO delivery.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6e49d2c36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/Atomizer/Storage/InMemoryStorage.cs Outdated

var candidates = unpartitioned
.Concat(partitionHeads)
var candidates = eligible
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep one leaseable head per partition

Selecting directly from eligible now allows multiple jobs with the same partition key to be returned in one GetDueJobsAsync sweep. Because QueuePoller.RunAsync leases every returned job and QueuePump can hand them to concurrent workers, a later job in the partition can run before an earlier one has completed or retried (for example after the first job is rescheduled with Attempts > 0), which breaks the documented “processed one at a time” partition semantics and can reorder side effects.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — this was a semantic break. Returning multiple jobs from the same partition per sweep allows a later job to be leased and executed before an earlier one finishes retrying, which violates the 'processed one at a time' FIFO guarantee. Reverted in e1c471b; the PR now only contains the flaky timing test fix.

mnbuhl and others added 3 commits May 4, 2026 23:18
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
500ms slack over a 1s grace period is too tight for CI. The key
invariant is that StopAsync returns well before the 3s worker
finishes; 2.5s proves that without being brittle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returning multiple jobs from the same partition per sweep breaks the
documented invariant: a later job can be leased and executed before an
earlier one finishes retrying, reordering side effects.

All partition-logic changes are reverted. Only the flaky timing test fix
(grace-period upper bound 1.5s → 2.5s) remains in this PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mnbuhl
Copy link
Copy Markdown
Owner Author

mnbuhl commented May 4, 2026

Current setup will not allow this since JobWorker can run with multiple threads

@mnbuhl mnbuhl closed this May 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Code Coverage

Package Line Rate Health
Atomizer 72%
Atomizer 12%
Atomizer.EntityFrameworkCore 89%
Atomizer 12%
Atomizer.EntityFrameworkCore 89%
Atomizer 72%
Atomizer 12%
Atomizer.EntityFrameworkCore 89%
Atomizer 72%
Summary 59% (7371 / 13215)

@mnbuhl mnbuhl deleted the fix/fifo-partition-multi-job-sweep branch May 5, 2026 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant