Skip to content

Add hook HookQueueStateCount + read middleware as hooks and vice versa#1203

Open
brandur wants to merge 1 commit intomasterfrom
brandur-queue-state-count
Open

Add hook HookQueueStateCount + read middleware as hooks and vice versa#1203
brandur wants to merge 1 commit intomasterfrom
brandur-queue-state-count

Conversation

@brandur
Copy link
Copy Markdown
Contributor

@brandur brandur commented Apr 12, 2026

Here, add a new hook called HookQueueStateCount which gets invoked to
produce job queue count statistics. We do this by adding a new
maintenance service which like other maintenance services, runs only on
the leader, so we only have one client performing counts at any given
time. Furthermore, in order to not introduce a potential operational
problem without opt-in from River users, the counts only run if a
HookQueueStateCount hook/middleware has been added to the client.

The reason we do all this to to implement a feature requested by one of
users: for otelriver in contrib to be able to emit queue count
metrics, which seems like a pretty reasonable ask for the package to be
able to do, and something that every River user would likely want access
to in their ops charts.

A slight oddity, but which I think is probably okay, is that the new
hook ideally stays a hook, but the existing otelriver middleware is a
middleware. It'd be nice not to have to put otelriver.Middleware into
both a client's Hooks and Middleware configuration, so we modify
client to allow for hooks that middleware and middleware which are
hooks. This lets otelriver.Middleware continue doing what it was
already doing, but also to start producing new counts as a hook.

@brandur brandur force-pushed the brandur-queue-state-count branch 3 times, most recently from af432f2 to f202436 Compare April 12, 2026 05:56
brandur added a commit to riverqueue/rivercontrib that referenced this pull request Apr 12, 2026
See [1] for context, but here we implement `HookQueueStateCount` in the
existing OTEL middleware so that it can emit metrics around queue
counts, a pretty handy feature that we have on definitive request for,
but which I'd guess most River users would like to have access to.

[1] riverqueue/river#1203
@brandur
Copy link
Copy Markdown
Contributor Author

brandur commented Apr 12, 2026

@codex review

@brandur brandur force-pushed the brandur-queue-state-count branch from f202436 to 109e1fe Compare April 12, 2026 06:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2024369e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@brandur brandur force-pushed the brandur-queue-state-count branch 3 times, most recently from 7266cb8 to b7694fa Compare April 12, 2026 06:44
brandur added a commit to riverqueue/rivercontrib that referenced this pull request Apr 12, 2026
See [1] for context, but here we implement `HookQueueStateCount` in the
existing OTEL middleware so that it can emit metrics around queue
counts, a pretty handy feature that we have on definitive request for,
but which I'd guess most River users would like to have access to.

[1] riverqueue/river#1203
@brandur brandur requested a review from bgentry April 12, 2026 06:53
Here, add a new hook called `HookQueueStateCount` which gets invoked to
produce job queue count statistics. We do this by adding a new
maintenance service which like other maintenance services, runs only on
the leader, so we only have one client performing counts at any given
time. Furthermore, in order to not introduce a potential operational
problem without opt-in from River users, the counts only run if a
`HookQueueStateCount` hook/middleware has been added to the client.

The reason we do all this to to implement a feature requested by one of
users: for `otelriver` in contrib to be able to emit queue count
metrics, which seems like a pretty reasonable ask for the package to be
able to do, and something that every River user would likely want access
to in their ops charts.

A slight oddity, but which I think is _probably_ okay, is that the new
hook ideally stays a hook, but the existing `otelriver` middleware is a
middleware. It'd be nice not to have to put `otelriver.Middleware` into
both a client's `Hooks` and `Middleware` configuration, so we modify
client to allow for hooks that middleware and middleware which are
hooks. This lets `otelriver.Middleware` continue doing what it was
already doing, but also to start producing new counts as a hook.
@brandur brandur force-pushed the brandur-queue-state-count branch from b7694fa to 3d3c21a Compare April 12, 2026 17:31
Copy link
Copy Markdown
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably be straightforward to update riverui to work with these changes, but we should make sure before merging it since it breaks an API it uses.


JobCancel(ctx context.Context, params *JobCancelParams) (*rivertype.JobRow, error)
JobCountByAllStates(ctx context.Context, params *JobCountByAllStatesParams) (map[rivertype.JobState]int, error)
JobCountByQueueAndState(ctx context.Context, params *JobCountByQueueAndStateParams) ([]*JobCountByQueueAndStateResult, error)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change for the UI. It has a few callers in https://github.com/riverqueue/riverui/blob/fa202780c19504cb8510d0b2c8d92f8ff31d8dc9/handler_api_endpoint.go

I was going to hold off on asking about this, but in light of this break, I'm wondering if we should give any thought to whether or not the UI runs its own queries for this purpose or somehow reuses logic in River so that we don't have multiple redundant queriers. But I think it probably is going to be too hard to try and share that state among backend nodes and UI frontend nodes.

Comment on lines +90 to +96
queue,
state,
COUNT(*) AS count
FROM /* TEMPLATE: schema */river_job
WHERE queue = ANY(@queue_names::text[])
GROUP BY queue, state
ORDER BY queue, state;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I believe this might break the prior query's efficient usage of the prioritized fetch index; I'm running an agent w/ benchmarks now to get some more details on whether this is a significant perf regression.

Copy link
Copy Markdown
Contributor

@bgentry bgentry Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so I opened #1211 with a perf benchmark showing this is a huge regression. Essentially the old split format was able to effectively leverage the index whereas this variant does not:

GPT 5.4 high explanation:

The new query is inefficient for the current schema because it filters on queue and groups on (queue, state), but the important existing btree on river_job is keyed (state, queue, priority, scheduled_at, id). PostgreSQL btree indexes are most useful when the query constrains the leftmost columns first, and this query no longer does that. Because state is not fixed in the predicate, the planner cannot use the index as a selective access path in the same way, so it tends to fall back to a parallel sequential scan and aggregate after reading a large slice of the table.

The old query shape was more appropriate for the existing index because it split the work into separate branches with constant state predicates, specifically state = 'available' and state = 'running'. That matches the left edge of the (state, queue, ...) index, and once Postgres has a fixed state, it can efficiently narrow by queue within the same index. In practice that gave us index-only or bitmap/index-driven plans instead of scanning the table broadly, which is why the old query performed much better on selective queue lists.

If we want to fix this without adding indexes, the main option is to reshape the new query so it restores a constant-state predicate per branch. Concretely, that means counting each state in its own grouped subquery, then UNION ALLing those results together or joining them back to the requested queue list. That is more verbose than the current grouped query, but it realigns the SQL with the index we already have and gives the planner a much better chance of using it efficiently.

The benchmark is probably useful regardless of how we proceed on this PR but I'll let you decide how to proceed on it ✌️

bgentry added a commit that referenced this pull request Apr 13, 2026
This adds a benchmark on top of #1203 to make the queue-state count
query regression easy to reproduce and discuss. The query in that branch
now groups by `(queue, state)` while filtering only on `queue`, which no
longer lines up with the existing `(state, queue, priority,
scheduled_at, id)` index on `river_job`.

The new benchmark seeds a migrated `river_job` table and compares the
current `JobCountByQueueAndState` implementation against the legacy
query shape for a few queue-list sizes. That gives us a durable way to
show the planner behavior and quantify the difference before deciding
whether to reshape the SQL or add another index.
bgentry added a commit that referenced this pull request Apr 13, 2026
This adds a benchmark on top of #1203 to make the queue-state count
query regression easy to reproduce and discuss. It compares the current
`JobCountByQueueAndState` implementation against the legacy query shape
on the same migrated `river_job` schema.

The benchmark stays lightweight by default, but can be scaled locally
with `RIVER_BENCH_QUEUE_STATE_COUNT_NUM_JOBS` to reproduce the planner
regression with a couple hundred thousand rows and quantify the gap.
bgentry added a commit that referenced this pull request Apr 13, 2026
This adds a benchmark on top of #1203 to make the queue-state count
query regression easy to reproduce and discuss. It compares the current
`JobCountByQueueAndState` implementation against the legacy query shape
on the same migrated `river_job` schema.

The benchmark stays lightweight by default, but can be scaled locally
with `RIVER_BENCH_QUEUE_STATE_COUNT_NUM_JOBS` to reproduce the planner
regression with a couple hundred thousand rows and quantify the gap.
bgentry added a commit that referenced this pull request Apr 13, 2026
This adds a benchmark on top of #1203 to make the queue-state count
query regression easy to reproduce and discuss. It compares the current
`JobCountByQueueAndState` implementation against the legacy query shape
on the same migrated `river_job` schema.

The benchmark stays lightweight by default, but can be scaled locally
with `RIVER_BENCH_QUEUE_STATE_COUNT_NUM_JOBS` to reproduce the planner
regression with a couple hundred thousand rows and quantify the gap.
bgentry added a commit that referenced this pull request Apr 13, 2026
This adds a benchmark on top of #1203 to make the queue-state count
query regression easy to reproduce and discuss. It compares the current
`JobCountByQueueAndState` implementation against the legacy query shape
on the same migrated `river_job` schema.

The benchmark stays lightweight by default, but can be scaled locally
with `RIVER_BENCH_QUEUE_STATE_COUNT_NUM_JOBS` to reproduce the planner
regression with a couple hundred thousand rows and quantify the gap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants