Add hook HookQueueStateCount + read middleware as hooks and vice versa#1203
Add hook HookQueueStateCount + read middleware as hooks and vice versa#1203
HookQueueStateCount + read middleware as hooks and vice versa#1203Conversation
af432f2 to
f202436
Compare
See [1] for context, but here we implement `HookQueueStateCount` in the existing OTEL middleware so that it can emit metrics around queue counts, a pretty handy feature that we have on definitive request for, but which I'd guess most River users would like to have access to. [1] riverqueue/river#1203
|
@codex review |
f202436 to
109e1fe
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f2024369e5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
7266cb8 to
b7694fa
Compare
See [1] for context, but here we implement `HookQueueStateCount` in the existing OTEL middleware so that it can emit metrics around queue counts, a pretty handy feature that we have on definitive request for, but which I'd guess most River users would like to have access to. [1] riverqueue/river#1203
Here, add a new hook called `HookQueueStateCount` which gets invoked to produce job queue count statistics. We do this by adding a new maintenance service which like other maintenance services, runs only on the leader, so we only have one client performing counts at any given time. Furthermore, in order to not introduce a potential operational problem without opt-in from River users, the counts only run if a `HookQueueStateCount` hook/middleware has been added to the client. The reason we do all this to to implement a feature requested by one of users: for `otelriver` in contrib to be able to emit queue count metrics, which seems like a pretty reasonable ask for the package to be able to do, and something that every River user would likely want access to in their ops charts. A slight oddity, but which I think is _probably_ okay, is that the new hook ideally stays a hook, but the existing `otelriver` middleware is a middleware. It'd be nice not to have to put `otelriver.Middleware` into both a client's `Hooks` and `Middleware` configuration, so we modify client to allow for hooks that middleware and middleware which are hooks. This lets `otelriver.Middleware` continue doing what it was already doing, but also to start producing new counts as a hook.
b7694fa to
3d3c21a
Compare
bgentry
left a comment
There was a problem hiding this comment.
It should probably be straightforward to update riverui to work with these changes, but we should make sure before merging it since it breaks an API it uses.
|
|
||
| JobCancel(ctx context.Context, params *JobCancelParams) (*rivertype.JobRow, error) | ||
| JobCountByAllStates(ctx context.Context, params *JobCountByAllStatesParams) (map[rivertype.JobState]int, error) | ||
| JobCountByQueueAndState(ctx context.Context, params *JobCountByQueueAndStateParams) ([]*JobCountByQueueAndStateResult, error) |
There was a problem hiding this comment.
This is a breaking change for the UI. It has a few callers in https://github.com/riverqueue/riverui/blob/fa202780c19504cb8510d0b2c8d92f8ff31d8dc9/handler_api_endpoint.go
I was going to hold off on asking about this, but in light of this break, I'm wondering if we should give any thought to whether or not the UI runs its own queries for this purpose or somehow reuses logic in River so that we don't have multiple redundant queriers. But I think it probably is going to be too hard to try and share that state among backend nodes and UI frontend nodes.
| queue, | ||
| state, | ||
| COUNT(*) AS count | ||
| FROM /* TEMPLATE: schema */river_job | ||
| WHERE queue = ANY(@queue_names::text[]) | ||
| GROUP BY queue, state | ||
| ORDER BY queue, state; |
There was a problem hiding this comment.
Also I believe this might break the prior query's efficient usage of the prioritized fetch index; I'm running an agent w/ benchmarks now to get some more details on whether this is a significant perf regression.
There was a problem hiding this comment.
Yeah so I opened #1211 with a perf benchmark showing this is a huge regression. Essentially the old split format was able to effectively leverage the index whereas this variant does not:
GPT 5.4 high explanation:
The new query is inefficient for the current schema because it filters on
queueand groups on(queue, state), but the important existing btree onriver_jobis keyed(state, queue, priority, scheduled_at, id). PostgreSQL btree indexes are most useful when the query constrains the leftmost columns first, and this query no longer does that. Becausestateis not fixed in the predicate, the planner cannot use the index as a selective access path in the same way, so it tends to fall back to a parallel sequential scan and aggregate after reading a large slice of the table.The old query shape was more appropriate for the existing index because it split the work into separate branches with constant
statepredicates, specificallystate = 'available'andstate = 'running'. That matches the left edge of the(state, queue, ...)index, and once Postgres has a fixedstate, it can efficiently narrow byqueuewithin the same index. In practice that gave us index-only or bitmap/index-driven plans instead of scanning the table broadly, which is why the old query performed much better on selective queue lists.If we want to fix this without adding indexes, the main option is to reshape the new query so it restores a constant-
statepredicate per branch. Concretely, that means counting each state in its own grouped subquery, thenUNION ALLing those results together or joining them back to the requested queue list. That is more verbose than the current grouped query, but it realigns the SQL with the index we already have and gives the planner a much better chance of using it efficiently.
The benchmark is probably useful regardless of how we proceed on this PR but I'll let you decide how to proceed on it ✌️
This adds a benchmark on top of #1203 to make the queue-state count query regression easy to reproduce and discuss. The query in that branch now groups by `(queue, state)` while filtering only on `queue`, which no longer lines up with the existing `(state, queue, priority, scheduled_at, id)` index on `river_job`. The new benchmark seeds a migrated `river_job` table and compares the current `JobCountByQueueAndState` implementation against the legacy query shape for a few queue-list sizes. That gives us a durable way to show the planner behavior and quantify the difference before deciding whether to reshape the SQL or add another index.
This adds a benchmark on top of #1203 to make the queue-state count query regression easy to reproduce and discuss. It compares the current `JobCountByQueueAndState` implementation against the legacy query shape on the same migrated `river_job` schema. The benchmark stays lightweight by default, but can be scaled locally with `RIVER_BENCH_QUEUE_STATE_COUNT_NUM_JOBS` to reproduce the planner regression with a couple hundred thousand rows and quantify the gap.
This adds a benchmark on top of #1203 to make the queue-state count query regression easy to reproduce and discuss. It compares the current `JobCountByQueueAndState` implementation against the legacy query shape on the same migrated `river_job` schema. The benchmark stays lightweight by default, but can be scaled locally with `RIVER_BENCH_QUEUE_STATE_COUNT_NUM_JOBS` to reproduce the planner regression with a couple hundred thousand rows and quantify the gap.
This adds a benchmark on top of #1203 to make the queue-state count query regression easy to reproduce and discuss. It compares the current `JobCountByQueueAndState` implementation against the legacy query shape on the same migrated `river_job` schema. The benchmark stays lightweight by default, but can be scaled locally with `RIVER_BENCH_QUEUE_STATE_COUNT_NUM_JOBS` to reproduce the planner regression with a couple hundred thousand rows and quantify the gap.
This adds a benchmark on top of #1203 to make the queue-state count query regression easy to reproduce and discuss. It compares the current `JobCountByQueueAndState` implementation against the legacy query shape on the same migrated `river_job` schema. The benchmark stays lightweight by default, but can be scaled locally with `RIVER_BENCH_QUEUE_STATE_COUNT_NUM_JOBS` to reproduce the planner regression with a couple hundred thousand rows and quantify the gap.
Here, add a new hook called
HookQueueStateCountwhich gets invoked toproduce job queue count statistics. We do this by adding a new
maintenance service which like other maintenance services, runs only on
the leader, so we only have one client performing counts at any given
time. Furthermore, in order to not introduce a potential operational
problem without opt-in from River users, the counts only run if a
HookQueueStateCounthook/middleware has been added to the client.The reason we do all this to to implement a feature requested by one of
users: for
otelriverin contrib to be able to emit queue countmetrics, which seems like a pretty reasonable ask for the package to be
able to do, and something that every River user would likely want access
to in their ops charts.
A slight oddity, but which I think is probably okay, is that the new
hook ideally stays a hook, but the existing
otelrivermiddleware is amiddleware. It'd be nice not to have to put
otelriver.Middlewareintoboth a client's
HooksandMiddlewareconfiguration, so we modifyclient to allow for hooks that middleware and middleware which are
hooks. This lets
otelriver.Middlewarecontinue doing what it wasalready doing, but also to start producing new counts as a hook.