Skip to content

Add queue + GC Indexes#310

Closed
MOZGIII wants to merge 1 commit intomainfrom
mzg/2026-04-02/db-idx-1
Closed

Add queue + GC Indexes#310
MOZGIII wants to merge 1 commit intomainfrom
mzg/2026-04-02/db-idx-1

Conversation

@MOZGIII
Copy link
Copy Markdown
Collaborator

@MOZGIII MOZGIII commented Apr 2, 2026

This PR adds targeted Postgres indexes to reduce long-term queueing and cleanup slowdown as table sizes grow, without changing runtime behavior.

Runtime Calls Affected

  • Queue dequeue/claim path:
    • Query label: select:queued_instances
    • Method: poll_queued_instances_once
  • Expired lock reclaim path:
    • Query label: update:queued_instances_expired_unlock
    • Method: reclaim_expired_instance_locks
  • Done-instance garbage collection candidate scan:
    • Query label: select:runner_instances_gc_candidates
    • Method: collect_done_instances_impl

Expected Performance Impact

The thinking is that GC paths doing seqscans could impair the performance of the hot loops by locking the database; this idea is consistent with our observations in production as database calls start taking longer as we go, and the certain threshold makes the system pivot into a catastrophically bad performance (most likely ratio between the long seqscan durations and the GC polling timouts).

Operating without access to the production database, at this time I wasn't able to verify that this is the root cause, or even a significant slowdown; the metrics for this are also not yet present.

One important check we could do before merging this is disabling the GC: if there are no GC runs, there are no GC seqscans, and thus should be no slowdowns.

@MOZGIII MOZGIII requested a review from piercefreeman April 2, 2026 06:08
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Coverage Report

Python Coverage

Metric Coverage
Lines 71.2%
Branches 57.3%

Download HTML Report

Rust Coverage

Metric Coverage
Lines 65.5%
Branches N/A

Download HTML Report

Compared to main branch

@MOZGIII MOZGIII changed the title Add Queue + GC Indexes Add queue + GC Indexes Apr 2, 2026
Copy link
Copy Markdown
Owner

@piercefreeman piercefreeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any obvious downside to this, but holding off on approval of merge to main until we do a bit of independent benchmarking. At the limit indexes can slow down writes more than they speed up reads so we have some more quantitative benchmarking against our test cluster.

@MOZGIII MOZGIII force-pushed the mzg/2026-04-02/db-idx-1 branch from 8c7852d to 4b99528 Compare April 4, 2026 07:01
@MOZGIII
Copy link
Copy Markdown
Collaborator Author

MOZGIII commented Apr 4, 2026

We actually could consider making some of indexes async, especially for things like gc...

Anyhow, I'll bench this first in the soak test.

Copy link
Copy Markdown
Owner

@piercefreeman piercefreeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on observed prod behavior on a dev branch, does seem to speed things up.

@MOZGIII
Copy link
Copy Markdown
Collaborator Author

MOZGIII commented Apr 9, 2026

@piercefreeman, to clarify.

I don't think we were running with this one - the improvement is from the #320. But that one needs dissection. In that other one the changes to the rust side could be irrelevant, but on the DB side - it tweaks the indexes and autovacuum, while this one includes the same index addition but also two more - and no autovacuum changes.

My understanding, though, is that the key improvement was the autovacuum tweaks though. Not saying these indexes in this PR are not useful - they still seem to be useful; but they're not helping with the toast issue at prod.

@piercefreeman
Copy link
Copy Markdown
Owner

Must have mixed up my non-mainline branches. In that case, let's prefer 320... I want to motivate any indexes that we create and we at least have some proof the other one didn't degrade performance. Ideally we'd have a cleaner ablation test where we add things one by one.

@MOZGIII
Copy link
Copy Markdown
Collaborator Author

MOZGIII commented Apr 9, 2026

Superseded with #320

@MOZGIII MOZGIII closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants