Conversation
Coverage Report
Compared to |
piercefreeman
left a comment
There was a problem hiding this comment.
I don't see any obvious downside to this, but holding off on approval of merge to main until we do a bit of independent benchmarking. At the limit indexes can slow down writes more than they speed up reads so we have some more quantitative benchmarking against our test cluster.
8c7852d to
4b99528
Compare
|
We actually could consider making some of indexes async, especially for things like gc... Anyhow, I'll bench this first in the soak test. |
piercefreeman
left a comment
There was a problem hiding this comment.
Based on observed prod behavior on a dev branch, does seem to speed things up.
|
@piercefreeman, to clarify. I don't think we were running with this one - the improvement is from the #320. But that one needs dissection. In that other one the changes to the rust side could be irrelevant, but on the DB side - it tweaks the indexes and autovacuum, while this one includes the same index addition but also two more - and no autovacuum changes. My understanding, though, is that the key improvement was the autovacuum tweaks though. Not saying these indexes in this PR are not useful - they still seem to be useful; but they're not helping with the toast issue at prod. |
|
Must have mixed up my non-mainline branches. In that case, let's prefer 320... I want to motivate any indexes that we create and we at least have some proof the other one didn't degrade performance. Ideally we'd have a cleaner ablation test where we add things one by one. |
|
Superseded with #320 |
This PR adds targeted Postgres indexes to reduce long-term queueing and cleanup slowdown as table sizes grow, without changing runtime behavior.
Runtime Calls Affected
select:queued_instancespoll_queued_instances_onceupdate:queued_instances_expired_unlockreclaim_expired_instance_locksselect:runner_instances_gc_candidatescollect_done_instances_implExpected Performance Impact
The thinking is that GC paths doing seqscans could impair the performance of the hot loops by locking the database; this idea is consistent with our observations in production as database calls start taking longer as we go, and the certain threshold makes the system pivot into a catastrophically bad performance (most likely ratio between the long seqscan durations and the GC polling timouts).
Operating without access to the production database, at this time I wasn't able to verify that this is the root cause, or even a significant slowdown; the metrics for this are also not yet present.
One important check we could do before merging this is disabling the GC: if there are no GC runs, there are no GC seqscans, and thus should be no slowdowns.