perf: reconcile incident indexes and add covering indexes for db metrics#1412
Merged
Conversation
🧹 PR Environment Cleaned UpThe PR environment has been successfully deleted. Deleted Resources:
All resources have been cleaned up and will no longer incur costs. |
ℹ️ No PR Environment to Clean UpNo PR environment was found for this PR. This is expected if:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
On 2026-05-29 the PR #1402 deploy raised prod DB CPU to 100 percent. The cause was
/api/metrics/dbrunning unindexed aggregations (getWorkflowStatsFromDb/getStepStatsFromDb) that full-table-scannedworkflow_executionsandworkflow_execution_logs. Mitigation was a helm + DB-migration rollback, after which an operator created indexes by hand directly on prod to bring CPU down.Those manual indexes exist on prod but in no migration file, so staging, PR envs, and fresh clones are missing them. This migration reconciles that drift and adds the durable covering indexes.
What this does
Single migration
0095_keep_669_reconcile_incident_indexes.sql, two groups:idx_workflow_executions_status_duration,idx_exec_logs_status_duration).IF NOT EXISTSmakes these no-ops on prod and real builds everywhere else.idx_workflow_executions_stats_covering (status, error_type, workflow_id) INCLUDE (duration)idx_exec_logs_stats_covering (node_type, status) INCLUDE (duration)Operator action required before merge
This migration carries
-- @requires-db-prep, so merge is blocked until thedb-prepped-staginglabel is set. Group B indexes are net-new on every environment including prod and must be built out-of-band withCREATE INDEX CONCURRENTLY IF NOT EXISTS, one at a time, outside a transaction, before the deploy runsdb:migrate.workflow_execution_logsis ~1.9 GB on prod; a non-concurrent in-transaction build (drizzle-kit's default) would take a multi-minute ACCESS EXCLUSIVE lock and risk a repeat incident. Group A already exists on prod, so only Group B needs building there. Runbook is in the migration header.Validate with
EXPLAINthat the planner picks the index-only scan before relying on it.Notes
0075).lib/db/schema.ts.