Skip to content

perf: reconcile incident indexes and add covering indexes for db metrics#1412

Merged
suisuss merged 1 commit into
stagingfrom
perf/KEEP-669-reconcile-indexes
May 29, 2026
Merged

perf: reconcile incident indexes and add covering indexes for db metrics#1412
suisuss merged 1 commit into
stagingfrom
perf/KEEP-669-reconcile-indexes

Conversation

@suisuss

@suisuss suisuss commented May 29, 2026

Copy link
Copy Markdown

Context

On 2026-05-29 the PR #1402 deploy raised prod DB CPU to 100 percent. The cause was /api/metrics/db running unindexed aggregations (getWorkflowStatsFromDb / getStepStatsFromDb) that full-table-scanned workflow_executions and workflow_execution_logs. Mitigation was a helm + DB-migration rollback, after which an operator created indexes by hand directly on prod to bring CPU down.

Those manual indexes exist on prod but in no migration file, so staging, PR envs, and fresh clones are missing them. This migration reconciles that drift and adds the durable covering indexes.

What this does

Single migration 0095_keep_669_reconcile_incident_indexes.sql, two groups:

  • Group A - reconciliation. Captures the two hand-built prod indexes verbatim (idx_workflow_executions_status_duration, idx_exec_logs_status_duration). IF NOT EXISTS makes these no-ops on prod and real builds everywhere else.
  • Group B - durable fix. New covering indexes so the metrics aggregations become index-only scans instead of seq scans:
    • idx_workflow_executions_stats_covering (status, error_type, workflow_id) INCLUDE (duration)
    • idx_exec_logs_stats_covering (node_type, status) INCLUDE (duration)

Operator action required before merge

This migration carries -- @requires-db-prep, so merge is blocked until the db-prepped-staging label is set. Group B indexes are net-new on every environment including prod and must be built out-of-band with CREATE INDEX CONCURRENTLY IF NOT EXISTS, one at a time, outside a transaction, before the deploy runs db:migrate. workflow_execution_logs is ~1.9 GB on prod; a non-concurrent in-transaction build (drizzle-kit's default) would take a multi-minute ACCESS EXCLUSIVE lock and risk a repeat incident. Group A already exists on prod, so only Group B needs building there. Runbook is in the migration header.

Validate with EXPLAIN that the planner picks the index-only scan before relying on it.

Notes

  • Journal timestamp is monotonic; no schema snapshot is generated (matches the hand-written index migration 0075).
  • All four indexed columns verified present in lib/db/schema.ts.

@suisuss suisuss added the db-prepped-staging Operator applied lock-free DDL to staging DB; safe to merge label May 29, 2026
@suisuss suisuss merged commit e7ebb70 into staging May 29, 2026
49 of 50 checks passed
@suisuss suisuss deleted the perf/KEEP-669-reconcile-indexes branch May 29, 2026 11:02
@github-actions

Copy link
Copy Markdown

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1412
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions

Copy link
Copy Markdown

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

  • The PR never had the deploy-pr-environment label
  • The environment was already cleaned up
  • The deployment never completed successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

db-prepped-staging Operator applied lock-free DDL to staging DB; safe to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant