release: To Prod#1410
Merged
Merged
Conversation
…rows KEEP-597. The sidebar workflow picker rendered soft-deleted rows with a strikethrough and "Deleted" badge instead of hiding them, and gave no visual signal for workflows with `enabled = false`. Soft-deleted rows are now filtered out at the sidebar boundary via a new `filterPickerVisible` helper in lib/workflow/soft-delete.ts. The wider /api/workflows payload still returns deleted rows so audit and history surfaces keep working (matches the KEEP-440 contract). Disabled rows stay visible and selectable but render with muted-foreground text and a "Disabled" badge -- no strikethrough. The dead deleted-render branch in WorkflowItem is removed; the picker contract is now centralised in the helper.
Hide the "Report an issue" feedback button in the navigation sidebar and short-circuit POST /api/feedback so it no longer proxies to the external feedback service. Both gate on NEXT_PUBLIC_FEEDBACK_ENABLED, which defaults to off; set it to "true" to re-enable without code changes. The ERC-8004 agent-reputation feedback system (/api/feedback/[id], agentic-wallet/feedback, schema-feedback) is intentionally untouched -- it is unrelated infrastructure that happens to share the "feedback" name. Adds unit coverage proving the route returns 503 without calling the external service when disabled (including non-"true" flag values), and resumes proxying when re-enabled.
…able
Refines KEEP-597. The first cut showed the "Disabled" badge on every
workflow with enabled = false, but the workflows table defaults enabled
to false and the UI only exposes the toggle for Schedule / Webhook /
Event / Block triggers (shouldShowEnableSwitch). That left every newly
created Manual workflow incorrectly tagged "Disabled" with no way for
the user to flip it off.
Adds two small helpers to lib/workflow/store.ts:
getWorkflowTriggerType(nodes) -- pulls the trigger type off the
trigger node without dragging the editor store into view.
shouldShowDisabledBadge({enabled, triggerType}) -- the picker
predicate: enabled === false AND the trigger has an enable switch.
The sidebar derives triggerType once at the visibleWorkflows boundary
and feeds it into WorkflowItem; the badge gates on the new predicate.
Manual rows now render clean regardless of the enabled column.
…ow check The sidebar caches the workflow list in local state and derives the "Disabled" label from each row's `enabled` column. Flipping the enable toggle in the editor header updated only the editor's atom -- the sidebar kept the stale value until a full refetch. Call refetchSidebar() inside updateWorkflowEnabled so the picker repaints right away. While here, drop the small Check icon on the row matching the currently viewed workflow. The bg-muted highlight on the active row already carries that signal and the icon is more noise than help.
…letedAt guard Two review nits from the PR review pass: - getWorkflowTriggerType now folds the legacy "Scheduled" triggerType onto "Schedule" before returning. Three other call sites (executor, metrics, mcp) already do this normalization, which implies real rows in the DB still hold the old spelling -- without normalization here the picker silently mis-classifies them and skips the Disabled badge. - filterPickerVisible now compares deletedAt against null/undefined explicitly (== null) instead of using a truthy check. The Date | string arms of the type would otherwise treat the epoch-zero Date or an empty-string deletedAt as not-deleted. Theoretical for our current API shape, but matches the documented intent of the type signature. Adds one test for the Scheduled-legacy path.
…or-ux-polish chore: KEEP-493 workflow editor UX polish (PR #1370)
chore: disable feedback widget behind a feature flag
…ker-deleted-disabled fix(workflow-picker): hide soft-deleted workflows; label only schedulable disabled
The scrape path runs ~35 aggregate queries against workflow_executions and workflow_execution_logs every 30s per pod. With replicaCount=2 and no caching, the DB sees ~4 full rounds per 2 minutes; staging profiling showed getBillingStatsFromDb averaging 5.9s, getWorkflowStatsFromDb 1.1s (Parallel Seq Scan), and getStepStatsFromDb 0.4s. Wrap updateDbMetrics in a per-process TTL cache (default 60s, overridable via DB_METRICS_CACHE_TTL_MS). Concurrent callers during an in-flight refresh share the same promise; failed refreshes are evicted so the next scrape retries instead of pinning a poisoned entry for the full TTL. The public boundary still swallows errors, preserving the existing "metrics endpoint never 500s" behavior. Tests added in tests/unit/db-metrics-cache.test.ts cover: TTL window hit, TTL expiry, TTL=0 bypass, in-flight dedup, and eviction-on-rejection.
If a refresh runs longer than DB_METRICS_CACHE_TTL_MS (e.g. DB under stress with multi-second seq scans), the old cache check used the start time and would let the next scrape begin a second concurrent refresh, doubling DB load in the exact moment the cache was supposed to protect against it. Track startedAt and completedAt separately. An in-flight entry (completedAt === null) is always shared regardless of TTL; only once the refresh completes does the TTL clock start, and the next refresh fires when (now - completedAt) >= ttl. Use Promise#then's two-arg form so the success-stamp / failure-evict handlers don't fork an orphan rejected promise that surfaces as an unhandled rejection. Adds a test covering the in-flight-past-TTL case.
Two new counters on the apiRegistry (per pod, scraped via
/api/metrics/api):
- keeperhub_db_metrics_cache_lookups_total{result}
result = hit | miss | in_flight_dedup | disabled
Increments once per updateDbMetrics() call. Cache hit rate is
hit / (hit + miss + in_flight_dedup).
- keeperhub_db_metrics_refresh_total{outcome}
outcome = success | error
Increments once per actual DB round-trip. DB load contribution
is sum(rate(refresh_total)); failure rate is
rate(refresh_total{outcome="error"}) / rate(refresh_total).
Without these, the only post-deploy signal for cache effectiveness
was the indirect "DB CPU went down" reading on CloudWatch.
parseInt silently strips trailing non-numeric characters, so DB_METRICS_CACHE_TTL_MS=60s would be parsed as 60 (i.e. 60ms, not 60s) without warning. Use Number() which returns NaN for any trailing junk, so malformed env vars fall back to the documented default instead of silently degrading. Adds a test case covering the '60s' typo.
Date.now() is wall-clock and can step backwards on NTP adjustments
(rare on EKS, not impossible). A backwards step would make the TTL
check artificially small, holding a stale cache entry longer than
intended. performance.now() is process-monotonic and the correct
primitive for an interval check.
Tests updated to use vi.useFakeTimers({ toFake: ['Date', 'performance'] })
so both clocks advance together with vi.advanceTimersByTime().
Exporting a cache-reset function from production code is a smell — even with the __ prefix, any importer can call it. Vitest sets NODE_ENV=test by default, so gate the call so production accidents fail loud (throw) instead of silently clearing mid-flight cache state. Keeps the test ergonomics (no need for vi.resetModules + re-imports per test) while removing the attractive nuisance.
mockClear only resets call history; mockImplementationOnce and mockRejectedValueOnce queued items persist across tests. If a test queues an implementation it doesn't fully consume, the next test inherits it. mockReset clears the implementation queue too. Default impls are factored into DEFAULT_DB_RETURNS so the per-test rebind is a single call.
flushMicrotasks(20) was a magic number chosen empirically because refreshDbMetricsNow() does a dynamic import + Promise.all of 13 helpers. Add a helper, the count silently changes; the test could keep passing for the wrong reason. waitForMockCall polls each microtask turn until the mock has been invoked at least once, then returns. Misuse fails loud with a clear error instead of a confusing downstream assertion failure.
…on test assertion The fire-and-forget raw.then() in updateDbMetrics had no protection against the settle callbacks themselves throwing (e.g. prom-client in a bad state), which would produce an unhandled rejection from the returned promise. Added .catch() to cover that path. Tightened the eviction test assertion from toBeGreaterThanOrEqual(2) to toHaveBeenCalledTimes(2) — the precise claim is that a failed refresh evicts the slot so the next call makes exactly one retry, not an unbounded number.
perf: cache /api/metrics/db scrape to cut DB CPU
perf: reconcile incident indexes and add covering indexes for db metrics
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.