Skip to content

release: To Prod#1410

Merged
OleksandrUA merged 21 commits into
prodfrom
staging
May 29, 2026
Merged

release: To Prod#1410
OleksandrUA merged 21 commits into
prodfrom
staging

Conversation

@suisuss

@suisuss suisuss commented May 29, 2026

Copy link
Copy Markdown

No description provided.

eskp and others added 9 commits May 29, 2026 17:57
…rows

KEEP-597. The sidebar workflow picker rendered soft-deleted rows with a
strikethrough and "Deleted" badge instead of hiding them, and gave no
visual signal for workflows with `enabled = false`.

Soft-deleted rows are now filtered out at the sidebar boundary via a new
`filterPickerVisible` helper in lib/workflow/soft-delete.ts. The wider
/api/workflows payload still returns deleted rows so audit and history
surfaces keep working (matches the KEEP-440 contract). Disabled rows
stay visible and selectable but render with muted-foreground text and a
"Disabled" badge -- no strikethrough.

The dead deleted-render branch in WorkflowItem is removed; the picker
contract is now centralised in the helper.
Hide the "Report an issue" feedback button in the navigation sidebar and
short-circuit POST /api/feedback so it no longer proxies to the external
feedback service. Both gate on NEXT_PUBLIC_FEEDBACK_ENABLED, which defaults
to off; set it to "true" to re-enable without code changes.

The ERC-8004 agent-reputation feedback system (/api/feedback/[id],
agentic-wallet/feedback, schema-feedback) is intentionally untouched -- it
is unrelated infrastructure that happens to share the "feedback" name.

Adds unit coverage proving the route returns 503 without calling the
external service when disabled (including non-"true" flag values), and
resumes proxying when re-enabled.
…able

Refines KEEP-597. The first cut showed the "Disabled" badge on every
workflow with enabled = false, but the workflows table defaults enabled
to false and the UI only exposes the toggle for Schedule / Webhook /
Event / Block triggers (shouldShowEnableSwitch). That left every newly
created Manual workflow incorrectly tagged "Disabled" with no way for
the user to flip it off.

Adds two small helpers to lib/workflow/store.ts:
  getWorkflowTriggerType(nodes) -- pulls the trigger type off the
    trigger node without dragging the editor store into view.
  shouldShowDisabledBadge({enabled, triggerType}) -- the picker
    predicate: enabled === false AND the trigger has an enable switch.

The sidebar derives triggerType once at the visibleWorkflows boundary
and feeds it into WorkflowItem; the badge gates on the new predicate.
Manual rows now render clean regardless of the enabled column.
…ow check

The sidebar caches the workflow list in local state and derives the
"Disabled" label from each row's `enabled` column. Flipping the enable
toggle in the editor header updated only the editor's atom -- the
sidebar kept the stale value until a full refetch. Call refetchSidebar()
inside updateWorkflowEnabled so the picker repaints right away.

While here, drop the small Check icon on the row matching the currently
viewed workflow. The bg-muted highlight on the active row already
carries that signal and the icon is more noise than help.
…letedAt guard

Two review nits from the PR review pass:

- getWorkflowTriggerType now folds the legacy "Scheduled" triggerType
  onto "Schedule" before returning. Three other call sites (executor,
  metrics, mcp) already do this normalization, which implies real rows
  in the DB still hold the old spelling -- without normalization here
  the picker silently mis-classifies them and skips the Disabled badge.

- filterPickerVisible now compares deletedAt against null/undefined
  explicitly (== null) instead of using a truthy check. The Date | string
  arms of the type would otherwise treat the epoch-zero Date or an
  empty-string deletedAt as not-deleted. Theoretical for our current API
  shape, but matches the documented intent of the type signature.

Adds one test for the Scheduled-legacy path.
…or-ux-polish

chore: KEEP-493 workflow editor UX polish (PR #1370)
chore: disable feedback widget behind a feature flag
…ker-deleted-disabled

fix(workflow-picker): hide soft-deleted workflows; label only schedulable disabled
The scrape path runs ~35 aggregate queries against workflow_executions and
workflow_execution_logs every 30s per pod. With replicaCount=2 and no
caching, the DB sees ~4 full rounds per 2 minutes; staging profiling
showed getBillingStatsFromDb averaging 5.9s, getWorkflowStatsFromDb
1.1s (Parallel Seq Scan), and getStepStatsFromDb 0.4s.

Wrap updateDbMetrics in a per-process TTL cache (default 60s, overridable
via DB_METRICS_CACHE_TTL_MS). Concurrent callers during an in-flight
refresh share the same promise; failed refreshes are evicted so the next
scrape retries instead of pinning a poisoned entry for the full TTL. The
public boundary still swallows errors, preserving the existing
"metrics endpoint never 500s" behavior.

Tests added in tests/unit/db-metrics-cache.test.ts cover: TTL window hit,
TTL expiry, TTL=0 bypass, in-flight dedup, and eviction-on-rejection.
suisuss added 9 commits May 29, 2026 19:46
If a refresh runs longer than DB_METRICS_CACHE_TTL_MS (e.g. DB under
stress with multi-second seq scans), the old cache check used the
start time and would let the next scrape begin a second concurrent
refresh, doubling DB load in the exact moment the cache was supposed
to protect against it.

Track startedAt and completedAt separately. An in-flight entry
(completedAt === null) is always shared regardless of TTL; only once
the refresh completes does the TTL clock start, and the next refresh
fires when (now - completedAt) >= ttl. Use Promise#then's two-arg form
so the success-stamp / failure-evict handlers don't fork an orphan
rejected promise that surfaces as an unhandled rejection.

Adds a test covering the in-flight-past-TTL case.
Two new counters on the apiRegistry (per pod, scraped via
/api/metrics/api):

- keeperhub_db_metrics_cache_lookups_total{result}
  result = hit | miss | in_flight_dedup | disabled
  Increments once per updateDbMetrics() call. Cache hit rate is
  hit / (hit + miss + in_flight_dedup).

- keeperhub_db_metrics_refresh_total{outcome}
  outcome = success | error
  Increments once per actual DB round-trip. DB load contribution
  is sum(rate(refresh_total)); failure rate is
  rate(refresh_total{outcome="error"}) / rate(refresh_total).

Without these, the only post-deploy signal for cache effectiveness
was the indirect "DB CPU went down" reading on CloudWatch.
parseInt silently strips trailing non-numeric characters, so
DB_METRICS_CACHE_TTL_MS=60s would be parsed as 60 (i.e. 60ms,
not 60s) without warning. Use Number() which returns NaN for any
trailing junk, so malformed env vars fall back to the documented
default instead of silently degrading.

Adds a test case covering the '60s' typo.
Date.now() is wall-clock and can step backwards on NTP adjustments
(rare on EKS, not impossible). A backwards step would make the TTL
check artificially small, holding a stale cache entry longer than
intended. performance.now() is process-monotonic and the correct
primitive for an interval check.

Tests updated to use vi.useFakeTimers({ toFake: ['Date', 'performance'] })
so both clocks advance together with vi.advanceTimersByTime().
Exporting a cache-reset function from production code is a smell —
even with the __ prefix, any importer can call it. Vitest sets
NODE_ENV=test by default, so gate the call so production accidents
fail loud (throw) instead of silently clearing mid-flight cache state.

Keeps the test ergonomics (no need for vi.resetModules + re-imports
per test) while removing the attractive nuisance.
mockClear only resets call history; mockImplementationOnce and
mockRejectedValueOnce queued items persist across tests. If a test
queues an implementation it doesn't fully consume, the next test
inherits it. mockReset clears the implementation queue too.

Default impls are factored into DEFAULT_DB_RETURNS so the per-test
rebind is a single call.
flushMicrotasks(20) was a magic number chosen empirically because
refreshDbMetricsNow() does a dynamic import + Promise.all of 13
helpers. Add a helper, the count silently changes; the test could
keep passing for the wrong reason.

waitForMockCall polls each microtask turn until the mock has been
invoked at least once, then returns. Misuse fails loud with a clear
error instead of a confusing downstream assertion failure.
…on test assertion

The fire-and-forget raw.then() in updateDbMetrics had no protection
against the settle callbacks themselves throwing (e.g. prom-client in a
bad state), which would produce an unhandled rejection from the returned
promise. Added .catch() to cover that path.

Tightened the eviction test assertion from toBeGreaterThanOrEqual(2) to
toHaveBeenCalledTimes(2) — the precise claim is that a failed refresh
evicts the slot so the next call makes exactly one retry, not an
unbounded number.
perf: cache /api/metrics/db scrape to cut DB CPU
perf: reconcile incident indexes and add covering indexes for db metrics
@suisuss suisuss added the db-prepped-prod Operator applied lock-free DDL to prod DB; safe to merge label May 29, 2026
@OleksandrUA OleksandrUA merged commit 5d3cbdb into prod May 29, 2026
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

db-prepped-prod Operator applied lock-free DDL to prod DB; safe to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants