Skip to content

release: To Prod#1514

Merged
suisuss merged 16 commits into
prodfrom
staging
Jun 11, 2026
Merged

release: To Prod#1514
suisuss merged 16 commits into
prodfrom
staging

Conversation

@suisuss

@suisuss suisuss commented Jun 11, 2026

Copy link
Copy Markdown

No description provided.

OleksandrUA and others added 16 commits May 20, 2026 11:26
Draft spec for the four-dashboard rebuild called out in KEEP-573 Phase 1 E.
Splits the current single "KeeperHub" dashboard by audience:

- A. Managed Client SLO (exec + Sky/Ajna account team)
- B. Platform Health (TechOps/DevOps on-call)
- C. Customer Workflows (per-org support debugging)
- D. Growth + Revenue (founders + revenue side)

Each dashboard section: variables, panel list with PromQL/LogQL,
linked alerts, owner suggestion, open questions. Implementation plan
targets the new grafana git-sync path adopted on 2026-05-20, with
files landing under grafana/keeperhub-dashboards/git-sync/.

Open questions for the review at the bottom of the doc.

Closes the Phase 1 E acceptance item on the parent ticket. Owners
finalized in this PR's review.
Bulk error_type reclassification (backfill re-run, classifier rule
change, manual SQL fix) makes the DB-sourced gauge's error_type
label-value series move at one scrape. PromQL's increase() reads the
gain as new errors while the loss is treated as a counter reset,
producing a phantom positive bump that contaminates SLI panels for
one window-length.

Documents the symptom + how to recognise it so future engineers don't
chase a phantom incident, and references the KEEP-592 analysis.
…MAC cutover

All internal producers (executor, scheduler, events) now sign HMAC, so the legacy X-Service-Key / X-Internal-Token bearer path and the INTERNAL_AUTH_REQUIRE_HMAC flag are dead weight. authenticateInternalService now accepts HMAC only and the scheme type narrows to "hmac". Migrate the two e2e suites to HMAC signing via a shared tests/utils/internal-service-auth helper, and drop the now-unused SERVICE_API_KEY producer export.
…to HMAC

The HMAC cutover left a few non-production consumers still sending the legacy X-Service-Key bearer. Migrate them so no legacy references remain: docker-compose reaper/execution-digest now sign HMAC via the existing reaper.sh/digest-cron.sh (on an openssl-capable image), with a new hmac-seed one-shot service that pushes the schema and seeds the shared *shared* secret into internal_service_hmac_secrets so locally-signed requests verify; the k6 load test signs via k6/crypto; seed-internal-service-hmac.ts now also accepts the db/postgres compose hostnames; docs (scripts/README, load-tests, KEEP-1164 spec) updated to the HMAC scheme.
…ing deploy values

Removes the dead INTERNAL_AUTH_REQUIRE_HMAC flag and the legacy
service-to-service keys (MCP/EVENTS/SCHEDULER/HUB_SERVICE_API_KEY app
consumers; KEEPERHUB_API_KEY scheduler/event producers) from the prod and
staging Helm values. These were read only by the now-deleted legacy bearer
verifier; producers sign HMAC via INTERNAL_SERVICE_HMAC_SECRET, which is
retained. Their backing SSM params are removed by infrastructure #267, so
these parameterStore references must go first to avoid external-secrets
sync failures.
…al auth

Replaces the legacy bearer keys with INTERNAL_SERVICE_HMAC_SECRET across the
non-prod manifests so internal producers can sign and the reaper cronjob can
call the app:

- pr-environment: app + reaper cronjob + scheduler/block dispatchers, executor
  and event-tracker now resolve INTERNAL_SERVICE_HMAC_SECRET from the staging
  parameterStore (the retained internal-service-hmac-secret), matching how
  pr-env already shares staging params.
- local (minikube/hybrid): dispatchers, runner trigger and digest cronjob now
  use the shared HMAC secret as a kv literal, matching the docker-compose dev
  default.

The pr-env/local app verifies against its DB-backed secret store, so these
producers only work once that store is seeded with the same secret - handled
in the following commit.
NEEDS LIVE TEST - not yet validated against a real PR environment.

The PR-env app verifies internal requests against its DB-backed HMAC store,
which starts empty on each ephemeral database. Append a seed step to the
existing db-migration initContainer (migrator image already runs tsx-based
db:seed, so tsx + scripts + lib are present): it inserts the shared signing
secret under caller *shared* so HMAC-signed producer requests verify.

- Reuses the migrator initContainer (has DATABASE_URL via shared_env); adds
  INTERNAL_SERVICE_HMAC_SECRET (the value to seed) and AGENTIC_WALLET_HMAC_KMS_KEY
  (envelope-encrypts it at rest) to its env.
- ALLOW_REMOTE=1 because the PR-env DB host is not localhost.
- The seed line is failure-tolerant (|| echo) so a seed failure - duplicate
  row on re-deploy, KMS/IAM gap - never blocks app startup. Worst case the
  store stays unseeded and internal dispatch stays non-functional, same as
  before this change; it never regresses pod startup.

Open verification points for the live test: migrator image can resolve @/lib
under tsx; the PR-env IRSA role has kms:Encrypt on the agentic-wallet KMS key.
…vent-tracker

The executor and event-tracker producers sign HMAC outbound and no longer use
the legacy bearer key, but still read KEEPERHUB_API_KEY into config. Remove the
unused reads (executor CONFIG.keeperhubApiKey, the events KEEPERHUB_API_KEY
export), the matching test fixtures/env-stubs, and the stale .env.example /
README references, which now document INTERNAL_SERVICE_HMAC_SECRET.

Verified: pnpm type-check passes; executor api-execute + internal-service-auth
unit suites pass (15/15).
The earlier consumer migration wired INTERNAL_SERVICE_HMAC_SECRET into the
reaper/digest/hmac-seed services but missed the dispatcher, block-dispatcher,
executor and event-tracker dev services - they still carried the dead
KEEPERHUB_API_KEY and no signing secret, so they signed with an empty secret
and the app rejected them (401) in local dev.

- Give every compose producer (incl. test-dispatcher) INTERNAL_SERVICE_HMAC_SECRET.
- Drop the dead legacy keys from the env-service-keys anchor and the app
  services (the app verifies via its DB-backed store, not these env vars).
- Document AGENTIC_WALLET_HMAC_KMS_KEY (the 32-byte base64 AES key the app needs
  to encrypt/decrypt the store) in .env.example; refresh the scheduler env docs.

Verified locally: db:push + seed + producer-sign + store decrypt round-trip and
signature match all pass against a real Postgres.
…ploy

NEEDS LIVE TEST - not yet validated against a real minikube deploy.

Mirrors the pr-env seeding for the full local k8s path (setup-local.sh):
- Add AGENTIC_WALLET_HMAC_KMS_KEY (kv literal) to the app env so the local app
  can encrypt/decrypt its internal_service_hmac_secrets store and verify
  HMAC-signed requests from the local dispatchers.
- Add a fully non-blocking init container (pnpm db:push then seed; both lines
  tolerate failure) so it can never block app startup - worst case the store
  stays unseeded, same as before, no regression.

The primary local path (docker-compose) already seeds via the hmac-seed service;
this covers the minikube/setup-local.sh path.
…tainer

Live test on pr-1511 showed the db-migration initContainer seed ran but failed
("No secret provided on stdin"): its INTERNAL_SERVICE_HMAC_SECRET secretKeyRef
env was empty because the initContainer starts before external-secrets syncs the
value from SSM, and secretKeyRef env is fixed at container start. The reaper then
got HTTP 401 "No active signing secret" (empty store).

- Remove the seed from the db-migration initContainer (it cannot win the ESO
  race; reverts that part of the earlier pr-env change).
- Add a post-app-deploy workflow step: after 'helm --atomic' (app ready), poll
  until the secret is populated, then run a one-shot migrator-image Job that
  seeds the store - signing secret + KMS key via secretKeyRef (now populated),
  DATABASE_URL built from the PR db credentials.

Local minikube keeps its initContainer seed: it uses kv literals (injected
directly, no external-secrets), so there is no sync race there.
The seed script left the DB connection open after a successful insert, so the
process - and the post-deploy hmac-seed Job pod - never exited; the Job stayed
Running and the workflow's kubectl wait hit its 180s timeout. Exit 0 explicitly
on success. Live pr-1511 confirmed the seed and reaper HMAC auth already work
(reaper now returns HTTP 200); this just lets the Job report completion and be
cleaned up by ttlSecondsAfterFinished.

Verified: pnpm type-check passes.
Reduce execution-mode defaults from 25 VUs x 65 wf/VU (1625 workflows)
to 5 VUs x 30 wf/VU (150 workflows) with a slower 30s ramp, and move
the weekly cron from Sat 20:00 UTC to Sun 04:00 UTC, off the busiest
organic traffic window. Defaults updated in dispatch inputs, scheduled
run fallbacks, and the run summary.
…test-10x-prod-avg

ci: recalibrate weekly k6 load test to 10x prod average concurrency
docs: KEEP-573 four-dashboard observability spec for review
…y-internal-auth

refactor: remove legacy bearer internal-service auth after HMAC cutover
@suisuss suisuss merged commit 3331085 into prod Jun 11, 2026
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants