Conversation
Draft spec for the four-dashboard rebuild called out in KEEP-573 Phase 1 E. Splits the current single "KeeperHub" dashboard by audience: - A. Managed Client SLO (exec + Sky/Ajna account team) - B. Platform Health (TechOps/DevOps on-call) - C. Customer Workflows (per-org support debugging) - D. Growth + Revenue (founders + revenue side) Each dashboard section: variables, panel list with PromQL/LogQL, linked alerts, owner suggestion, open questions. Implementation plan targets the new grafana git-sync path adopted on 2026-05-20, with files landing under grafana/keeperhub-dashboards/git-sync/. Open questions for the review at the bottom of the doc. Closes the Phase 1 E acceptance item on the parent ticket. Owners finalized in this PR's review.
Bulk error_type reclassification (backfill re-run, classifier rule change, manual SQL fix) makes the DB-sourced gauge's error_type label-value series move at one scrape. PromQL's increase() reads the gain as new errors while the loss is treated as a counter reset, producing a phantom positive bump that contaminates SLI panels for one window-length. Documents the symptom + how to recognise it so future engineers don't chase a phantom incident, and references the KEEP-592 analysis.
…MAC cutover All internal producers (executor, scheduler, events) now sign HMAC, so the legacy X-Service-Key / X-Internal-Token bearer path and the INTERNAL_AUTH_REQUIRE_HMAC flag are dead weight. authenticateInternalService now accepts HMAC only and the scheme type narrows to "hmac". Migrate the two e2e suites to HMAC signing via a shared tests/utils/internal-service-auth helper, and drop the now-unused SERVICE_API_KEY producer export.
…to HMAC The HMAC cutover left a few non-production consumers still sending the legacy X-Service-Key bearer. Migrate them so no legacy references remain: docker-compose reaper/execution-digest now sign HMAC via the existing reaper.sh/digest-cron.sh (on an openssl-capable image), with a new hmac-seed one-shot service that pushes the schema and seeds the shared *shared* secret into internal_service_hmac_secrets so locally-signed requests verify; the k6 load test signs via k6/crypto; seed-internal-service-hmac.ts now also accepts the db/postgres compose hostnames; docs (scripts/README, load-tests, KEEP-1164 spec) updated to the HMAC scheme.
…ing deploy values Removes the dead INTERNAL_AUTH_REQUIRE_HMAC flag and the legacy service-to-service keys (MCP/EVENTS/SCHEDULER/HUB_SERVICE_API_KEY app consumers; KEEPERHUB_API_KEY scheduler/event producers) from the prod and staging Helm values. These were read only by the now-deleted legacy bearer verifier; producers sign HMAC via INTERNAL_SERVICE_HMAC_SECRET, which is retained. Their backing SSM params are removed by infrastructure #267, so these parameterStore references must go first to avoid external-secrets sync failures.
…al auth Replaces the legacy bearer keys with INTERNAL_SERVICE_HMAC_SECRET across the non-prod manifests so internal producers can sign and the reaper cronjob can call the app: - pr-environment: app + reaper cronjob + scheduler/block dispatchers, executor and event-tracker now resolve INTERNAL_SERVICE_HMAC_SECRET from the staging parameterStore (the retained internal-service-hmac-secret), matching how pr-env already shares staging params. - local (minikube/hybrid): dispatchers, runner trigger and digest cronjob now use the shared HMAC secret as a kv literal, matching the docker-compose dev default. The pr-env/local app verifies against its DB-backed secret store, so these producers only work once that store is seeded with the same secret - handled in the following commit.
NEEDS LIVE TEST - not yet validated against a real PR environment. The PR-env app verifies internal requests against its DB-backed HMAC store, which starts empty on each ephemeral database. Append a seed step to the existing db-migration initContainer (migrator image already runs tsx-based db:seed, so tsx + scripts + lib are present): it inserts the shared signing secret under caller *shared* so HMAC-signed producer requests verify. - Reuses the migrator initContainer (has DATABASE_URL via shared_env); adds INTERNAL_SERVICE_HMAC_SECRET (the value to seed) and AGENTIC_WALLET_HMAC_KMS_KEY (envelope-encrypts it at rest) to its env. - ALLOW_REMOTE=1 because the PR-env DB host is not localhost. - The seed line is failure-tolerant (|| echo) so a seed failure - duplicate row on re-deploy, KMS/IAM gap - never blocks app startup. Worst case the store stays unseeded and internal dispatch stays non-functional, same as before this change; it never regresses pod startup. Open verification points for the live test: migrator image can resolve @/lib under tsx; the PR-env IRSA role has kms:Encrypt on the agentic-wallet KMS key.
…vent-tracker The executor and event-tracker producers sign HMAC outbound and no longer use the legacy bearer key, but still read KEEPERHUB_API_KEY into config. Remove the unused reads (executor CONFIG.keeperhubApiKey, the events KEEPERHUB_API_KEY export), the matching test fixtures/env-stubs, and the stale .env.example / README references, which now document INTERNAL_SERVICE_HMAC_SECRET. Verified: pnpm type-check passes; executor api-execute + internal-service-auth unit suites pass (15/15).
The earlier consumer migration wired INTERNAL_SERVICE_HMAC_SECRET into the reaper/digest/hmac-seed services but missed the dispatcher, block-dispatcher, executor and event-tracker dev services - they still carried the dead KEEPERHUB_API_KEY and no signing secret, so they signed with an empty secret and the app rejected them (401) in local dev. - Give every compose producer (incl. test-dispatcher) INTERNAL_SERVICE_HMAC_SECRET. - Drop the dead legacy keys from the env-service-keys anchor and the app services (the app verifies via its DB-backed store, not these env vars). - Document AGENTIC_WALLET_HMAC_KMS_KEY (the 32-byte base64 AES key the app needs to encrypt/decrypt the store) in .env.example; refresh the scheduler env docs. Verified locally: db:push + seed + producer-sign + store decrypt round-trip and signature match all pass against a real Postgres.
…ploy NEEDS LIVE TEST - not yet validated against a real minikube deploy. Mirrors the pr-env seeding for the full local k8s path (setup-local.sh): - Add AGENTIC_WALLET_HMAC_KMS_KEY (kv literal) to the app env so the local app can encrypt/decrypt its internal_service_hmac_secrets store and verify HMAC-signed requests from the local dispatchers. - Add a fully non-blocking init container (pnpm db:push then seed; both lines tolerate failure) so it can never block app startup - worst case the store stays unseeded, same as before, no regression. The primary local path (docker-compose) already seeds via the hmac-seed service; this covers the minikube/setup-local.sh path.
…tainer
Live test on pr-1511 showed the db-migration initContainer seed ran but failed
("No secret provided on stdin"): its INTERNAL_SERVICE_HMAC_SECRET secretKeyRef
env was empty because the initContainer starts before external-secrets syncs the
value from SSM, and secretKeyRef env is fixed at container start. The reaper then
got HTTP 401 "No active signing secret" (empty store).
- Remove the seed from the db-migration initContainer (it cannot win the ESO
race; reverts that part of the earlier pr-env change).
- Add a post-app-deploy workflow step: after 'helm --atomic' (app ready), poll
until the secret is populated, then run a one-shot migrator-image Job that
seeds the store - signing secret + KMS key via secretKeyRef (now populated),
DATABASE_URL built from the PR db credentials.
Local minikube keeps its initContainer seed: it uses kv literals (injected
directly, no external-secrets), so there is no sync race there.
The seed script left the DB connection open after a successful insert, so the process - and the post-deploy hmac-seed Job pod - never exited; the Job stayed Running and the workflow's kubectl wait hit its 180s timeout. Exit 0 explicitly on success. Live pr-1511 confirmed the seed and reaper HMAC auth already work (reaper now returns HTTP 200); this just lets the Job report completion and be cleaned up by ttlSecondsAfterFinished. Verified: pnpm type-check passes.
Reduce execution-mode defaults from 25 VUs x 65 wf/VU (1625 workflows) to 5 VUs x 30 wf/VU (150 workflows) with a slower 30s ramp, and move the weekly cron from Sat 20:00 UTC to Sun 04:00 UTC, off the busiest organic traffic window. Defaults updated in dispatch inputs, scheduled run fallbacks, and the run summary.
…test-10x-prod-avg ci: recalibrate weekly k6 load test to 10x prod average concurrency
docs: KEEP-573 four-dashboard observability spec for review
…y-internal-auth refactor: remove legacy bearer internal-service auth after HMAC cutover
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.