Conversation
…bstitutions Legacy-mode assertResolved previously called logSystemError, which paged on-call via Sentry and bumped the generic engine-errors counter. That made the would-have-aborted exposure question unanswerable from a dashboard without alert noise. Switch the legacy branch to logSystemWarn and emit keeperhub_template_resolve_legacy_substitution_total once per call, with action_type and reason labels (single reason or "mixed"). Dashboards can then run sum by (action_type, reason) (rate(...)) to quantify the exposure window before the env var is hard-deprecated.
KEEPERHUB_TEMPLATE_RESOLVE_MODE was never set in any deployment config, Helm value, or .env file. Strict has been the default since KEEP-468 (PR #1187) with no observed regressions, so the staged migration window the env var was protecting is no longer needed. Delete the env var read, the legacy branch in assertResolved, the TemplateResolveMode type, the dedicated exposure counter added in the previous commit (now dead code), and the legacy-mode test cases. Update the user-facing error message and templating doc to drop the opt-out hint. assertResolved now always throws TemplateResolutionError on any unresolved reference. Completes the KEEP-525 scope. PR #1305 stays draft until reviewed.
Replaces the bearer-token-only X-Service-Key scheme with HMAC-SHA256
request authentication bound to method, path, caller, body digest, and
unix timestamp. 300-second replay window matches the agentic-wallet
precedent; no server-side nonce cache for the same latency-cost reason.
Signing secrets move into a new versioned internal_service_hmac_secrets
table modelled on agentic_wallet_hmac_secrets: AES-256-GCM at rest,
AAD-bound to (caller, key_version), 24h rotation grace via expires_at.
v1 reuses AGENTIC_WALLET_HMAC_KMS_KEY (a dedicated key is a documented
follow-up) and stores one shared secret under caller='*shared*' (per-
producer split is a one-row follow-up).
Verifier accepts both schemes during the dual-accept rollout window:
- presence of any X-KH-* header commits to the HMAC path with NO
fallback (prevents an attacker who knows the legacy bearer from
appending a junk signature to slip through the looser path),
- INTERNAL_AUTH_REQUIRE_HMAC=true rejects all legacy bearers and is
the per-environment kill-switch,
- X-Service-Key or X-Internal-Token falls back to the existing
timingSafeEqual compare against the four *_SERVICE_API_KEY env vars
plus KEEPERHUB_API_KEY.
Consolidates the /api/workflows/events bespoke X-Internal-Token check
into authenticateInternalService. The 500-on-misconfigured-env path is
intentionally dropped because it leaks operational state to clients.
All 12 internal-service consumer routes migrated to a raw-body pattern
so the HMAC verifier hashes the same bytes that arrive over the wire.
The workflow execute route was the highest-risk migration because it
consumes the body twice; raw body is captured once at the top and
parsed from the captured string thereafter.
Audit log via logInternalAuthEvent fires once per request with caller,
scheme, route, ip, key_version, latency. Caller field sanitized to the
InternalCaller enum or 'unknown' to bound log cardinality. Logs-only
for v1; Prometheus counter and Sentry breadcrumb deferred.
Drizzle migration 0094 hand-written since drizzle-kit generate is
broken in this repo; journal idx 94 appended with when=1779843900001
(monotonic).
Verified locally:
- pnpm exec biome check: clean on touched files
- pnpm type-check: clean across the whole codebase
- pnpm vitest run tests/unit tests/integration: 6132/6132 pass
- pnpm db:migrate against the dev DB: applies cleanly, table exists
with PK on (caller, key_version)
…sibility Add a pluggable geo-IP resolver with a provider fallback chain and use its coordinates to flag physically implausible logins, plus a country-coarse per-member session view for org owners and admins. - geo-IP strategy chain (ipinfo, ipapi.co, ipwho.is, freeipapi) behind one ResolvedLocation interface; keyless providers keep resolution working with no credentials configured - login-risk now records coordinates in risk_flags_json and flags impossible_travel when the velocity between consecutive sessions exceeds 800 km/h over at least 100 km, catching same-country city hops - org owners and admins can view a member's recent sessions narrowed to country plus the anomaly verdict; IP, city, and coordinates are filtered out server-side and never reach the client - shared risk-flags decoder and session formatters deduped across the self-service and admin views
Make the per-member Sessions trigger and the member Remove control icon-only with accessible labels, and add a title to the member email so truncated addresses stay distinguishable. Frees horizontal room in the members panel so emails are no longer cut to ambiguity in the narrow modal.
Resolve drizzle/meta/_journal.json conflict: staging owns 0094_chain_status and 0095_keep_669_reconcile_incident_indexes; renumber the internal-service HMAC migration to 0096 (when=1779843900003) to preserve monotonic ordering. events/route.ts auto-merge verified clean (non-overlapping auth vs chains.status). type-check passes on the merged tree.
The internal-service HMAC store decrypts seeded secrets with AGENTIC_WALLET_HMAC_KMS_KEY (shared with agentic-wallet for v1). Staging and prod app values set it; the PR-environment app template did not, so a seeded internal_service_hmac_secrets row could never be decrypted and every HMAC request fell through to "No active signing secret". Mirror the staging entry so PR envs can validate the HMAC accept-path once INTERNAL_AUTH_REQUIRE_HMAC is flipped.
Standalone single-replica service that serves the DB-sourced Prometheus gauges (the former /api/metrics/db scrape) off the request-serving pods. Executor-style: build-context = repo root, imports lib/metrics verbatim via tsx, so the exposed gauge families are identical to today's endpoint. - index.ts boots a node:http server (PORT, default 9090) with graceful shutdown - server.ts serves GET /metrics (updateDbMetrics -> getDbMetrics, TTL- and pool-gated in lib/) and GET /health - server.test.ts mocks the metrics module and exercises the HTTP wiring on an ephemeral port (200 + gauges, /health, 500 on refresh failure, 404) Containerization, CI, and deploy wiring follow in later commits (TECH-6484).
Dockerfile metrics-collector stage (executor-style: reuses lib/ + root
node_modules via tsx, shims server-only, serves on :9090) and a
docker-bake.hcl target/group + METRICS_COLLECTOR_ECR_REPO var. Bake config
validated with 'buildx bake --print'.
Image build/push runs once the ECR repo (keeperhub-metrics-collector-{env})
exists and a deploy workflow wires the bake target -- following commits.
…(Stage 3)
- deploy/metrics-collector/{staging,prod}/values.yaml: replicaCount 1, serves
/metrics on :9090, ServiceMonitor scrapes the DB gauges from this one pod
(deterministic, no hashmod). Minimal env -- only DATABASE_URL; no SQS/Turnkey.
- .github/workflows/deploy-metrics-collector.yaml: standalone (events-style)
trigger on staging/prod + dispatch; bakes the metrics-collector target and
helm-deploys via the shared techops-services/common chart, release name
keeperhub-metrics-collector, namespace keeperhub.
Does NOT cut over: the app's db-metrics ServiceMonitor and /api/metrics/db
route stay in place until the collector is verified in staging (Stage 4).
Requires the ECR repo + TFC workspace (terraform drafted in
techops_infrastructure, applied by infra) before the first deploy.
Validated by building the image: the metrics-collector stage now (1) is included in the Docker source stage (COPY keeperhub-metrics-collector/) so the build doesn't fail, and (2) drops the COPY --from=builder generated-file lines -- the metrics import graph references no builder-generated file at runtime (lib/db/schema only uses lib/types/integration via import type, which tsx erases). The stage now depends on source only, so the image builds without the expensive Next builder stage. Confirmed: image builds, prometheus-api + db-metrics import cleanly, /health 200, /metrics runs the real queries.
Replace url.includes(host) substring checks in the geo-IP resolver test with exact hostname comparison, clearing four js/incomplete-url-substring- sanitization high alerts and removing the freeipapi/ipapi.co substring collision workaround.
…mpossible-travel-admin feat(security): impossible-travel detection and per-member session visibility
…-telemetry refactor(template-resolver): remove legacy template-resolution opt-out
…uild-check CI Review feedback on PR #1439 (suisuss): - #3 Startup defined twice: drop the helm command/args override from deploy/metrics-collector/{staging,prod}/values.yaml; rely on the Dockerfile CMD (tsx keeperhub-metrics-collector/index.ts) as the single source of truth. - #4 Nothing builds this stage in CI: add .github/workflows/collector-build-check.yml -- a PR smoke job that builds the metrics-collector target and runs keeperhub-metrics-collector/import-check.ts inside the image, asserting the runtime import graph resolves. Catches a future value-import of a builder-generated module (server.test.ts mocks the module, so it can't). Validated locally: image builds, import-check prints IMPORTS_OK, exit 0. - #1 ServiceMonitor port name: matches the executor pattern; will verify the actual Prometheus target in staging (replied on the PR).
Staging advanced 40 commits and added 0096_keep_683_workflow_gas_columns, colliding with our hand-numbered migration. Renumber the internal-service HMAC migration to 0097 (when=1779843900004) and keep staging's 0096. type-check passes on the merged tree.
…al-service-auth hotfix(security): harden internal service auth with HMAC
…lector-service feat(metrics): dedicated keeperhub-metrics-collector service (TECH-6484)
Marketplace indexers (thespawn) score the agent "not a callable endpoint" because their liveness probe needs an anonymous tools/list, while /mcp is OAuth-gated. Verified the top-rated agents pass by exposing an anonymous, x402-priced MCP surface; KeeperHub couldn't follow without re-exposing the org-scoped tools the recent security work locked down. Add a dedicated anonymous endpoint that serves only public, already-public data and never touches the hardened /mcp path: - app/mcp/public/route.ts: anonymous initialize + tools/list + tools/call, plus a GET health probe. Never reads Authorization; pins a non-empty org sentinel (__anon_public__) and the terminal mcp:public scope; per-IP rate limits (cf-connecting-ip), stricter on tools/call. - oauth-scopes: SCOPE_MCP_PUBLIC + hand-curated PUBLIC_TOOLS allowlist (7 read tools + call_workflow). isToolAllowed treats mcp:public as terminal deny-by-default -- never widens to read/write/admin. The allowlist excludes validate_workflow and prepare_test_pin_data (org-scoped reads that masquerade as read-only). - server: createMcpServer registers ONLY public tools under mcp:public (tools/list reflects registrations, so non-public tools must be dropped at registration, not just call-gated) and skips org-scoped resources. - rate-limit: getClientIp prefers the unspoofable cf-connecting-ip. - .well-known/mcp.json: advertise /mcp/public as the primary (anonymous) endpoint with the public tool list; keep the OAuth /mcp under authenticatedEndpoint. No onchain re-pin needed. call_workflow stays listed-only via the existing /api/mcp/workflows/[slug]/call guards (isListed + reachable; paid -> 402). /mcp is byte-for-byte unchanged. Tests: deny-by-default scope + registration filter (no catalog leak), plus route-level coverage for cross-surface session isolation, per-IP rate-limit buckets, and anon-identity pinning. Dev-verified: anon tools/list returns exactly the 9 public tools, search_workflows executes anonymously, delete_workflow is not found, non-SSE-Accept POST succeeds, /mcp still 401s.
Update the ERC-8004 agent card's agentWallet and platformFeeRecipient, the served /api/agent-registry route, and the docs example from the deprecated 0xC300...362B51 wallet to the current 0xAA70...8Fee address. The old wallet is no longer in use; the public card (and its onchain-pinned copy) and docs still advertised it, so an agent reading agentWallet / platformFeeRecipient directly, or following the doc example, could route funds to a stale address. The active x402 402 payTo is unaffected (it is built per-workflow from creatorWalletAddress, not this address). Propagation note: the onchain ERC-8004 card is IPFS-pinned; this repo change updates the served endpoints, but the pinned card requires a re-pin (pin-agent-card.ts -> new CID, update-agent-uri.ts -> setAgentURI) to propagate to indexers (8004scan/thespawn). Tracked under the security remediation epic.
hotfix(agent): rotate advertised agent wallet to current address
feat(mcp): anonymous /mcp/public marketplace endpoint
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.