Conversation
The Workflow DevKit runtime has no abort mechanism, so after the cancel endpoint sets status to "cancelled" the runtime continues executing steps and writing "success" logs to the DB. This causes steps to flash green and node borders to stick on their last color. Add three layers of defense: - Server-side guards in workflow-logging.ts: logStepStartDb, logStepCompleteDb, updateCurrentStep, and incrementCompletedSteps all bail out when the execution is in a terminal state (cancelled/success/error) - Cancel endpoint cleanup: mark any in-flight "running" step logs as "error" and protect the internal PATCH route from overwriting a cancelled execution - Client-side fixes: Runs panel does one final log refresh when an execution transitions to terminal then stops polling it; toolbar resets nodes to idle on cancel; new runsRefreshTriggerAtom gives instant Run row appearance after clicking Run instead of waiting for the 2s poll
- Rename e2e-tests.yml to e2e-tests-local.yml with explicit label gating - Remove workflow_run trigger that caused phantom CI runs on every PR push - Add e2e-vitest-remote and e2e-playwright-remote to deploy-pr-environment.yaml gated by run-e2e-tests-pr-deploy label - Add e2e-playwright-remote to deploy-keeperhub.yaml for post-deploy verification - Vitest remote on PR envs uses kubectl port-forward to CloudNativePG and LocalStack - No vitest-remote on staging/prod (direct DB writes risk corrupting live data) - Add docs/testing/README.md with workflow architecture and design decisions
Check github.event.label.name to distinguish which label was just added. Adding run-e2e-tests-pr-deploy no longer re-deploys the PR environment. Adding unrelated labels no longer re-runs local e2e tests. On synchronize events (new commits), jobs re-run if their labels are present.
Add TEST_API_KEY-gated admin endpoints so Playwright tests can retrieve OTP codes and invitation IDs from deployed environments without direct DB access. This removes the --grep-invert exclusions for invitation and wallet tests on remote runs. - keeperhub/lib/admin-auth.ts: timing-safe Bearer auth + @techops.services validation - /api/admin/test/otp: OTP lookup via Drizzle - /api/admin/test/invitation: invitation ID lookup via Drizzle - Test utils auto-switch between API (remote) and direct DB (ephemeral) - Preflight checks in global-setup.ts validate env vars before tests run - Rename e2e-tests-local.yml -> e2e-tests-ephemeral.yml (jobs: ephemeral) - Add TEST_API_KEY to deploy values (SSM) and workflow env
deploy-keeperhub.yaml referenced "E2E Tests Local" but the workflow was renamed to "E2E Tests Ephemeral". workflow_run triggers match by name, so the deploy pipeline would never trigger after ephemeral tests pass.
Mermaid diagrams, workflow files table, and label reference still used the old "local" naming. Updated to match the actual workflow file and job names (ephemeral).
…module The same function was duplicated in auth.ts and invitations.ts. Extracted to admin-fetch.ts and imported from both.
The early return on length mismatch allowed an attacker to infer key length via timing. Hash both values with SHA-256 first so timingSafeEqual always compares fixed-length inputs.
The @techops.services email check is an authorization constraint, not a validation error. 403 Forbidden is semantically correct.
…onment The new remote test jobs used actions/checkout@v4 while the rest of the file uses v6. Standardized to v6.
- Unit tests for authenticateAdmin and validateTestEmail (9 cases) - Integration tests for GET /api/admin/test/otp (10 cases) - Integration tests for GET /api/admin/test/invitation (7 cases) Covers auth rejection, email domain restriction, DB success/empty/error paths, and OTP value parsing.
Allows the e2e-playwright-remote job to access environment-scoped secrets (TEST_API_KEY) needed for admin test API authentication.
…orward Add AWS/kubectl/DB port-forward steps to the e2e-playwright-remote job so persistent test users are seeded before tests run. Update global-setup to run seed in remote mode when DATABASE_URL is available.
Replace workflow_run triggers with workflow_call reusable workflows coordinated by orchestrator files. This associates workflow runs with their source branch in the GitHub Actions UI. - Add ci-pipeline.yml orchestrator (e2e-tests -> deploy) - Add release-pipeline.yml orchestrator (release -> docs-sync) - Convert e2e-tests-ephemeral, deploy-keeperhub, release, docs-sync to reusable workflows (workflow_call) - Pass caller_event input to work around github.event_name always being 'workflow_call' in reusable workflows - Simplify branch/SHA expressions by removing workflow_run fallbacks - Add type headers (orchestrator/reusable) to all pipeline workflows
signOut() previously did nothing if the user menu wasn't visible, allowing tests to silently proceed while still logged in. Now uses expect assertions that throw on timeout, failing the test immediately.
Linear 500ms retries flake under load. Exponential backoff (500ms -> 1s -> 2s -> 4s cap) with 8 retries gives the server time to process OTP generation asynchronously.
Same fix as OTP polling: exponential backoff (500ms -> 4s cap) instead of linear 500ms retries.
Stop forwarding error.message to the client in 500 responses. The detail is still logged server-side via console.error.
Admin routes now return "Internal server error" instead of forwarding error.message. Update test expectations to match.
deploy-keeperhub.yaml is now a reusable workflow called via workflow_call from ci-pipeline.yml, not triggered by workflow_run.
…ction Deduplicate ~40 lines of setup steps repeated across 6 jobs. New .github/actions/setup-node-pnpm/action.yml with two boolean inputs: install-playwright and discover-plugins.
- Rename "Local" context to "Ephemeral" to match naming convention - Fix mermaid diagram: replace non-existent check-e2e-label with check-labels - Update e2e-tests-ephemeral.yml trigger to reflect workflow_call pattern - Correct remote vitest test count from 114+ to ~130 - Add composite action (setup-node-pnpm) documentation to workflow docs
…ignals Add data attributes to app components for test automation: - workflow-canvas: data-ready for canvas load state - org-switcher: data-state for switching/loading/ready - accept-invite: data-page-state for hydration state Replace waitForTimeout/networkidle with element assertions: - auth.setup: remove 2s sleeps, use domcontentloaded + org-switcher wait - auth utils: replace networkidle with org-switcher visibility - workflow utils: use data-ready, waitForURL, element assertions - invitations: replace retry loop with data-page-state wait - workflow.test/schedule-trigger.test: remove all waitForTimeout calls - organization-wallet.test: replace toast race with element assertion Stabilize playwright config: - fullyParallel: false, workers: 1 (serial to avoid shared-state conflicts) - retries: 2 (handles environmental flakiness) - reporter: github + html in CI, list locally
- Unskip analytics-gas, scheduled-workflow, web3-balance, para-wallet tests - Remove stop-execution.test.ts and ORG-4 placeholder (unimplemented UI) - Fix Para Wallet "Create Wallet" ambiguous selector (data-slot scoping) - Fix analytics-gas scrollIntoView race (wait for attachment first) - Remove CI sharding from e2e-tests-ephemeral.yml (serial execution) - Move regex literal to top-level scope (biome lint fix) - Use consistent button[role="combobox"] selector in signIn() - Update docs with CI execution model and stability decisions
The deployed PR environment runs NODE_ENV=development, not test/CI, so rate limiting was still active. Add DISABLE_AUTH_RATE_LIMIT env var to PR environment values and check it in the auth config.
Use better-auth customRules to bypass rate limiting when requests include a valid X-Test-API-Key header. Playwright config sends this header automatically when TEST_API_KEY env var is set. This keeps rate limiting active for real users while allowing E2E tests to run without hitting limits in PR, staging, and prod environments.
Add testFetch() and getTestHeaders() to vitest E2E utils for future tests that hit auth endpoints. Also add X-Test-API-Key to Playwright admin-fetch headers.
Fix customRules return type (return currentRule instead of undefined), remove unused biome-ignore suppression, drop unnecessary async.
Remote and ephemeral E2E test jobs are disabled (if: false) across deploy-keeperhub, deploy-pr-environment, and e2e-tests-ephemeral workflows while auth/rate-limit infrastructure is being stabilised.
Remote tests gated by ENABLE_E2E_REMOTE_TESTS, ephemeral tests by ENABLE_E2E_EPHEMERAL_TESTS. Both are GitHub repository variables. Currently ephemeral=true, remote=false.
Condition-based branching workflows (e.g. parallel "Balance < 1 ETH" and "Balance >= 1 ETH") incorrectly show "Error" status when one branch is dead. Root cause: finalSuccess treats every result entry equally, so a condition that fails because it references an unexecuted dead-branch node poisons the entire run. Three fixes: - Track condition routing decisions (conditionDecisions map) and exclude nodes on not-taken branches from the finalSuccess calculation. - Harden replaceTemplateVariable: when a referenced node exists in the graph but was never executed (dead branch), return undefined instead of throwing, so the condition evaluates gracefully to false. - Add diagnostic logging when finalSuccess is false in a branching workflow to aid production debugging.
The Workflow DevKit's durability layer can throw errors after withStepLogging has already recorded a step as successful. Previously only "exceeded max retries" errors were reconciled (KEEP-1541). This adds a second pass (reconcileSdkFailures) that catches any remaining failed node whose step was recorded as successful, covering SDK errors with different messages that surface during parallel/branching execution (event log corruption, state replay mismatches, unexpected event types).
…lse-error-status fix: KEEP-1512 condition node branching false error status
…retries-all-steps fix: disable SDK retries on all web3 steps and match error formats
Cancelled step logs were incorrectly marked as "error". Added "cancelled" to the log status type union to accurately reflect user-initiated stops.
Update type annotations in workflow-runs, workflow-store, api-client, and template-helpers to accept "cancelled" for step log status.
Cancelled runs were previously grouped under error. Now they appear as their own status with orange styling in the time series chart, runs table, and status filter dropdown.
feat: Add Stop mode to Run button to be able to cancel runs (workflows with Manual Trigger only)
…ility ci: restructure e2e test workflows and add admin test API
joelorzet
approved these changes
Mar 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.