Skip to content

test(e2e): execute real shell assertions; delete dry-run, --validate-only, and the bash runner#4380

Merged
jyaunches merged 42 commits into
mainfrom
feat/e2e-real-execution
Jun 8, 2026
Merged

test(e2e): execute real shell assertions; delete dry-run, --validate-only, and the bash runner#4380
jyaunches merged 42 commits into
mainfrom
feat/e2e-real-execution

Conversation

@jyaunches

@jyaunches jyaunches commented May 28, 2026

Copy link
Copy Markdown
Contributor

Why

Removing --dry-run and plan-mode execution from the E2E runner: they let a coding agent ship work behind false-green test runs. The TS runner now has one execution mode: live. No flag, env var, helper, or branch in any production path bypasses real assertion execution.

Two seams needed closing:

  1. phase.ts:executeStep had no child_process.spawn — every real shell/probe step fell through to a hardcoded failed: unsupported live step, and four placeholder refs (fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton) were the only paths that reported passed.
  2. ~30 shell scripts under validation_suites/, onboarding_assertions/, and nemoclaw_scenarios/ began with if e2e_env_is_dry_run; then exit 0; fi, so when the workflow passed --dry-run the scripts exited 0 before running.

What changed

Orchestrator (TypeScript)

  • phase.ts:executeStep now spawns shell steps via child_process.spawn, with detached process groups so timeouts kill bash + sleep cleanly via negative-pid signal. Probe steps return skipped: "probe not registered" until the registry lands. Pending steps return skipped: "pending: <ref>". Unknown kinds throw. Real evidence captured to .e2e/logs/<step-id>.log. Step-level reliability.timeoutSeconds and retry.{attempts,on} enforced here, not in clients.
  • run.ts: --dry-run, --validate-only deleted. Default invocation is live execution. --list and --plan-only (local debug) survive read-only. --emit-matrix added for the dynamic-matrix workflow (feat(e2e): generate scenario fan-out matrix from typed registry #4359), with the richer payload (runner, label, platform, suites) merged from main.
  • types.ts: RunContext.dryRun deleted.

Workflow

  • e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both --dry-run invocations (Linux + WSL), are gone. Workflows execute live. Summary step retitled from "typed dry-run summary" to "typed scenario run".
  • tools/e2e-scenarios/workflow-boundary.mts validator now rejects --dry-run, --plan-only, --validate-only in the workflow.

Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)

  • runtime/run-scenario.sh: 483 lines of duplicated install/onboard/gateway-check/suite-execution → 5-line fail-fast stub pointing at run.ts.
  • runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep walks typed assertionGroups directly; nothing in TS calls a YAML-walking bash runner.

Shell scripts (the leaves stay, the dry-run skip blocks die)

  • validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**: every if e2e_env_is_dry_run; then ... exit 0; fi and every [[ "${E2E_DRY_RUN:-0}" == "1" ]] short-circuit removed. The real assertion logic that was hiding underneath now runs unconditionally.
  • runtime/lib/env.sh: e2e_env_is_dry_run helper deleted.
  • inference_routing.sh: dead _e2e_inference_plan helper deleted.

Legacy YAML resolver retired (rolled in from the planned follow-up PR)

  • Deleted nemoclaw_scenarios/expected-states.yaml and the entire runtime/resolver/ tree (load.ts, plan.ts, schema.ts, coverage.ts, index.ts, validator.ts, expected-failure.ts, js-yaml.d.ts) plus runtime/coverage-report.sh.
  • Six framework tests that exercised only the resolver path were dropped: e2e-scenario-schema, e2e-scenario-resolver, e2e-coverage-report, e2e-scenario-additional-families, e2e-metadata-final-hygiene, e2e-expected-failure (typed equivalent now lives in e2e-negative-matcher.test.ts).
  • e2e-expected-state.test.ts: YAML-mirror parity tests dropped; replaced with id-coverage assertions against listExpectedStates(). The typed registry in scenarios/expected-states.ts is the single source of truth.

Tests

  • DELETED (validated dead code paths): e2e-suite-runner.test.ts, e2e-scenario-first-migration.test.ts, e2e-expected-state-validator.test.ts.
  • REWRITTEN e2e-phase-orchestrators.test.ts: now exercises real shell spawning via temp scripts (pass / fail-with-stderr-tail / timeout / retry-on-classified-transient / missing-ref), real probe skipping with visible reason, and real pending skipping. Replaces the prior placeholder refs with assertions that observe actual subprocess behavior.
  • TRIMMED e2e-lib-helpers.test.ts, e2e-scenario-additional-families.test.ts, e2e-scenario-resolver.test.ts, e2e-context-helper.test.ts: dry-run-mode / run-scenario.sh-spawning tests deleted; tests of real bash semantics survive.

Docs

  • test/e2e-scenario/docs/README.md: one runner, one mode (live), no dry-run, no validate-only. Migration tracking section reconciled with main: "Migration status is tracked outside the repository in GitHub issues and pull requests" (parent issue Implement layered E2E scenario model #3588).

Merge with main

  • Resolved conflicts in run.ts, 00-slack-provider-state.sh, docs/README.md, three modify/delete pairs under framework-tests/ and runtime/resolver/, and e2e-scenarios.yaml.
  • Noted: a small E2E --validate-only/--dry-run plumbing fix that was layered onto the live runner during merge — ~/.local/bin is now prepended to child PATH at the framework boundary (redaction.ts:buildChildEnv) so post-install nemoclaw shims resolve on runners (notably self-hosted GPU and WSL) where it isn't on the default PATH; and snapshot lifecycle assertions in validation_suites/lib/sandbox_lifecycle.sh now use the canonical nemoclaw <sandbox> snapshot <subcommand> argv shape, matching test-snapshot-commands.sh.

Verification

$ npx tsx test/e2e-scenario/scenarios/run.ts --list
hybrid scenario registry
- brev-launchable-cloud-openclaw: ...
- ubuntu-repo-cloud-openclaw: ...
(22 scenarios listed)

$ npx tsx test/e2e-scenario/scenarios/run.ts --emit-matrix
[{"id":"brev-launchable-cloud-openclaw","runner":"...","label":"...","platform":"...","suites":[...]}, ...]

$ E2E_CONTEXT_DIR=/tmp/x npx tsx test/e2e-scenario/scenarios/run.ts \
    --scenarios ubuntu-repo-cloud-openclaw
... (compiled plan output) ...
Phase results:
  environment: skipped (skipped=1)
  onboarding: failed (passed=1 failed=1)
  runtime: failed (failed=34 skipped=5)

$ cat /tmp/x/.e2e/logs/runtime.smoke.cli-available.log
smoke:cli-available
e2e context: missing required key(s): E2E_SCENARIO

$ vitest run test/e2e-scenario/
Test Files  30 passed (30)
     Tests  366 passed (366)

$ npm run typecheck:cli
(clean)

Audited absent in production paths: --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run, fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton, unsupported-live-step, --validate-only, RunContext.dryRun.

Spec gates addressed

  • Phase 6 — orchestrators execute live shell/probe assertion steps.
  • Phase 7 — single TS runtime entrypoint; bash runners deprecated; workflows use the typed runner with no --dry-run.
  • Workflow side of Phase 9--dry-run, --validate-only, and suite_filter gone from active paths.

Known follow-ups (not blocking this PR)

  • Two leftover if [[ -n "${E2E_DRY_RUN:-}" ]] short-circuits remain inside the hermes branch of validation_suites/messaging/slack/00-slack-provider-state.sh. They are dead code now that nothing in production sets E2E_DRY_RUN, but should be removed for consistency with the stated audit.

Summary by CodeRabbit

  • Tests

    • E2E now runs live scenarios by default; many legacy plan-only/dry-run tests removed or refocused. Phase execution uses real shell steps with timeouts, retries, richer evidence and failure reporting.
  • New Features

    • Added probes and lifecycle utilities for CLI/gateway/sandbox/negative/health checks. Typed runner can emit an execution matrix and is the canonical executor. Secret-redaction and child-env minimization added for safer spawn logging.
  • Chores

    • Docs updated for live-only workflow; legacy bash entrypoints deprecated; CI adds a new scenario and tighter artifact uploads.

…only, and the bash runner

The merged hybrid scenario architecture shipped scaffolding that looked like it
ran E2E tests but did not. Two layers were producing fake green:

1. phase.ts:executeStep had no child_process.spawn anywhere. Every real
   shell/probe step fell through to a hardcoded
   { status: "failed", message: "unsupported live step" }, and a handful of
   fake-pass refs (fake-pass, fake-retry-once-pass, fake-always-transient,
   phase-1-skeleton) were the only paths that reported "passed". CI green
   meant the plan compiled, not that any assertion executed.

2. ~30 shell scripts under validation_suites/, onboarding_assertions/,
   nemoclaw_scenarios/install/, and nemoclaw_scenarios/onboard/ began with
   "if e2e_env_is_dry_run; then echo [dry-run] ...; exit 0; fi". Once the
   dry-run flag flowed in (which workflows did pass), every script silently
   exited 0 before its real assertion ran.

This change rips out both layers in one shot. The TS runner has one execution
mode: live. There is no flag, env var, helper, branch, or comment in any
production path that can produce a fake pass.

Orchestrator (TypeScript)
- phase.ts: executeStep now spawns shell steps via child_process.spawn,
  with detached process groups so timeouts kill bash + sleep cleanly. Probe
  steps return skipped "probe not registered" until the registry lands.
  Pending steps return skipped "pending: <ref>". Unknown kinds throw.
  Real evidence is captured to .e2e/logs/<step-id>.log. Step-level
  reliability.timeoutSeconds and retry.{attempts,on} policy are enforced
  here, not in clients.
- run.ts: --dry-run, --validate-only deleted. Default invocation is live
  execution. --list and --plan-only (local debug) survive read-only.
  --emit-matrix added for the dynamic-matrix workflow (PR #4359).
- types.ts: RunContext.dryRun deleted. AssertionResult already supported
  "skipped" status, now actually used.

Workflow
- e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both
  --dry-run invocations (Linux + WSL), are gone. Workflows execute live.
- workflow-boundary.mts validator now requires --dry-run, --plan-only,
  and --validate-only to NOT appear in the workflow.

Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)
- runtime/run-scenario.sh: 483 lines of duplicated install/onboard/
  gateway-check/suite-execution -> 5-line fail-fast stub pointing at
  run.ts. The TS phase orchestrators own this work now.
- runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep
  walks typed assertionGroups directly; nothing in TS calls a YAML-walking
  bash runner.

Shell scripts (the leaves stay, the dry-run skip blocks die)
- validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**:
  every "if e2e_env_is_dry_run; then ... exit 0; fi" and every
  "[[ ${E2E_DRY_RUN:-0} == 1 ]]" short-circuit removed. The real assertion
  logic that was hiding underneath now runs unconditionally.
- runtime/lib/env.sh: e2e_env_is_dry_run helper deleted.
- inference_routing.sh: dead _e2e_inference_plan helper (only callable
  from the deleted dry-run paths) deleted.

Tests
- DELETED (validated dead code paths):
    e2e-suite-runner.test.ts             (run-suites.sh behavior)
    e2e-scenario-first-migration.test.ts (run-scenario.sh dry-run plan)
    e2e-expected-state-validator.test.ts (--validate-only mode)
- REWRITTEN:
    e2e-phase-orchestrators.test.ts: now exercises real shell spawning
      via temp scripts (pass/fail/timeout/retry/missing-ref), real probe
      skipping with visible reason, and real pending skipping. The
      previous fake-pass refs in this test were the canonical example of
      the problem.
- TRIMMED:
    e2e-lib-helpers.test.ts: dry-run-mode unit tests deleted; tests of
      real bash semantics survive.
    e2e-scenario-additional-families.test.ts: planOnly-via-bash tests
      deleted; resolveScenario-direct tests survive.
    e2e-scenario-resolver.test.ts: run-scenario.sh --plan-only spawn
      tests deleted; resolver unit tests survive.
    e2e-context-helper.test.ts: dry-run trace test deleted.

Docs
- docs/README.md: updated to state one runner, one mode (live), no
  dry-run, no validate-only. Bash entrypoints documented as deprecated
  fail-fast stubs.

Verification
- run.ts --list          : prints the typed registry (intact)
- run.ts --emit-matrix   : emits JSON matrix for the dynamic-matrix
                           workflow
- run.ts --scenarios <id>: spawns real shell scripts, real exit codes,
                           real failures with real evidence logs. Phase
                           results show passed/failed/skipped honestly.
- All 274 e2e-scenario framework tests pass.
- Audited: no surviving --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run,
  fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton,
  unsupported-live-step, --validate-only, or RunContext.dryRun in any
  production path.

CI for this PR will go red on environments where nemoclaw is not actually
installed and onboarded. That is the point. Red is the first honest signal
in months. Subsequent PRs (probe registry, OnboardingOrchestrator wiring
into the real install/onboard dispatchers, old YAML resolver deletion)
fix the real failures rather than hide them.

Spec gates addressed: Phase 6 (orchestrators execute live shell steps),
Phase 7 (single TS runtime entrypoint, bash runners deprecated), and
the workflow side of Phase 9 (--dry-run / --validate-only / suite_filter
gone from active paths). The old YAML resolver source under
runtime/resolver/ stays for now; its deletion is the next PR.
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Converts the E2E flow to a single TypeScript live runner: adds typed phase actions/results, executes actions and shell steps for real (with logs, timeouts, retries, evidence), deprecates bash runners, removes dry-run short-circuits across suites/install/onboard, updates docs/workflows, and prunes plan-only/dry-run tests.

Changes

Live E2E Execution Framework

Layer / File(s) Summary
Type contracts and phase action result model
test/e2e-scenario/scenarios/types.ts
Adds PhaseAction and PhaseActionResult, changes RunPlanPhase.actions to PhaseAction[], and removes dryRun from RunContext.
Compiler action generation and dispatcher entrypoint
test/e2e-scenario/scenarios/compiler.ts, test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
Generates typed shell actions for environment/onboarding phases with timeouts/evidence, centralizes helper script refs, updates plan rendering, and adds dispatch-action.sh as deterministic shell-fn entrypoint.
Phase orchestrator real execution path
test/e2e-scenario/scenarios/orchestrators/phase.ts
Executes phase.actions before assertions, runs shell steps via bash with repo-relative resolution, per-step logs, detached process-group timeouts, stderr-tail capture, retries, classifier-based failure categories, and updated phase status computation.
Phase orchestrator and compiler tests
test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
Refactors tests to use OS temp contexts and adds real shell execution suites (pass/fail/timeout/retry/missing-script), action-ordering/evidence checks, and compiler action-shape assertions.
CLI updates and bash runner deprecation
test/e2e-scenario/scenarios/run.ts, test/e2e-scenario/runtime/run-scenario.sh, test/e2e-scenario/runtime/run-suites.sh
run.ts drops --dry-run/--validate-only, gains --emit-matrix/--plan-only; legacy bash runners are converted to fail-fast deprecation stubs.
Docs and env helper cleanup
test/e2e-scenario/docs/README.md, test/e2e-scenario/runtime/lib/env.sh
README centers TS runner as the supported entrypoint and live execution model; removes e2e_env_is_dry_run helper.
Install, onboard, probes, and validation suites run live
test/e2e-scenario/nemoclaw_scenarios/*, test/e2e-scenario/validation_suites/**/*.sh, test/e2e-scenario/nemoclaw_scenarios/probes/*
Removes E2E_DRY_RUN short-circuits across install/onboard/assertions so checks run against live outputs; adds lifecycle and probe scripts and updates suite implementations to use sandbox exec wrappers and real verification.
Legacy dry-run and plan-only test pruning
test/e2e-scenario/framework-tests/e2e-*.test.ts
Deletes or narrows tests that relied on deprecated bash plan-only/dry-run flows, migrating coverage to typed resolver/runner tests.
Workflow wiring and artifact guardrails
.github/workflows/e2e-scenarios.yaml, .github/workflows/e2e-scenarios-all.yaml, tools/e2e-scenarios/workflow-boundary.mts
Adds ubuntu-rebuild-openclaw mapping/matrix, removes plan-only resolution steps, runs run.ts without --dry-run, renames dry-run summary text to live wording, and tightens artifact uploads to an explicit .e2e/* subpath allowlist with include-hidden-files: false; enforces forbidden flags at workflow boundary.

Sequence Diagram(s)

sequenceDiagram
  participant GH as GitHub Actions
  participant RunnerTS as run.ts (TypeScript)
  participant Orchestrator as PhaseOrchestrator
  participant Shell as bash script
  GH->>RunnerTS: invoke --scenarios / --emit-matrix
  RunnerTS->>Orchestrator: compile plans, execute phases/actions
  Orchestrator->>Shell: runAction -> spawn bash (logs, timeout)
  Shell-->>Orchestrator: exit code, evidence (logs/stderr tail)
  Orchestrator-->>RunnerTS: aggregate PhaseResult (actions + assertions)
  RunnerTS-->>GH: exit code + artifact emission
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

refactor, area: ci

Suggested reviewers

  • prekshivyas
  • cv

Poem

A rabbit hops through tests and logs,
Old dry-run tunnels fade like fog.
Typed actions march, scripts hum true,
Timeouts caught, evidence in view.
🐇 Hooray — green traces, tidy and new.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/e2e-real-execution

# Conflicts:
#	test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e-scenario/scenarios/orchestrators/phase.ts`:
- Around line 32-34: The branch uses a case-insensitive regex test on ref but
then calls case-sensitive ref.includes("tunnel") / ref.includes("cloudflared"),
causing mixed-case refs to misclassify; update the branching in the same block
(the if handling /provider|inference|chat-completion|cloudflared|tunnel/i) to
compare in a case-insensitive way—e.g., normalize ref to lowerCase once and use
that for the subsequent includes checks (or use case-insensitive regex matches)
so tunnel/cloudflared variants are correctly detected and return
"external-tunnel".

In `@test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh`:
- Line 62: The current substring check [[ "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" ==
*"${E2E_SANDBOX_NAME}"* ]] can produce false positives (e.g., sb1 matching
sb10); change it to perform an exact token/line match of the sandbox name
instead: test SANDBOX_LIFECYCLE_LAST_OUTPUT for the exact E2E_SANDBOX_NAME token
(for example by piping SANDBOX_LIFECYCLE_LAST_OUTPUT to grep -w/-x or using a
word-boundary regex with [[ ... =~ ... ]]) and fail if not found, so the check
around SANDBOX_LIFECYCLE_LAST_OUTPUT and E2E_SANDBOX_NAME only succeeds for
exact sandbox-name matches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 401b8a23-1471-429d-9905-b413e6bae24c

📥 Commits

Reviewing files that changed from the base of the PR and between 1daf081 and b7acfb7.

📒 Files selected for processing (44)
  • .github/workflows/e2e-scenarios.yaml
  • test/e2e-scenario/docs/README.md
  • test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
  • test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
  • test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
  • test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
  • test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
  • test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
  • test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
  • test/e2e-scenario/runtime/lib/env.sh
  • test/e2e-scenario/runtime/run-scenario.sh
  • test/e2e-scenario/runtime/run-suites.sh
  • test/e2e-scenario/scenarios/orchestrators/phase.ts
  • test/e2e-scenario/scenarios/run.ts
  • test/e2e-scenario/scenarios/types.ts
  • test/e2e-scenario/validation_suites/assert/gateway-alive.sh
  • test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
  • test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
  • test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
  • test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
  • test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
  • test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
  • test/e2e-scenario/validation_suites/lib/inference_routing.sh
  • test/e2e-scenario/validation_suites/lib/messaging_providers.sh
  • test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
  • test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
  • test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
  • test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
  • test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
  • test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
  • test/e2e-scenario/validation_suites/sandbox-exec.sh
  • test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
  • test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
  • tools/e2e-scenarios/workflow-boundary.mts
💤 Files with no reviewable changes (30)
  • test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
  • test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
  • test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
  • test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
  • test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
  • test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
  • test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
  • test/e2e-scenario/validation_suites/lib/messaging_providers.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
  • test/e2e-scenario/scenarios/types.ts
  • test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
  • test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
  • test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
  • test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
  • test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
  • test/e2e-scenario/validation_suites/lib/inference_routing.sh
  • test/e2e-scenario/validation_suites/assert/gateway-alive.sh
  • test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
  • test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
  • test/e2e-scenario/validation_suites/sandbox-exec.sh
  • test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
  • test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
  • test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
  • test/e2e-scenario/runtime/lib/env.sh
  • test/e2e-scenario/framework-tests/e2e-context-helper.test.ts

Comment thread test/e2e-scenario/scenarios/orchestrators/phase.ts
Comment thread test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh Outdated
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: ubuntu-repo-cloud-openclaw, ubuntu-no-docker-preflight-negative, ubuntu-rebuild-openclaw, ubuntu-repo-cloud-openclaw-custom-policies
Optional E2E: wsl-repo-cloud-openclaw, macos-repo-cloud-openclaw, gpu-repo-local-ollama-openclaw, ubuntu-repo-cloud-hermes

Dispatch hint: ubuntu-repo-cloud-openclaw,ubuntu-no-docker-preflight-negative,ubuntu-rebuild-openclaw,ubuntu-repo-cloud-openclaw-custom-policies

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • ubuntu-repo-cloud-openclaw (high): Primary live typed scenario covering repo install, cloud OpenClaw onboarding, gateway/sandbox state validation, smoke, inference, and credentials after the workflow changed from dry-run to live execution.
  • ubuntu-no-docker-preflight-negative (medium): Validates the rewritten negative failure contract, preflight/onboarding failure handling, absent gateway/sandbox probes, and short-circuit behavior without creating forbidden side effects.
  • ubuntu-rebuild-openclaw (high): Newly routed/fanned-out scenario and touched manifest/lifecycle/rebuild-upgrade helpers must be exercised live to validate sandbox rebuild, workspace preservation, policy preservation, and post-rebuild inference.
  • ubuntu-repo-cloud-openclaw-custom-policies (high): Covers custom network policy, credentials/security assertions, gateway/sandbox probes, and baseline onboarding paths touched by the scenario framework and validation-suite changes.

Optional E2E

  • wsl-repo-cloud-openclaw (high): The WSL branch of the scenario workflow also changed from dry-run to live execution and copies artifacts back from WSL; useful platform confidence but not required if Ubuntu live coverage passes.
  • macos-repo-cloud-openclaw (medium): Platform macOS smoke suite was touched; useful to verify optional Docker/platform behavior after runner/orchestrator changes.
  • gpu-repo-local-ollama-openclaw (very-high): Ollama GPU and auth-proxy validation scripts changed; run when GPU capacity is available to confirm local inference routing remains intact.
  • ubuntu-repo-cloud-hermes (high): Hermes health/history suites were touched; optional adjacent coverage for non-OpenClaw assistant runtime behavior.

New E2E recommendations

  • artifact-redaction (high): The workflow now excludes hidden files and narrows uploaded .e2e paths to avoid leaking context.env; add a live or post-run assertion that downloaded artifacts never contain raw provider/API key values or context.env.
    • Suggested test: Add a scenario artifact-safety E2E check that runs a minimal secret-gated scenario and verifies the uploaded artifact contains only redacted logs and no raw .e2e/context.env.
  • workflow-input-validation (medium): resolve-runner no longer invokes typed plan validation before route lookup, so malformed-but-routed scenario IDs or registry drift could reach the live runner later than intended.
    • Suggested test: Add a workflow-boundary E2E/CI check that dispatches e2e-scenarios.yaml with valid, unknown, and mixed-runner scenario inputs and asserts early, clear failure messages.

Dispatch hint

  • Workflow: .github/workflows/e2e-scenarios.yaml
  • jobs input: ubuntu-repo-cloud-openclaw,ubuntu-no-docker-preflight-negative,ubuntu-rebuild-openclaw,ubuntu-repo-cloud-openclaw-custom-policies

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: e2e-scenarios-all
Optional scenario E2E: None

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • e2e-scenarios-all: Scenario workflow files, scenario runtime/runner code, expected-state metadata, typed scenario orchestration, manifests, probes, and validation suite scripts changed. Per scenario E2E policy, changes to scenario workflows, runtime/runner code, catalog/expected-state metadata, or broad suite surfaces require the all-scenarios fan-out.
    • Dispatch: gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>

Optional scenario E2E

  • None.

Relevant changed files

  • .github/workflows/e2e-scenarios-all.yaml
  • .github/workflows/e2e-scenarios.yaml
  • test/e2e-scenario/docs/README.md
  • test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
  • test/e2e-scenario/framework-tests/e2e-coverage-report.test.ts
  • test/e2e-scenario/framework-tests/e2e-expected-failure.test.ts
  • test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
  • test/e2e-scenario/framework-tests/e2e-expected-state.test.ts
  • test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
  • test/e2e-scenario/framework-tests/e2e-metadata-final-hygiene.test.ts
  • test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts
  • test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
  • test/e2e-scenario/framework-tests/e2e-probes.test.ts
  • test/e2e-scenario/framework-tests/e2e-redaction-parity.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-schema.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts
  • test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
  • test/e2e-scenario/manifests/openclaw-nvidia-rebuild.yaml
  • test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
  • test/e2e-scenario/nemoclaw_scenarios/expected-states.yaml
  • test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
  • test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh
  • test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh
  • test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
  • test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/dispatch.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/gateway-absent.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/gateway-healthy.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
  • test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh
  • test/e2e-scenario/runtime/coverage-report.sh
  • test/e2e-scenario/runtime/lib/env.sh
  • test/e2e-scenario/runtime/resolver/coverage.ts
  • test/e2e-scenario/runtime/resolver/expected-failure.ts
  • test/e2e-scenario/runtime/resolver/index.ts
  • test/e2e-scenario/runtime/resolver/js-yaml.d.ts
  • test/e2e-scenario/runtime/resolver/load.ts
  • test/e2e-scenario/runtime/resolver/plan.ts
  • test/e2e-scenario/runtime/resolver/schema.ts
  • test/e2e-scenario/runtime/resolver/validator.ts
  • test/e2e-scenario/runtime/run-scenario.sh
  • test/e2e-scenario/runtime/run-suites.sh
  • test/e2e-scenario/scenarios/assertions/environment.ts
  • test/e2e-scenario/scenarios/assertions/onboarding.ts
  • test/e2e-scenario/scenarios/assertions/registry.ts
  • test/e2e-scenario/scenarios/assertions/runtime.ts
  • test/e2e-scenario/scenarios/builder.ts
  • test/e2e-scenario/scenarios/compiler.ts
  • test/e2e-scenario/scenarios/expected-states.ts
  • test/e2e-scenario/scenarios/matrix.ts
  • test/e2e-scenario/scenarios/orchestrators/context.ts
  • test/e2e-scenario/scenarios/orchestrators/lifecycle.ts
  • test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts
  • test/e2e-scenario/scenarios/orchestrators/phase.ts
  • test/e2e-scenario/scenarios/orchestrators/redaction.ts
  • test/e2e-scenario/scenarios/orchestrators/runner.ts
  • test/e2e-scenario/scenarios/orchestrators/state-validation.ts
  • test/e2e-scenario/scenarios/probes/builtin.ts
  • test/e2e-scenario/scenarios/probes/diagnostics.ts
  • test/e2e-scenario/scenarios/probes/docs-validation.ts
  • test/e2e-scenario/scenarios/probes/injection-blocked.ts
  • test/e2e-scenario/scenarios/probes/network-policy.ts
  • test/e2e-scenario/scenarios/probes/registry.ts
  • test/e2e-scenario/scenarios/probes/shields-config.ts
  • test/e2e-scenario/scenarios/probes/types.ts
  • test/e2e-scenario/scenarios/probes/util.ts
  • test/e2e-scenario/scenarios/run.ts
  • test/e2e-scenario/scenarios/scenarios/baseline.ts
  • test/e2e-scenario/scenarios/types.ts
  • test/e2e-scenario/validation_suites/assert/gateway-alive.sh
  • test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
  • test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
  • test/e2e-scenario/validation_suites/hermes/01-history-writable.sh
  • test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
  • test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
  • test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
  • test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
  • test/e2e-scenario/validation_suites/lib/inference_routing.sh
  • test/e2e-scenario/validation_suites/lib/messaging_providers.sh
  • test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
  • test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
  • test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
  • test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
  • test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
  • test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
  • test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
  • test/e2e-scenario/validation_suites/sandbox-exec.sh
  • test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
  • test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
  • tools/e2e-scenarios/workflow-boundary.mts

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 2 needs attention, 10 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 9 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • Live fan-out includes scenarios without onboarding and secret plumbing (test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh:32): The all-scenarios workflow now live-runs every typed registry scenario, but several generated scenarios declare onboarding profiles and required secrets that the workflow, compiler secret allowlist, and shell dispatcher do not expose. Those jobs will either fail at onboarding dispatch or run without the credential boundary declared by the registry.
    • Recommendation: Before broad live fan-out, either implement dispatcher cases plus workflow_call/all-scenarios secret declarations and compiler secretEnv mappings for every generated profile, or filter the generated matrix to scenarios that are fully supported. Add a static contract test covering onboarding dispatch and requiredSecrets end to end.
    • Evidence: baseline.ts declares scenarios requiring OPENAI_COMPATIBLE_API_KEY, BRAVE_API_KEY, TELEGRAM_BOT_TOKEN, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, and SLACK_APP_TOKEN with onboarding ids such as openai-compatible-openclaw and cloud-nvidia-openclaw-telegram. e2e-scenarios.yaml declares only NVIDIA_API_KEY, e2e-scenarios-all.yaml passes only NVIDIA_API_KEY, compiler.ts ONBOARD_PROFILE_SECRET_ENV only covers NVIDIA/local profiles, and onboard/dispatch.sh only routes cloud-openclaw/cloud-hermes/local variants.
  • No-docker negative onboarding truncates the framework-seeded context (test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh:35): The TS runner seeds context.env before phase actions, and dispatch-action.sh explicitly documents that workers must not call e2e_context_init because it truncates the seeded file. The no-docker negative worker still calls e2e_context_init, which can wipe E2E_SCENARIO, E2E_SANDBOX_NAME, and E2E_GATEWAY_URL before state-validation probes and negative side-effect checks need them.
    • Recommendation: Remove e2e_context_init from the worker and rely on the framework-owned context initialization, using e2e_context_set only for additional keys. Add a test that pre-seeds context.env, runs the no-docker worker against a fake failing nemoclaw command, and verifies the seeded keys survive for gateway-absent and sandbox-absent probes.
    • Evidence: dispatch-action.sh says 'do NOT call e2e_context_init here' because it opens with ': > ctx'. cloud-openclaw-no-docker.sh calls e2e_context_init at the start of e2e_onboard_cloud_openclaw_no_docker.

🔎 Worth checking

  • Source-of-truth review needed: Generated all-scenarios workflow versus typed registry: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: e2e-scenarios-all.yaml says no workflow edits are required when adding baseline.ts scenarios, but also declares a ubuntu-rebuild-openclaw job.
  • Source-of-truth review needed: Workflow secret and onboarding execution contract: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: baseline.ts declares non-NVIDIA secrets and onboarding ids not present in e2e-scenarios.yaml secrets, e2e-scenarios-all.yaml secret passing, compiler ONBOARD_PROFILE_SECRET_ENV, or onboard/dispatch.sh.
  • Source-of-truth review needed: Probe subprocess utilities: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: probes/util.ts uses process.env in spawnBash/runSandboxCmd/runHostCmd and writeProbeEvidence serializes raw payloads while phase.ts uses buildChildEnv and pipeRedacted.
  • Source-of-truth review needed: Expected-state registry documentation after YAML resolver deletion: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: README.md lists runtime/resolver and coverage-report.sh; expected-states.ts and types.ts still say expected-states.yaml stays in place and is mirrored.
  • Probe subprocesses bypass the framework env and redaction boundary (test/e2e-scenario/scenarios/probes/util.ts:133): Phase actions and shell steps use buildChildEnv and pipeRedacted, but the new probe utilities spawn host and sandbox commands with process.env and write stdout/stderr tails directly into evidence JSON. In live secret-bearing runs this can pass workflow secrets to probe helper processes and persist secret-shaped output in artifacts or failure messages.
    • Recommendation: Route runSandboxCmd, runHostCmd, diagnosticsProbe, and docsValidationProbe through the same child-env allowlist and redactString handling as PhaseOrchestrator, or implement an equivalent probe-owned boundary. Add a leak test where a fake command prints NVIDIA_API_KEY and token-shaped output, asserting the child does not receive undeclared secrets and evidence/messages are redacted.
    • Evidence: runSandboxCmd calls spawnBash with env: { ...process.env, E2E_CONTEXT_DIR: ctx.contextDir }; runHostCmd defaults to env: opts.env ?? process.env; diagnostics.ts spawns nemoclaw with env: process.env; docs-validation.ts spawns with { ...process.env, CHECK_DOC_LINKS_REMOTE: '0' }; writeProbeEvidence serializes payloads directly.
  • ubuntu-rebuild-openclaw is hard-coded in addition to the generated matrix (.github/workflows/e2e-scenarios-all.yaml:80): The all-scenarios workflow states the typed registry is the fan-out source of truth and that adding a scenario requires no workflow edits, but it adds a separate reusable-workflow job for ubuntu-rebuild-openclaw. That scenario is already in the typed registry, so this duplicates a secret-bearing live run and reintroduces per-scenario workflow drift.
    • Recommendation: Remove the hard-coded job and rely on the generated matrix, or explicitly exclude the scenario from the matrix if it truly needs special handling. Add a workflow-boundary test that rejects per-scenario reusable-workflow jobs outside the generated matrix fan-out.
    • Evidence: e2e-scenarios-all.yaml generates matrix entries from run.ts --emit-matrix and then separately defines ubuntu-rebuild-openclaw. baseline.ts contains ubuntu-rebuild-openclaw and run.ts buildScenarioMatrix uses listScenarios().
  • Docs and source comments still advertise deleted dry-run and resolver paths (test/e2e-scenario/docs/README.md:127): The PR retires dry-run/validate-only and deletes the YAML resolver, but active documentation and comments still list run-scenario.sh --dry-run, --validate-only, run.ts --dry-run, runtime/coverage-report.sh, runtime/resolver, workflow dry-runs, and expected-states.yaml mirror/parity language. This contradicts the single live runner and typed expected-state source-of-truth claims.
    • Recommendation: Update README.md, scenarios/expected-states.ts, and scenarios/types.ts to describe the live TS runner and typed expected-state registry as the active source of truth. Add a static docs/source hygiene test that rejects active references to --dry-run, --validate-only, runtime/resolver, runtime/coverage-report.sh, and deleted expected-states.yaml outside explicitly historical prose.
    • Evidence: README.md 'How to run' still includes run-scenario.sh --dry-run, --validate-only, runtime/coverage-report.sh, runtime/resolver, and run.ts --dry-run. expected-states.ts says expected-states.yaml stays in place for the legacy bash resolver, and types.ts says expected-states.yaml stays alongside with a mirror contract, while those files were deleted.
  • Runner override accepts arbitrary runs-on labels (test/e2e-scenario/scenarios/runner-routing.ts:39): resolveRunnerForScenario treats any non-empty runnerRequirements entry beginning with runs-on: as authoritative runner metadata. The current reusable workflow still ignores matrix.runner for actual execution, but the PR exposes it as workflow-facing routing data; a later consumer could route secret-bearing jobs to unexpected self-hosted labels.
    • Recommendation: Validate runs-on overrides against a narrow allowlist of approved GitHub-hosted and NVIDIA self-hosted labels, or remove arbitrary overrides from workflow-facing matrix output until the trusted routing contract is finalized. Replace any custom-self-hosted acceptance test with a negative test for rejected labels.
    • Evidence: runner-routing.ts returns explicit.slice('runs-on:'.length).trim() after only checking non-empty. The linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 description and tests document arbitrary runs-on:<label> override behavior.
  • Live WSL path installs Node through unpinned curl-to-bash (.github/workflows/e2e-scenarios.yaml:239): The workflow now executes live scenarios and later passes NVIDIA_API_KEY into the WSL run. Before that secret-bearing run, the WSL setup pipes the remote NodeSource installer directly into bash without a pinned digest or verified artifact.
    • Recommendation: Use a pinned or verified Node installation path for WSL, or constrain and document this installer trust boundary. Keep unrelated secrets out of WSL setup and add workflow validation covering the installer boundary.
    • Evidence: The WSL Node step runs curl -fsSL https://deb.nodesource.com/setup\_22.x | bash - followed by apt-get install -y nodejs; the WSL scenario step later passes NVIDIA_API_KEY.
  • Workflow contract tests cover routes but not secrets or onboarding dispatch (test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts:31): The workflow test now checks typed scenario IDs and runner labels against the reusable workflow route map, but it does not verify that every generated live scenario can be dispatched or receives its declared required secrets. That leaves the current live fan-out regressions undetected.
    • Recommendation: Extend static workflow contract coverage so every generated scenario has an onboarding action arg accepted by onboard/dispatch.sh or is explicitly filtered, and every requiredSecrets entry is declared in e2e-scenarios.yaml workflow_call secrets, passed by e2e-scenarios-all.yaml, and included in compiler action secretEnv where needed.
    • Evidence: e2e-scenarios-workflow.test.ts parses the ROUTES map and compares IDs/runner labels. No inspected test checks workflow_call secret declarations, all-scenarios secret passing, compiler ONBOARD_PROFILE_SECRET_ENV, or onboard/dispatch.sh case labels.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Every generated matrix scenario's onboarding profile is accepted by nemoclaw_scenarios/onboard/dispatch.sh or the scenario is explicitly filtered from all-scenarios.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
  • **Runtime validation** — Every ScenarioDefinition.requiredSecrets value is declared in .github/workflows/e2e-scenarios.yaml, passed by e2e-scenarios-all.yaml, and allowlisted in compiler secretEnv for the phase action that consumes it.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
  • **Runtime validation** — cloud-openclaw-no-docker preserves preseeded context.env keys such as E2E_SCENARIO, E2E_SANDBOX_NAME, and E2E_GATEWAY_URL while capturing a fake preflight failure.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
  • **Runtime validation** — A probe subprocess fake command that prints NVIDIA_API_KEY and token-shaped output does not receive undeclared secrets and stores only redacted stdout/stderr/message/evidence.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
  • **Runtime validation** — Workflow boundary rejects per-scenario reusable-workflow jobs outside the generated matrix fan-out.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
  • **Workflow contract tests cover routes but not secrets or onboarding dispatch** — Extend static workflow contract coverage so every generated scenario has an onboarding action arg accepted by onboard/dispatch.sh or is explicitly filtered, and every requiredSecrets entry is declared in e2e-scenarios.yaml workflow_call secrets, passed by e2e-scenarios-all.yaml, and included in compiler action secretEnv where needed.
  • **Acceptance clause:** Removing `--dry-run` and plan-mode execution from the E2E runner: they let a coding agent ship work behind false-green test runs. The TS runner now has **one execution mode: live.** No flag, env var, helper, or branch in any production path bypasses real assertion execution. — add test evidence or identify existing coverage. The workflows now invoke run.ts without --dry-run and workflow-boundary rejects dry-run/validate-only/plan-only in CI. However README.md still documents active dry-run/validate-only paths, and run.ts still supports --plan-only for local debug.
  • **Acceptance clause:** Deleted `nemoclaw_scenarios/expected-states.yaml` and the entire `runtime/resolver/` tree (`load.ts`, `plan.ts`, `schema.ts`, `coverage.ts`, `index.ts`, `validator.ts`, `expected-failure.ts`, `js-yaml.d.ts`) plus `runtime/coverage-report.sh`. — add test evidence or identify existing coverage. The files are deleted in the diff, but README.md, scenarios/expected-states.ts, and scenarios/types.ts still refer to expected-states.yaml, runtime/resolver, and coverage-report.sh as active or transitional surfaces.
Since last review details

Current findings:

  • Source-of-truth review needed: Generated all-scenarios workflow versus typed registry: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: e2e-scenarios-all.yaml says no workflow edits are required when adding baseline.ts scenarios, but also declares a ubuntu-rebuild-openclaw job.
  • Source-of-truth review needed: Workflow secret and onboarding execution contract: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: baseline.ts declares non-NVIDIA secrets and onboarding ids not present in e2e-scenarios.yaml secrets, e2e-scenarios-all.yaml secret passing, compiler ONBOARD_PROFILE_SECRET_ENV, or onboard/dispatch.sh.
  • Source-of-truth review needed: Probe subprocess utilities: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: probes/util.ts uses process.env in spawnBash/runSandboxCmd/runHostCmd and writeProbeEvidence serializes raw payloads while phase.ts uses buildChildEnv and pipeRedacted.
  • Source-of-truth review needed: Expected-state registry documentation after YAML resolver deletion: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: README.md lists runtime/resolver and coverage-report.sh; expected-states.ts and types.ts still say expected-states.yaml stays in place and is mirrored.
  • Live fan-out includes scenarios without onboarding and secret plumbing (test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh:32): The all-scenarios workflow now live-runs every typed registry scenario, but several generated scenarios declare onboarding profiles and required secrets that the workflow, compiler secret allowlist, and shell dispatcher do not expose. Those jobs will either fail at onboarding dispatch or run without the credential boundary declared by the registry.
    • Recommendation: Before broad live fan-out, either implement dispatcher cases plus workflow_call/all-scenarios secret declarations and compiler secretEnv mappings for every generated profile, or filter the generated matrix to scenarios that are fully supported. Add a static contract test covering onboarding dispatch and requiredSecrets end to end.
    • Evidence: baseline.ts declares scenarios requiring OPENAI_COMPATIBLE_API_KEY, BRAVE_API_KEY, TELEGRAM_BOT_TOKEN, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, and SLACK_APP_TOKEN with onboarding ids such as openai-compatible-openclaw and cloud-nvidia-openclaw-telegram. e2e-scenarios.yaml declares only NVIDIA_API_KEY, e2e-scenarios-all.yaml passes only NVIDIA_API_KEY, compiler.ts ONBOARD_PROFILE_SECRET_ENV only covers NVIDIA/local profiles, and onboard/dispatch.sh only routes cloud-openclaw/cloud-hermes/local variants.
  • No-docker negative onboarding truncates the framework-seeded context (test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh:35): The TS runner seeds context.env before phase actions, and dispatch-action.sh explicitly documents that workers must not call e2e_context_init because it truncates the seeded file. The no-docker negative worker still calls e2e_context_init, which can wipe E2E_SCENARIO, E2E_SANDBOX_NAME, and E2E_GATEWAY_URL before state-validation probes and negative side-effect checks need them.
    • Recommendation: Remove e2e_context_init from the worker and rely on the framework-owned context initialization, using e2e_context_set only for additional keys. Add a test that pre-seeds context.env, runs the no-docker worker against a fake failing nemoclaw command, and verifies the seeded keys survive for gateway-absent and sandbox-absent probes.
    • Evidence: dispatch-action.sh says 'do NOT call e2e_context_init here' because it opens with ': > ctx'. cloud-openclaw-no-docker.sh calls e2e_context_init at the start of e2e_onboard_cloud_openclaw_no_docker.
  • Probe subprocesses bypass the framework env and redaction boundary (test/e2e-scenario/scenarios/probes/util.ts:133): Phase actions and shell steps use buildChildEnv and pipeRedacted, but the new probe utilities spawn host and sandbox commands with process.env and write stdout/stderr tails directly into evidence JSON. In live secret-bearing runs this can pass workflow secrets to probe helper processes and persist secret-shaped output in artifacts or failure messages.
    • Recommendation: Route runSandboxCmd, runHostCmd, diagnosticsProbe, and docsValidationProbe through the same child-env allowlist and redactString handling as PhaseOrchestrator, or implement an equivalent probe-owned boundary. Add a leak test where a fake command prints NVIDIA_API_KEY and token-shaped output, asserting the child does not receive undeclared secrets and evidence/messages are redacted.
    • Evidence: runSandboxCmd calls spawnBash with env: { ...process.env, E2E_CONTEXT_DIR: ctx.contextDir }; runHostCmd defaults to env: opts.env ?? process.env; diagnostics.ts spawns nemoclaw with env: process.env; docs-validation.ts spawns with { ...process.env, CHECK_DOC_LINKS_REMOTE: '0' }; writeProbeEvidence serializes payloads directly.
  • ubuntu-rebuild-openclaw is hard-coded in addition to the generated matrix (.github/workflows/e2e-scenarios-all.yaml:80): The all-scenarios workflow states the typed registry is the fan-out source of truth and that adding a scenario requires no workflow edits, but it adds a separate reusable-workflow job for ubuntu-rebuild-openclaw. That scenario is already in the typed registry, so this duplicates a secret-bearing live run and reintroduces per-scenario workflow drift.
    • Recommendation: Remove the hard-coded job and rely on the generated matrix, or explicitly exclude the scenario from the matrix if it truly needs special handling. Add a workflow-boundary test that rejects per-scenario reusable-workflow jobs outside the generated matrix fan-out.
    • Evidence: e2e-scenarios-all.yaml generates matrix entries from run.ts --emit-matrix and then separately defines ubuntu-rebuild-openclaw. baseline.ts contains ubuntu-rebuild-openclaw and run.ts buildScenarioMatrix uses listScenarios().
  • Docs and source comments still advertise deleted dry-run and resolver paths (test/e2e-scenario/docs/README.md:127): The PR retires dry-run/validate-only and deletes the YAML resolver, but active documentation and comments still list run-scenario.sh --dry-run, --validate-only, run.ts --dry-run, runtime/coverage-report.sh, runtime/resolver, workflow dry-runs, and expected-states.yaml mirror/parity language. This contradicts the single live runner and typed expected-state source-of-truth claims.
    • Recommendation: Update README.md, scenarios/expected-states.ts, and scenarios/types.ts to describe the live TS runner and typed expected-state registry as the active source of truth. Add a static docs/source hygiene test that rejects active references to --dry-run, --validate-only, runtime/resolver, runtime/coverage-report.sh, and deleted expected-states.yaml outside explicitly historical prose.
    • Evidence: README.md 'How to run' still includes run-scenario.sh --dry-run, --validate-only, runtime/coverage-report.sh, runtime/resolver, and run.ts --dry-run. expected-states.ts says expected-states.yaml stays in place for the legacy bash resolver, and types.ts says expected-states.yaml stays alongside with a mirror contract, while those files were deleted.
  • Runner override accepts arbitrary runs-on labels (test/e2e-scenario/scenarios/runner-routing.ts:39): resolveRunnerForScenario treats any non-empty runnerRequirements entry beginning with runs-on: as authoritative runner metadata. The current reusable workflow still ignores matrix.runner for actual execution, but the PR exposes it as workflow-facing routing data; a later consumer could route secret-bearing jobs to unexpected self-hosted labels.
    • Recommendation: Validate runs-on overrides against a narrow allowlist of approved GitHub-hosted and NVIDIA self-hosted labels, or remove arbitrary overrides from workflow-facing matrix output until the trusted routing contract is finalized. Replace any custom-self-hosted acceptance test with a negative test for rejected labels.
    • Evidence: runner-routing.ts returns explicit.slice('runs-on:'.length).trim() after only checking non-empty. The linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 description and tests document arbitrary runs-on:<label> override behavior.
  • Live WSL path installs Node through unpinned curl-to-bash (.github/workflows/e2e-scenarios.yaml:239): The workflow now executes live scenarios and later passes NVIDIA_API_KEY into the WSL run. Before that secret-bearing run, the WSL setup pipes the remote NodeSource installer directly into bash without a pinned digest or verified artifact.
    • Recommendation: Use a pinned or verified Node installation path for WSL, or constrain and document this installer trust boundary. Keep unrelated secrets out of WSL setup and add workflow validation covering the installer boundary.
    • Evidence: The WSL Node step runs curl -fsSL https://deb.nodesource.com/setup\_22.x | bash - followed by apt-get install -y nodejs; the WSL scenario step later passes NVIDIA_API_KEY.
  • Workflow contract tests cover routes but not secrets or onboarding dispatch (test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts:31): The workflow test now checks typed scenario IDs and runner labels against the reusable workflow route map, but it does not verify that every generated live scenario can be dispatched or receives its declared required secrets. That leaves the current live fan-out regressions undetected.
    • Recommendation: Extend static workflow contract coverage so every generated scenario has an onboarding action arg accepted by onboard/dispatch.sh or is explicitly filtered, and every requiredSecrets entry is declared in e2e-scenarios.yaml workflow_call secrets, passed by e2e-scenarios-all.yaml, and included in compiler action secretEnv where needed.
    • Evidence: e2e-scenarios-workflow.test.ts parses the ROUTES map and compares IDs/runner labels. No inspected test checks workflow_call secret declarations, all-scenarios secret passing, compiler ONBOARD_PROFILE_SECRET_ENV, or onboard/dispatch.sh case labels.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

jyaunches added 3 commits May 27, 2026 21:21
…shape client check

- PhaseOrchestrator.runShellStep: wait for the log WriteStream to finish
  before resolving so callers (and tests) reading evidence synchronously
  see the actual stdout/stderr instead of an empty file. Race exposed by
  e2e-phase-orchestrators 'shell_step_passes_when_script_exits_zero'.
- e2e-phase-orchestrators: replace client-source toMatch regex (1
  source-shape test, budget=0) with a runtime-shape behavior assertion
  on the HostCliClient observation. Still enforces 'clients do not
  encode pass/fail or retry/timeout semantics' per hybrid-scenario E2E
  architecture spec, without violating source-shape budget.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Closes the spec's reopened Phase 6 gap. The new typed runner now
executes the install and onboarding work that the deleted bash runner
used to perform, but inside EnvironmentOrchestrator and
OnboardingOrchestrator instead of in workflow YAML or a resurrected
bash runner. All canonical scenarios now reach a real, live SUT before
their assertions run.

Architecture (per hybrid-scenario-e2e-architecture spec):

* types.ts: introduce typed PhaseAction (kind=shell-fn|shell, scriptRef,
  fn, arg, timeoutSeconds, evidencePath) and PhaseActionResult. Replace
  the prior actions: string[] free-form labels with PhaseAction[]. Add
  actions[] to PhaseResult so failure-layer attribution stays clear:
  setup failure is recorded distinctly from assertion failure.

* compiler.ts: phaseActions() now emits typed actions for environment
  (context.emit + install.<id>) and onboarding (profile.<id>). Stable
  action ids: environment.context.emit,
  environment.install.<install-id>, onboarding.profile.<profile-id>.
  All install/onboard actions point at the existing dispatcher scripts
  (install/dispatch.sh, onboard/dispatch.sh) - shell remains the
  implementation per spec, invocation is centralized.

* orchestrators/phase.ts: PhaseOrchestrator.run() executes actions
  before assertions. Action failure short-circuits the phase so
  assertions never run against an environment that was never set up.
  Action runner reuses the same spawn/timeout/process-group/log-flush
  machinery as runShellStep. Per-action timeout, no retry (install and
  onboarding must fail loudly).

* nemoclaw_scenarios/dispatch-action.sh: new bash launcher (the only
  new shell file). The install/onboard dispatchers are intentionally
  library-style (function definitions only); this launcher gives them
  a deterministic executable entrypoint that sources runtime/lib/env.sh
  + runtime/lib/context.sh, applies non-interactive env, sources the
  requested dispatcher, and invokes the named function with one arg.
  Replaces the orchestration that the deleted run-scenario.sh used to
  do, but called from the typed orchestrator instead.

* plan-only output: now shows 'Action: <id> (timeout=...) -> <fn> <arg>'
  per phase, before assertion groups. Maintainers can preview the full
  setup+onboard+assert sequence before dispatch.

* framework-tests/e2e-phase-orchestrators.test.ts: add five behavior
  tests covering action-runs-before-assertions, action-failure short-
  circuits-assertions, action timeout via orchestrator policy,
  evidence-log flushed-before-resolve, and compiler emits typed
  install/onboard actions for all 7 canonical scenarios.

What stays out:

* No workflow YAML edits. .github/workflows/e2e-scenarios.yaml still
  invokes only 'npx tsx test/e2e-scenario/scenarios/run.ts --scenarios
  ...'. Workflow YAML stays innocent of install/onboard plumbing.
* No client edits. HostCliClient et al. remain pass/fail/policy free.
* No resolver/YAML-first revival. setup_scenarios/test_plans/suite_filter
  remain unsupported.

Validation gate (Phase 6 reopen note) is the next step: after this
push goes green on PR CI, dispatch e2e-scenarios-all.yaml against
feat/e2e-real-execution and confirm canonical scenarios produce real
phase results with action evidence under .e2e/actions/<id>.log,
instead of <1s 'failed=34 skipped=5'.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/e2e-scenario/scenarios/types.ts (1)

110-131: ⚡ Quick win

Strengthen PhaseAction typing (discriminated union) to enforce per-kind required fields.

Current runtime already fails loudly if a "shell-fn" action is missing fn (the runner passes action.fn ?? "", and dispatch-action.sh errors when the function name isn’t found). Also, current PhaseAction objects are produced in test/e2e-scenario/scenarios/compiler.ts with fn populated for "shell-fn". Still, the type allows invalid combinations to compile, so a discriminated union would prevent that drift for any future producers.

Suggested refactor
-export interface PhaseAction {
-  id: string;
-  phase: PhaseName;
-  description?: string;
-  // "shell-fn" sources the bash dispatcher and invokes the named function.
-  // "shell"    runs an executable script (used for context-emit helper).
-  kind: "shell-fn" | "shell";
-  // Repo-relative path to the script.
-  scriptRef: string;
-  // For "shell-fn": the bash function to invoke after sourcing scriptRef.
-  fn?: string;
-  // Single positional arg passed to the function/script (install method or
-  // onboarding profile id today). Kept as a single string to keep stable
-  // ids predictable; multi-arg variants can extend this later.
-  arg?: string;
-  // Per-action timeout. No retry by default - install/onboard must fail
-  // loudly so the regression is visible. Retry stays a property of
-  // assertion steps, not actions.
-  timeoutSeconds?: number;
-  // Repo-relative evidence log path.
-  evidencePath?: string;
-}
+interface PhaseActionBase {
+  id: string;
+  phase: PhaseName;
+  description?: string;
+  scriptRef: string;
+  timeoutSeconds?: number;
+  evidencePath?: string;
+}
+
+export type PhaseAction =
+  | (PhaseActionBase & {
+      kind: "shell-fn";
+      fn: string;
+      arg?: string;
+    })
+  | (PhaseActionBase & {
+      kind: "shell";
+      arg?: string;
+    });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/scenarios/types.ts` around lines 110 - 131, Change
PhaseAction from a single broad interface into a discriminated union so
TypeScript enforces per-kind fields: define one variant for kind: "shell-fn"
that requires fn: string (plus shared fields like id, phase, scriptRef, arg?,
timeoutSeconds?, evidencePath?, description?) and another variant for kind:
"shell" that omits/marks fn as disallowed/undefined; update any usages (e.g.,
the action objects created in test/e2e-scenario/scenarios/compiler.ts and the
runner that reads action.fn) to satisfy the new types so compilation ensures
"shell-fn" always has fn and "shell" never does.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh`:
- Line 60: Remove the "|| true" suppression so failures in e2e_context_init
surface during test runs: in dispatch-action.sh replace the line that calls the
initializer (the call to e2e_context_init) by invoking e2e_context_init without
"|| true" so that any mkdir or file-write errors creating
${E2E_CONTEXT_DIR}/context.env propagate (this ensures e2e_context_require will
fail immediately instead of masking the error and misattributing missing keys).

In `@test/e2e-scenario/scenarios/compiler.ts`:
- Around line 94-99: The code currently silently returns [] when required phase
action dimensions like installId (from scenario.environment?.install) are
missing; instead throw a hard error to fail-fast: replace the early returns that
yield [] with throwing a descriptive Error (include context such as scenario.id
or scenario.name and which dimension is missing) in the function that generates
phase actions (the branch checking installId / scenario.environment?.install),
and make the identical change in the other similar branch (the second check at
the same pattern) so malformed scenarios surface as hard failures rather than
emitting empty action lists.

---

Nitpick comments:
In `@test/e2e-scenario/scenarios/types.ts`:
- Around line 110-131: Change PhaseAction from a single broad interface into a
discriminated union so TypeScript enforces per-kind fields: define one variant
for kind: "shell-fn" that requires fn: string (plus shared fields like id,
phase, scriptRef, arg?, timeoutSeconds?, evidencePath?, description?) and
another variant for kind: "shell" that omits/marks fn as disallowed/undefined;
update any usages (e.g., the action objects created in
test/e2e-scenario/scenarios/compiler.ts and the runner that reads action.fn) to
satisfy the new types so compilation ensures "shell-fn" always has fn and
"shell" never does.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a237354-36d1-4e41-8990-3b50d8f973f0

📥 Commits

Reviewing files that changed from the base of the PR and between 903f90b and 628870c.

📒 Files selected for processing (5)
  • test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
  • test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
  • test/e2e-scenario/scenarios/compiler.ts
  • test/e2e-scenario/scenarios/orchestrators/phase.ts
  • test/e2e-scenario/scenarios/types.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/e2e-scenario/scenarios/orchestrators/phase.ts
  • test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts

Comment thread test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh Outdated
Comment thread test/e2e-scenario/scenarios/compiler.ts Outdated
jyaunches added 8 commits May 27, 2026 22:19
The first live dispatch of the Phase 6 wiring (run 26550310438) gave us
real action evidence and surfaced three real bugs. All three are fixed
inside the spec's prescribed layers - no workflow YAML, no client, no
old-resolver path.

1. environment.context.emit was a shell action that called the legacy
   emit-context-from-plan.sh helper. That helper expects the OLD
   YAML-resolver plan.json shape (dimensions.platform.profile.os...),
   which the typed compiler does not produce. Drop the shell action;
   add scenarios/orchestrators/context.ts that derives a normalized
   context.env directly from the typed RunPlan and writes it from
   ScenarioRunner.run() before any phase. Spec: context emission is
   framework infrastructure, not a phase action.

2. PhaseOrchestrator.runShellStep was reading context.env from
   ${ctx.contextDir}/.e2e/context.env, but the shell helper writes
   to ${E2E_CONTEXT_DIR}/context.env (top-level). Fix the path so
   shell assertions see seeded keys.

3. ScenarioRunner did not short-circuit across phase boundaries: a
   failed environment ACTION (real setup work) still let onboarding
   and runtime run, producing a misleading 34-failure cascade.
   Runner now consults prior phase results: if any prior action
   failed, downstream phases are synthesized as skipped with a
   message naming the blocking phase+action+message. Assertion-only
   failures still propagate as failures.

Tests added (8 new, 292/292 scenario framework tests green).

Validation gate next: dispatch e2e-scenarios-all.yaml again.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Run 26550936453 surfaced two more real bugs after Phase 6 wiring went
live. Both fixed inside the spec's prescribed layers; nothing leaks
into workflow YAML or clients.

1. dispatch-action.sh called e2e_context_init unconditionally before
   sourcing the install/onboard dispatcher. e2e_context_init opens
   context.env with `: > ctx`, which truncated the file the
   ScenarioRunner had just seeded. All runtime assertions then failed
   with 'e2e context: missing required key(s): E2E_SCENARIO ...'.
   Fix: dispatch-action.sh no longer calls e2e_context_init. The TS
   framework owns context.env initialization; workers may still extend
   it via e2e_context_set.

2. The legacy onboarding.preflight.passed assertion expects an
   onboard.log file at ${E2E_CONTEXT_DIR}/onboard.log. The old bash
   runner used to redirect onboarding output there; the typed
   orchestrator captured it under .e2e/actions/<action-id>.log. Fix:
   add optional aliasPath to PhaseAction; compiler sets aliasPath to
   'onboard.log' for the onboarding profile action; orchestrator
   copies the action evidence log to the alias on success. Best-
   effort - alias copy failures do not fail the action.

Live evidence from run 26550936453 (canonical ubuntu-repo-cloud-openclaw):
- environment.install.repo-current: passed in 14.2s
- onboarding.profile.cloud-openclaw: passed in 302s (real onboarding!)
- onboarding.base.cli-installed: passed
- onboarding.preflight.passed: failed (onboard.log not found) <- fixed
- runtime.* (10 steps): all 'missing key(s)' <- fixed by #1

Tests: 38/38 phase-orchestrator (was 36; +2 alias tests), 294/294
scenario framework. shellcheck clean.

Validation gate next: redispatch e2e-scenarios-all and confirm
runtime steps actually exercise the SUT (real pass/fail, not key
errors).

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The first canonical-scenario run that reached the onboarding-assertion
phase (run 26552550140 / ubuntu-repo-cloud-openclaw) showed that the
legacy onboarding.preflight.passed assertion fails on every successful
run because its regex matches any mention of 'docker' / 'container' /
'daemon' / 'socket' in onboard.log - and a normal nemoclaw onboarding
mentions all of those many times.

The action itself succeeded (exit 0, 263s of real onboarding work);
the assertion is meant to confirm onboard.log does not contain
explicit preflight FAILURE markers. Tighten the regex accordingly:
match phrases like 'preflight failed/error', 'cannot connect to the
docker daemon', 'onboarding aborted', 'FATAL: docker', 'ERROR: docker
daemon' - not bare topic words.

Verified: shellcheck passes; bash -n passes.

Why we stop here on this PR:

This commit lands the last small framework-level fix produced by live
action evidence. The Phase 6 wiring is now fully validated end-to-end:

  Install:         passed (~12s)
  Onboarding:      action passed (~263s real onboarding)
                   base.cli-installed passed
                   preflight.passed will now pass
  Runtime:         9 passed / 25 failed / 5 skipped against live SUT

The remaining 25 runtime failures are real product/test bugs surfaced
by finally executing the suite against a live SUT (sandbox-shell
timeouts, inference 30-60s timeouts, lifecycle.sandbox_operations
exit-1 mismatches, lifecycle.rebuild/upgrade 120s timeouts even after
retries). They are pre-existing and out of scope for 'execute real
shell assertions; delete dry-run, --validate-only, and the bash
runner'. They become productive follow-up issues.

The 5 skipped runtime steps are 'probe not registered' - known per
spec; probe registry lands in a follow-up.

Negative scenarios (ubuntu-no-docker-preflight-negative,
invalid-key-negative, gateway-port-conflict-negative) need expected-
failure semantics and a way to actually simulate docker-missing on
the runner. Out of scope here; tracked as follow-up.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
All three findings are valid mechanical bugs introduced/touched by this
PR. Batched per the safe-batch policy: same-risk, independently
obvious, testable together.

1. orchestrators/phase.ts classifierForRef:
   Outer guard is /i (case-insensitive), but the inner branch used
   case-sensitive ref.includes("tunnel") / ref.includes("cloudflared")
   - mixed-case refs would fall through and misclassify as
   provider-transient. Replace with /tunnel|cloudflared/i.test(ref).

2. scenarios/compiler.ts phaseActions:
   Inline comment said "the scenario is malformed; surface it as a
   hard error" but the code returned []. Hard-fail instead, with a
   message that names the missing dimension. Empty environment is
   still tolerated (skeleton scenarios can carry no setup yet).

3. validation_suites/lib/sandbox_lifecycle.sh:
   Substring match `*${E2E_SANDBOX_NAME}*` would let sb1 falsely
   match sb10. Use awk with a whole-token equality check on column
   one of `nemoclaw list` output.

Tests: 294/294 scenario framework still green. shellcheck + shfmt
clean. No behavior change for canonical scenarios; affected paths
were either dormant (case-mixed classifier) or returning a slightly
stricter outcome (compiler hard-fail, sandbox exact match).

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Resolves conflicts in:
- .github/workflows/e2e-scenarios.yaml (WSL run step)
- tools/e2e-scenarios/workflow-boundary.mts (boundary contract test)

main brought PR #4346 ("fix(windows): enable Docker WSL integration
and avoid Ubuntu first-run races"): the WSL run step is now a
PowerShell wrapper that writes a bash script to RUNNER_TEMP, maps it
into WSL via wslpath, and invokes `wsl -d $WSL_DISTRO -- env ... bash
-l $wslTmp`. This handles Docker WSL integration plus a first-run
race on freshly installed distros.

This PR (#4380) removes --dry-run, --plan-only, and --validate-only
from CI.

Resolution:

* e2e-scenarios.yaml: take main's PowerShell wrapper structure
  verbatim, but drop `--dry-run` from the inner `npx tsx ... run.ts`
  invocation. The typed runner is now the single execution path.

* workflow-boundary.mts: assert BOTH the new constraints. The WSL
  step must NOT contain --dry-run / --plan-only / --validate-only
  (this PR's invariant) AND must contain $env:WSL_WORKDIR /
  WriteAllText / `bash -l $wslTmp` (main's robustness invariant).

294/294 scenario framework tests pass. shellcheck / shfmt /
end-of-file / trailing-whitespace clean on touched files.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The PR body audits two invariants as absent from production paths:
  - E2E_DRY_RUN / e2e_env_is_dry_run
  - phase-1-skeleton
The PR Review Advisor caught counterexamples to both. Fix them now.

E2E_DRY_RUN: validation_suites/messaging/slack/00-slack-provider-state.sh
still branched on ${E2E_DRY_RUN:-} and emitted dry-run pass markers
without reading the OpenClaw config surface or querying runtime
discovery. Removed; the assertion now always reads real config and
queries OpenClaw runtime state. One execution mode.

phase-1-skeleton: scenarios/assertions/{environment,onboarding,runtime}.ts
each defined a 'kind: pending, ref: phase-1-skeleton' step that the
orchestrator silently skips. Of the three:
  - environmentBaseline() was in the registry and every scenario plan;
    environment work is performed by typed PhaseAction entries
    (context.emit + install.<id>) from compiler.phaseActions(), so the
    assertion group is redundant. Removed from registry and
    assertionGroupsForScenario(); deleted the file.
  - onboardingBaseline() and runtimeSmokeSkeleton() were never
    imported. Deleted as dead code.

Verification: rg of E2E_DRY_RUN, e2e_env_is_dry_run, phase-1-skeleton
across scenarios, runtime, validation_suites, onboarding_assertions,
nemoclaw_scenarios, and the scenario workflows returns nothing outside
framework-tests. 294/294 e2e-scenario framework tests pass. bash -n
clean on the touched shell script.

Addresses PR Review Advisor findings:
  - 'Residual E2E_DRY_RUN branch contradicts the one-live-mode contract'
  - 'Skeleton pending refs remain in production scenario plans'

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
…stic

The PR #4380 scenario run produced a SIGTERM cascade across runtime
suite steps (sandbox-shell, inference.*, kimi.*, lifecycle.rebuild.*,
lifecycle.upgrade.*) because the canonical sandbox-exec wrapper had no
inner timeout. When openshell ssh-into-sandbox wedged, the orchestrator
killed the process group at the step cap (30/45/60/120s) and the script
log captured only the header line — no diagnostic, no classifier.

Mirror the legacy test/e2e/ defense-in-depth pattern (outer process
timeout + inner curl --max-time) inside the new framework's canonical
wrapper:

- sandbox-exec.sh now applies a per-call timeout via timeout/gtimeout
  with --kill-after=5s. Default 25s sits below the common 30s step cap;
  callers override via E2E_SANDBOX_EXEC_TIMEOUT_SECONDS for steps with
  larger budgets (chat-completion 50s, sandbox-local 35s, slack 50s,
  gateway-alive fallback 15s). On timeout (124/137) the wrapper emits
  a single-line classified diagnostic so phase.ts captures observable
  evidence instead of a SIGTERM black hole.

- Migrate 8 BARE leaf scripts that called openshell sandbox exec
  directly to route through the wrapper:
    smoke/03-sandbox-shell.sh
    assert/gateway-alive.sh (ollama fallback path)
    inference/cloud/00-models-health.sh
    inference/cloud/01-chat-completion.sh
    inference/cloud/02-inference-local-from-sandbox.sh
    inference/ollama-auth-proxy/00-proxy-reachable.sh
    messaging/slack/00-slack-provider-state.sh

  kimi-compatibility steps already route through inference_routing.sh
  which uses the wrapper, so they inherit the fix transparently.
  rebuild_upgrade.sh and security_policy_credentials.sh continue to use
  their own ad-hoc timeouts; unifying those onto the wrapper is a
  follow-up.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
jyaunches added 4 commits May 28, 2026 07:47
shellcheck flags 'E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=N \\\n cmd' as
SC2034 (apparently unused) because static analysis can't follow that the
wrapper reads it via environment. The pattern is correct (prefix-env
assignment exports the variable to the immediately-following command),
but the warning blocks the 'checks' job.

Add a localized 'shellcheck disable=SC2034' directly above each affected
prefix-env line in the two cloud-inference assertion scripts that
override the default wrapper timeout. No behavior change.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Addresses PR Review Advisor finding #1: 'Secret-bearing child process
output is persisted and uploaded without redaction'.

Designed in the spirit of the test/e2e-scenario architecture: secret
hygiene is FRAMEWORK INFRASTRUCTURE, not a per-script / per-action /
per-workflow concern. Same one-mode discipline that motivates the rest
of this PR (no flag, no env var, no helper bypasses redaction).

New module: test/e2e-scenario/scenarios/orchestrators/redaction.ts
  Sits next to context.ts, owns:
    - redactString(text):       canonical token-shape redaction
    - buildChildEnv(base, opts): minimal allowlisted env + declared
                                 secretEnv passthrough only
    - pipeRedacted(src, log):   single I/O wrapper used by both
                                 runAction and runShellStep
  Pattern set is a framework-local mirror of
  src/lib/security/secret-patterns.ts. The framework deliberately does
  NOT import from src/lib/security/ to keep the framework-vs-product
  boundary clean and avoid cross-tsconfig ESM issues. A new parity
  test (e2e-redaction-parity.test.ts) asserts the two pattern sets
  stay in lockstep so adding a token shape in one place keeps both
  layers honest.

Typed contract extension: PhaseAction.secretEnv?: readonly string[]
  AssertionStep.secretEnv?: readonly string[]
  Compiler decides which secrets cross the framework boundary, the
  same way aliasPath / scriptRef / timeoutSeconds are declared
  metadata. buildChildEnv enforces a 'secret-key shape' contract on
  every entry (must end with API_KEY/TOKEN/SECRET/PASSWORD/CREDENTIAL/
  PASSPHRASE/PRIVATE_KEY/ACCESS_KEY) so the secretEnv channel cannot
  silently allowlist non-secret variables.

compiler.ts: declares onboard secretEnv per profile
  cloud-* profiles -> ['NVIDIA_API_KEY']
  local-ollama-openclaw -> []
  Install actions declare nothing (the installer itself does not
  authenticate to the cloud).

phase.ts: two call sites updated, identical pattern in both
  runAction:    buildChildEnv(...) + pipeRedacted(stdout, log) +
                pipeRedacted(stderr, log, tail<-redactedChunk)
  runShellStep: same shape
  Spawn-error 'message' fields are also wrapped with redactString as
  defense-in-depth.

Workflow upload tightening (.github/workflows/e2e-scenarios.yaml):
  - include-hidden-files: false
    Hidden dotfiles under the workspace can carry raw secrets
    (notably .e2e/context.env, written by e2e_context_set without
    redaction). Diagnostic dumps of context use e2e_context_dump
    which redacts on emit.
  - path: explicit subpath list (.e2e/actions/, .e2e/logs/, result
    JSONs, plan/run-plan, onboard.log) instead of blanket .e2e/.
  workflow-boundary.mts contract test updated to assert the new
  invariant; e2e-scenarios-workflow.test.ts expectations updated.

Bash side: unchanged. test/e2e-scenario/runtime/lib/context.sh::
e2e_context_dump already redacts via _e2e_context_is_sensitive_key.
This module covers the TS-spawned-child path that lacked coverage.

Tests (e2e-phase-orchestrators.test.ts, +5):
  - test_should_not_persist_secret_shaped_child_output_into_evidence
    Real shell child writes nvapi-, ghp_, sk-, xoxb-, Bearer ...
    Asserts evidence log, phase result.json, and result.message
    contain none of the literal tokens but do contain <REDACTED>.
  - test_should_drop_non_allowlisted_parent_env_unless_declared_in_secretEnv
    Sets a sentinel parent-env var, runs child without declaring it,
    asserts child cannot see it. PATH and E2E_PHASE still arrive.
  - test_should_pass_declared_secretEnv_through_to_child
    Declares the var in step.secretEnv, asserts child sees it.
  - test_should_reject_non_secret_shaped_keys_in_secretEnv_at_runtime
    buildChildEnv throws on FOO_VAR; the secret-key-shape contract
    keeps the allowlist boundary honest.
  - test_should_declare_NVIDIA_API_KEY_only_for_cloud_onboarding_actions
    Compiler-level contract: cloud onboard declares NVIDIA_API_KEY,
    local-ollama declares nothing.

Tests (e2e-redaction-parity.test.ts, new file, +2):
  - TOKEN_PREFIX_PATTERNS framework copy matches product source
  - CONTEXT_PATTERNS framework copy matches product source

308/308 e2e-scenario framework tests pass.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
…re pending steps

Addresses PR Review Advisor finding #2: 'Required security probes and
expected-failure checks still skip without failing live runs'.

Same architectural failure mode as findings #5 (E2E_DRY_RUN) and #6
(phase-1-skeleton): the typed framework declared a contract, but a
code path silently produced non-failing skips that contradicted this
PR's 'one mode, no fake green' invariant.

Three suites (security-shields, security-policy, security-injection)
were wired as kind: 'probe' steps; the orchestrator marked them
status: 'skipped' because the probe registry hasn't landed yet. The
expected-failure side-effect validator (runtime.expected-failure.no-
side-effects) was wired as kind: 'pending'. run.ts only sets
process.exitCode = 1 on status === 'failed', so a live run could
omit security shields, network policy, prompt-injection blocking,
AND the negative-scenario forbidden-side-effect contract while still
appearing non-failing.

Fix follows the framework spirit: declare requirement in the typed
metadata, enforce it once at the orchestrator boundary.

Typed contract extension (types.ts):
  AssertionStep.required?: boolean
  When true, a probe/pending step that resolves as 'skipped' is
  reclassified as 'failed' by the phase orchestrator. Defaults to
  false (existing behavior preserved for diagnostics, docs-validation,
  and other non-security probes).

Orchestrator (phase.ts):
  executeStep: probe and pending branches now check step.required
  and return status: 'failed' with a 'required <kind> not <state>'
  message when set. Non-required probes/pending steps continue to
  surface as visible skipped results so unrelated gaps stay
  diagnostic.

Registry (assertions/registry.ts):
  - probeStep / pendingStep helpers grew an options arg with
    required?: boolean (matches the secretEnv pattern: options come
    in via typed objects, not positional args).
  - security-shields, security-policy, security-injection probes
    annotated required: true. These are the suites the run is not
    safe without; failing closed beats fake green.
  - runtime.expected-failure.no-side-effects pending step annotated
    required: true. Negative scenarios cannot honestly pass while
    their core side-effect contract is unimplemented.
  - diagnostics, docs-validation probes intentionally remain
    non-required: they are informational, not safety-critical, and
    will switch to required when the probe registry lands and they
    have real implementations.

run.ts: no change needed. The new failed-status flows through the
existing exit-code path (process.exitCode = 1 when any phase status
is failed), so the workflow correctly reports red for missing
required probes/pending steps.

Tests (e2e-phase-orchestrators.test.ts, +5):
  - test_required_probe_step_that_is_unregistered_fails_the_phase
  - test_non_required_probe_step_continues_to_skip_visibly
  - test_required_pending_step_fails_closed
  - test_security_suite_groups_in_registry_mark_their_steps_as_required
  - test_expected_failure_no_side_effects_step_in_registry_is_required

318/318 e2e-scenario framework tests pass.

Until the probe registry follow-up PR registers actual
implementations for shieldsConfigProbe / networkPolicyProbe /
injectionBlockedProbe / expectedFailureNoSideEffectsProbe, scenarios
that include those suites/expected-failure groups will fail loudly
with 'required probe not registered' or 'required pending step not
implemented'. That is the correct intermediate state: visible red
until real, instead of green-by-default.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Diagnoses surfaced by the timeout wrapper on PR #4380's scenario run
showed every assertion that goes through 'openshell sandbox exec' hangs
in CI (host can curl the gateway and list sandboxes, but openshell-exec
into the sandbox never returns). The legacy test/e2e/ scripts have
always entered the sandbox via 'openshell sandbox ssh-config' + 'ssh -F'
and don't exhibit this hang, so adopt that pattern as the default
transport in the canonical wrapper.

Behavior:

- On first call per (sandbox, script), e2e_sandbox_exec calls
  'openshell sandbox ssh-config <name>' once and caches the result
  under ${E2E_CONTEXT_DIR}/.ssh-config-cache/<name>.cfg.
- Subsequent calls reuse the cached config and invoke
  'ssh -F <cfg> -o ConnectTimeout=10 -o StrictHostKeyChecking=no
   -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR
   openshell-<name> <quoted remote-cmd>'.
- Args are quoted via printf '%q' so payloads with shell
  metacharacters (e.g. JSON bodies for chat-completion) survive
  intact.
- Fallback: if 'openshell sandbox ssh-config' returns non-zero (e.g.
  older openshell builds, runners without an ssh client), the wrapper
  falls back to 'openshell sandbox exec' and emits a stderr breadcrumb.
- Opt-out: set E2E_SANDBOX_EXEC_VIA_OPENSHELL=1 to force the original
  transport (used by the framework unit tests that stub openshell
  but not ssh).

Tests:

- Two new vitest cases:
  * sandbox_exec_should_prefer_ssh_config_transport_when_openshell_offers_one
  * sandbox_exec_should_fall_back_to_openshell_when_ssh_config_unavailable
- Existing exit-code propagation and stdin-quoting tests opt into the
  openshell-direct transport so their stub-openshell expectations
  remain valid.
- 322 framework tests pass.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
@jyaunches jyaunches marked this pull request as draft May 28, 2026 12:14
@copy-pr-bot

copy-pr-bot Bot commented May 28, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

jyaunches added 3 commits May 28, 2026 08:16
…apper

The lifecycle.rebuild.* and lifecycle.upgrade.survivor-reachable
assertions on PR #4380's scenario run still produced raw 120s SIGTERM
cascades because rebuild_upgrade.sh used its own ad-hoc
_rebuild_upgrade_run helper that called 'openshell sandbox exec'
without a per-call timeout or transport choice. The orchestrator's
120s step cap was the only safety net, and on hang the script log
captured no diagnostic.

Migrate the lib onto the canonical wrapper:

- Add _rebuild_upgrade_sandbox_exec(sandbox, cmd...) that:
  * Honors the existing REBUILD_UPGRADE_SANDBOX_CMD test override (with
    the same -n <sandbox> -- <cmd>... contract used by existing fakes)
    so framework unit tests keep working unchanged.
  * Routes production calls through e2e_sandbox_exec, picking up the
    ssh-config-preferred transport, the per-call timeout, and the
    classified diagnostic on hang.
- Replace the seven openshell-sandbox-exec call sites with the new
  helper.
- Default E2E_SANDBOX_EXEC_TIMEOUT_SECONDS to 100s for this lib's
  callers (orchestrator caps these steps at 120s; 100s leaves room for
  the wrapper to emit a diagnostic and exit cleanly before SIGTERM).

The existing rebuild_upgrade_checks_should_allow_command_fakes test
continues to pass, proving the override contract is preserved.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The lifecycle.sandbox.list-and-status assertion on PR #4380's scenario
run failed with 'missing status field: status'. Root cause: the
assertion required the literal lower-case tokens 'status', 'gateway',
and 'sandbox' in 'nemoclaw <name> status' output, but the production
command (src/lib/actions/sandbox/status.ts) does not print 'Status:'
or 'Gateway:' labels in its normal output. The original assertion only
passed against the framework-test mock string ('status running gateway
healthy sandbox running'), which was synthetic and did not reflect
what the CLI actually emits.

Align the assertion with the production output:

- Require the sandbox name itself to appear (every status output for a
  present sandbox emits 'Sandbox: <name>').
- Require the field labels that the CLI unconditionally prints when a
  sandbox is found: 'Sandbox', 'Model', 'OpenShell'.
- Drop the 'status' and 'gateway' tokens that never appear in real
  output.

This matches the legacy test/e2e/test-full-e2e.sh:230 pattern (verifies
'nemoclaw status' exits 0 and contains substantive content) while
keeping the assertion strict enough to catch CLI regressions that drop
core fields.

Update the framework test mock to emit realistic output ('Sandbox: sb1',
'Model:', 'OpenShell:', 'Policies:') so the existing happy-path test
keeps passing on the new contract.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The 'ubuntu-no-docker-preflight-negative' scenario on PR #4380 was
structurally registered (manifest, scenario id, expected_failure
metadata) but never executed its negative path. The framework called
the regular 'cloud-openclaw' onboarding worker, docker was actually
running, onboard succeeded, and nothing wrote
${E2E_CONTEXT_DIR}/negative-preflight.log. The
'onboarding.preflight.expected-failed' assertion then failed with
'evidence not found'.

Mirror the legacy
test/e2e/e2e-cloud-experimental/test-port8080-conflict.sh pattern
(set up failure condition -> capture onboard output -> assert
non-zero exit -> grep log) inside the typed framework:

- Add nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh:
  * Installs a 'docker' shim earlier on PATH that exits non-zero with
    'Cannot connect to the Docker daemon ...' so commandExists('docker')
    succeeds while 'docker info' fails - the same failure mode users
    see when Docker is installed but the daemon is not running.
  * Runs 'nemoclaw onboard --non-interactive' redirecting stdout+stderr
    to ${E2E_CONTEXT_DIR}/negative-preflight.log.
  * Asserts nemoclaw exited non-zero (preflight DID fail). If onboard
    unexpectedly succeeds, the action fails so a regression that lets
    docker-missing slide is loud, not silent.
  * Returns 0 on the expected-failure path so the orchestrator marks
    the action passed and the dependent assertion phase runs against
    the captured log.

- Wire the worker into onboard/dispatch.sh under the
  'cloud-openclaw-no-docker' profile id.

- Compiler routes scenarios with environment.runtime='docker-missing'
  to '<base>-no-docker' onboarding profile. Adds the new profile to
  ONBOARD_PROFILE_SECRET_ENV so NVIDIA_API_KEY still flows through
  (matches a real user invocation where the CLI loads creds even when
  preflight aborts).

- Adds compiler_routes_docker_missing_runtime_to_no_docker_onboarding_profile
  test that locks in the routing: negative scenario -> -no-docker
  profile id, evidence path, secret env; positive scenario unaffected.

324 tests pass.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
jyaunches added 2 commits June 8, 2026 11:36
- redaction.ts: prepend $HOME/.local/bin to child PATH so post-install
  shims resolve on runners where it's not on the default PATH
- sandbox_lifecycle.sh: use 'nemoclaw <sandbox> snapshot <sub>' argv
  shape; the prior form parsed 'snapshot' as the sandbox name
Resolutions:
- run.ts: kept HEAD's removal of --dry-run/--validate-only modes; took
  main's richer --emit-matrix payload (runner/label/platform/suites)
  consumed by the dynamic matrix workflow (#4359).
- 00-slack-provider-state.sh: kept HEAD's removal of the openclaw
  dry-run skip block; carried main's 'agent' local var.
- docs/README.md: kept HEAD (live-only runner, deprecated bash
  entrypoints) and folded in main's 'Migration status is tracked
  outside the repository' line so e2e-migration-inventory-lock stays
  green.
- e2e-coverage-report.test.ts, e2e-scenario-resolver.test.ts,
  runtime/resolver/coverage.ts: confirmed deletes (intentional in
  9da75ac, 'retire legacy expected-states.yaml + runtime/resolver').
- e2e-expected-state.test.ts: applied the trim that 9da75ac's
  commit message described but did not include — drop the YAML-mirror
  parity tests that depended on the now-deleted expected-states.yaml,
  replace with id-coverage assertions against listExpectedStates().
- e2e-scenarios.yaml: retitled the summary step from 'typed dry-run
  summary' to 'typed scenario run' so the workflow text matches the
  PR's live-only stance.
@jyaunches jyaunches marked this pull request as ready for review June 8, 2026 15:59
jyaunches added 2 commits June 8, 2026 12:02
#4939 added 'transitional model / target runner' guidance to docs/README.md
after our merge. Resolved by taking that content but reconciling it with
this PR's retirement of nemoclaw_scenarios/expected-states.yaml and the
runtime/resolver/ tree:

- 'Current sources of truth' table: drop the row pointing at expected-states.yaml
  and runtime/resolver; add a row for the typed expected-state registry
  (scenarios/expected-states.ts) as the single source of truth.
- Replaced 'fan-out and dry-run planning' with 'fan-out and live execution'.
- Added one paragraph noting the legacy YAML resolver and expected-states.yaml
  have been retired so the doc and the codebase agree.
CI on the merge commit (863e8eb) flagged two regressions caused by
the merge with main:

1. ShellCheck SC2148: 5 probe scripts under
   nemoclaw_scenarios/probes/ start with the SPDX comment and have no
   shebang. ShellCheck cannot infer the target shell. Added
   '#!/usr/bin/env bash' and ensured executable bit. dispatch.sh
   already had a shebang.

2. markdown-links: docs/README.md still pointed at two paths that the
   YAML-resolver retirement (commit 9da75ac) deleted:
   - ../nemoclaw_scenarios/expected-states.yaml
   - ../runtime/resolver/schema.ts
   Repointed both to the typed registry that replaced them
   (scenarios/expected-states.ts, scenarios/types.ts).

Verified locally: check-docs --only-links now passes;
test/e2e-scenario/ vitest still 366 green;
typecheck:cli clean.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh (1)

39-44: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent grep pattern can cause false negatives.

Line 33 uses a word-boundary regex to prevent partial matches (e.g., "test" won't match "test-2"), but line 40 uses fixed-string substring matching (grep -Fq), which will match partial names. If a sandbox named "${E2E_SANDBOX_NAME}-suffix" exists but "${E2E_SANDBOX_NAME}" doesn't, this probe will incorrectly fail.

🔍 Proposed fix to align openshell check with nemoclaw pattern
 if command -v openshell >/dev/null 2>&1; then
-  if openshell sandbox list 2>/dev/null | grep -Fq "${E2E_SANDBOX_NAME}"; then
+  if openshell sandbox list 2>/dev/null | grep -qE "(^|[[:space:]])${E2E_SANDBOX_NAME}([[:space:]]|$)"; then
     echo "probe sandbox-absent: openshell reports sandbox '${E2E_SANDBOX_NAME}', expected absent" >&2
     exit 1
   fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh` around lines
39 - 44, The probe sandbox-absent uses substring matching (grep -Fq) which can
false-match names like "${E2E_SANDBOX_NAME}-suffix"; update the openshell check
so it matches whole words instead (e.g., replace grep -Fq "${E2E_SANDBOX_NAME}"
with grep -wq -- "${E2E_SANDBOX_NAME}" or an equivalent word-boundary regex like
grep -Eq "\\b${E2E_SANDBOX_NAME}\\b") in the block containing the openshell
sandbox list call and the probe named "sandbox-absent" so the check aligns with
the word-boundary behavior used elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh`:
- Around line 39-44: The probe sandbox-absent uses substring matching (grep -Fq)
which can false-match names like "${E2E_SANDBOX_NAME}-suffix"; update the
openshell check so it matches whole words instead (e.g., replace grep -Fq
"${E2E_SANDBOX_NAME}" with grep -wq -- "${E2E_SANDBOX_NAME}" or an equivalent
word-boundary regex like grep -Eq "\\b${E2E_SANDBOX_NAME}\\b") in the block
containing the openshell sandbox list call and the probe named "sandbox-absent"
so the check aligns with the word-boundary behavior used elsewhere.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dbfe9c5f-8470-47b6-9fc2-8259436800ba

📥 Commits

Reviewing files that changed from the base of the PR and between 863e8eb and 3d8adfa.

📒 Files selected for processing (6)
  • test/e2e-scenario/docs/README.md
  • test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/gateway-absent.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/gateway-healthy.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
✅ Files skipped from review due to trivial changes (2)
  • test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
  • test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh

The PR audit claimed 'Audited absent in production paths: e2e_env_is_dry_run,
E2E_DRY_RUN, ...' but the live scenarios run on commit 3d8adfa revealed
two survivors:

1. validation_suites/hermes/01-history-writable.sh:34
   - Calls 'if e2e_env_is_dry_run; then ... exit 0; fi'
   - The helper was deleted in b7acfb7 -> the whole hermes scenario
     blew up with 'e2e_env_is_dry_run: command not found' on every
     run, masquerading as a hermes regression
   - Surfaced in: ubuntu-repo-cloud-hermes scenario, runtime.hermes.history-writable

2. validation_suites/messaging/slack/00-slack-provider-state.sh:66
   - Wrapped the entire hermes branch in 'if [[ -n "${E2E_DRY_RUN:-}" ]];
     then 3x e2e_pass else <real probe>; fi'
   - Unreachable now (runner never sets E2E_DRY_RUN), but contradicts
     the audit and would skip the real probes if anything ever did

Removed the dry-run wrapper in both files; dedented the slack else-block
inline. bash -n clean; vitest 366 green; full repo grep for
'e2e_env_is_dry_run|E2E_DRY_RUN' (excluding .test.ts) returns zero.

Re-enables: ubuntu-repo-cloud-hermes / runtime.hermes.history-writable

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh`:
- Line 70: Replace direct openshell sandbox invocations with the canonical
e2e_sandbox_exec wrapper to preserve per-call timeout and transport behavior;
specifically, change calls that currently run "openshell sandbox exec --name
\"${sandbox_name}\" -- /opt/hermes/.venv/bin/python -c '...'" to use
e2e_sandbox_exec passing the sandbox identifier and the same command payload
(same python -c string) so timeouts/fallbacks are honored. Apply the same
replacement for the other occurrences mentioned (the calls around the other two
locations) and keep the original command arguments unchanged except for swapping
the invocation to e2e_sandbox_exec.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 060de45a-9ece-4905-ba24-816fa28927ce

📥 Commits

Reviewing files that changed from the base of the PR and between 3d8adfa and 381d654.

📒 Files selected for processing (2)
  • test/e2e-scenario/validation_suites/hermes/01-history-writable.sh
  • test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
💤 Files with no reviewable changes (1)
  • test/e2e-scenario/validation_suites/hermes/01-history-writable.sh

Comment thread test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh Outdated
jyaunches added 2 commits June 8, 2026 13:27
Two batched mechanical fixes for the static-checks pre-commit job on
PR head 381d654:

1. Files were modified by shfmt (-i 2 -ci -bn): 6 .sh files used
   '\\<newline>' line continuations where shfmt's -bn (binary-next-line)
   prefers '&&\n' / '||\n'. Pre-existing on main; the merge dragged
   them into the PR diff and our local commits weren't run through
   pre-commit. Re-formatted in place.

2. check-shebang-scripts-are-executable: 3 scripts had shebangs but
   were tracked as 0644:
   - test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh
   - test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh
   - test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
   chmod +x and re-tracked.

Verified: shfmt -d clean across all PR-touched .sh files; biome format
clean across PR-touched ts/tsx/js/mjs/mts; bash -n clean on the 6
reformatted scripts; vitest test/e2e-scenario/ 366 green.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh (1)

13-13: ⚠️ Potential issue | 🟡 Minor

Update the PR summary: runtime/lib/env.sh is still present and is sourced by the scenario script

  • test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh line 13 still sources . "${LIB_DIR}/env.sh", and test/e2e-scenario/runtime/lib/env.sh exists in the repo.
  • The “removed”/runtime-failure claim doesn’t match the current code; align the summary (e.g., if only specific helpers were removed, state that).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh`
at line 13, The PR summary claims runtime/lib/env.sh was removed but the
scenario script still sources it via the LIB_DIR variable (the line sourcing
"${LIB_DIR}/env.sh" in 02-inference-local-from-sandbox.sh) and the
runtime/lib/env.sh file remains in the repo; update the PR summary to accurately
reflect what was changed (e.g., which specific helpers were removed or
refactored) or remove the source line if you truly intend to stop using
runtime/lib/env.sh—ensure consistency between the commit changes and the summary
by either modifying the script to stop sourcing LIB_DIR/env.sh or amending the
PR description to state that env.sh is still present and which pieces were
removed.
🧹 Nitpick comments (1)
test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh (1)

81-81: ⚖️ Poor tradeoff

Shell injection risk in command construction.

The command string interpolates ${marker_content} into a single-quoted context. If marker_content ever contains a single quote, the quoting breaks. While currently safe (the content is REBUILD_LIFECYCLE_$(date +%s)_${RANDOM}), this pattern was explicitly flagged in PR coordination as a concern: "prevention of plan-artifact-to-shell injection."

Pass variables as positional arguments to avoid quote interpretation:

🔒 Proposed fix using positional arguments
-  if ! E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=30 \
-    e2e_sandbox_exec "${sandbox_name}" -- sh -c \
-    "mkdir -p '$(dirname "${LIFECYCLE_REBUILD_MARKER_PATH}")' && printf '%s' '${marker_content}' > '${LIFECYCLE_REBUILD_MARKER_PATH}'"; then
+  if ! E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=30 \
+    e2e_sandbox_exec "${sandbox_name}" -- sh -c \
+    'mkdir -p "$(dirname "$1")" && printf %s "$2" > "$1"' \
+    sh "${LIFECYCLE_REBUILD_MARKER_PATH}" "${marker_content}"; then

This passes marker_content as $2 to the remote shell, preventing quote interpretation issues.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh` at
line 81, The current shell command interpolates marker_content directly into a
single-quoted string which breaks if marker_content contains a single quote;
update the command used to create LIFECYCLE_REBUILD_MARKER_PATH so
marker_content is passed as a positional argument to the remote shell instead of
being interpolated into the command string. Concretely, change the constructed
command that references LIFECYCLE_REBUILD_MARKER_PATH and marker_content to use
a printf (or echo) that reads from a positional parameter (e.g., $2) and then
invoke the remote shell/exec call with marker_content supplied as that
positional argument so no shell quoting of marker_content is required.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh`:
- Line 13: The PR summary claims runtime/lib/env.sh was removed but the scenario
script still sources it via the LIB_DIR variable (the line sourcing
"${LIB_DIR}/env.sh" in 02-inference-local-from-sandbox.sh) and the
runtime/lib/env.sh file remains in the repo; update the PR summary to accurately
reflect what was changed (e.g., which specific helpers were removed or
refactored) or remove the source line if you truly intend to stop using
runtime/lib/env.sh—ensure consistency between the commit changes and the summary
by either modifying the script to stop sourcing LIB_DIR/env.sh or amending the
PR description to state that env.sh is still present and which pieces were
removed.

---

Nitpick comments:
In `@test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh`:
- Line 81: The current shell command interpolates marker_content directly into a
single-quoted string which breaks if marker_content contains a single quote;
update the command used to create LIFECYCLE_REBUILD_MARKER_PATH so
marker_content is passed as a positional argument to the remote shell instead of
being interpolated into the command string. Concretely, change the constructed
command that references LIFECYCLE_REBUILD_MARKER_PATH and marker_content to use
a printf (or echo) that reads from a positional parameter (e.g., $2) and then
invoke the remote shell/exec call with marker_content supplied as that
positional argument so no shell quoting of marker_content is required.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 19ad2265-3927-4694-80ee-0d4c10fded73

📥 Commits

Reviewing files that changed from the base of the PR and between 381d654 and fc06168.

📒 Files selected for processing (6)
  • test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh
  • test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh
  • test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
  • test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
  • test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
  • test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
💤 Files with no reviewable changes (2)
  • test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
  • test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh

…e-shape budget

The static-checks job runs scripts/find-source-shape-tests.ts --check
with budget ci/source-shape-test-budget.json (maxSourceShapeCases=0).
e2e-redaction-parity.test.ts had 2 source-shape cases that read both
redaction.ts files as text and compared regex literals via string parsing.

Refactored to import the actual RegExp arrays from both modules and
compare them by behavior — fingerprint each via .source + .flags. Same
parity guarantee, no source-text reads, scanner now reports 0 cases.

Required exporting TOKEN_PREFIX_PATTERNS and CONTEXT_PATTERNS from
test/e2e-scenario/scenarios/orchestrators/redaction.ts. The framework
runtime stays decoupled from src/lib/security/ — only the parity test
crosses that boundary, which is the test's whole purpose. Added a comment
on the framework export explaining the test-only contract.

Verified: `npm run source-shape:check` reports 0 cases; vitest
test/e2e-scenario/ 366 green; typecheck:cli clean.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e-scenario/scenarios/orchestrators/redaction.ts (1)

211-221: ⚠️ Potential issue | 🟠 Major

Fix chunk-boundary redaction in pipeRedacted to prevent split tokens from leaking into evidence logs.

test/e2e-scenario/scenarios/orchestrators/redaction.ts (pipeRedacted, lines 211-221) redacts each data chunk independently via redactString(chunk.toString("utf8")) and writes it immediately. With no carry/overlap buffering, a token split across adjacent chunks can fail the regex matches and persist in the evidence log (and into the stderrTail-based message path). The current redaction test covers echo-style output but doesn’t exercise split-token chunk boundaries.

Suggested direction (stream-safe redaction)
+import { StringDecoder } from "node:string_decoder";
...
 export function pipeRedacted(
   src: Readable,
   log: Writable,
   onChunk?: (redactedChunk: string) => void,
 ): void {
-  src.on("data", (chunk: Buffer) => {
-    const redacted = redactString(chunk.toString("utf8"));
-    log.write(redacted);
-    onChunk?.(redacted);
-  });
+  const decoder = new StringDecoder("utf8");
+  let carry = "";
+  const OVERLAP_CHARS = 4096;
+
+  const emitRedacted = (text: string) => {
+    if (!text) return;
+    const redacted = redactString(text);
+    log.write(redacted);
+    onChunk?.(redacted);
+  };
+
+  src.on("data", (chunk: Buffer | string) => {
+    const text = typeof chunk === "string" ? chunk : decoder.write(chunk);
+    const merged = carry + text;
+    const cutoff = Math.max(0, merged.length - OVERLAP_CHARS);
+    emitRedacted(merged.slice(0, cutoff));
+    carry = merged.slice(cutoff);
+  });
+
+  src.on("end", () => {
+    emitRedacted(carry + decoder.end());
+    carry = "";
+  });
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts` around lines 211 -
221, pipeRedacted currently redacts each Buffer chunk independently
(src.on("data") → redactString(...); log.write(...)) which allows sensitive
tokens split across chunk boundaries to slip through; change pipeRedacted to
maintain a small carry buffer between chunks: append incoming chunk to a
leftover string, run redactString on the combined text but retain any trailing
partial token (notably the same token/regex semantics used by redactString) by
slicing off an appropriate suffix into leftover for the next chunk, write only
the fully-redacted portion to log and call onChunk for that portion, and on
"end"/"close"/"error" flush the remaining leftover through redactString and
write it; ensure the same function names (pipeRedacted, redactString) and event
handlers (src.on("data"), src.on("end"/"close"/"error")) are updated.
🧹 Nitpick comments (1)
test/e2e-scenario/scenarios/orchestrators/redaction.ts (1)

52-74: ⚡ Quick win

Exported regex arrays should be immutable to prevent runtime mutation of redaction behavior.

Now that these arrays are exported for tests, any importer can mutate them (push/splice), which can silently alter production redaction behavior in-process.

Hardening patch
-export const TOKEN_PREFIX_PATTERNS: RegExp[] = [
+export const TOKEN_PREFIX_PATTERNS: readonly RegExp[] = Object.freeze([
   ...
-];
+]);

-export const CONTEXT_PATTERNS: RegExp[] = [
+export const CONTEXT_PATTERNS: readonly RegExp[] = Object.freeze([
   ...
-];
+]);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts` around lines 52 - 74,
The exported arrays TOKEN_PREFIX_PATTERNS and CONTEXT_PATTERNS are mutable and
can be altered by importers; change their exports to immutable frozen/read-only
arrays and freeze each RegExp instance. Replace direct array exports with
something like: export const TOKEN_PREFIX_PATTERNS: ReadonlyArray<RegExp> =
Object.freeze([ ...patterns... ].map(r => Object.freeze(r))); and do the same
for CONTEXT_PATTERNS so callers cannot push/splice or mutate the regex objects.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts`:
- Around line 211-221: pipeRedacted currently redacts each Buffer chunk
independently (src.on("data") → redactString(...); log.write(...)) which allows
sensitive tokens split across chunk boundaries to slip through; change
pipeRedacted to maintain a small carry buffer between chunks: append incoming
chunk to a leftover string, run redactString on the combined text but retain any
trailing partial token (notably the same token/regex semantics used by
redactString) by slicing off an appropriate suffix into leftover for the next
chunk, write only the fully-redacted portion to log and call onChunk for that
portion, and on "end"/"close"/"error" flush the remaining leftover through
redactString and write it; ensure the same function names (pipeRedacted,
redactString) and event handlers (src.on("data"), src.on("end"/"close"/"error"))
are updated.

---

Nitpick comments:
In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts`:
- Around line 52-74: The exported arrays TOKEN_PREFIX_PATTERNS and
CONTEXT_PATTERNS are mutable and can be altered by importers; change their
exports to immutable frozen/read-only arrays and freeze each RegExp instance.
Replace direct array exports with something like: export const
TOKEN_PREFIX_PATTERNS: ReadonlyArray<RegExp> = Object.freeze([ ...patterns...
].map(r => Object.freeze(r))); and do the same for CONTEXT_PATTERNS so callers
cannot push/splice or mutate the regex objects.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 50d05560-6afd-4f7b-927c-bafdd8cf0b74

📥 Commits

Reviewing files that changed from the base of the PR and between fc06168 and b41c17b.

📒 Files selected for processing (2)
  • test/e2e-scenario/framework-tests/e2e-redaction-parity.test.ts
  • test/e2e-scenario/scenarios/orchestrators/redaction.ts

Two batched review-thread fixes (Phase 3 of CI loop):

1. CodeRabbit (major) on
   validation_suites/messaging/slack/00-slack-provider-state.sh:
   3 calls used 'openshell sandbox exec --name X -- ...' directly,
   bypassing the canonical e2e_sandbox_exec wrapper. The wrapper
   provides per-call timeout via E2E_SANDBOX_EXEC_TIMEOUT_SECONDS
   plus ssh-config-preferred / openshell-exec-fallback transport. A
   wedged openshell call without the wrapper can stall the suite
   indefinitely in live mode. Replaced all 3 with the wrapper call,
   matching the timeout cap pattern already used at line 36.
   Fixes hermes-platforms-enabled, hermes-allowed-channels-scoped,
   and hermes-gateway-running probes in the slack scenario.

2. CodeQL alert 715 (Shell command built from environment values) on
   test/e2e-scenario/scenarios/probes/util.ts:
   runSandboxCmd interpolated wrapperPath (derived from
   ProbeContext.repoRoot) and the sandbox name + payload argv into
   a bash -c script string. Even with JSON.stringify-style escaping,
   CodeQL flags taint flow from runtime context into shell. Refactored
   to pass all user-controlled values as positional bash parameters:
   spawn('bash', ['-c', script, 'e2e-probe-spawn', wrapperPath,
   fnName, sandboxName, ...args]). Script body references them as
   "$1", "$2", "$3", "${@:4}". No user data is concatenated
   into the script. Smoke-tested the new pattern; runtime behavior
   is identical for callers (single argv vector preserved).

Updated runSandboxCmd doc-comment which described the old
single-quote-into-script approach.

Verified: bash -n + shfmt clean on slack file; vitest 366 green;
typecheck:cli clean; source-shape:check still 0 cases.
Comment thread test/e2e-scenario/scenarios/probes/util.ts Fixed
jyaunches added 3 commits June 8, 2026 13:55
Iteration 5's refactor cleared CodeQL alert 715 (interpolated user data
into shell-command body) but introduced alert 792 on the resulting
spawn("bash", ["-c", script, sentinel, ...bashArgs]) call.

CodeQL's js/shell-command-injection-from-environment rule fires on any
data flow from runtime context into a 'bash -c' invocation, even when
the data reaches the script as positional bash parameters rather than
through interpolation.

Applied the established repo pattern (src/lib/runner.ts:101-102):

  1. Added a NUL-byte rejector on every bashArg (mirrors
     normalizeSpawnArgs / rejectNulByte from runner.ts).
  2. Documented the spawn contract in spawnBash's doc-comment:
     - script body is a string literal at every call site
     - all variable data flows through positional bash params ($1,$2,...)
     - bash treats positional argv as data, not code
     - bash binary is hard-coded; shell:false is implicit
  3. Suppressed alert 792 with an explicit lgtm[] comment that points
     at the contract above.

This follows the documented repo convention: validate inputs, document
why the spawn is safe, suppress the false positive with the rule name
and a short justification.

Verified: typecheck:cli clean; vitest test/e2e-scenario/ 366 green;
smoke-tested the new spawn behavior with a temp wrapper script.
# Conflicts:
#	test/e2e-scenario/docs/README.md
Comment thread test/e2e-scenario/scenarios/probes/util.ts
…awn()

Iter 9 added the lgtm suppression but with a multi-line justification
that pushed the actual marker 3 lines above the spawn() call. CodeQL
treats each '//' comment as a separate node, so my marker on line N
followed by 2 continuation comments on N+1, N+2 was not recognized
as a suppression for the spawn() on N+3. The new alert 793 fired
on the post-iter-9 code at line 113.

Restructured to match src/lib/runner.ts:101-104 exactly: contextual
explanation as plain comments above, then the bare lgtm directive
on the immediately-preceding line. Justification (literal script
body, NUL-validated positional argv, hard-coded bash binary) is
already in spawnBash's doc-comment, which is the documentation site
of record for the contract.

Verified: typecheck:cli clean; vitest 366 green.
@jyaunches jyaunches merged commit 282876a into main Jun 8, 2026
33 checks passed
@jyaunches jyaunches deleted the feat/e2e-real-execution branch June 8, 2026 18:34
jyaunches added a commit that referenced this pull request Jun 9, 2026
…#4996)

## Summary

Bash is not the direction the E2E scenario stack is heading, and the
vitest-runner migration underway makes that even clearer — none of these
bash files have a future role. This PR sweeps out a set of inert
tombstones and orphan helpers under `test/e2e-scenario/runtime/` that
have no production callers, so the migration lands on a smaller surface.
Net: 6 file deletions + minor test trimming. ~250 lines removed. No
production behavior change.

## Changes

**Tombstone bash entrypoints (deleted; bodies were just `exit 2`
stubs):**
- `test/e2e-scenario/runtime/run-scenario.sh`
- `test/e2e-scenario/runtime/run-suites.sh`

**Orphan helper libs (deleted; zero callers in production):**
- `test/e2e-scenario/runtime/lib/artifacts.sh` — only caller was a
self-test in `e2e-lib-helpers.test.ts`.
- `test/e2e-scenario/runtime/lib/cleanup.sh` — was a thin re-export of
`sandbox-teardown.sh`.
- `test/e2e-scenario/runtime/lib/negative.sh` — superseded by
`scenarios/orchestrators/negative-matcher.ts`.
- `test/e2e-scenario/runtime/lib/port-holder.sh` — superseded by typed
probes.

**Test cleanup in
`test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts` (rides
along):**
- Dropped
`artifact_helper_should_collect_known_logs_without_failing_when_missing`
— it was the only caller of `artifacts.sh`.
- Dropped `gateway_helper_should_report_unhealthy_gateway_clearly` — it
sourced `${RUNTIME_LIB}/gateway.sh`, which does not exist on main. The
test was passing only by accident: the missing-file stderr happens to
match `/gateway/i`.
- Dropped the dead `E2E_DRY_RUN: "1"` env arg in
`older_base_image_should_emit_dockerfile_pointing_at_tagged_base`. PR
#4380 deleted the `e2e_env_is_dry_run` short-circuit from
`older-base-image.sh`, so the env var is unread test scaffolding now.

**Comment hygiene:**
- `nemoclaw_scenarios/probes/sandbox-absent.sh` — dropped the "Typed
replacement for the legacy `run-scenario.sh` inline check" note now that
the legacy file is deleted.

**Intentionally untouched:**
- `test/e2e-scenario-advisor.test.ts` still references
`test/e2e-scenario/runtime/run-scenario.sh` as a string fixture. The
advisor never reads the file — it filters changed-file paths by shape —
so the fixture continues to validate the same plumbing. Refreshing those
fixtures is out of scope.
- `test/e2e-scenario/docs/{README,MIGRATION}.md` reference the old bash
entrypoints. Doc updates are out of scope; the docs already mark these
paths as deprecated.

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `npm run docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Notes on verification:
- Targeted vitest run is green: `npx vitest run
test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
test/e2e-scenario-advisor.test.ts test/e2e-scenario/` — 3 test files,
106 tests, 0 failures.
- `npx prek run --all-files` reports the same `Test (CLI)` flakes that
exist on a clean `origin/main` checkout (timing-sensitive tests in
`nemoclaw-start.test.ts`, `doctor-gateway-token.test.ts`,
`e2e-fixture-context.test.ts`,
`runtime-hermes-secret-boundary-behavioural.test.ts`). All pass on
isolated re-run; none reference any of the deleted bash files.
Pre-existing — not caused by this PR. Leaving the verification box
unchecked because the local prek run wasn't fully clean even though the
issue is unrelated.

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: e2e End-to-end tests, nightly failures, or validation infrastructure bug-fix PR fixes a bug or regression v0.0.61 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants