test(e2e): execute real shell assertions; delete dry-run, --validate-only, and the bash runner by jyaunches · Pull Request #4380 · NVIDIA/NemoClaw

jyaunches · 2026-05-28T01:04:14Z

Why

Removing --dry-run and plan-mode execution from the E2E runner: they let a coding agent ship work behind false-green test runs. The TS runner now has one execution mode: live. No flag, env var, helper, or branch in any production path bypasses real assertion execution.

Two seams needed closing:

phase.ts:executeStep had no child_process.spawn — every real shell/probe step fell through to a hardcoded failed: unsupported live step, and four placeholder refs (fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton) were the only paths that reported passed.
~30 shell scripts under validation_suites/, onboarding_assertions/, and nemoclaw_scenarios/ began with if e2e_env_is_dry_run; then exit 0; fi, so when the workflow passed --dry-run the scripts exited 0 before running.

What changed

Orchestrator (TypeScript)

phase.ts:executeStep now spawns shell steps via child_process.spawn, with detached process groups so timeouts kill bash + sleep cleanly via negative-pid signal. Probe steps return skipped: "probe not registered" until the registry lands. Pending steps return skipped: "pending: <ref>". Unknown kinds throw. Real evidence captured to .e2e/logs/<step-id>.log. Step-level reliability.timeoutSeconds and retry.{attempts,on} enforced here, not in clients.
run.ts: --dry-run, --validate-only deleted. Default invocation is live execution. --list and --plan-only (local debug) survive read-only. --emit-matrix added for the dynamic-matrix workflow (feat(e2e): generate scenario fan-out matrix from typed registry #4359), with the richer payload (runner, label, platform, suites) merged from main.
types.ts: RunContext.dryRun deleted.

Workflow

e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both --dry-run invocations (Linux + WSL), are gone. Workflows execute live. Summary step retitled from "typed dry-run summary" to "typed scenario run".
tools/e2e-scenarios/workflow-boundary.mts validator now rejects --dry-run, --plan-only, --validate-only in the workflow.

Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)

runtime/run-scenario.sh: 483 lines of duplicated install/onboard/gateway-check/suite-execution → 5-line fail-fast stub pointing at run.ts.
runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep walks typed assertionGroups directly; nothing in TS calls a YAML-walking bash runner.

Shell scripts (the leaves stay, the dry-run skip blocks die)

validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**: every if e2e_env_is_dry_run; then ... exit 0; fi and every [[ "${E2E_DRY_RUN:-0}" == "1" ]] short-circuit removed. The real assertion logic that was hiding underneath now runs unconditionally.
runtime/lib/env.sh: e2e_env_is_dry_run helper deleted.
inference_routing.sh: dead _e2e_inference_plan helper deleted.

Legacy YAML resolver retired (rolled in from the planned follow-up PR)

Deleted nemoclaw_scenarios/expected-states.yaml and the entire runtime/resolver/ tree (load.ts, plan.ts, schema.ts, coverage.ts, index.ts, validator.ts, expected-failure.ts, js-yaml.d.ts) plus runtime/coverage-report.sh.
Six framework tests that exercised only the resolver path were dropped: e2e-scenario-schema, e2e-scenario-resolver, e2e-coverage-report, e2e-scenario-additional-families, e2e-metadata-final-hygiene, e2e-expected-failure (typed equivalent now lives in e2e-negative-matcher.test.ts).
e2e-expected-state.test.ts: YAML-mirror parity tests dropped; replaced with id-coverage assertions against listExpectedStates(). The typed registry in scenarios/expected-states.ts is the single source of truth.

Tests

DELETED (validated dead code paths): e2e-suite-runner.test.ts, e2e-scenario-first-migration.test.ts, e2e-expected-state-validator.test.ts.
REWRITTEN e2e-phase-orchestrators.test.ts: now exercises real shell spawning via temp scripts (pass / fail-with-stderr-tail / timeout / retry-on-classified-transient / missing-ref), real probe skipping with visible reason, and real pending skipping. Replaces the prior placeholder refs with assertions that observe actual subprocess behavior.
TRIMMED e2e-lib-helpers.test.ts, e2e-scenario-additional-families.test.ts, e2e-scenario-resolver.test.ts, e2e-context-helper.test.ts: dry-run-mode / run-scenario.sh-spawning tests deleted; tests of real bash semantics survive.

Docs

test/e2e-scenario/docs/README.md: one runner, one mode (live), no dry-run, no validate-only. Migration tracking section reconciled with main: "Migration status is tracked outside the repository in GitHub issues and pull requests" (parent issue Implement layered E2E scenario model #3588).

Merge with main

Resolved conflicts in run.ts, 00-slack-provider-state.sh, docs/README.md, three modify/delete pairs under framework-tests/ and runtime/resolver/, and e2e-scenarios.yaml.
Noted: a small E2E --validate-only/--dry-run plumbing fix that was layered onto the live runner during merge — ~/.local/bin is now prepended to child PATH at the framework boundary (redaction.ts:buildChildEnv) so post-install nemoclaw shims resolve on runners (notably self-hosted GPU and WSL) where it isn't on the default PATH; and snapshot lifecycle assertions in validation_suites/lib/sandbox_lifecycle.sh now use the canonical nemoclaw <sandbox> snapshot <subcommand> argv shape, matching test-snapshot-commands.sh.

Verification

$ npx tsx test/e2e-scenario/scenarios/run.ts --list
hybrid scenario registry
- brev-launchable-cloud-openclaw: ...
- ubuntu-repo-cloud-openclaw: ...
(22 scenarios listed)

$ npx tsx test/e2e-scenario/scenarios/run.ts --emit-matrix
[{"id":"brev-launchable-cloud-openclaw","runner":"...","label":"...","platform":"...","suites":[...]}, ...]

$ E2E_CONTEXT_DIR=/tmp/x npx tsx test/e2e-scenario/scenarios/run.ts \
    --scenarios ubuntu-repo-cloud-openclaw
... (compiled plan output) ...
Phase results:
  environment: skipped (skipped=1)
  onboarding: failed (passed=1 failed=1)
  runtime: failed (failed=34 skipped=5)

$ cat /tmp/x/.e2e/logs/runtime.smoke.cli-available.log
smoke:cli-available
e2e context: missing required key(s): E2E_SCENARIO

$ vitest run test/e2e-scenario/
Test Files  30 passed (30)
     Tests  366 passed (366)

$ npm run typecheck:cli
(clean)

Audited absent in production paths: --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run, fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton, unsupported-live-step, --validate-only, RunContext.dryRun.

Spec gates addressed

Phase 6 — orchestrators execute live shell/probe assertion steps.
Phase 7 — single TS runtime entrypoint; bash runners deprecated; workflows use the typed runner with no --dry-run.
Workflow side of Phase 9 — --dry-run, --validate-only, and suite_filter gone from active paths.

Known follow-ups (not blocking this PR)

Two leftover if [[ -n "${E2E_DRY_RUN:-}" ]] short-circuits remain inside the hermes branch of validation_suites/messaging/slack/00-slack-provider-state.sh. They are dead code now that nothing in production sets E2E_DRY_RUN, but should be removed for consistency with the stated audit.

Summary by CodeRabbit

Tests
- E2E now runs live scenarios by default; many legacy plan-only/dry-run tests removed or refocused. Phase execution uses real shell steps with timeouts, retries, richer evidence and failure reporting.
New Features
- Added probes and lifecycle utilities for CLI/gateway/sandbox/negative/health checks. Typed runner can emit an execution matrix and is the canonical executor. Secret-redaction and child-env minimization added for safer spawn logging.
Chores
- Docs updated for live-only workflow; legacy bash entrypoints deprecated; CI adds a new scenario and tighter artifact uploads.

…only, and the bash runner The merged hybrid scenario architecture shipped scaffolding that looked like it ran E2E tests but did not. Two layers were producing fake green: 1. phase.ts:executeStep had no child_process.spawn anywhere. Every real shell/probe step fell through to a hardcoded { status: "failed", message: "unsupported live step" }, and a handful of fake-pass refs (fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton) were the only paths that reported "passed". CI green meant the plan compiled, not that any assertion executed. 2. ~30 shell scripts under validation_suites/, onboarding_assertions/, nemoclaw_scenarios/install/, and nemoclaw_scenarios/onboard/ began with "if e2e_env_is_dry_run; then echo [dry-run] ...; exit 0; fi". Once the dry-run flag flowed in (which workflows did pass), every script silently exited 0 before its real assertion ran. This change rips out both layers in one shot. The TS runner has one execution mode: live. There is no flag, env var, helper, branch, or comment in any production path that can produce a fake pass. Orchestrator (TypeScript) - phase.ts: executeStep now spawns shell steps via child_process.spawn, with detached process groups so timeouts kill bash + sleep cleanly. Probe steps return skipped "probe not registered" until the registry lands. Pending steps return skipped "pending: <ref>". Unknown kinds throw. Real evidence is captured to .e2e/logs/<step-id>.log. Step-level reliability.timeoutSeconds and retry.{attempts,on} policy are enforced here, not in clients. - run.ts: --dry-run, --validate-only deleted. Default invocation is live execution. --list and --plan-only (local debug) survive read-only. --emit-matrix added for the dynamic-matrix workflow (PR #4359). - types.ts: RunContext.dryRun deleted. AssertionResult already supported "skipped" status, now actually used. Workflow - e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both --dry-run invocations (Linux + WSL), are gone. Workflows execute live. - workflow-boundary.mts validator now requires --dry-run, --plan-only, and --validate-only to NOT appear in the workflow. Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2) - runtime/run-scenario.sh: 483 lines of duplicated install/onboard/ gateway-check/suite-execution -> 5-line fail-fast stub pointing at run.ts. The TS phase orchestrators own this work now. - runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep walks typed assertionGroups directly; nothing in TS calls a YAML-walking bash runner. Shell scripts (the leaves stay, the dry-run skip blocks die) - validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**: every "if e2e_env_is_dry_run; then ... exit 0; fi" and every "[[ ${E2E_DRY_RUN:-0} == 1 ]]" short-circuit removed. The real assertion logic that was hiding underneath now runs unconditionally. - runtime/lib/env.sh: e2e_env_is_dry_run helper deleted. - inference_routing.sh: dead _e2e_inference_plan helper (only callable from the deleted dry-run paths) deleted. Tests - DELETED (validated dead code paths): e2e-suite-runner.test.ts (run-suites.sh behavior) e2e-scenario-first-migration.test.ts (run-scenario.sh dry-run plan) e2e-expected-state-validator.test.ts (--validate-only mode) - REWRITTEN: e2e-phase-orchestrators.test.ts: now exercises real shell spawning via temp scripts (pass/fail/timeout/retry/missing-ref), real probe skipping with visible reason, and real pending skipping. The previous fake-pass refs in this test were the canonical example of the problem. - TRIMMED: e2e-lib-helpers.test.ts: dry-run-mode unit tests deleted; tests of real bash semantics survive. e2e-scenario-additional-families.test.ts: planOnly-via-bash tests deleted; resolveScenario-direct tests survive. e2e-scenario-resolver.test.ts: run-scenario.sh --plan-only spawn tests deleted; resolver unit tests survive. e2e-context-helper.test.ts: dry-run trace test deleted. Docs - docs/README.md: updated to state one runner, one mode (live), no dry-run, no validate-only. Bash entrypoints documented as deprecated fail-fast stubs. Verification - run.ts --list : prints the typed registry (intact) - run.ts --emit-matrix : emits JSON matrix for the dynamic-matrix workflow - run.ts --scenarios <id>: spawns real shell scripts, real exit codes, real failures with real evidence logs. Phase results show passed/failed/skipped honestly. - All 274 e2e-scenario framework tests pass. - Audited: no surviving --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run, fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton, unsupported-live-step, --validate-only, or RunContext.dryRun in any production path. CI for this PR will go red on environments where nemoclaw is not actually installed and onboarded. That is the point. Red is the first honest signal in months. Subsequent PRs (probe registry, OnboardingOrchestrator wiring into the real install/onboard dispatchers, old YAML resolver deletion) fix the real failures rather than hide them. Spec gates addressed: Phase 6 (orchestrators execute live shell steps), Phase 7 (single TS runtime entrypoint, bash runners deprecated), and the workflow side of Phase 9 (--dry-run / --validate-only / suite_filter gone from active paths). The old YAML resolver source under runtime/resolver/ stays for now; its deletion is the next PR.

coderabbitai · 2026-05-28T01:04:26Z

📝 Walkthrough

Walkthrough

Converts the E2E flow to a single TypeScript live runner: adds typed phase actions/results, executes actions and shell steps for real (with logs, timeouts, retries, evidence), deprecates bash runners, removes dry-run short-circuits across suites/install/onboard, updates docs/workflows, and prunes plan-only/dry-run tests.

Changes

Live E2E Execution Framework

Layer / File(s)	Summary
Type contracts and phase action result model `test/e2e-scenario/scenarios/types.ts`	Adds `PhaseAction` and `PhaseActionResult`, changes `RunPlanPhase.actions` to `PhaseAction[]`, and removes `dryRun` from `RunContext`.
Compiler action generation and dispatcher entrypoint `test/e2e-scenario/scenarios/compiler.ts`, `test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh`	Generates typed shell actions for environment/onboarding phases with timeouts/evidence, centralizes helper script refs, updates plan rendering, and adds `dispatch-action.sh` as deterministic shell-fn entrypoint.
Phase orchestrator real execution path `test/e2e-scenario/scenarios/orchestrators/phase.ts`	Executes `phase.actions` before assertions, runs shell steps via bash with repo-relative resolution, per-step logs, detached process-group timeouts, stderr-tail capture, retries, classifier-based failure categories, and updated phase status computation.
Phase orchestrator and compiler tests `test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts`	Refactors tests to use OS temp contexts and adds real shell execution suites (pass/fail/timeout/retry/missing-script), action-ordering/evidence checks, and compiler action-shape assertions.
CLI updates and bash runner deprecation `test/e2e-scenario/scenarios/run.ts`, `test/e2e-scenario/runtime/run-scenario.sh`, `test/e2e-scenario/runtime/run-suites.sh`	`run.ts` drops `--dry-run`/`--validate-only`, gains `--emit-matrix`/`--plan-only`; legacy bash runners are converted to fail-fast deprecation stubs.
Docs and env helper cleanup `test/e2e-scenario/docs/README.md`, `test/e2e-scenario/runtime/lib/env.sh`	README centers TS runner as the supported entrypoint and live execution model; removes `e2e_env_is_dry_run` helper.
Install, onboard, probes, and validation suites run live `test/e2e-scenario/nemoclaw_scenarios/`, `test/e2e-scenario/validation_suites//.sh`, `test/e2e-scenario/nemoclaw_scenarios/probes/*`	Removes `E2E_DRY_RUN` short-circuits across install/onboard/assertions so checks run against live outputs; adds lifecycle and probe scripts and updates suite implementations to use sandbox exec wrappers and real verification.
Legacy dry-run and plan-only test pruning `test/e2e-scenario/framework-tests/e2e-*.test.ts`	Deletes or narrows tests that relied on deprecated bash plan-only/dry-run flows, migrating coverage to typed resolver/runner tests.
Workflow wiring and artifact guardrails `.github/workflows/e2e-scenarios.yaml`, `.github/workflows/e2e-scenarios-all.yaml`, `tools/e2e-scenarios/workflow-boundary.mts`	Adds `ubuntu-rebuild-openclaw` mapping/matrix, removes plan-only resolution steps, runs `run.ts` without `--dry-run`, renames dry-run summary text to live wording, and tightens artifact uploads to an explicit `.e2e/*` subpath allowlist with `include-hidden-files: false`; enforces forbidden flags at workflow boundary.

Sequence Diagram(s)

sequenceDiagram
  participant GH as GitHub Actions
  participant RunnerTS as run.ts (TypeScript)
  participant Orchestrator as PhaseOrchestrator
  participant Shell as bash script
  GH->>RunnerTS: invoke --scenarios / --emit-matrix
  RunnerTS->>Orchestrator: compile plans, execute phases/actions
  Orchestrator->>Shell: runAction -> spawn bash (logs, timeout)
  Shell-->>Orchestrator: exit code, evidence (logs/stderr tail)
  Orchestrator-->>RunnerTS: aggregate PhaseResult (actions + assertions)
  RunnerTS-->>GH: exit code + artifact emission

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/NemoClaw#4656: Overlapping workflow summary/typed-runner step wording changes.
NVIDIA/NemoClaw#4655: Related scenario-to-runner ROUTES mapping updates and drift-guard tests.
NVIDIA/NemoClaw#4651: Related resolver test changes and coverage migration.

Suggested labels

refactor, area: ci

Suggested reviewers

prekshivyas
cv

Poem

A rabbit hops through tests and logs,
Old dry-run tunnels fade like fog.
Typed actions march, scripts hum true,
Timeouts caught, evidence in view.
🐇 Hooray — green traces, tidy and new.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/e2e-real-execution

# Conflicts: # test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e-scenario/scenarios/orchestrators/phase.ts`:
- Around line 32-34: The branch uses a case-insensitive regex test on ref but
then calls case-sensitive ref.includes("tunnel") / ref.includes("cloudflared"),
causing mixed-case refs to misclassify; update the branching in the same block
(the if handling /provider|inference|chat-completion|cloudflared|tunnel/i) to
compare in a case-insensitive way—e.g., normalize ref to lowerCase once and use
that for the subsequent includes checks (or use case-insensitive regex matches)
so tunnel/cloudflared variants are correctly detected and return
"external-tunnel".

In `@test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh`:
- Line 62: The current substring check [[ "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" ==
*"${E2E_SANDBOX_NAME}"* ]] can produce false positives (e.g., sb1 matching
sb10); change it to perform an exact token/line match of the sandbox name
instead: test SANDBOX_LIFECYCLE_LAST_OUTPUT for the exact E2E_SANDBOX_NAME token
(for example by piping SANDBOX_LIFECYCLE_LAST_OUTPUT to grep -w/-x or using a
word-boundary regex with [[ ... =~ ... ]]) and fail if not found, so the check
around SANDBOX_LIFECYCLE_LAST_OUTPUT and E2E_SANDBOX_NAME only succeeds for
exact sandbox-name matches.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 401b8a23-1471-429d-9905-b413e6bae24c

📥 Commits

Reviewing files that changed from the base of the PR and between 1daf081 and b7acfb7.

📒 Files selected for processing (44)

.github/workflows/e2e-scenarios.yaml
test/e2e-scenario/docs/README.md
test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
test/e2e-scenario/runtime/lib/env.sh
test/e2e-scenario/runtime/run-scenario.sh
test/e2e-scenario/runtime/run-suites.sh
test/e2e-scenario/scenarios/orchestrators/phase.ts
test/e2e-scenario/scenarios/run.ts
test/e2e-scenario/scenarios/types.ts
test/e2e-scenario/validation_suites/assert/gateway-alive.sh
test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
test/e2e-scenario/validation_suites/lib/inference_routing.sh
test/e2e-scenario/validation_suites/lib/messaging_providers.sh
test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
test/e2e-scenario/validation_suites/sandbox-exec.sh
test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
tools/e2e-scenarios/workflow-boundary.mts

💤 Files with no reviewable changes (30)

test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
test/e2e-scenario/validation_suites/lib/messaging_providers.sh
test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
test/e2e-scenario/scenarios/types.ts
test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
test/e2e-scenario/validation_suites/lib/inference_routing.sh
test/e2e-scenario/validation_suites/assert/gateway-alive.sh
test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
test/e2e-scenario/validation_suites/sandbox-exec.sh
test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
test/e2e-scenario/runtime/lib/env.sh
test/e2e-scenario/framework-tests/e2e-context-helper.test.ts

github-actions · 2026-05-28T01:13:18Z

E2E Advisor Recommendation

Required E2E: ubuntu-repo-cloud-openclaw, ubuntu-no-docker-preflight-negative, ubuntu-rebuild-openclaw, ubuntu-repo-cloud-openclaw-custom-policies
Optional E2E: wsl-repo-cloud-openclaw, macos-repo-cloud-openclaw, gpu-repo-local-ollama-openclaw, ubuntu-repo-cloud-hermes

Dispatch hint: ubuntu-repo-cloud-openclaw,ubuntu-no-docker-preflight-negative,ubuntu-rebuild-openclaw,ubuntu-repo-cloud-openclaw-custom-policies

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

ubuntu-repo-cloud-openclaw (high): Primary live typed scenario covering repo install, cloud OpenClaw onboarding, gateway/sandbox state validation, smoke, inference, and credentials after the workflow changed from dry-run to live execution.
ubuntu-no-docker-preflight-negative (medium): Validates the rewritten negative failure contract, preflight/onboarding failure handling, absent gateway/sandbox probes, and short-circuit behavior without creating forbidden side effects.
ubuntu-rebuild-openclaw (high): Newly routed/fanned-out scenario and touched manifest/lifecycle/rebuild-upgrade helpers must be exercised live to validate sandbox rebuild, workspace preservation, policy preservation, and post-rebuild inference.
ubuntu-repo-cloud-openclaw-custom-policies (high): Covers custom network policy, credentials/security assertions, gateway/sandbox probes, and baseline onboarding paths touched by the scenario framework and validation-suite changes.

Optional E2E

wsl-repo-cloud-openclaw (high): The WSL branch of the scenario workflow also changed from dry-run to live execution and copies artifacts back from WSL; useful platform confidence but not required if Ubuntu live coverage passes.
macos-repo-cloud-openclaw (medium): Platform macOS smoke suite was touched; useful to verify optional Docker/platform behavior after runner/orchestrator changes.
gpu-repo-local-ollama-openclaw (very-high): Ollama GPU and auth-proxy validation scripts changed; run when GPU capacity is available to confirm local inference routing remains intact.
ubuntu-repo-cloud-hermes (high): Hermes health/history suites were touched; optional adjacent coverage for non-OpenClaw assistant runtime behavior.

New E2E recommendations

artifact-redaction (high): The workflow now excludes hidden files and narrows uploaded .e2e paths to avoid leaking context.env; add a live or post-run assertion that downloaded artifacts never contain raw provider/API key values or context.env.
- Suggested test: Add a scenario artifact-safety E2E check that runs a minimal secret-gated scenario and verifies the uploaded artifact contains only redacted logs and no raw .e2e/context.env.
workflow-input-validation (medium): resolve-runner no longer invokes typed plan validation before route lookup, so malformed-but-routed scenario IDs or registry drift could reach the live runner later than intended.
- Suggested test: Add a workflow-boundary E2E/CI check that dispatches e2e-scenarios.yaml with valid, unknown, and mixed-runner scenario inputs and asserts early, clear failure messages.

Dispatch hint

Workflow: .github/workflows/e2e-scenarios.yaml
jobs input: ubuntu-repo-cloud-openclaw,ubuntu-no-docker-preflight-negative,ubuntu-rebuild-openclaw,ubuntu-repo-cloud-openclaw-custom-policies

github-actions · 2026-05-28T01:13:19Z

E2E Scenario Advisor Recommendation

Required scenario E2E: e2e-scenarios-all
Optional scenario E2E: None

Dispatch required scenario E2E:

gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

e2e-scenarios-all: Scenario workflow files, scenario runtime/runner code, expected-state metadata, typed scenario orchestration, manifests, probes, and validation suite scripts changed. Per scenario E2E policy, changes to scenario workflows, runtime/runner code, catalog/expected-state metadata, or broad suite surfaces require the all-scenarios fan-out.
- Dispatch: gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>

Optional scenario E2E

None.

Relevant changed files

.github/workflows/e2e-scenarios-all.yaml
.github/workflows/e2e-scenarios.yaml
test/e2e-scenario/docs/README.md
test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
test/e2e-scenario/framework-tests/e2e-coverage-report.test.ts
test/e2e-scenario/framework-tests/e2e-expected-failure.test.ts
test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
test/e2e-scenario/framework-tests/e2e-expected-state.test.ts
test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
test/e2e-scenario/framework-tests/e2e-metadata-final-hygiene.test.ts
test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts
test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
test/e2e-scenario/framework-tests/e2e-probes.test.ts
test/e2e-scenario/framework-tests/e2e-redaction-parity.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-schema.test.ts
test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts
test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
test/e2e-scenario/manifests/openclaw-nvidia-rebuild.yaml
test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
test/e2e-scenario/nemoclaw_scenarios/expected-states.yaml
test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh
test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh
test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh
test/e2e-scenario/nemoclaw_scenarios/probes/dispatch.sh
test/e2e-scenario/nemoclaw_scenarios/probes/gateway-absent.sh
test/e2e-scenario/nemoclaw_scenarios/probes/gateway-healthy.sh
test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh
test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh
test/e2e-scenario/runtime/coverage-report.sh
test/e2e-scenario/runtime/lib/env.sh
test/e2e-scenario/runtime/resolver/coverage.ts
test/e2e-scenario/runtime/resolver/expected-failure.ts
test/e2e-scenario/runtime/resolver/index.ts
test/e2e-scenario/runtime/resolver/js-yaml.d.ts
test/e2e-scenario/runtime/resolver/load.ts
test/e2e-scenario/runtime/resolver/plan.ts
test/e2e-scenario/runtime/resolver/schema.ts
test/e2e-scenario/runtime/resolver/validator.ts
test/e2e-scenario/runtime/run-scenario.sh
test/e2e-scenario/runtime/run-suites.sh
test/e2e-scenario/scenarios/assertions/environment.ts
test/e2e-scenario/scenarios/assertions/onboarding.ts
test/e2e-scenario/scenarios/assertions/registry.ts
test/e2e-scenario/scenarios/assertions/runtime.ts
test/e2e-scenario/scenarios/builder.ts
test/e2e-scenario/scenarios/compiler.ts
test/e2e-scenario/scenarios/expected-states.ts
test/e2e-scenario/scenarios/matrix.ts
test/e2e-scenario/scenarios/orchestrators/context.ts
test/e2e-scenario/scenarios/orchestrators/lifecycle.ts
test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts
test/e2e-scenario/scenarios/orchestrators/phase.ts
test/e2e-scenario/scenarios/orchestrators/redaction.ts
test/e2e-scenario/scenarios/orchestrators/runner.ts
test/e2e-scenario/scenarios/orchestrators/state-validation.ts
test/e2e-scenario/scenarios/probes/builtin.ts
test/e2e-scenario/scenarios/probes/diagnostics.ts
test/e2e-scenario/scenarios/probes/docs-validation.ts
test/e2e-scenario/scenarios/probes/injection-blocked.ts
test/e2e-scenario/scenarios/probes/network-policy.ts
test/e2e-scenario/scenarios/probes/registry.ts
test/e2e-scenario/scenarios/probes/shields-config.ts
test/e2e-scenario/scenarios/probes/types.ts
test/e2e-scenario/scenarios/probes/util.ts
test/e2e-scenario/scenarios/run.ts
test/e2e-scenario/scenarios/scenarios/baseline.ts
test/e2e-scenario/scenarios/types.ts
test/e2e-scenario/validation_suites/assert/gateway-alive.sh
test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
test/e2e-scenario/validation_suites/hermes/01-history-writable.sh
test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
test/e2e-scenario/validation_suites/lib/inference_routing.sh
test/e2e-scenario/validation_suites/lib/messaging_providers.sh
test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
test/e2e-scenario/validation_suites/sandbox-exec.sh
test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
tools/e2e-scenarios/workflow-boundary.mts

github-actions · 2026-05-28T01:17:13Z

PR Review Advisor

Findings: 2 needs attention, 10 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 9 still apply, 0 new items found

Review findings

🛠️ Needs attention

Live fan-out includes scenarios without onboarding and secret plumbing (test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh:32): The all-scenarios workflow now live-runs every typed registry scenario, but several generated scenarios declare onboarding profiles and required secrets that the workflow, compiler secret allowlist, and shell dispatcher do not expose. Those jobs will either fail at onboarding dispatch or run without the credential boundary declared by the registry.
- Recommendation: Before broad live fan-out, either implement dispatcher cases plus workflow_call/all-scenarios secret declarations and compiler secretEnv mappings for every generated profile, or filter the generated matrix to scenarios that are fully supported. Add a static contract test covering onboarding dispatch and requiredSecrets end to end.
- Evidence: baseline.ts declares scenarios requiring OPENAI_COMPATIBLE_API_KEY, BRAVE_API_KEY, TELEGRAM_BOT_TOKEN, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, and SLACK_APP_TOKEN with onboarding ids such as openai-compatible-openclaw and cloud-nvidia-openclaw-telegram. e2e-scenarios.yaml declares only NVIDIA_API_KEY, e2e-scenarios-all.yaml passes only NVIDIA_API_KEY, compiler.ts ONBOARD_PROFILE_SECRET_ENV only covers NVIDIA/local profiles, and onboard/dispatch.sh only routes cloud-openclaw/cloud-hermes/local variants.
No-docker negative onboarding truncates the framework-seeded context (test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh:35): The TS runner seeds context.env before phase actions, and dispatch-action.sh explicitly documents that workers must not call e2e_context_init because it truncates the seeded file. The no-docker negative worker still calls e2e_context_init, which can wipe E2E_SCENARIO, E2E_SANDBOX_NAME, and E2E_GATEWAY_URL before state-validation probes and negative side-effect checks need them.
- Recommendation: Remove e2e_context_init from the worker and rely on the framework-owned context initialization, using e2e_context_set only for additional keys. Add a test that pre-seeds context.env, runs the no-docker worker against a fake failing nemoclaw command, and verifies the seeded keys survive for gateway-absent and sandbox-absent probes.
- Evidence: dispatch-action.sh says 'do NOT call e2e_context_init here' because it opens with ': > ctx'. cloud-openclaw-no-docker.sh calls e2e_context_init at the start of e2e_onboard_cloud_openclaw_no_docker.

🔎 Worth checking

Source-of-truth review needed: Generated all-scenarios workflow versus typed registry: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: e2e-scenarios-all.yaml says no workflow edits are required when adding baseline.ts scenarios, but also declares a ubuntu-rebuild-openclaw job.
Source-of-truth review needed: Workflow secret and onboarding execution contract: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: baseline.ts declares non-NVIDIA secrets and onboarding ids not present in e2e-scenarios.yaml secrets, e2e-scenarios-all.yaml secret passing, compiler ONBOARD_PROFILE_SECRET_ENV, or onboard/dispatch.sh.
Source-of-truth review needed: Probe subprocess utilities: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: probes/util.ts uses process.env in spawnBash/runSandboxCmd/runHostCmd and writeProbeEvidence serializes raw payloads while phase.ts uses buildChildEnv and pipeRedacted.
Source-of-truth review needed: Expected-state registry documentation after YAML resolver deletion: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: README.md lists runtime/resolver and coverage-report.sh; expected-states.ts and types.ts still say expected-states.yaml stays in place and is mirrored.
Probe subprocesses bypass the framework env and redaction boundary (test/e2e-scenario/scenarios/probes/util.ts:133): Phase actions and shell steps use buildChildEnv and pipeRedacted, but the new probe utilities spawn host and sandbox commands with process.env and write stdout/stderr tails directly into evidence JSON. In live secret-bearing runs this can pass workflow secrets to probe helper processes and persist secret-shaped output in artifacts or failure messages.
- Recommendation: Route runSandboxCmd, runHostCmd, diagnosticsProbe, and docsValidationProbe through the same child-env allowlist and redactString handling as PhaseOrchestrator, or implement an equivalent probe-owned boundary. Add a leak test where a fake command prints NVIDIA_API_KEY and token-shaped output, asserting the child does not receive undeclared secrets and evidence/messages are redacted.
- Evidence: runSandboxCmd calls spawnBash with env: { ...process.env, E2E_CONTEXT_DIR: ctx.contextDir }; runHostCmd defaults to env: opts.env ?? process.env; diagnostics.ts spawns nemoclaw with env: process.env; docs-validation.ts spawns with { ...process.env, CHECK_DOC_LINKS_REMOTE: '0' }; writeProbeEvidence serializes payloads directly.
ubuntu-rebuild-openclaw is hard-coded in addition to the generated matrix (.github/workflows/e2e-scenarios-all.yaml:80): The all-scenarios workflow states the typed registry is the fan-out source of truth and that adding a scenario requires no workflow edits, but it adds a separate reusable-workflow job for ubuntu-rebuild-openclaw. That scenario is already in the typed registry, so this duplicates a secret-bearing live run and reintroduces per-scenario workflow drift.
- Recommendation: Remove the hard-coded job and rely on the generated matrix, or explicitly exclude the scenario from the matrix if it truly needs special handling. Add a workflow-boundary test that rejects per-scenario reusable-workflow jobs outside the generated matrix fan-out.
- Evidence: e2e-scenarios-all.yaml generates matrix entries from run.ts --emit-matrix and then separately defines ubuntu-rebuild-openclaw. baseline.ts contains ubuntu-rebuild-openclaw and run.ts buildScenarioMatrix uses listScenarios().
Docs and source comments still advertise deleted dry-run and resolver paths (test/e2e-scenario/docs/README.md:127): The PR retires dry-run/validate-only and deletes the YAML resolver, but active documentation and comments still list run-scenario.sh --dry-run, --validate-only, run.ts --dry-run, runtime/coverage-report.sh, runtime/resolver, workflow dry-runs, and expected-states.yaml mirror/parity language. This contradicts the single live runner and typed expected-state source-of-truth claims.
- Recommendation: Update README.md, scenarios/expected-states.ts, and scenarios/types.ts to describe the live TS runner and typed expected-state registry as the active source of truth. Add a static docs/source hygiene test that rejects active references to --dry-run, --validate-only, runtime/resolver, runtime/coverage-report.sh, and deleted expected-states.yaml outside explicitly historical prose.
- Evidence: README.md 'How to run' still includes run-scenario.sh --dry-run, --validate-only, runtime/coverage-report.sh, runtime/resolver, and run.ts --dry-run. expected-states.ts says expected-states.yaml stays in place for the legacy bash resolver, and types.ts says expected-states.yaml stays alongside with a mirror contract, while those files were deleted.
Runner override accepts arbitrary runs-on labels (test/e2e-scenario/scenarios/runner-routing.ts:39): resolveRunnerForScenario treats any non-empty runnerRequirements entry beginning with runs-on: as authoritative runner metadata. The current reusable workflow still ignores matrix.runner for actual execution, but the PR exposes it as workflow-facing routing data; a later consumer could route secret-bearing jobs to unexpected self-hosted labels.
- Recommendation: Validate runs-on overrides against a narrow allowlist of approved GitHub-hosted and NVIDIA self-hosted labels, or remove arbitrary overrides from workflow-facing matrix output until the trusted routing contract is finalized. Replace any custom-self-hosted acceptance test with a negative test for rejected labels.
- Evidence: runner-routing.ts returns explicit.slice('runs-on:'.length).trim() after only checking non-empty. The linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 description and tests document arbitrary runs-on:<label> override behavior.
Live WSL path installs Node through unpinned curl-to-bash (.github/workflows/e2e-scenarios.yaml:239): The workflow now executes live scenarios and later passes NVIDIA_API_KEY into the WSL run. Before that secret-bearing run, the WSL setup pipes the remote NodeSource installer directly into bash without a pinned digest or verified artifact.
- Recommendation: Use a pinned or verified Node installation path for WSL, or constrain and document this installer trust boundary. Keep unrelated secrets out of WSL setup and add workflow validation covering the installer boundary.
- Evidence: The WSL Node step runs curl -fsSL https://deb.nodesource.com/setup\_22.x | bash - followed by apt-get install -y nodejs; the WSL scenario step later passes NVIDIA_API_KEY.
Workflow contract tests cover routes but not secrets or onboarding dispatch (test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts:31): The workflow test now checks typed scenario IDs and runner labels against the reusable workflow route map, but it does not verify that every generated live scenario can be dispatched or receives its declared required secrets. That leaves the current live fan-out regressions undetected.
- Recommendation: Extend static workflow contract coverage so every generated scenario has an onboarding action arg accepted by onboard/dispatch.sh or is explicitly filtered, and every requiredSecrets entry is declared in e2e-scenarios.yaml workflow_call secrets, passed by e2e-scenarios-all.yaml, and included in compiler action secretEnv where needed.
- Evidence: e2e-scenarios-workflow.test.ts parses the ROUTES map and compares IDs/runner labels. No inspected test checks workflow_call secret declarations, all-scenarios secret passing, compiler ONBOARD_PROFILE_SECRET_ENV, or onboard/dispatch.sh case labels.

🌱 Nice ideas

None.

Consider writing more tests for

**Runtime validation** — Every generated matrix scenario's onboarding profile is accepted by nemoclaw_scenarios/onboard/dispatch.sh or the scenario is explicitly filtered from all-scenarios.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
**Runtime validation** — Every ScenarioDefinition.requiredSecrets value is declared in .github/workflows/e2e-scenarios.yaml, passed by e2e-scenarios-all.yaml, and allowlisted in compiler secretEnv for the phase action that consumes it.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
**Runtime validation** — cloud-openclaw-no-docker preserves preseeded context.env keys such as E2E_SCENARIO, E2E_SANDBOX_NAME, and E2E_GATEWAY_URL while capturing a fake preflight failure.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
**Runtime validation** — A probe subprocess fake command that prints NVIDIA_API_KEY and token-shaped output does not receive undeclared secrets and stores only redacted stdout/stderr/message/evidence.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
**Runtime validation** — Workflow boundary rejects per-scenario reusable-workflow jobs outside the generated matrix fan-out.. This PR turns the scenario workflow into live secret-bearing execution and changes workflow fan-out, spawn/redaction boundaries, sandbox transport, lifecycle actions, and probe utilities. Unit tests cover many local behaviors, but static contract and runtime-path validation are still needed for the live workflow surfaces.
**Workflow contract tests cover routes but not secrets or onboarding dispatch** — Extend static workflow contract coverage so every generated scenario has an onboarding action arg accepted by onboard/dispatch.sh or is explicitly filtered, and every requiredSecrets entry is declared in e2e-scenarios.yaml workflow_call secrets, passed by e2e-scenarios-all.yaml, and included in compiler action secretEnv where needed.
**Acceptance clause:** Removing `--dry-run` and plan-mode execution from the E2E runner: they let a coding agent ship work behind false-green test runs. The TS runner now has **one execution mode: live.** No flag, env var, helper, or branch in any production path bypasses real assertion execution. — add test evidence or identify existing coverage. The workflows now invoke run.ts without --dry-run and workflow-boundary rejects dry-run/validate-only/plan-only in CI. However README.md still documents active dry-run/validate-only paths, and run.ts still supports --plan-only for local debug.
**Acceptance clause:** Deleted `nemoclaw_scenarios/expected-states.yaml` and the entire `runtime/resolver/` tree (`load.ts`, `plan.ts`, `schema.ts`, `coverage.ts`, `index.ts`, `validator.ts`, `expected-failure.ts`, `js-yaml.d.ts`) plus `runtime/coverage-report.sh`. — add test evidence or identify existing coverage. The files are deleted in the diff, but README.md, scenarios/expected-states.ts, and scenarios/types.ts still refer to expected-states.yaml, runtime/resolver, and coverage-report.sh as active or transitional surfaces.

Since last review details

Current findings:

Source-of-truth review needed: Generated all-scenarios workflow versus typed registry: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: e2e-scenarios-all.yaml says no workflow edits are required when adding baseline.ts scenarios, but also declares a ubuntu-rebuild-openclaw job.
Source-of-truth review needed: Workflow secret and onboarding execution contract: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: baseline.ts declares non-NVIDIA secrets and onboarding ids not present in e2e-scenarios.yaml secrets, e2e-scenarios-all.yaml secret passing, compiler ONBOARD_PROFILE_SECRET_ENV, or onboard/dispatch.sh.
Source-of-truth review needed: Probe subprocess utilities: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: probes/util.ts uses process.env in spawnBash/runSandboxCmd/runHostCmd and writeProbeEvidence serializes raw payloads while phase.ts uses buildChildEnv and pipeRedacted.
Source-of-truth review needed: Expected-state registry documentation after YAML resolver deletion: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: README.md lists runtime/resolver and coverage-report.sh; expected-states.ts and types.ts still say expected-states.yaml stays in place and is mirrored.
Live fan-out includes scenarios without onboarding and secret plumbing (test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh:32): The all-scenarios workflow now live-runs every typed registry scenario, but several generated scenarios declare onboarding profiles and required secrets that the workflow, compiler secret allowlist, and shell dispatcher do not expose. Those jobs will either fail at onboarding dispatch or run without the credential boundary declared by the registry.
- Recommendation: Before broad live fan-out, either implement dispatcher cases plus workflow_call/all-scenarios secret declarations and compiler secretEnv mappings for every generated profile, or filter the generated matrix to scenarios that are fully supported. Add a static contract test covering onboarding dispatch and requiredSecrets end to end.
- Evidence: baseline.ts declares scenarios requiring OPENAI_COMPATIBLE_API_KEY, BRAVE_API_KEY, TELEGRAM_BOT_TOKEN, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, and SLACK_APP_TOKEN with onboarding ids such as openai-compatible-openclaw and cloud-nvidia-openclaw-telegram. e2e-scenarios.yaml declares only NVIDIA_API_KEY, e2e-scenarios-all.yaml passes only NVIDIA_API_KEY, compiler.ts ONBOARD_PROFILE_SECRET_ENV only covers NVIDIA/local profiles, and onboard/dispatch.sh only routes cloud-openclaw/cloud-hermes/local variants.
No-docker negative onboarding truncates the framework-seeded context (test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh:35): The TS runner seeds context.env before phase actions, and dispatch-action.sh explicitly documents that workers must not call e2e_context_init because it truncates the seeded file. The no-docker negative worker still calls e2e_context_init, which can wipe E2E_SCENARIO, E2E_SANDBOX_NAME, and E2E_GATEWAY_URL before state-validation probes and negative side-effect checks need them.
- Recommendation: Remove e2e_context_init from the worker and rely on the framework-owned context initialization, using e2e_context_set only for additional keys. Add a test that pre-seeds context.env, runs the no-docker worker against a fake failing nemoclaw command, and verifies the seeded keys survive for gateway-absent and sandbox-absent probes.
- Evidence: dispatch-action.sh says 'do NOT call e2e_context_init here' because it opens with ': > ctx'. cloud-openclaw-no-docker.sh calls e2e_context_init at the start of e2e_onboard_cloud_openclaw_no_docker.
Probe subprocesses bypass the framework env and redaction boundary (test/e2e-scenario/scenarios/probes/util.ts:133): Phase actions and shell steps use buildChildEnv and pipeRedacted, but the new probe utilities spawn host and sandbox commands with process.env and write stdout/stderr tails directly into evidence JSON. In live secret-bearing runs this can pass workflow secrets to probe helper processes and persist secret-shaped output in artifacts or failure messages.
- Recommendation: Route runSandboxCmd, runHostCmd, diagnosticsProbe, and docsValidationProbe through the same child-env allowlist and redactString handling as PhaseOrchestrator, or implement an equivalent probe-owned boundary. Add a leak test where a fake command prints NVIDIA_API_KEY and token-shaped output, asserting the child does not receive undeclared secrets and evidence/messages are redacted.
- Evidence: runSandboxCmd calls spawnBash with env: { ...process.env, E2E_CONTEXT_DIR: ctx.contextDir }; runHostCmd defaults to env: opts.env ?? process.env; diagnostics.ts spawns nemoclaw with env: process.env; docs-validation.ts spawns with { ...process.env, CHECK_DOC_LINKS_REMOTE: '0' }; writeProbeEvidence serializes payloads directly.
ubuntu-rebuild-openclaw is hard-coded in addition to the generated matrix (.github/workflows/e2e-scenarios-all.yaml:80): The all-scenarios workflow states the typed registry is the fan-out source of truth and that adding a scenario requires no workflow edits, but it adds a separate reusable-workflow job for ubuntu-rebuild-openclaw. That scenario is already in the typed registry, so this duplicates a secret-bearing live run and reintroduces per-scenario workflow drift.
- Recommendation: Remove the hard-coded job and rely on the generated matrix, or explicitly exclude the scenario from the matrix if it truly needs special handling. Add a workflow-boundary test that rejects per-scenario reusable-workflow jobs outside the generated matrix fan-out.
- Evidence: e2e-scenarios-all.yaml generates matrix entries from run.ts --emit-matrix and then separately defines ubuntu-rebuild-openclaw. baseline.ts contains ubuntu-rebuild-openclaw and run.ts buildScenarioMatrix uses listScenarios().
Docs and source comments still advertise deleted dry-run and resolver paths (test/e2e-scenario/docs/README.md:127): The PR retires dry-run/validate-only and deletes the YAML resolver, but active documentation and comments still list run-scenario.sh --dry-run, --validate-only, run.ts --dry-run, runtime/coverage-report.sh, runtime/resolver, workflow dry-runs, and expected-states.yaml mirror/parity language. This contradicts the single live runner and typed expected-state source-of-truth claims.
- Recommendation: Update README.md, scenarios/expected-states.ts, and scenarios/types.ts to describe the live TS runner and typed expected-state registry as the active source of truth. Add a static docs/source hygiene test that rejects active references to --dry-run, --validate-only, runtime/resolver, runtime/coverage-report.sh, and deleted expected-states.yaml outside explicitly historical prose.
- Evidence: README.md 'How to run' still includes run-scenario.sh --dry-run, --validate-only, runtime/coverage-report.sh, runtime/resolver, and run.ts --dry-run. expected-states.ts says expected-states.yaml stays in place for the legacy bash resolver, and types.ts says expected-states.yaml stays alongside with a mirror contract, while those files were deleted.
Runner override accepts arbitrary runs-on labels (test/e2e-scenario/scenarios/runner-routing.ts:39): resolveRunnerForScenario treats any non-empty runnerRequirements entry beginning with runs-on: as authoritative runner metadata. The current reusable workflow still ignores matrix.runner for actual execution, but the PR exposes it as workflow-facing routing data; a later consumer could route secret-bearing jobs to unexpected self-hosted labels.
- Recommendation: Validate runs-on overrides against a narrow allowlist of approved GitHub-hosted and NVIDIA self-hosted labels, or remove arbitrary overrides from workflow-facing matrix output until the trusted routing contract is finalized. Replace any custom-self-hosted acceptance test with a negative test for rejected labels.
- Evidence: runner-routing.ts returns explicit.slice('runs-on:'.length).trim() after only checking non-empty. The linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 description and tests document arbitrary runs-on:<label> override behavior.
Live WSL path installs Node through unpinned curl-to-bash (.github/workflows/e2e-scenarios.yaml:239): The workflow now executes live scenarios and later passes NVIDIA_API_KEY into the WSL run. Before that secret-bearing run, the WSL setup pipes the remote NodeSource installer directly into bash without a pinned digest or verified artifact.
- Recommendation: Use a pinned or verified Node installation path for WSL, or constrain and document this installer trust boundary. Keep unrelated secrets out of WSL setup and add workflow validation covering the installer boundary.
- Evidence: The WSL Node step runs curl -fsSL https://deb.nodesource.com/setup\_22.x | bash - followed by apt-get install -y nodejs; the WSL scenario step later passes NVIDIA_API_KEY.
Workflow contract tests cover routes but not secrets or onboarding dispatch (test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts:31): The workflow test now checks typed scenario IDs and runner labels against the reusable workflow route map, but it does not verify that every generated live scenario can be dispatched or receives its declared required secrets. That leaves the current live fan-out regressions undetected.
- Recommendation: Extend static workflow contract coverage so every generated scenario has an onboarding action arg accepted by onboard/dispatch.sh or is explicitly filtered, and every requiredSecrets entry is declared in e2e-scenarios.yaml workflow_call secrets, passed by e2e-scenarios-all.yaml, and included in compiler action secretEnv where needed.
- Evidence: e2e-scenarios-workflow.test.ts parses the ROUTES map and compares IDs/runner labels. No inspected test checks workflow_call secret declarations, all-scenarios secret passing, compiler ONBOARD_PROFILE_SECRET_ENV, or onboard/dispatch.sh case labels.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

…shape client check - PhaseOrchestrator.runShellStep: wait for the log WriteStream to finish before resolving so callers (and tests) reading evidence synchronously see the actual stdout/stderr instead of an empty file. Race exposed by e2e-phase-orchestrators 'shell_step_passes_when_script_exits_zero'. - e2e-phase-orchestrators: replace client-source toMatch regex (1 source-shape test, budget=0) with a runtime-shape behavior assertion on the HostCliClient observation. Still enforces 'clients do not encode pass/fail or retry/timeout semantics' per hybrid-scenario E2E architecture spec, without violating source-shape budget. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Closes the spec's reopened Phase 6 gap. The new typed runner now executes the install and onboarding work that the deleted bash runner used to perform, but inside EnvironmentOrchestrator and OnboardingOrchestrator instead of in workflow YAML or a resurrected bash runner. All canonical scenarios now reach a real, live SUT before their assertions run. Architecture (per hybrid-scenario-e2e-architecture spec): * types.ts: introduce typed PhaseAction (kind=shell-fn|shell, scriptRef, fn, arg, timeoutSeconds, evidencePath) and PhaseActionResult. Replace the prior actions: string[] free-form labels with PhaseAction[]. Add actions[] to PhaseResult so failure-layer attribution stays clear: setup failure is recorded distinctly from assertion failure. * compiler.ts: phaseActions() now emits typed actions for environment (context.emit + install.<id>) and onboarding (profile.<id>). Stable action ids: environment.context.emit, environment.install.<install-id>, onboarding.profile.<profile-id>. All install/onboard actions point at the existing dispatcher scripts (install/dispatch.sh, onboard/dispatch.sh) - shell remains the implementation per spec, invocation is centralized. * orchestrators/phase.ts: PhaseOrchestrator.run() executes actions before assertions. Action failure short-circuits the phase so assertions never run against an environment that was never set up. Action runner reuses the same spawn/timeout/process-group/log-flush machinery as runShellStep. Per-action timeout, no retry (install and onboarding must fail loudly). * nemoclaw_scenarios/dispatch-action.sh: new bash launcher (the only new shell file). The install/onboard dispatchers are intentionally library-style (function definitions only); this launcher gives them a deterministic executable entrypoint that sources runtime/lib/env.sh + runtime/lib/context.sh, applies non-interactive env, sources the requested dispatcher, and invokes the named function with one arg. Replaces the orchestration that the deleted run-scenario.sh used to do, but called from the typed orchestrator instead. * plan-only output: now shows 'Action: <id> (timeout=...) -> <fn> <arg>' per phase, before assertion groups. Maintainers can preview the full setup+onboard+assert sequence before dispatch. * framework-tests/e2e-phase-orchestrators.test.ts: add five behavior tests covering action-runs-before-assertions, action-failure short- circuits-assertions, action timeout via orchestrator policy, evidence-log flushed-before-resolve, and compiler emits typed install/onboard actions for all 7 canonical scenarios. What stays out: * No workflow YAML edits. .github/workflows/e2e-scenarios.yaml still invokes only 'npx tsx test/e2e-scenario/scenarios/run.ts --scenarios ...'. Workflow YAML stays innocent of install/onboard plumbing. * No client edits. HostCliClient et al. remain pass/fail/policy free. * No resolver/YAML-first revival. setup_scenarios/test_plans/suite_filter remain unsupported. Validation gate (Phase 6 reopen note) is the next step: after this push goes green on PR CI, dispatch e2e-scenarios-all.yaml against feat/e2e-real-execution and confirm canonical scenarios produce real phase results with action evidence under .e2e/actions/<id>.log, instead of <1s 'failed=34 skipped=5'. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

test/e2e-scenario/scenarios/types.ts (1)

110-131: ⚡ Quick win

Strengthen PhaseAction typing (discriminated union) to enforce per-kind required fields.

Current runtime already fails loudly if a "shell-fn" action is missing fn (the runner passes action.fn ?? "", and dispatch-action.sh errors when the function name isn’t found). Also, current PhaseAction objects are produced in test/e2e-scenario/scenarios/compiler.ts with fn populated for "shell-fn". Still, the type allows invalid combinations to compile, so a discriminated union would prevent that drift for any future producers.

Suggested refactor

-export interface PhaseAction {
-  id: string;
-  phase: PhaseName;
-  description?: string;
-  // "shell-fn" sources the bash dispatcher and invokes the named function.
-  // "shell"    runs an executable script (used for context-emit helper).
-  kind: "shell-fn" | "shell";
-  // Repo-relative path to the script.
-  scriptRef: string;
-  // For "shell-fn": the bash function to invoke after sourcing scriptRef.
-  fn?: string;
-  // Single positional arg passed to the function/script (install method or
-  // onboarding profile id today). Kept as a single string to keep stable
-  // ids predictable; multi-arg variants can extend this later.
-  arg?: string;
-  // Per-action timeout. No retry by default - install/onboard must fail
-  // loudly so the regression is visible. Retry stays a property of
-  // assertion steps, not actions.
-  timeoutSeconds?: number;
-  // Repo-relative evidence log path.
-  evidencePath?: string;
-}
+interface PhaseActionBase {
+  id: string;
+  phase: PhaseName;
+  description?: string;
+  scriptRef: string;
+  timeoutSeconds?: number;
+  evidencePath?: string;
+}
+
+export type PhaseAction =
+  | (PhaseActionBase & {
+      kind: "shell-fn";
+      fn: string;
+      arg?: string;
+    })
+  | (PhaseActionBase & {
+      kind: "shell";
+      arg?: string;
+    });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/scenarios/types.ts` around lines 110 - 131, Change
PhaseAction from a single broad interface into a discriminated union so
TypeScript enforces per-kind fields: define one variant for kind: "shell-fn"
that requires fn: string (plus shared fields like id, phase, scriptRef, arg?,
timeoutSeconds?, evidencePath?, description?) and another variant for kind:
"shell" that omits/marks fn as disallowed/undefined; update any usages (e.g.,
the action objects created in test/e2e-scenario/scenarios/compiler.ts and the
runner that reads action.fn) to satisfy the new types so compilation ensures
"shell-fn" always has fn and "shell" never does.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh`:
- Line 60: Remove the "|| true" suppression so failures in e2e_context_init
surface during test runs: in dispatch-action.sh replace the line that calls the
initializer (the call to e2e_context_init) by invoking e2e_context_init without
"|| true" so that any mkdir or file-write errors creating
${E2E_CONTEXT_DIR}/context.env propagate (this ensures e2e_context_require will
fail immediately instead of masking the error and misattributing missing keys).

In `@test/e2e-scenario/scenarios/compiler.ts`:
- Around line 94-99: The code currently silently returns [] when required phase
action dimensions like installId (from scenario.environment?.install) are
missing; instead throw a hard error to fail-fast: replace the early returns that
yield [] with throwing a descriptive Error (include context such as scenario.id
or scenario.name and which dimension is missing) in the function that generates
phase actions (the branch checking installId / scenario.environment?.install),
and make the identical change in the other similar branch (the second check at
the same pattern) so malformed scenarios surface as hard failures rather than
emitting empty action lists.

---

Nitpick comments:
In `@test/e2e-scenario/scenarios/types.ts`:
- Around line 110-131: Change PhaseAction from a single broad interface into a
discriminated union so TypeScript enforces per-kind fields: define one variant
for kind: "shell-fn" that requires fn: string (plus shared fields like id,
phase, scriptRef, arg?, timeoutSeconds?, evidencePath?, description?) and
another variant for kind: "shell" that omits/marks fn as disallowed/undefined;
update any usages (e.g., the action objects created in
test/e2e-scenario/scenarios/compiler.ts and the runner that reads action.fn) to
satisfy the new types so compilation ensures "shell-fn" always has fn and
"shell" never does.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a237354-36d1-4e41-8990-3b50d8f973f0

📥 Commits

Reviewing files that changed from the base of the PR and between 903f90b and 628870c.

📒 Files selected for processing (5)

test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
test/e2e-scenario/scenarios/compiler.ts
test/e2e-scenario/scenarios/orchestrators/phase.ts
test/e2e-scenario/scenarios/types.ts

🚧 Files skipped from review as they are similar to previous changes (2)

test/e2e-scenario/scenarios/orchestrators/phase.ts
test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts

The first live dispatch of the Phase 6 wiring (run 26550310438) gave us real action evidence and surfaced three real bugs. All three are fixed inside the spec's prescribed layers - no workflow YAML, no client, no old-resolver path. 1. environment.context.emit was a shell action that called the legacy emit-context-from-plan.sh helper. That helper expects the OLD YAML-resolver plan.json shape (dimensions.platform.profile.os...), which the typed compiler does not produce. Drop the shell action; add scenarios/orchestrators/context.ts that derives a normalized context.env directly from the typed RunPlan and writes it from ScenarioRunner.run() before any phase. Spec: context emission is framework infrastructure, not a phase action. 2. PhaseOrchestrator.runShellStep was reading context.env from ${ctx.contextDir}/.e2e/context.env, but the shell helper writes to ${E2E_CONTEXT_DIR}/context.env (top-level). Fix the path so shell assertions see seeded keys. 3. ScenarioRunner did not short-circuit across phase boundaries: a failed environment ACTION (real setup work) still let onboarding and runtime run, producing a misleading 34-failure cascade. Runner now consults prior phase results: if any prior action failed, downstream phases are synthesized as skipped with a message naming the blocking phase+action+message. Assertion-only failures still propagate as failures. Tests added (8 new, 292/292 scenario framework tests green). Validation gate next: dispatch e2e-scenarios-all.yaml again. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Run 26550936453 surfaced two more real bugs after Phase 6 wiring went live. Both fixed inside the spec's prescribed layers; nothing leaks into workflow YAML or clients. 1. dispatch-action.sh called e2e_context_init unconditionally before sourcing the install/onboard dispatcher. e2e_context_init opens context.env with `: > ctx`, which truncated the file the ScenarioRunner had just seeded. All runtime assertions then failed with 'e2e context: missing required key(s): E2E_SCENARIO ...'. Fix: dispatch-action.sh no longer calls e2e_context_init. The TS framework owns context.env initialization; workers may still extend it via e2e_context_set. 2. The legacy onboarding.preflight.passed assertion expects an onboard.log file at ${E2E_CONTEXT_DIR}/onboard.log. The old bash runner used to redirect onboarding output there; the typed orchestrator captured it under .e2e/actions/<action-id>.log. Fix: add optional aliasPath to PhaseAction; compiler sets aliasPath to 'onboard.log' for the onboarding profile action; orchestrator copies the action evidence log to the alias on success. Best- effort - alias copy failures do not fail the action. Live evidence from run 26550936453 (canonical ubuntu-repo-cloud-openclaw): - environment.install.repo-current: passed in 14.2s - onboarding.profile.cloud-openclaw: passed in 302s (real onboarding!) - onboarding.base.cli-installed: passed - onboarding.preflight.passed: failed (onboard.log not found) <- fixed - runtime.* (10 steps): all 'missing key(s)' <- fixed by #1 Tests: 38/38 phase-orchestrator (was 36; +2 alias tests), 294/294 scenario framework. shellcheck clean. Validation gate next: redispatch e2e-scenarios-all and confirm runtime steps actually exercise the SUT (real pass/fail, not key errors). Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

The first canonical-scenario run that reached the onboarding-assertion phase (run 26552550140 / ubuntu-repo-cloud-openclaw) showed that the legacy onboarding.preflight.passed assertion fails on every successful run because its regex matches any mention of 'docker' / 'container' / 'daemon' / 'socket' in onboard.log - and a normal nemoclaw onboarding mentions all of those many times. The action itself succeeded (exit 0, 263s of real onboarding work); the assertion is meant to confirm onboard.log does not contain explicit preflight FAILURE markers. Tighten the regex accordingly: match phrases like 'preflight failed/error', 'cannot connect to the docker daemon', 'onboarding aborted', 'FATAL: docker', 'ERROR: docker daemon' - not bare topic words. Verified: shellcheck passes; bash -n passes. Why we stop here on this PR: This commit lands the last small framework-level fix produced by live action evidence. The Phase 6 wiring is now fully validated end-to-end: Install: passed (~12s) Onboarding: action passed (~263s real onboarding) base.cli-installed passed preflight.passed will now pass Runtime: 9 passed / 25 failed / 5 skipped against live SUT The remaining 25 runtime failures are real product/test bugs surfaced by finally executing the suite against a live SUT (sandbox-shell timeouts, inference 30-60s timeouts, lifecycle.sandbox_operations exit-1 mismatches, lifecycle.rebuild/upgrade 120s timeouts even after retries). They are pre-existing and out of scope for 'execute real shell assertions; delete dry-run, --validate-only, and the bash runner'. They become productive follow-up issues. The 5 skipped runtime steps are 'probe not registered' - known per spec; probe registry lands in a follow-up. Negative scenarios (ubuntu-no-docker-preflight-negative, invalid-key-negative, gateway-port-conflict-negative) need expected- failure semantics and a way to actually simulate docker-missing on the runner. Out of scope here; tracked as follow-up. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

All three findings are valid mechanical bugs introduced/touched by this PR. Batched per the safe-batch policy: same-risk, independently obvious, testable together. 1. orchestrators/phase.ts classifierForRef: Outer guard is /i (case-insensitive), but the inner branch used case-sensitive ref.includes("tunnel") / ref.includes("cloudflared") - mixed-case refs would fall through and misclassify as provider-transient. Replace with /tunnel|cloudflared/i.test(ref). 2. scenarios/compiler.ts phaseActions: Inline comment said "the scenario is malformed; surface it as a hard error" but the code returned []. Hard-fail instead, with a message that names the missing dimension. Empty environment is still tolerated (skeleton scenarios can carry no setup yet). 3. validation_suites/lib/sandbox_lifecycle.sh: Substring match `*${E2E_SANDBOX_NAME}*` would let sb1 falsely match sb10. Use awk with a whole-token equality check on column one of `nemoclaw list` output. Tests: 294/294 scenario framework still green. shellcheck + shfmt clean. No behavior change for canonical scenarios; affected paths were either dormant (case-mixed classifier) or returning a slightly stricter outcome (compiler hard-fail, sandbox exact match). Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Resolves conflicts in: - .github/workflows/e2e-scenarios.yaml (WSL run step) - tools/e2e-scenarios/workflow-boundary.mts (boundary contract test) main brought PR #4346 ("fix(windows): enable Docker WSL integration and avoid Ubuntu first-run races"): the WSL run step is now a PowerShell wrapper that writes a bash script to RUNNER_TEMP, maps it into WSL via wslpath, and invokes `wsl -d $WSL_DISTRO -- env ... bash -l $wslTmp`. This handles Docker WSL integration plus a first-run race on freshly installed distros. This PR (#4380) removes --dry-run, --plan-only, and --validate-only from CI. Resolution: * e2e-scenarios.yaml: take main's PowerShell wrapper structure verbatim, but drop `--dry-run` from the inner `npx tsx ... run.ts` invocation. The typed runner is now the single execution path. * workflow-boundary.mts: assert BOTH the new constraints. The WSL step must NOT contain --dry-run / --plan-only / --validate-only (this PR's invariant) AND must contain $env:WSL_WORKDIR / WriteAllText / `bash -l $wslTmp` (main's robustness invariant). 294/294 scenario framework tests pass. shellcheck / shfmt / end-of-file / trailing-whitespace clean on touched files. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

The PR body audits two invariants as absent from production paths: - E2E_DRY_RUN / e2e_env_is_dry_run - phase-1-skeleton The PR Review Advisor caught counterexamples to both. Fix them now. E2E_DRY_RUN: validation_suites/messaging/slack/00-slack-provider-state.sh still branched on ${E2E_DRY_RUN:-} and emitted dry-run pass markers without reading the OpenClaw config surface or querying runtime discovery. Removed; the assertion now always reads real config and queries OpenClaw runtime state. One execution mode. phase-1-skeleton: scenarios/assertions/{environment,onboarding,runtime}.ts each defined a 'kind: pending, ref: phase-1-skeleton' step that the orchestrator silently skips. Of the three: - environmentBaseline() was in the registry and every scenario plan; environment work is performed by typed PhaseAction entries (context.emit + install.<id>) from compiler.phaseActions(), so the assertion group is redundant. Removed from registry and assertionGroupsForScenario(); deleted the file. - onboardingBaseline() and runtimeSmokeSkeleton() were never imported. Deleted as dead code. Verification: rg of E2E_DRY_RUN, e2e_env_is_dry_run, phase-1-skeleton across scenarios, runtime, validation_suites, onboarding_assertions, nemoclaw_scenarios, and the scenario workflows returns nothing outside framework-tests. 294/294 e2e-scenario framework tests pass. bash -n clean on the touched shell script. Addresses PR Review Advisor findings: - 'Residual E2E_DRY_RUN branch contradicts the one-live-mode contract' - 'Skeleton pending refs remain in production scenario plans' Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

…stic The PR #4380 scenario run produced a SIGTERM cascade across runtime suite steps (sandbox-shell, inference.*, kimi.*, lifecycle.rebuild.*, lifecycle.upgrade.*) because the canonical sandbox-exec wrapper had no inner timeout. When openshell ssh-into-sandbox wedged, the orchestrator killed the process group at the step cap (30/45/60/120s) and the script log captured only the header line — no diagnostic, no classifier. Mirror the legacy test/e2e/ defense-in-depth pattern (outer process timeout + inner curl --max-time) inside the new framework's canonical wrapper: - sandbox-exec.sh now applies a per-call timeout via timeout/gtimeout with --kill-after=5s. Default 25s sits below the common 30s step cap; callers override via E2E_SANDBOX_EXEC_TIMEOUT_SECONDS for steps with larger budgets (chat-completion 50s, sandbox-local 35s, slack 50s, gateway-alive fallback 15s). On timeout (124/137) the wrapper emits a single-line classified diagnostic so phase.ts captures observable evidence instead of a SIGTERM black hole. - Migrate 8 BARE leaf scripts that called openshell sandbox exec directly to route through the wrapper: smoke/03-sandbox-shell.sh assert/gateway-alive.sh (ollama fallback path) inference/cloud/00-models-health.sh inference/cloud/01-chat-completion.sh inference/cloud/02-inference-local-from-sandbox.sh inference/ollama-auth-proxy/00-proxy-reachable.sh messaging/slack/00-slack-provider-state.sh kimi-compatibility steps already route through inference_routing.sh which uses the wrapper, so they inherit the fix transparently. rebuild_upgrade.sh and security_policy_credentials.sh continue to use their own ad-hoc timeouts; unifying those onto the wrapper is a follow-up. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

shellcheck flags 'E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=N \\\n cmd' as SC2034 (apparently unused) because static analysis can't follow that the wrapper reads it via environment. The pattern is correct (prefix-env assignment exports the variable to the immediately-following command), but the warning blocks the 'checks' job. Add a localized 'shellcheck disable=SC2034' directly above each affected prefix-env line in the two cloud-inference assertion scripts that override the default wrapper timeout. No behavior change. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Addresses PR Review Advisor finding #1: 'Secret-bearing child process output is persisted and uploaded without redaction'. Designed in the spirit of the test/e2e-scenario architecture: secret hygiene is FRAMEWORK INFRASTRUCTURE, not a per-script / per-action / per-workflow concern. Same one-mode discipline that motivates the rest of this PR (no flag, no env var, no helper bypasses redaction). New module: test/e2e-scenario/scenarios/orchestrators/redaction.ts Sits next to context.ts, owns: - redactString(text): canonical token-shape redaction - buildChildEnv(base, opts): minimal allowlisted env + declared secretEnv passthrough only - pipeRedacted(src, log): single I/O wrapper used by both runAction and runShellStep Pattern set is a framework-local mirror of src/lib/security/secret-patterns.ts. The framework deliberately does NOT import from src/lib/security/ to keep the framework-vs-product boundary clean and avoid cross-tsconfig ESM issues. A new parity test (e2e-redaction-parity.test.ts) asserts the two pattern sets stay in lockstep so adding a token shape in one place keeps both layers honest. Typed contract extension: PhaseAction.secretEnv?: readonly string[] AssertionStep.secretEnv?: readonly string[] Compiler decides which secrets cross the framework boundary, the same way aliasPath / scriptRef / timeoutSeconds are declared metadata. buildChildEnv enforces a 'secret-key shape' contract on every entry (must end with API_KEY/TOKEN/SECRET/PASSWORD/CREDENTIAL/ PASSPHRASE/PRIVATE_KEY/ACCESS_KEY) so the secretEnv channel cannot silently allowlist non-secret variables. compiler.ts: declares onboard secretEnv per profile cloud-* profiles -> ['NVIDIA_API_KEY'] local-ollama-openclaw -> [] Install actions declare nothing (the installer itself does not authenticate to the cloud). phase.ts: two call sites updated, identical pattern in both runAction: buildChildEnv(...) + pipeRedacted(stdout, log) + pipeRedacted(stderr, log, tail<-redactedChunk) runShellStep: same shape Spawn-error 'message' fields are also wrapped with redactString as defense-in-depth. Workflow upload tightening (.github/workflows/e2e-scenarios.yaml): - include-hidden-files: false Hidden dotfiles under the workspace can carry raw secrets (notably .e2e/context.env, written by e2e_context_set without redaction). Diagnostic dumps of context use e2e_context_dump which redacts on emit. - path: explicit subpath list (.e2e/actions/, .e2e/logs/, result JSONs, plan/run-plan, onboard.log) instead of blanket .e2e/. workflow-boundary.mts contract test updated to assert the new invariant; e2e-scenarios-workflow.test.ts expectations updated. Bash side: unchanged. test/e2e-scenario/runtime/lib/context.sh:: e2e_context_dump already redacts via _e2e_context_is_sensitive_key. This module covers the TS-spawned-child path that lacked coverage. Tests (e2e-phase-orchestrators.test.ts, +5): - test_should_not_persist_secret_shaped_child_output_into_evidence Real shell child writes nvapi-, ghp_, sk-, xoxb-, Bearer ... Asserts evidence log, phase result.json, and result.message contain none of the literal tokens but do contain <REDACTED>. - test_should_drop_non_allowlisted_parent_env_unless_declared_in_secretEnv Sets a sentinel parent-env var, runs child without declaring it, asserts child cannot see it. PATH and E2E_PHASE still arrive. - test_should_pass_declared_secretEnv_through_to_child Declares the var in step.secretEnv, asserts child sees it. - test_should_reject_non_secret_shaped_keys_in_secretEnv_at_runtime buildChildEnv throws on FOO_VAR; the secret-key-shape contract keeps the allowlist boundary honest. - test_should_declare_NVIDIA_API_KEY_only_for_cloud_onboarding_actions Compiler-level contract: cloud onboard declares NVIDIA_API_KEY, local-ollama declares nothing. Tests (e2e-redaction-parity.test.ts, new file, +2): - TOKEN_PREFIX_PATTERNS framework copy matches product source - CONTEXT_PATTERNS framework copy matches product source 308/308 e2e-scenario framework tests pass. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

…re pending steps Addresses PR Review Advisor finding #2: 'Required security probes and expected-failure checks still skip without failing live runs'. Same architectural failure mode as findings #5 (E2E_DRY_RUN) and #6 (phase-1-skeleton): the typed framework declared a contract, but a code path silently produced non-failing skips that contradicted this PR's 'one mode, no fake green' invariant. Three suites (security-shields, security-policy, security-injection) were wired as kind: 'probe' steps; the orchestrator marked them status: 'skipped' because the probe registry hasn't landed yet. The expected-failure side-effect validator (runtime.expected-failure.no- side-effects) was wired as kind: 'pending'. run.ts only sets process.exitCode = 1 on status === 'failed', so a live run could omit security shields, network policy, prompt-injection blocking, AND the negative-scenario forbidden-side-effect contract while still appearing non-failing. Fix follows the framework spirit: declare requirement in the typed metadata, enforce it once at the orchestrator boundary. Typed contract extension (types.ts): AssertionStep.required?: boolean When true, a probe/pending step that resolves as 'skipped' is reclassified as 'failed' by the phase orchestrator. Defaults to false (existing behavior preserved for diagnostics, docs-validation, and other non-security probes). Orchestrator (phase.ts): executeStep: probe and pending branches now check step.required and return status: 'failed' with a 'required <kind> not <state>' message when set. Non-required probes/pending steps continue to surface as visible skipped results so unrelated gaps stay diagnostic. Registry (assertions/registry.ts): - probeStep / pendingStep helpers grew an options arg with required?: boolean (matches the secretEnv pattern: options come in via typed objects, not positional args). - security-shields, security-policy, security-injection probes annotated required: true. These are the suites the run is not safe without; failing closed beats fake green. - runtime.expected-failure.no-side-effects pending step annotated required: true. Negative scenarios cannot honestly pass while their core side-effect contract is unimplemented. - diagnostics, docs-validation probes intentionally remain non-required: they are informational, not safety-critical, and will switch to required when the probe registry lands and they have real implementations. run.ts: no change needed. The new failed-status flows through the existing exit-code path (process.exitCode = 1 when any phase status is failed), so the workflow correctly reports red for missing required probes/pending steps. Tests (e2e-phase-orchestrators.test.ts, +5): - test_required_probe_step_that_is_unregistered_fails_the_phase - test_non_required_probe_step_continues_to_skip_visibly - test_required_pending_step_fails_closed - test_security_suite_groups_in_registry_mark_their_steps_as_required - test_expected_failure_no_side_effects_step_in_registry_is_required 318/318 e2e-scenario framework tests pass. Until the probe registry follow-up PR registers actual implementations for shieldsConfigProbe / networkPolicyProbe / injectionBlockedProbe / expectedFailureNoSideEffectsProbe, scenarios that include those suites/expected-failure groups will fail loudly with 'required probe not registered' or 'required pending step not implemented'. That is the correct intermediate state: visible red until real, instead of green-by-default. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Diagnoses surfaced by the timeout wrapper on PR #4380's scenario run showed every assertion that goes through 'openshell sandbox exec' hangs in CI (host can curl the gateway and list sandboxes, but openshell-exec into the sandbox never returns). The legacy test/e2e/ scripts have always entered the sandbox via 'openshell sandbox ssh-config' + 'ssh -F' and don't exhibit this hang, so adopt that pattern as the default transport in the canonical wrapper. Behavior: - On first call per (sandbox, script), e2e_sandbox_exec calls 'openshell sandbox ssh-config <name>' once and caches the result under ${E2E_CONTEXT_DIR}/.ssh-config-cache/<name>.cfg. - Subsequent calls reuse the cached config and invoke 'ssh -F <cfg> -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR openshell-<name> <quoted remote-cmd>'. - Args are quoted via printf '%q' so payloads with shell metacharacters (e.g. JSON bodies for chat-completion) survive intact. - Fallback: if 'openshell sandbox ssh-config' returns non-zero (e.g. older openshell builds, runners without an ssh client), the wrapper falls back to 'openshell sandbox exec' and emits a stderr breadcrumb. - Opt-out: set E2E_SANDBOX_EXEC_VIA_OPENSHELL=1 to force the original transport (used by the framework unit tests that stub openshell but not ssh). Tests: - Two new vitest cases: * sandbox_exec_should_prefer_ssh_config_transport_when_openshell_offers_one * sandbox_exec_should_fall_back_to_openshell_when_ssh_config_unavailable - Existing exit-code propagation and stdin-quoting tests opt into the openshell-direct transport so their stub-openshell expectations remain valid. - 322 framework tests pass. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

copy-pr-bot · 2026-05-28T12:14:21Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…apper The lifecycle.rebuild.* and lifecycle.upgrade.survivor-reachable assertions on PR #4380's scenario run still produced raw 120s SIGTERM cascades because rebuild_upgrade.sh used its own ad-hoc _rebuild_upgrade_run helper that called 'openshell sandbox exec' without a per-call timeout or transport choice. The orchestrator's 120s step cap was the only safety net, and on hang the script log captured no diagnostic. Migrate the lib onto the canonical wrapper: - Add _rebuild_upgrade_sandbox_exec(sandbox, cmd...) that: * Honors the existing REBUILD_UPGRADE_SANDBOX_CMD test override (with the same -n <sandbox> -- <cmd>... contract used by existing fakes) so framework unit tests keep working unchanged. * Routes production calls through e2e_sandbox_exec, picking up the ssh-config-preferred transport, the per-call timeout, and the classified diagnostic on hang. - Replace the seven openshell-sandbox-exec call sites with the new helper. - Default E2E_SANDBOX_EXEC_TIMEOUT_SECONDS to 100s for this lib's callers (orchestrator caps these steps at 120s; 100s leaves room for the wrapper to emit a diagnostic and exit cleanly before SIGTERM). The existing rebuild_upgrade_checks_should_allow_command_fakes test continues to pass, proving the override contract is preserved. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

The lifecycle.sandbox.list-and-status assertion on PR #4380's scenario run failed with 'missing status field: status'. Root cause: the assertion required the literal lower-case tokens 'status', 'gateway', and 'sandbox' in 'nemoclaw <name> status' output, but the production command (src/lib/actions/sandbox/status.ts) does not print 'Status:' or 'Gateway:' labels in its normal output. The original assertion only passed against the framework-test mock string ('status running gateway healthy sandbox running'), which was synthetic and did not reflect what the CLI actually emits. Align the assertion with the production output: - Require the sandbox name itself to appear (every status output for a present sandbox emits 'Sandbox: <name>'). - Require the field labels that the CLI unconditionally prints when a sandbox is found: 'Sandbox', 'Model', 'OpenShell'. - Drop the 'status' and 'gateway' tokens that never appear in real output. This matches the legacy test/e2e/test-full-e2e.sh:230 pattern (verifies 'nemoclaw status' exits 0 and contains substantive content) while keeping the assertion strict enough to catch CLI regressions that drop core fields. Update the framework test mock to emit realistic output ('Sandbox: sb1', 'Model:', 'OpenShell:', 'Policies:') so the existing happy-path test keeps passing on the new contract. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

The 'ubuntu-no-docker-preflight-negative' scenario on PR #4380 was structurally registered (manifest, scenario id, expected_failure metadata) but never executed its negative path. The framework called the regular 'cloud-openclaw' onboarding worker, docker was actually running, onboard succeeded, and nothing wrote ${E2E_CONTEXT_DIR}/negative-preflight.log. The 'onboarding.preflight.expected-failed' assertion then failed with 'evidence not found'. Mirror the legacy test/e2e/e2e-cloud-experimental/test-port8080-conflict.sh pattern (set up failure condition -> capture onboard output -> assert non-zero exit -> grep log) inside the typed framework: - Add nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh: * Installs a 'docker' shim earlier on PATH that exits non-zero with 'Cannot connect to the Docker daemon ...' so commandExists('docker') succeeds while 'docker info' fails - the same failure mode users see when Docker is installed but the daemon is not running. * Runs 'nemoclaw onboard --non-interactive' redirecting stdout+stderr to ${E2E_CONTEXT_DIR}/negative-preflight.log. * Asserts nemoclaw exited non-zero (preflight DID fail). If onboard unexpectedly succeeds, the action fails so a regression that lets docker-missing slide is loud, not silent. * Returns 0 on the expected-failure path so the orchestrator marks the action passed and the dependent assertion phase runs against the captured log. - Wire the worker into onboard/dispatch.sh under the 'cloud-openclaw-no-docker' profile id. - Compiler routes scenarios with environment.runtime='docker-missing' to '<base>-no-docker' onboarding profile. Adds the new profile to ONBOARD_PROFILE_SECRET_ENV so NVIDIA_API_KEY still flows through (matches a real user invocation where the CLI loads creds even when preflight aborts). - Adds compiler_routes_docker_missing_runtime_to_no_docker_onboarding_profile test that locks in the routing: negative scenario -> -no-docker profile id, evidence path, secret env; positive scenario unaffected. 324 tests pass. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

- redaction.ts: prepend $HOME/.local/bin to child PATH so post-install shims resolve on runners where it's not on the default PATH - sandbox_lifecycle.sh: use 'nemoclaw <sandbox> snapshot <sub>' argv shape; the prior form parsed 'snapshot' as the sandbox name

Resolutions: - run.ts: kept HEAD's removal of --dry-run/--validate-only modes; took main's richer --emit-matrix payload (runner/label/platform/suites) consumed by the dynamic matrix workflow (#4359). - 00-slack-provider-state.sh: kept HEAD's removal of the openclaw dry-run skip block; carried main's 'agent' local var. - docs/README.md: kept HEAD (live-only runner, deprecated bash entrypoints) and folded in main's 'Migration status is tracked outside the repository' line so e2e-migration-inventory-lock stays green. - e2e-coverage-report.test.ts, e2e-scenario-resolver.test.ts, runtime/resolver/coverage.ts: confirmed deletes (intentional in 9da75ac, 'retire legacy expected-states.yaml + runtime/resolver'). - e2e-expected-state.test.ts: applied the trim that 9da75ac's commit message described but did not include — drop the YAML-mirror parity tests that depended on the now-deleted expected-states.yaml, replace with id-coverage assertions against listExpectedStates(). - e2e-scenarios.yaml: retitled the summary step from 'typed dry-run summary' to 'typed scenario run' so the workflow text matches the PR's live-only stance.

#4939 added 'transitional model / target runner' guidance to docs/README.md after our merge. Resolved by taking that content but reconciling it with this PR's retirement of nemoclaw_scenarios/expected-states.yaml and the runtime/resolver/ tree: - 'Current sources of truth' table: drop the row pointing at expected-states.yaml and runtime/resolver; add a row for the typed expected-state registry (scenarios/expected-states.ts) as the single source of truth. - Replaced 'fan-out and dry-run planning' with 'fan-out and live execution'. - Added one paragraph noting the legacy YAML resolver and expected-states.yaml have been retired so the doc and the codebase agree.

CI on the merge commit (863e8eb) flagged two regressions caused by the merge with main: 1. ShellCheck SC2148: 5 probe scripts under nemoclaw_scenarios/probes/ start with the SPDX comment and have no shebang. ShellCheck cannot infer the target shell. Added '#!/usr/bin/env bash' and ensured executable bit. dispatch.sh already had a shebang. 2. markdown-links: docs/README.md still pointed at two paths that the YAML-resolver retirement (commit 9da75ac) deleted: - ../nemoclaw_scenarios/expected-states.yaml - ../runtime/resolver/schema.ts Repointed both to the typed registry that replaced them (scenarios/expected-states.ts, scenarios/types.ts). Verified locally: check-docs --only-links now passes; test/e2e-scenario/ vitest still 366 green; typecheck:cli clean.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh (1)
39-44: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent grep pattern can cause false negatives.

Line 33 uses a word-boundary regex to prevent partial matches (e.g., "test" won't match "test-2"), but line 40 uses fixed-string substring matching (grep -Fq), which will match partial names. If a sandbox named "${E2E_SANDBOX_NAME}-suffix" exists but "${E2E_SANDBOX_NAME}" doesn't, this probe will incorrectly fail.
🔍 Proposed fix to align openshell check with nemoclaw pattern
 if command -v openshell >/dev/null 2>&1; then
-  if openshell sandbox list 2>/dev/null | grep -Fq "${E2E_SANDBOX_NAME}"; then
+  if openshell sandbox list 2>/dev/null | grep -qE "(^|[[:space:]])${E2E_SANDBOX_NAME}([[:space:]]|$)"; then
     echo "probe sandbox-absent: openshell reports sandbox '${E2E_SANDBOX_NAME}', expected absent" >&2
     exit 1
   fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh` around lines
39 - 44, The probe sandbox-absent uses substring matching (grep -Fq) which can
false-match names like "${E2E_SANDBOX_NAME}-suffix"; update the openshell check
so it matches whole words instead (e.g., replace grep -Fq "${E2E_SANDBOX_NAME}"
with grep -wq -- "${E2E_SANDBOX_NAME}" or an equivalent word-boundary regex like
grep -Eq "\\b${E2E_SANDBOX_NAME}\\b") in the block containing the openshell
sandbox list call and the probe named "sandbox-absent" so the check aligns with
the word-boundary behavior used elsewhere.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh`:
- Around line 39-44: The probe sandbox-absent uses substring matching (grep -Fq)
which can false-match names like "${E2E_SANDBOX_NAME}-suffix"; update the
openshell check so it matches whole words instead (e.g., replace grep -Fq
"${E2E_SANDBOX_NAME}" with grep -wq -- "${E2E_SANDBOX_NAME}" or an equivalent
word-boundary regex like grep -Eq "\\b${E2E_SANDBOX_NAME}\\b") in the block
containing the openshell sandbox list call and the probe named "sandbox-absent"
so the check aligns with the word-boundary behavior used elsewhere.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dbfe9c5f-8470-47b6-9fc2-8259436800ba

📥 Commits

Reviewing files that changed from the base of the PR and between 863e8eb and 3d8adfa.

📒 Files selected for processing (6)

test/e2e-scenario/docs/README.md
test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh
test/e2e-scenario/nemoclaw_scenarios/probes/gateway-absent.sh
test/e2e-scenario/nemoclaw_scenarios/probes/gateway-healthy.sh
test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh
test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh

✅ Files skipped from review due to trivial changes (2)

test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh

The PR audit claimed 'Audited absent in production paths: e2e_env_is_dry_run, E2E_DRY_RUN, ...' but the live scenarios run on commit 3d8adfa revealed two survivors: 1. validation_suites/hermes/01-history-writable.sh:34 - Calls 'if e2e_env_is_dry_run; then ... exit 0; fi' - The helper was deleted in b7acfb7 -> the whole hermes scenario blew up with 'e2e_env_is_dry_run: command not found' on every run, masquerading as a hermes regression - Surfaced in: ubuntu-repo-cloud-hermes scenario, runtime.hermes.history-writable 2. validation_suites/messaging/slack/00-slack-provider-state.sh:66 - Wrapped the entire hermes branch in 'if [[ -n "${E2E_DRY_RUN:-}" ]]; then 3x e2e_pass else <real probe>; fi' - Unreachable now (runner never sets E2E_DRY_RUN), but contradicts the audit and would skip the real probes if anything ever did Removed the dry-run wrapper in both files; dedented the slack else-block inline. bash -n clean; vitest 366 green; full repo grep for 'e2e_env_is_dry_run|E2E_DRY_RUN' (excluding .test.ts) returns zero. Re-enables: ubuntu-repo-cloud-hermes / runtime.hermes.history-writable

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh`:
- Line 70: Replace direct openshell sandbox invocations with the canonical
e2e_sandbox_exec wrapper to preserve per-call timeout and transport behavior;
specifically, change calls that currently run "openshell sandbox exec --name
\"${sandbox_name}\" -- /opt/hermes/.venv/bin/python -c '...'" to use
e2e_sandbox_exec passing the sandbox identifier and the same command payload
(same python -c string) so timeouts/fallbacks are honored. Apply the same
replacement for the other occurrences mentioned (the calls around the other two
locations) and keep the original command arguments unchanged except for swapping
the invocation to e2e_sandbox_exec.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 060de45a-9ece-4905-ba24-816fa28927ce

📥 Commits

Reviewing files that changed from the base of the PR and between 3d8adfa and 381d654.

📒 Files selected for processing (2)

test/e2e-scenario/validation_suites/hermes/01-history-writable.sh
test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh

💤 Files with no reviewable changes (1)

test/e2e-scenario/validation_suites/hermes/01-history-writable.sh

Two batched mechanical fixes for the static-checks pre-commit job on PR head 381d654: 1. Files were modified by shfmt (-i 2 -ci -bn): 6 .sh files used '\\<newline>' line continuations where shfmt's -bn (binary-next-line) prefers '&&\n' / '||\n'. Pre-existing on main; the merge dragged them into the PR diff and our local commits weren't run through pre-commit. Re-formatted in place. 2. check-shebang-scripts-are-executable: 3 scripts had shebangs but were tracked as 0644: - test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh - test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh - test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh chmod +x and re-tracked. Verified: shfmt -d clean across all PR-touched .sh files; biome format clean across PR-touched ts/tsx/js/mjs/mts; bash -n clean on the 6 reformatted scripts; vitest test/e2e-scenario/ 366 green.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh (1)
13-13: ⚠️ Potential issue | 🟡 Minor

Update the PR summary: runtime/lib/env.sh is still present and is sourced by the scenario script

test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh line 13 still sources . "${LIB_DIR}/env.sh", and test/e2e-scenario/runtime/lib/env.sh exists in the repo.

The “removed”/runtime-failure claim doesn’t match the current code; align the summary (e.g., if only specific helpers were removed, state that).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh`
at line 13, The PR summary claims runtime/lib/env.sh was removed but the
scenario script still sources it via the LIB_DIR variable (the line sourcing
"${LIB_DIR}/env.sh" in 02-inference-local-from-sandbox.sh) and the
runtime/lib/env.sh file remains in the repo; update the PR summary to accurately
reflect what was changed (e.g., which specific helpers were removed or
refactored) or remove the source line if you truly intend to stop using
runtime/lib/env.sh—ensure consistency between the commit changes and the summary
by either modifying the script to stop sourcing LIB_DIR/env.sh or amending the
PR description to state that env.sh is still present and which pieces were
removed.

🧹 Nitpick comments (1)

test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh (1)
81-81: ⚖️ Poor tradeoff

Shell injection risk in command construction.

The command string interpolates ${marker_content} into a single-quoted context. If marker_content ever contains a single quote, the quoting breaks. While currently safe (the content is REBUILD_LIFECYCLE_$(date +%s)_${RANDOM}), this pattern was explicitly flagged in PR coordination as a concern: "prevention of plan-artifact-to-shell injection."

Pass variables as positional arguments to avoid quote interpretation:
🔒 Proposed fix using positional arguments
-  if ! E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=30 \
-    e2e_sandbox_exec "${sandbox_name}" -- sh -c \
-    "mkdir -p '$(dirname "${LIFECYCLE_REBUILD_MARKER_PATH}")' && printf '%s' '${marker_content}' > '${LIFECYCLE_REBUILD_MARKER_PATH}'"; then
+  if ! E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=30 \
+    e2e_sandbox_exec "${sandbox_name}" -- sh -c \
+    'mkdir -p "$(dirname "$1")" && printf %s "$2" > "$1"' \
+    sh "${LIFECYCLE_REBUILD_MARKER_PATH}" "${marker_content}"; then
This passes marker_content as $2 to the remote shell, preventing quote interpretation issues.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh` at
line 81, The current shell command interpolates marker_content directly into a
single-quoted string which breaks if marker_content contains a single quote;
update the command used to create LIFECYCLE_REBUILD_MARKER_PATH so
marker_content is passed as a positional argument to the remote shell instead of
being interpolated into the command string. Concretely, change the constructed
command that references LIFECYCLE_REBUILD_MARKER_PATH and marker_content to use
a printf (or echo) that reads from a positional parameter (e.g., $2) and then
invoke the remote shell/exec call with marker_content supplied as that
positional argument so no shell quoting of marker_content is required.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh`:
- Line 13: The PR summary claims runtime/lib/env.sh was removed but the scenario
script still sources it via the LIB_DIR variable (the line sourcing
"${LIB_DIR}/env.sh" in 02-inference-local-from-sandbox.sh) and the
runtime/lib/env.sh file remains in the repo; update the PR summary to accurately
reflect what was changed (e.g., which specific helpers were removed or
refactored) or remove the source line if you truly intend to stop using
runtime/lib/env.sh—ensure consistency between the commit changes and the summary
by either modifying the script to stop sourcing LIB_DIR/env.sh or amending the
PR description to state that env.sh is still present and which pieces were
removed.

---

Nitpick comments:
In `@test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh`:
- Line 81: The current shell command interpolates marker_content directly into a
single-quoted string which breaks if marker_content contains a single quote;
update the command used to create LIFECYCLE_REBUILD_MARKER_PATH so
marker_content is passed as a positional argument to the remote shell instead of
being interpolated into the command string. Concretely, change the constructed
command that references LIFECYCLE_REBUILD_MARKER_PATH and marker_content to use
a printf (or echo) that reads from a positional parameter (e.g., $2) and then
invoke the remote shell/exec call with marker_content supplied as that
positional argument so no shell quoting of marker_content is required.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 19ad2265-3927-4694-80ee-0d4c10fded73

📥 Commits

Reviewing files that changed from the base of the PR and between 381d654 and fc06168.

📒 Files selected for processing (6)

test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh
test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh
test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh

💤 Files with no reviewable changes (2)

test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh

🚧 Files skipped from review as they are similar to previous changes (1)

test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh

…e-shape budget The static-checks job runs scripts/find-source-shape-tests.ts --check with budget ci/source-shape-test-budget.json (maxSourceShapeCases=0). e2e-redaction-parity.test.ts had 2 source-shape cases that read both redaction.ts files as text and compared regex literals via string parsing. Refactored to import the actual RegExp arrays from both modules and compare them by behavior — fingerprint each via .source + .flags. Same parity guarantee, no source-text reads, scanner now reports 0 cases. Required exporting TOKEN_PREFIX_PATTERNS and CONTEXT_PATTERNS from test/e2e-scenario/scenarios/orchestrators/redaction.ts. The framework runtime stays decoupled from src/lib/security/ — only the parity test crosses that boundary, which is the test's whole purpose. Added a comment on the framework export explaining the test-only contract. Verified: `npm run source-shape:check` reports 0 cases; vitest test/e2e-scenario/ 366 green; typecheck:cli clean.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/e2e-scenario/scenarios/orchestrators/redaction.ts (1)

211-221: ⚠️ Potential issue | 🟠 Major

Fix chunk-boundary redaction in pipeRedacted to prevent split tokens from leaking into evidence logs.

test/e2e-scenario/scenarios/orchestrators/redaction.ts (pipeRedacted, lines 211-221) redacts each data chunk independently via redactString(chunk.toString("utf8")) and writes it immediately. With no carry/overlap buffering, a token split across adjacent chunks can fail the regex matches and persist in the evidence log (and into the stderrTail-based message path). The current redaction test covers echo-style output but doesn’t exercise split-token chunk boundaries.

Suggested direction (stream-safe redaction)

+import { StringDecoder } from "node:string_decoder";
...
 export function pipeRedacted(
   src: Readable,
   log: Writable,
   onChunk?: (redactedChunk: string) => void,
 ): void {
-  src.on("data", (chunk: Buffer) => {
-    const redacted = redactString(chunk.toString("utf8"));
-    log.write(redacted);
-    onChunk?.(redacted);
-  });
+  const decoder = new StringDecoder("utf8");
+  let carry = "";
+  const OVERLAP_CHARS = 4096;
+
+  const emitRedacted = (text: string) => {
+    if (!text) return;
+    const redacted = redactString(text);
+    log.write(redacted);
+    onChunk?.(redacted);
+  };
+
+  src.on("data", (chunk: Buffer | string) => {
+    const text = typeof chunk === "string" ? chunk : decoder.write(chunk);
+    const merged = carry + text;
+    const cutoff = Math.max(0, merged.length - OVERLAP_CHARS);
+    emitRedacted(merged.slice(0, cutoff));
+    carry = merged.slice(cutoff);
+  });
+
+  src.on("end", () => {
+    emitRedacted(carry + decoder.end());
+    carry = "";
+  });
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts` around lines 211 -
221, pipeRedacted currently redacts each Buffer chunk independently
(src.on("data") → redactString(...); log.write(...)) which allows sensitive
tokens split across chunk boundaries to slip through; change pipeRedacted to
maintain a small carry buffer between chunks: append incoming chunk to a
leftover string, run redactString on the combined text but retain any trailing
partial token (notably the same token/regex semantics used by redactString) by
slicing off an appropriate suffix into leftover for the next chunk, write only
the fully-redacted portion to log and call onChunk for that portion, and on
"end"/"close"/"error" flush the remaining leftover through redactString and
write it; ensure the same function names (pipeRedacted, redactString) and event
handlers (src.on("data"), src.on("end"/"close"/"error")) are updated.

🧹 Nitpick comments (1)

test/e2e-scenario/scenarios/orchestrators/redaction.ts (1)

52-74: ⚡ Quick win

Exported regex arrays should be immutable to prevent runtime mutation of redaction behavior.

Now that these arrays are exported for tests, any importer can mutate them (push/splice), which can silently alter production redaction behavior in-process.

Hardening patch

-export const TOKEN_PREFIX_PATTERNS: RegExp[] = [
+export const TOKEN_PREFIX_PATTERNS: readonly RegExp[] = Object.freeze([
   ...
-];
+]);

-export const CONTEXT_PATTERNS: RegExp[] = [
+export const CONTEXT_PATTERNS: readonly RegExp[] = Object.freeze([
   ...
-];
+]);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts` around lines 52 - 74,
The exported arrays TOKEN_PREFIX_PATTERNS and CONTEXT_PATTERNS are mutable and
can be altered by importers; change their exports to immutable frozen/read-only
arrays and freeze each RegExp instance. Replace direct array exports with
something like: export const TOKEN_PREFIX_PATTERNS: ReadonlyArray<RegExp> =
Object.freeze([ ...patterns... ].map(r => Object.freeze(r))); and do the same
for CONTEXT_PATTERNS so callers cannot push/splice or mutate the regex objects.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts`:
- Around line 211-221: pipeRedacted currently redacts each Buffer chunk
independently (src.on("data") → redactString(...); log.write(...)) which allows
sensitive tokens split across chunk boundaries to slip through; change
pipeRedacted to maintain a small carry buffer between chunks: append incoming
chunk to a leftover string, run redactString on the combined text but retain any
trailing partial token (notably the same token/regex semantics used by
redactString) by slicing off an appropriate suffix into leftover for the next
chunk, write only the fully-redacted portion to log and call onChunk for that
portion, and on "end"/"close"/"error" flush the remaining leftover through
redactString and write it; ensure the same function names (pipeRedacted,
redactString) and event handlers (src.on("data"), src.on("end"/"close"/"error"))
are updated.

---

Nitpick comments:
In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts`:
- Around line 52-74: The exported arrays TOKEN_PREFIX_PATTERNS and
CONTEXT_PATTERNS are mutable and can be altered by importers; change their
exports to immutable frozen/read-only arrays and freeze each RegExp instance.
Replace direct array exports with something like: export const
TOKEN_PREFIX_PATTERNS: ReadonlyArray<RegExp> = Object.freeze([ ...patterns...
].map(r => Object.freeze(r))); and do the same for CONTEXT_PATTERNS so callers
cannot push/splice or mutate the regex objects.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 50d05560-6afd-4f7b-927c-bafdd8cf0b74

📥 Commits

Reviewing files that changed from the base of the PR and between fc06168 and b41c17b.

📒 Files selected for processing (2)

test/e2e-scenario/framework-tests/e2e-redaction-parity.test.ts
test/e2e-scenario/scenarios/orchestrators/redaction.ts

Two batched review-thread fixes (Phase 3 of CI loop): 1. CodeRabbit (major) on validation_suites/messaging/slack/00-slack-provider-state.sh: 3 calls used 'openshell sandbox exec --name X -- ...' directly, bypassing the canonical e2e_sandbox_exec wrapper. The wrapper provides per-call timeout via E2E_SANDBOX_EXEC_TIMEOUT_SECONDS plus ssh-config-preferred / openshell-exec-fallback transport. A wedged openshell call without the wrapper can stall the suite indefinitely in live mode. Replaced all 3 with the wrapper call, matching the timeout cap pattern already used at line 36. Fixes hermes-platforms-enabled, hermes-allowed-channels-scoped, and hermes-gateway-running probes in the slack scenario. 2. CodeQL alert 715 (Shell command built from environment values) on test/e2e-scenario/scenarios/probes/util.ts: runSandboxCmd interpolated wrapperPath (derived from ProbeContext.repoRoot) and the sandbox name + payload argv into a bash -c script string. Even with JSON.stringify-style escaping, CodeQL flags taint flow from runtime context into shell. Refactored to pass all user-controlled values as positional bash parameters: spawn('bash', ['-c', script, 'e2e-probe-spawn', wrapperPath, fnName, sandboxName, ...args]). Script body references them as "$1", "$2", "$3", "${@:4}". No user data is concatenated into the script. Smoke-tested the new pattern; runtime behavior is identical for callers (single argv vector preserved). Updated runSandboxCmd doc-comment which described the old single-quote-into-script approach. Verified: bash -n + shfmt clean on slack file; vitest 366 green; typecheck:cli clean; source-shape:check still 0 cases.

Iteration 5's refactor cleared CodeQL alert 715 (interpolated user data into shell-command body) but introduced alert 792 on the resulting spawn("bash", ["-c", script, sentinel, ...bashArgs]) call. CodeQL's js/shell-command-injection-from-environment rule fires on any data flow from runtime context into a 'bash -c' invocation, even when the data reaches the script as positional bash parameters rather than through interpolation. Applied the established repo pattern (src/lib/runner.ts:101-102): 1. Added a NUL-byte rejector on every bashArg (mirrors normalizeSpawnArgs / rejectNulByte from runner.ts). 2. Documented the spawn contract in spawnBash's doc-comment: - script body is a string literal at every call site - all variable data flows through positional bash params ($1,$2,...) - bash treats positional argv as data, not code - bash binary is hard-coded; shell:false is implicit 3. Suppressed alert 792 with an explicit lgtm[] comment that points at the contract above. This follows the documented repo convention: validate inputs, document why the spawn is safe, suppress the false positive with the rule name and a short justification. Verified: typecheck:cli clean; vitest test/e2e-scenario/ 366 green; smoke-tested the new spawn behavior with a temp wrapper script.

# Conflicts: # test/e2e-scenario/docs/README.md

…awn() Iter 9 added the lgtm suppression but with a multi-line justification that pushed the actual marker 3 lines above the spawn() call. CodeQL treats each '//' comment as a separate node, so my marker on line N followed by 2 continuation comments on N+1, N+2 was not recognized as a suppression for the spawn() on N+3. The new alert 793 fired on the post-iter-9 code at line 113. Restructured to match src/lib/runner.ts:101-104 exactly: contextual explanation as plain comments above, then the bare lgtm directive on the immediately-preceding line. Justification (literal script body, NUL-validated positional argv, hard-coded bash binary) is already in spawnBash's doc-comment, which is the documentation site of record for the contract. Verified: typecheck:cli clean; vitest 366 green.

…#4996) ## Summary Bash is not the direction the E2E scenario stack is heading, and the vitest-runner migration underway makes that even clearer — none of these bash files have a future role. This PR sweeps out a set of inert tombstones and orphan helpers under `test/e2e-scenario/runtime/` that have no production callers, so the migration lands on a smaller surface. Net: 6 file deletions + minor test trimming. ~250 lines removed. No production behavior change. ## Changes **Tombstone bash entrypoints (deleted; bodies were just `exit 2` stubs):** - `test/e2e-scenario/runtime/run-scenario.sh` - `test/e2e-scenario/runtime/run-suites.sh` **Orphan helper libs (deleted; zero callers in production):** - `test/e2e-scenario/runtime/lib/artifacts.sh` — only caller was a self-test in `e2e-lib-helpers.test.ts`. - `test/e2e-scenario/runtime/lib/cleanup.sh` — was a thin re-export of `sandbox-teardown.sh`. - `test/e2e-scenario/runtime/lib/negative.sh` — superseded by `scenarios/orchestrators/negative-matcher.ts`. - `test/e2e-scenario/runtime/lib/port-holder.sh` — superseded by typed probes. **Test cleanup in `test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts` (rides along):** - Dropped `artifact_helper_should_collect_known_logs_without_failing_when_missing` — it was the only caller of `artifacts.sh`. - Dropped `gateway_helper_should_report_unhealthy_gateway_clearly` — it sourced `${RUNTIME_LIB}/gateway.sh`, which does not exist on main. The test was passing only by accident: the missing-file stderr happens to match `/gateway/i`. - Dropped the dead `E2E_DRY_RUN: "1"` env arg in `older_base_image_should_emit_dockerfile_pointing_at_tagged_base`. PR #4380 deleted the `e2e_env_is_dry_run` short-circuit from `older-base-image.sh`, so the env var is unread test scaffolding now. **Comment hygiene:** - `nemoclaw_scenarios/probes/sandbox-absent.sh` — dropped the "Typed replacement for the legacy `run-scenario.sh` inline check" note now that the legacy file is deleted. **Intentionally untouched:** - `test/e2e-scenario-advisor.test.ts` still references `test/e2e-scenario/runtime/run-scenario.sh` as a string fixture. The advisor never reads the file — it filters changed-file paths by shape — so the fixture continues to validate the same plumbing. Refreshing those fixtures is out of scope. - `test/e2e-scenario/docs/{README,MIGRATION}.md` reference the old bash entrypoints. Doc updates are out of scope; the docs already mark these paths as deprecated. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Notes on verification: - Targeted vitest run is green: `npx vitest run test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts test/e2e-scenario-advisor.test.ts test/e2e-scenario/` — 3 test files, 106 tests, 0 failures. - `npx prek run --all-files` reports the same `Test (CLI)` flakes that exist on a clean `origin/main` checkout (timing-sensitive tests in `nemoclaw-start.test.ts`, `doctor-gateway-token.test.ts`, `e2e-fixture-context.test.ts`, `runtime-hermes-secret-boundary-behavioural.test.ts`). All pass on isolated re-run; none reference any of the deleted bash files. Pre-existing — not caused by this PR. Leaving the verification box unchecked because the local prek run wasn't fully clean even though the issue is unrelated. --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Merge remote-tracking branch 'origin/main' into feat/e2e-real-execution

2716ab0

# Conflicts: # test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread test/e2e-scenario/scenarios/orchestrators/phase.ts

Comment thread test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh Outdated

jyaunches added the v0.0.55 label May 28, 2026

jyaunches added 3 commits May 27, 2026 21:21

fix(e2e): drop extra trailing newline (end-of-file-fixer)

903f90b

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh Outdated

Comment thread test/e2e-scenario/scenarios/compiler.ts Outdated

jyaunches added 8 commits May 27, 2026 22:19

style(e2e): shfmt indentation on preflight-passed continuation lines

1d128a1

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

github-advanced-security AI found potential problems May 28, 2026

View reviewed changes

Comment thread test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh Fixed

Comment thread test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh Fixed

jyaunches added 4 commits May 28, 2026 07:47

jyaunches marked this pull request as draft May 28, 2026 12:14

jyaunches added 3 commits May 28, 2026 08:16

cv mentioned this pull request Jun 8, 2026

Adopt Vitest fixtures as the E2E scenario execution model #4941

Open

7 tasks

jyaunches added 2 commits June 8, 2026 11:36

jyaunches marked this pull request as ready for review June 8, 2026 15:59

jyaunches added 2 commits June 8, 2026 12:02

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh Outdated

jyaunches added 2 commits June 8, 2026 13:27

Merge remote-tracking branch 'origin/main' into feat/e2e-real-execution

c8ae688

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

github-advanced-security AI found potential problems Jun 8, 2026

View reviewed changes

Comment thread test/e2e-scenario/scenarios/probes/util.ts Fixed

jyaunches added 3 commits June 8, 2026 13:55

Merge remote-tracking branch 'origin/main' into feat/e2e-real-execution

415c577

Merge remote-tracking branch 'origin/main' into feat/e2e-real-execution

8c6261d

# Conflicts: # test/e2e-scenario/docs/README.md

github-advanced-security AI found potential problems Jun 8, 2026

View reviewed changes

Comment thread test/e2e-scenario/scenarios/probes/util.ts

jyaunches merged commit 282876a into main Jun 8, 2026
33 checks passed

jyaunches deleted the feat/e2e-real-execution branch June 8, 2026 18:34

This was referenced Jun 8, 2026

fix(e2e): gate scenario fan-out by onboarding+secret support contract #4978

Closed

test(e2e): add scenario fixture context #4965

Merged

cv mentioned this pull request Jun 8, 2026

Consolidate E2E fixture shell probe with scenario spawn infra #4988

Open

This was referenced Jun 8, 2026

Adopt phase fixtures + registry-driven test discovery for Vitest E2E scenarios #4990

Open

chore(e2e): delete deprecated bash entrypoints and orphan helper libs #4996

Merged

jyaunches mentioned this pull request Jun 9, 2026

Vitest E2E coverage: assert NemoClaw's high-value behavioral catalog #4999

Open

Conversation

jyaunches commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

Orchestrator (TypeScript)

Workflow

Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)

Shell scripts (the leaves stay, the dry-run skip blocks die)

Legacy YAML resolver retired (rolled in from the planned follow-up PR)

Tests

Docs

Merge with main

Verification

Spec gates addressed

Known follow-ups (not blocking this PR)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented May 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jyaunches commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading