test(e2e): execute real shell assertions; delete dry-run, --validate-only, and the bash runner#4380
Conversation
…only, and the bash runner
The merged hybrid scenario architecture shipped scaffolding that looked like it
ran E2E tests but did not. Two layers were producing fake green:
1. phase.ts:executeStep had no child_process.spawn anywhere. Every real
shell/probe step fell through to a hardcoded
{ status: "failed", message: "unsupported live step" }, and a handful of
fake-pass refs (fake-pass, fake-retry-once-pass, fake-always-transient,
phase-1-skeleton) were the only paths that reported "passed". CI green
meant the plan compiled, not that any assertion executed.
2. ~30 shell scripts under validation_suites/, onboarding_assertions/,
nemoclaw_scenarios/install/, and nemoclaw_scenarios/onboard/ began with
"if e2e_env_is_dry_run; then echo [dry-run] ...; exit 0; fi". Once the
dry-run flag flowed in (which workflows did pass), every script silently
exited 0 before its real assertion ran.
This change rips out both layers in one shot. The TS runner has one execution
mode: live. There is no flag, env var, helper, branch, or comment in any
production path that can produce a fake pass.
Orchestrator (TypeScript)
- phase.ts: executeStep now spawns shell steps via child_process.spawn,
with detached process groups so timeouts kill bash + sleep cleanly. Probe
steps return skipped "probe not registered" until the registry lands.
Pending steps return skipped "pending: <ref>". Unknown kinds throw.
Real evidence is captured to .e2e/logs/<step-id>.log. Step-level
reliability.timeoutSeconds and retry.{attempts,on} policy are enforced
here, not in clients.
- run.ts: --dry-run, --validate-only deleted. Default invocation is live
execution. --list and --plan-only (local debug) survive read-only.
--emit-matrix added for the dynamic-matrix workflow (PR #4359).
- types.ts: RunContext.dryRun deleted. AssertionResult already supported
"skipped" status, now actually used.
Workflow
- e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both
--dry-run invocations (Linux + WSL), are gone. Workflows execute live.
- workflow-boundary.mts validator now requires --dry-run, --plan-only,
and --validate-only to NOT appear in the workflow.
Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)
- runtime/run-scenario.sh: 483 lines of duplicated install/onboard/
gateway-check/suite-execution -> 5-line fail-fast stub pointing at
run.ts. The TS phase orchestrators own this work now.
- runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep
walks typed assertionGroups directly; nothing in TS calls a YAML-walking
bash runner.
Shell scripts (the leaves stay, the dry-run skip blocks die)
- validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**:
every "if e2e_env_is_dry_run; then ... exit 0; fi" and every
"[[ ${E2E_DRY_RUN:-0} == 1 ]]" short-circuit removed. The real assertion
logic that was hiding underneath now runs unconditionally.
- runtime/lib/env.sh: e2e_env_is_dry_run helper deleted.
- inference_routing.sh: dead _e2e_inference_plan helper (only callable
from the deleted dry-run paths) deleted.
Tests
- DELETED (validated dead code paths):
e2e-suite-runner.test.ts (run-suites.sh behavior)
e2e-scenario-first-migration.test.ts (run-scenario.sh dry-run plan)
e2e-expected-state-validator.test.ts (--validate-only mode)
- REWRITTEN:
e2e-phase-orchestrators.test.ts: now exercises real shell spawning
via temp scripts (pass/fail/timeout/retry/missing-ref), real probe
skipping with visible reason, and real pending skipping. The
previous fake-pass refs in this test were the canonical example of
the problem.
- TRIMMED:
e2e-lib-helpers.test.ts: dry-run-mode unit tests deleted; tests of
real bash semantics survive.
e2e-scenario-additional-families.test.ts: planOnly-via-bash tests
deleted; resolveScenario-direct tests survive.
e2e-scenario-resolver.test.ts: run-scenario.sh --plan-only spawn
tests deleted; resolver unit tests survive.
e2e-context-helper.test.ts: dry-run trace test deleted.
Docs
- docs/README.md: updated to state one runner, one mode (live), no
dry-run, no validate-only. Bash entrypoints documented as deprecated
fail-fast stubs.
Verification
- run.ts --list : prints the typed registry (intact)
- run.ts --emit-matrix : emits JSON matrix for the dynamic-matrix
workflow
- run.ts --scenarios <id>: spawns real shell scripts, real exit codes,
real failures with real evidence logs. Phase
results show passed/failed/skipped honestly.
- All 274 e2e-scenario framework tests pass.
- Audited: no surviving --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run,
fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton,
unsupported-live-step, --validate-only, or RunContext.dryRun in any
production path.
CI for this PR will go red on environments where nemoclaw is not actually
installed and onboarded. That is the point. Red is the first honest signal
in months. Subsequent PRs (probe registry, OnboardingOrchestrator wiring
into the real install/onboard dispatchers, old YAML resolver deletion)
fix the real failures rather than hide them.
Spec gates addressed: Phase 6 (orchestrators execute live shell steps),
Phase 7 (single TS runtime entrypoint, bash runners deprecated), and
the workflow side of Phase 9 (--dry-run / --validate-only / suite_filter
gone from active paths). The old YAML resolver source under
runtime/resolver/ stays for now; its deletion is the next PR.
📝 WalkthroughWalkthroughConverts the E2E flow to a single TypeScript live runner: adds typed phase actions/results, executes actions and shell steps for real (with logs, timeouts, retries, evidence), deprecates bash runners, removes dry-run short-circuits across suites/install/onboard, updates docs/workflows, and prunes plan-only/dry-run tests. ChangesLive E2E Execution Framework
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub Actions
participant RunnerTS as run.ts (TypeScript)
participant Orchestrator as PhaseOrchestrator
participant Shell as bash script
GH->>RunnerTS: invoke --scenarios / --emit-matrix
RunnerTS->>Orchestrator: compile plans, execute phases/actions
Orchestrator->>Shell: runAction -> spawn bash (logs, timeout)
Shell-->>Orchestrator: exit code, evidence (logs/stderr tail)
Orchestrator-->>RunnerTS: aggregate PhaseResult (actions + assertions)
RunnerTS-->>GH: exit code + artifact emission
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
# Conflicts: # test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e-scenario/scenarios/orchestrators/phase.ts`:
- Around line 32-34: The branch uses a case-insensitive regex test on ref but
then calls case-sensitive ref.includes("tunnel") / ref.includes("cloudflared"),
causing mixed-case refs to misclassify; update the branching in the same block
(the if handling /provider|inference|chat-completion|cloudflared|tunnel/i) to
compare in a case-insensitive way—e.g., normalize ref to lowerCase once and use
that for the subsequent includes checks (or use case-insensitive regex matches)
so tunnel/cloudflared variants are correctly detected and return
"external-tunnel".
In `@test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh`:
- Line 62: The current substring check [[ "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" ==
*"${E2E_SANDBOX_NAME}"* ]] can produce false positives (e.g., sb1 matching
sb10); change it to perform an exact token/line match of the sandbox name
instead: test SANDBOX_LIFECYCLE_LAST_OUTPUT for the exact E2E_SANDBOX_NAME token
(for example by piping SANDBOX_LIFECYCLE_LAST_OUTPUT to grep -w/-x or using a
word-boundary regex with [[ ... =~ ... ]]) and fail if not found, so the check
around SANDBOX_LIFECYCLE_LAST_OUTPUT and E2E_SANDBOX_NAME only succeeds for
exact sandbox-name matches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 401b8a23-1471-429d-9905-b413e6bae24c
📒 Files selected for processing (44)
.github/workflows/e2e-scenarios.yamltest/e2e-scenario/docs/README.mdtest/e2e-scenario/framework-tests/e2e-context-helper.test.tstest/e2e-scenario/framework-tests/e2e-expected-state-validator.test.tstest/e2e-scenario/framework-tests/e2e-lib-helpers.test.tstest/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.tstest/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.tstest/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.tstest/e2e-scenario/framework-tests/e2e-scenario-resolver.test.tstest/e2e-scenario/framework-tests/e2e-suite-runner.test.tstest/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.shtest/e2e-scenario/nemoclaw_scenarios/install/dispatch.shtest/e2e-scenario/nemoclaw_scenarios/install/launchable.shtest/e2e-scenario/nemoclaw_scenarios/install/ollama.shtest/e2e-scenario/nemoclaw_scenarios/install/public-curl.shtest/e2e-scenario/nemoclaw_scenarios/install/repo-current.shtest/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.shtest/e2e-scenario/runtime/lib/env.shtest/e2e-scenario/runtime/run-scenario.shtest/e2e-scenario/runtime/run-suites.shtest/e2e-scenario/scenarios/orchestrators/phase.tstest/e2e-scenario/scenarios/run.tstest/e2e-scenario/scenarios/types.tstest/e2e-scenario/validation_suites/assert/gateway-alive.shtest/e2e-scenario/validation_suites/assert/sandbox-alive.shtest/e2e-scenario/validation_suites/hermes/00-hermes-health.shtest/e2e-scenario/validation_suites/inference/cloud/00-models-health.shtest/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.shtest/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.shtest/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.shtest/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.shtest/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.shtest/e2e-scenario/validation_suites/lib/inference_routing.shtest/e2e-scenario/validation_suites/lib/messaging_providers.shtest/e2e-scenario/validation_suites/lib/rebuild_upgrade.shtest/e2e-scenario/validation_suites/lib/sandbox_lifecycle.shtest/e2e-scenario/validation_suites/lib/security_policy_credentials.shtest/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.shtest/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.shtest/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.shtest/e2e-scenario/validation_suites/sandbox-exec.shtest/e2e-scenario/validation_suites/smoke/00-cli-available.shtest/e2e-scenario/validation_suites/smoke/03-sandbox-shell.shtools/e2e-scenarios/workflow-boundary.mts
💤 Files with no reviewable changes (30)
- test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
- test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
- test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
- test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
- test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
- test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
- test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
- test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
- test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
- test/e2e-scenario/validation_suites/lib/messaging_providers.sh
- test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
- test/e2e-scenario/scenarios/types.ts
- test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
- test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
- test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
- test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
- test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
- test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
- test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
- test/e2e-scenario/validation_suites/lib/inference_routing.sh
- test/e2e-scenario/validation_suites/assert/gateway-alive.sh
- test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
- test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
- test/e2e-scenario/validation_suites/sandbox-exec.sh
- test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
- test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
- test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
- test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
- test/e2e-scenario/runtime/lib/env.sh
- test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 2 needs attention, 10 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
…shape client check - PhaseOrchestrator.runShellStep: wait for the log WriteStream to finish before resolving so callers (and tests) reading evidence synchronously see the actual stdout/stderr instead of an empty file. Race exposed by e2e-phase-orchestrators 'shell_step_passes_when_script_exits_zero'. - e2e-phase-orchestrators: replace client-source toMatch regex (1 source-shape test, budget=0) with a runtime-shape behavior assertion on the HostCliClient observation. Still enforces 'clients do not encode pass/fail or retry/timeout semantics' per hybrid-scenario E2E architecture spec, without violating source-shape budget. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Closes the spec's reopened Phase 6 gap. The new typed runner now executes the install and onboarding work that the deleted bash runner used to perform, but inside EnvironmentOrchestrator and OnboardingOrchestrator instead of in workflow YAML or a resurrected bash runner. All canonical scenarios now reach a real, live SUT before their assertions run. Architecture (per hybrid-scenario-e2e-architecture spec): * types.ts: introduce typed PhaseAction (kind=shell-fn|shell, scriptRef, fn, arg, timeoutSeconds, evidencePath) and PhaseActionResult. Replace the prior actions: string[] free-form labels with PhaseAction[]. Add actions[] to PhaseResult so failure-layer attribution stays clear: setup failure is recorded distinctly from assertion failure. * compiler.ts: phaseActions() now emits typed actions for environment (context.emit + install.<id>) and onboarding (profile.<id>). Stable action ids: environment.context.emit, environment.install.<install-id>, onboarding.profile.<profile-id>. All install/onboard actions point at the existing dispatcher scripts (install/dispatch.sh, onboard/dispatch.sh) - shell remains the implementation per spec, invocation is centralized. * orchestrators/phase.ts: PhaseOrchestrator.run() executes actions before assertions. Action failure short-circuits the phase so assertions never run against an environment that was never set up. Action runner reuses the same spawn/timeout/process-group/log-flush machinery as runShellStep. Per-action timeout, no retry (install and onboarding must fail loudly). * nemoclaw_scenarios/dispatch-action.sh: new bash launcher (the only new shell file). The install/onboard dispatchers are intentionally library-style (function definitions only); this launcher gives them a deterministic executable entrypoint that sources runtime/lib/env.sh + runtime/lib/context.sh, applies non-interactive env, sources the requested dispatcher, and invokes the named function with one arg. Replaces the orchestration that the deleted run-scenario.sh used to do, but called from the typed orchestrator instead. * plan-only output: now shows 'Action: <id> (timeout=...) -> <fn> <arg>' per phase, before assertion groups. Maintainers can preview the full setup+onboard+assert sequence before dispatch. * framework-tests/e2e-phase-orchestrators.test.ts: add five behavior tests covering action-runs-before-assertions, action-failure short- circuits-assertions, action timeout via orchestrator policy, evidence-log flushed-before-resolve, and compiler emits typed install/onboard actions for all 7 canonical scenarios. What stays out: * No workflow YAML edits. .github/workflows/e2e-scenarios.yaml still invokes only 'npx tsx test/e2e-scenario/scenarios/run.ts --scenarios ...'. Workflow YAML stays innocent of install/onboard plumbing. * No client edits. HostCliClient et al. remain pass/fail/policy free. * No resolver/YAML-first revival. setup_scenarios/test_plans/suite_filter remain unsupported. Validation gate (Phase 6 reopen note) is the next step: after this push goes green on PR CI, dispatch e2e-scenarios-all.yaml against feat/e2e-real-execution and confirm canonical scenarios produce real phase results with action evidence under .e2e/actions/<id>.log, instead of <1s 'failed=34 skipped=5'. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/e2e-scenario/scenarios/types.ts (1)
110-131: ⚡ Quick winStrengthen
PhaseActiontyping (discriminated union) to enforce per-kindrequired fields.Current runtime already fails loudly if a
"shell-fn"action is missingfn(the runner passesaction.fn ?? "", anddispatch-action.sherrors when the function name isn’t found). Also, currentPhaseActionobjects are produced intest/e2e-scenario/scenarios/compiler.tswithfnpopulated for"shell-fn". Still, the type allows invalid combinations to compile, so a discriminated union would prevent that drift for any future producers.Suggested refactor
-export interface PhaseAction { - id: string; - phase: PhaseName; - description?: string; - // "shell-fn" sources the bash dispatcher and invokes the named function. - // "shell" runs an executable script (used for context-emit helper). - kind: "shell-fn" | "shell"; - // Repo-relative path to the script. - scriptRef: string; - // For "shell-fn": the bash function to invoke after sourcing scriptRef. - fn?: string; - // Single positional arg passed to the function/script (install method or - // onboarding profile id today). Kept as a single string to keep stable - // ids predictable; multi-arg variants can extend this later. - arg?: string; - // Per-action timeout. No retry by default - install/onboard must fail - // loudly so the regression is visible. Retry stays a property of - // assertion steps, not actions. - timeoutSeconds?: number; - // Repo-relative evidence log path. - evidencePath?: string; -} +interface PhaseActionBase { + id: string; + phase: PhaseName; + description?: string; + scriptRef: string; + timeoutSeconds?: number; + evidencePath?: string; +} + +export type PhaseAction = + | (PhaseActionBase & { + kind: "shell-fn"; + fn: string; + arg?: string; + }) + | (PhaseActionBase & { + kind: "shell"; + arg?: string; + });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-scenario/scenarios/types.ts` around lines 110 - 131, Change PhaseAction from a single broad interface into a discriminated union so TypeScript enforces per-kind fields: define one variant for kind: "shell-fn" that requires fn: string (plus shared fields like id, phase, scriptRef, arg?, timeoutSeconds?, evidencePath?, description?) and another variant for kind: "shell" that omits/marks fn as disallowed/undefined; update any usages (e.g., the action objects created in test/e2e-scenario/scenarios/compiler.ts and the runner that reads action.fn) to satisfy the new types so compilation ensures "shell-fn" always has fn and "shell" never does.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh`:
- Line 60: Remove the "|| true" suppression so failures in e2e_context_init
surface during test runs: in dispatch-action.sh replace the line that calls the
initializer (the call to e2e_context_init) by invoking e2e_context_init without
"|| true" so that any mkdir or file-write errors creating
${E2E_CONTEXT_DIR}/context.env propagate (this ensures e2e_context_require will
fail immediately instead of masking the error and misattributing missing keys).
In `@test/e2e-scenario/scenarios/compiler.ts`:
- Around line 94-99: The code currently silently returns [] when required phase
action dimensions like installId (from scenario.environment?.install) are
missing; instead throw a hard error to fail-fast: replace the early returns that
yield [] with throwing a descriptive Error (include context such as scenario.id
or scenario.name and which dimension is missing) in the function that generates
phase actions (the branch checking installId / scenario.environment?.install),
and make the identical change in the other similar branch (the second check at
the same pattern) so malformed scenarios surface as hard failures rather than
emitting empty action lists.
---
Nitpick comments:
In `@test/e2e-scenario/scenarios/types.ts`:
- Around line 110-131: Change PhaseAction from a single broad interface into a
discriminated union so TypeScript enforces per-kind fields: define one variant
for kind: "shell-fn" that requires fn: string (plus shared fields like id,
phase, scriptRef, arg?, timeoutSeconds?, evidencePath?, description?) and
another variant for kind: "shell" that omits/marks fn as disallowed/undefined;
update any usages (e.g., the action objects created in
test/e2e-scenario/scenarios/compiler.ts and the runner that reads action.fn) to
satisfy the new types so compilation ensures "shell-fn" always has fn and
"shell" never does.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1a237354-36d1-4e41-8990-3b50d8f973f0
📒 Files selected for processing (5)
test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.tstest/e2e-scenario/nemoclaw_scenarios/dispatch-action.shtest/e2e-scenario/scenarios/compiler.tstest/e2e-scenario/scenarios/orchestrators/phase.tstest/e2e-scenario/scenarios/types.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- test/e2e-scenario/scenarios/orchestrators/phase.ts
- test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
The first live dispatch of the Phase 6 wiring (run 26550310438) gave us
real action evidence and surfaced three real bugs. All three are fixed
inside the spec's prescribed layers - no workflow YAML, no client, no
old-resolver path.
1. environment.context.emit was a shell action that called the legacy
emit-context-from-plan.sh helper. That helper expects the OLD
YAML-resolver plan.json shape (dimensions.platform.profile.os...),
which the typed compiler does not produce. Drop the shell action;
add scenarios/orchestrators/context.ts that derives a normalized
context.env directly from the typed RunPlan and writes it from
ScenarioRunner.run() before any phase. Spec: context emission is
framework infrastructure, not a phase action.
2. PhaseOrchestrator.runShellStep was reading context.env from
${ctx.contextDir}/.e2e/context.env, but the shell helper writes
to ${E2E_CONTEXT_DIR}/context.env (top-level). Fix the path so
shell assertions see seeded keys.
3. ScenarioRunner did not short-circuit across phase boundaries: a
failed environment ACTION (real setup work) still let onboarding
and runtime run, producing a misleading 34-failure cascade.
Runner now consults prior phase results: if any prior action
failed, downstream phases are synthesized as skipped with a
message naming the blocking phase+action+message. Assertion-only
failures still propagate as failures.
Tests added (8 new, 292/292 scenario framework tests green).
Validation gate next: dispatch e2e-scenarios-all.yaml again.
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Run 26550936453 surfaced two more real bugs after Phase 6 wiring went
live. Both fixed inside the spec's prescribed layers; nothing leaks
into workflow YAML or clients.
1. dispatch-action.sh called e2e_context_init unconditionally before
sourcing the install/onboard dispatcher. e2e_context_init opens
context.env with `: > ctx`, which truncated the file the
ScenarioRunner had just seeded. All runtime assertions then failed
with 'e2e context: missing required key(s): E2E_SCENARIO ...'.
Fix: dispatch-action.sh no longer calls e2e_context_init. The TS
framework owns context.env initialization; workers may still extend
it via e2e_context_set.
2. The legacy onboarding.preflight.passed assertion expects an
onboard.log file at ${E2E_CONTEXT_DIR}/onboard.log. The old bash
runner used to redirect onboarding output there; the typed
orchestrator captured it under .e2e/actions/<action-id>.log. Fix:
add optional aliasPath to PhaseAction; compiler sets aliasPath to
'onboard.log' for the onboarding profile action; orchestrator
copies the action evidence log to the alias on success. Best-
effort - alias copy failures do not fail the action.
Live evidence from run 26550936453 (canonical ubuntu-repo-cloud-openclaw):
- environment.install.repo-current: passed in 14.2s
- onboarding.profile.cloud-openclaw: passed in 302s (real onboarding!)
- onboarding.base.cli-installed: passed
- onboarding.preflight.passed: failed (onboard.log not found) <- fixed
- runtime.* (10 steps): all 'missing key(s)' <- fixed by #1
Tests: 38/38 phase-orchestrator (was 36; +2 alias tests), 294/294
scenario framework. shellcheck clean.
Validation gate next: redispatch e2e-scenarios-all and confirm
runtime steps actually exercise the SUT (real pass/fail, not key
errors).
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The first canonical-scenario run that reached the onboarding-assertion
phase (run 26552550140 / ubuntu-repo-cloud-openclaw) showed that the
legacy onboarding.preflight.passed assertion fails on every successful
run because its regex matches any mention of 'docker' / 'container' /
'daemon' / 'socket' in onboard.log - and a normal nemoclaw onboarding
mentions all of those many times.
The action itself succeeded (exit 0, 263s of real onboarding work);
the assertion is meant to confirm onboard.log does not contain
explicit preflight FAILURE markers. Tighten the regex accordingly:
match phrases like 'preflight failed/error', 'cannot connect to the
docker daemon', 'onboarding aborted', 'FATAL: docker', 'ERROR: docker
daemon' - not bare topic words.
Verified: shellcheck passes; bash -n passes.
Why we stop here on this PR:
This commit lands the last small framework-level fix produced by live
action evidence. The Phase 6 wiring is now fully validated end-to-end:
Install: passed (~12s)
Onboarding: action passed (~263s real onboarding)
base.cli-installed passed
preflight.passed will now pass
Runtime: 9 passed / 25 failed / 5 skipped against live SUT
The remaining 25 runtime failures are real product/test bugs surfaced
by finally executing the suite against a live SUT (sandbox-shell
timeouts, inference 30-60s timeouts, lifecycle.sandbox_operations
exit-1 mismatches, lifecycle.rebuild/upgrade 120s timeouts even after
retries). They are pre-existing and out of scope for 'execute real
shell assertions; delete dry-run, --validate-only, and the bash
runner'. They become productive follow-up issues.
The 5 skipped runtime steps are 'probe not registered' - known per
spec; probe registry lands in a follow-up.
Negative scenarios (ubuntu-no-docker-preflight-negative,
invalid-key-negative, gateway-port-conflict-negative) need expected-
failure semantics and a way to actually simulate docker-missing on
the runner. Out of scope here; tracked as follow-up.
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
All three findings are valid mechanical bugs introduced/touched by this
PR. Batched per the safe-batch policy: same-risk, independently
obvious, testable together.
1. orchestrators/phase.ts classifierForRef:
Outer guard is /i (case-insensitive), but the inner branch used
case-sensitive ref.includes("tunnel") / ref.includes("cloudflared")
- mixed-case refs would fall through and misclassify as
provider-transient. Replace with /tunnel|cloudflared/i.test(ref).
2. scenarios/compiler.ts phaseActions:
Inline comment said "the scenario is malformed; surface it as a
hard error" but the code returned []. Hard-fail instead, with a
message that names the missing dimension. Empty environment is
still tolerated (skeleton scenarios can carry no setup yet).
3. validation_suites/lib/sandbox_lifecycle.sh:
Substring match `*${E2E_SANDBOX_NAME}*` would let sb1 falsely
match sb10. Use awk with a whole-token equality check on column
one of `nemoclaw list` output.
Tests: 294/294 scenario framework still green. shellcheck + shfmt
clean. No behavior change for canonical scenarios; affected paths
were either dormant (case-mixed classifier) or returning a slightly
stricter outcome (compiler hard-fail, sandbox exact match).
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Resolves conflicts in: - .github/workflows/e2e-scenarios.yaml (WSL run step) - tools/e2e-scenarios/workflow-boundary.mts (boundary contract test) main brought PR #4346 ("fix(windows): enable Docker WSL integration and avoid Ubuntu first-run races"): the WSL run step is now a PowerShell wrapper that writes a bash script to RUNNER_TEMP, maps it into WSL via wslpath, and invokes `wsl -d $WSL_DISTRO -- env ... bash -l $wslTmp`. This handles Docker WSL integration plus a first-run race on freshly installed distros. This PR (#4380) removes --dry-run, --plan-only, and --validate-only from CI. Resolution: * e2e-scenarios.yaml: take main's PowerShell wrapper structure verbatim, but drop `--dry-run` from the inner `npx tsx ... run.ts` invocation. The typed runner is now the single execution path. * workflow-boundary.mts: assert BOTH the new constraints. The WSL step must NOT contain --dry-run / --plan-only / --validate-only (this PR's invariant) AND must contain $env:WSL_WORKDIR / WriteAllText / `bash -l $wslTmp` (main's robustness invariant). 294/294 scenario framework tests pass. shellcheck / shfmt / end-of-file / trailing-whitespace clean on touched files. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The PR body audits two invariants as absent from production paths:
- E2E_DRY_RUN / e2e_env_is_dry_run
- phase-1-skeleton
The PR Review Advisor caught counterexamples to both. Fix them now.
E2E_DRY_RUN: validation_suites/messaging/slack/00-slack-provider-state.sh
still branched on ${E2E_DRY_RUN:-} and emitted dry-run pass markers
without reading the OpenClaw config surface or querying runtime
discovery. Removed; the assertion now always reads real config and
queries OpenClaw runtime state. One execution mode.
phase-1-skeleton: scenarios/assertions/{environment,onboarding,runtime}.ts
each defined a 'kind: pending, ref: phase-1-skeleton' step that the
orchestrator silently skips. Of the three:
- environmentBaseline() was in the registry and every scenario plan;
environment work is performed by typed PhaseAction entries
(context.emit + install.<id>) from compiler.phaseActions(), so the
assertion group is redundant. Removed from registry and
assertionGroupsForScenario(); deleted the file.
- onboardingBaseline() and runtimeSmokeSkeleton() were never
imported. Deleted as dead code.
Verification: rg of E2E_DRY_RUN, e2e_env_is_dry_run, phase-1-skeleton
across scenarios, runtime, validation_suites, onboarding_assertions,
nemoclaw_scenarios, and the scenario workflows returns nothing outside
framework-tests. 294/294 e2e-scenario framework tests pass. bash -n
clean on the touched shell script.
Addresses PR Review Advisor findings:
- 'Residual E2E_DRY_RUN branch contradicts the one-live-mode contract'
- 'Skeleton pending refs remain in production scenario plans'
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
…stic The PR #4380 scenario run produced a SIGTERM cascade across runtime suite steps (sandbox-shell, inference.*, kimi.*, lifecycle.rebuild.*, lifecycle.upgrade.*) because the canonical sandbox-exec wrapper had no inner timeout. When openshell ssh-into-sandbox wedged, the orchestrator killed the process group at the step cap (30/45/60/120s) and the script log captured only the header line — no diagnostic, no classifier. Mirror the legacy test/e2e/ defense-in-depth pattern (outer process timeout + inner curl --max-time) inside the new framework's canonical wrapper: - sandbox-exec.sh now applies a per-call timeout via timeout/gtimeout with --kill-after=5s. Default 25s sits below the common 30s step cap; callers override via E2E_SANDBOX_EXEC_TIMEOUT_SECONDS for steps with larger budgets (chat-completion 50s, sandbox-local 35s, slack 50s, gateway-alive fallback 15s). On timeout (124/137) the wrapper emits a single-line classified diagnostic so phase.ts captures observable evidence instead of a SIGTERM black hole. - Migrate 8 BARE leaf scripts that called openshell sandbox exec directly to route through the wrapper: smoke/03-sandbox-shell.sh assert/gateway-alive.sh (ollama fallback path) inference/cloud/00-models-health.sh inference/cloud/01-chat-completion.sh inference/cloud/02-inference-local-from-sandbox.sh inference/ollama-auth-proxy/00-proxy-reachable.sh messaging/slack/00-slack-provider-state.sh kimi-compatibility steps already route through inference_routing.sh which uses the wrapper, so they inherit the fix transparently. rebuild_upgrade.sh and security_policy_credentials.sh continue to use their own ad-hoc timeouts; unifying those onto the wrapper is a follow-up. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
shellcheck flags 'E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=N \\\n cmd' as SC2034 (apparently unused) because static analysis can't follow that the wrapper reads it via environment. The pattern is correct (prefix-env assignment exports the variable to the immediately-following command), but the warning blocks the 'checks' job. Add a localized 'shellcheck disable=SC2034' directly above each affected prefix-env line in the two cloud-inference assertion scripts that override the default wrapper timeout. No behavior change. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Addresses PR Review Advisor finding #1: 'Secret-bearing child process output is persisted and uploaded without redaction'. Designed in the spirit of the test/e2e-scenario architecture: secret hygiene is FRAMEWORK INFRASTRUCTURE, not a per-script / per-action / per-workflow concern. Same one-mode discipline that motivates the rest of this PR (no flag, no env var, no helper bypasses redaction). New module: test/e2e-scenario/scenarios/orchestrators/redaction.ts Sits next to context.ts, owns: - redactString(text): canonical token-shape redaction - buildChildEnv(base, opts): minimal allowlisted env + declared secretEnv passthrough only - pipeRedacted(src, log): single I/O wrapper used by both runAction and runShellStep Pattern set is a framework-local mirror of src/lib/security/secret-patterns.ts. The framework deliberately does NOT import from src/lib/security/ to keep the framework-vs-product boundary clean and avoid cross-tsconfig ESM issues. A new parity test (e2e-redaction-parity.test.ts) asserts the two pattern sets stay in lockstep so adding a token shape in one place keeps both layers honest. Typed contract extension: PhaseAction.secretEnv?: readonly string[] AssertionStep.secretEnv?: readonly string[] Compiler decides which secrets cross the framework boundary, the same way aliasPath / scriptRef / timeoutSeconds are declared metadata. buildChildEnv enforces a 'secret-key shape' contract on every entry (must end with API_KEY/TOKEN/SECRET/PASSWORD/CREDENTIAL/ PASSPHRASE/PRIVATE_KEY/ACCESS_KEY) so the secretEnv channel cannot silently allowlist non-secret variables. compiler.ts: declares onboard secretEnv per profile cloud-* profiles -> ['NVIDIA_API_KEY'] local-ollama-openclaw -> [] Install actions declare nothing (the installer itself does not authenticate to the cloud). phase.ts: two call sites updated, identical pattern in both runAction: buildChildEnv(...) + pipeRedacted(stdout, log) + pipeRedacted(stderr, log, tail<-redactedChunk) runShellStep: same shape Spawn-error 'message' fields are also wrapped with redactString as defense-in-depth. Workflow upload tightening (.github/workflows/e2e-scenarios.yaml): - include-hidden-files: false Hidden dotfiles under the workspace can carry raw secrets (notably .e2e/context.env, written by e2e_context_set without redaction). Diagnostic dumps of context use e2e_context_dump which redacts on emit. - path: explicit subpath list (.e2e/actions/, .e2e/logs/, result JSONs, plan/run-plan, onboard.log) instead of blanket .e2e/. workflow-boundary.mts contract test updated to assert the new invariant; e2e-scenarios-workflow.test.ts expectations updated. Bash side: unchanged. test/e2e-scenario/runtime/lib/context.sh:: e2e_context_dump already redacts via _e2e_context_is_sensitive_key. This module covers the TS-spawned-child path that lacked coverage. Tests (e2e-phase-orchestrators.test.ts, +5): - test_should_not_persist_secret_shaped_child_output_into_evidence Real shell child writes nvapi-, ghp_, sk-, xoxb-, Bearer ... Asserts evidence log, phase result.json, and result.message contain none of the literal tokens but do contain <REDACTED>. - test_should_drop_non_allowlisted_parent_env_unless_declared_in_secretEnv Sets a sentinel parent-env var, runs child without declaring it, asserts child cannot see it. PATH and E2E_PHASE still arrive. - test_should_pass_declared_secretEnv_through_to_child Declares the var in step.secretEnv, asserts child sees it. - test_should_reject_non_secret_shaped_keys_in_secretEnv_at_runtime buildChildEnv throws on FOO_VAR; the secret-key-shape contract keeps the allowlist boundary honest. - test_should_declare_NVIDIA_API_KEY_only_for_cloud_onboarding_actions Compiler-level contract: cloud onboard declares NVIDIA_API_KEY, local-ollama declares nothing. Tests (e2e-redaction-parity.test.ts, new file, +2): - TOKEN_PREFIX_PATTERNS framework copy matches product source - CONTEXT_PATTERNS framework copy matches product source 308/308 e2e-scenario framework tests pass. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
…re pending steps Addresses PR Review Advisor finding #2: 'Required security probes and expected-failure checks still skip without failing live runs'. Same architectural failure mode as findings #5 (E2E_DRY_RUN) and #6 (phase-1-skeleton): the typed framework declared a contract, but a code path silently produced non-failing skips that contradicted this PR's 'one mode, no fake green' invariant. Three suites (security-shields, security-policy, security-injection) were wired as kind: 'probe' steps; the orchestrator marked them status: 'skipped' because the probe registry hasn't landed yet. The expected-failure side-effect validator (runtime.expected-failure.no- side-effects) was wired as kind: 'pending'. run.ts only sets process.exitCode = 1 on status === 'failed', so a live run could omit security shields, network policy, prompt-injection blocking, AND the negative-scenario forbidden-side-effect contract while still appearing non-failing. Fix follows the framework spirit: declare requirement in the typed metadata, enforce it once at the orchestrator boundary. Typed contract extension (types.ts): AssertionStep.required?: boolean When true, a probe/pending step that resolves as 'skipped' is reclassified as 'failed' by the phase orchestrator. Defaults to false (existing behavior preserved for diagnostics, docs-validation, and other non-security probes). Orchestrator (phase.ts): executeStep: probe and pending branches now check step.required and return status: 'failed' with a 'required <kind> not <state>' message when set. Non-required probes/pending steps continue to surface as visible skipped results so unrelated gaps stay diagnostic. Registry (assertions/registry.ts): - probeStep / pendingStep helpers grew an options arg with required?: boolean (matches the secretEnv pattern: options come in via typed objects, not positional args). - security-shields, security-policy, security-injection probes annotated required: true. These are the suites the run is not safe without; failing closed beats fake green. - runtime.expected-failure.no-side-effects pending step annotated required: true. Negative scenarios cannot honestly pass while their core side-effect contract is unimplemented. - diagnostics, docs-validation probes intentionally remain non-required: they are informational, not safety-critical, and will switch to required when the probe registry lands and they have real implementations. run.ts: no change needed. The new failed-status flows through the existing exit-code path (process.exitCode = 1 when any phase status is failed), so the workflow correctly reports red for missing required probes/pending steps. Tests (e2e-phase-orchestrators.test.ts, +5): - test_required_probe_step_that_is_unregistered_fails_the_phase - test_non_required_probe_step_continues_to_skip_visibly - test_required_pending_step_fails_closed - test_security_suite_groups_in_registry_mark_their_steps_as_required - test_expected_failure_no_side_effects_step_in_registry_is_required 318/318 e2e-scenario framework tests pass. Until the probe registry follow-up PR registers actual implementations for shieldsConfigProbe / networkPolicyProbe / injectionBlockedProbe / expectedFailureNoSideEffectsProbe, scenarios that include those suites/expected-failure groups will fail loudly with 'required probe not registered' or 'required pending step not implemented'. That is the correct intermediate state: visible red until real, instead of green-by-default. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Diagnoses surfaced by the timeout wrapper on PR #4380's scenario run showed every assertion that goes through 'openshell sandbox exec' hangs in CI (host can curl the gateway and list sandboxes, but openshell-exec into the sandbox never returns). The legacy test/e2e/ scripts have always entered the sandbox via 'openshell sandbox ssh-config' + 'ssh -F' and don't exhibit this hang, so adopt that pattern as the default transport in the canonical wrapper. Behavior: - On first call per (sandbox, script), e2e_sandbox_exec calls 'openshell sandbox ssh-config <name>' once and caches the result under ${E2E_CONTEXT_DIR}/.ssh-config-cache/<name>.cfg. - Subsequent calls reuse the cached config and invoke 'ssh -F <cfg> -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR openshell-<name> <quoted remote-cmd>'. - Args are quoted via printf '%q' so payloads with shell metacharacters (e.g. JSON bodies for chat-completion) survive intact. - Fallback: if 'openshell sandbox ssh-config' returns non-zero (e.g. older openshell builds, runners without an ssh client), the wrapper falls back to 'openshell sandbox exec' and emits a stderr breadcrumb. - Opt-out: set E2E_SANDBOX_EXEC_VIA_OPENSHELL=1 to force the original transport (used by the framework unit tests that stub openshell but not ssh). Tests: - Two new vitest cases: * sandbox_exec_should_prefer_ssh_config_transport_when_openshell_offers_one * sandbox_exec_should_fall_back_to_openshell_when_ssh_config_unavailable - Existing exit-code propagation and stdin-quoting tests opt into the openshell-direct transport so their stub-openshell expectations remain valid. - 322 framework tests pass. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…apper The lifecycle.rebuild.* and lifecycle.upgrade.survivor-reachable assertions on PR #4380's scenario run still produced raw 120s SIGTERM cascades because rebuild_upgrade.sh used its own ad-hoc _rebuild_upgrade_run helper that called 'openshell sandbox exec' without a per-call timeout or transport choice. The orchestrator's 120s step cap was the only safety net, and on hang the script log captured no diagnostic. Migrate the lib onto the canonical wrapper: - Add _rebuild_upgrade_sandbox_exec(sandbox, cmd...) that: * Honors the existing REBUILD_UPGRADE_SANDBOX_CMD test override (with the same -n <sandbox> -- <cmd>... contract used by existing fakes) so framework unit tests keep working unchanged. * Routes production calls through e2e_sandbox_exec, picking up the ssh-config-preferred transport, the per-call timeout, and the classified diagnostic on hang. - Replace the seven openshell-sandbox-exec call sites with the new helper. - Default E2E_SANDBOX_EXEC_TIMEOUT_SECONDS to 100s for this lib's callers (orchestrator caps these steps at 120s; 100s leaves room for the wrapper to emit a diagnostic and exit cleanly before SIGTERM). The existing rebuild_upgrade_checks_should_allow_command_fakes test continues to pass, proving the override contract is preserved. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The lifecycle.sandbox.list-and-status assertion on PR #4380's scenario run failed with 'missing status field: status'. Root cause: the assertion required the literal lower-case tokens 'status', 'gateway', and 'sandbox' in 'nemoclaw <name> status' output, but the production command (src/lib/actions/sandbox/status.ts) does not print 'Status:' or 'Gateway:' labels in its normal output. The original assertion only passed against the framework-test mock string ('status running gateway healthy sandbox running'), which was synthetic and did not reflect what the CLI actually emits. Align the assertion with the production output: - Require the sandbox name itself to appear (every status output for a present sandbox emits 'Sandbox: <name>'). - Require the field labels that the CLI unconditionally prints when a sandbox is found: 'Sandbox', 'Model', 'OpenShell'. - Drop the 'status' and 'gateway' tokens that never appear in real output. This matches the legacy test/e2e/test-full-e2e.sh:230 pattern (verifies 'nemoclaw status' exits 0 and contains substantive content) while keeping the assertion strict enough to catch CLI regressions that drop core fields. Update the framework test mock to emit realistic output ('Sandbox: sb1', 'Model:', 'OpenShell:', 'Policies:') so the existing happy-path test keeps passing on the new contract. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The 'ubuntu-no-docker-preflight-negative' scenario on PR #4380 was structurally registered (manifest, scenario id, expected_failure metadata) but never executed its negative path. The framework called the regular 'cloud-openclaw' onboarding worker, docker was actually running, onboard succeeded, and nothing wrote ${E2E_CONTEXT_DIR}/negative-preflight.log. The 'onboarding.preflight.expected-failed' assertion then failed with 'evidence not found'. Mirror the legacy test/e2e/e2e-cloud-experimental/test-port8080-conflict.sh pattern (set up failure condition -> capture onboard output -> assert non-zero exit -> grep log) inside the typed framework: - Add nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh: * Installs a 'docker' shim earlier on PATH that exits non-zero with 'Cannot connect to the Docker daemon ...' so commandExists('docker') succeeds while 'docker info' fails - the same failure mode users see when Docker is installed but the daemon is not running. * Runs 'nemoclaw onboard --non-interactive' redirecting stdout+stderr to ${E2E_CONTEXT_DIR}/negative-preflight.log. * Asserts nemoclaw exited non-zero (preflight DID fail). If onboard unexpectedly succeeds, the action fails so a regression that lets docker-missing slide is loud, not silent. * Returns 0 on the expected-failure path so the orchestrator marks the action passed and the dependent assertion phase runs against the captured log. - Wire the worker into onboard/dispatch.sh under the 'cloud-openclaw-no-docker' profile id. - Compiler routes scenarios with environment.runtime='docker-missing' to '<base>-no-docker' onboarding profile. Adds the new profile to ONBOARD_PROFILE_SECRET_ENV so NVIDIA_API_KEY still flows through (matches a real user invocation where the CLI loads creds even when preflight aborts). - Adds compiler_routes_docker_missing_runtime_to_no_docker_onboarding_profile test that locks in the routing: negative scenario -> -no-docker profile id, evidence path, secret env; positive scenario unaffected. 324 tests pass. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
- redaction.ts: prepend $HOME/.local/bin to child PATH so post-install shims resolve on runners where it's not on the default PATH - sandbox_lifecycle.sh: use 'nemoclaw <sandbox> snapshot <sub>' argv shape; the prior form parsed 'snapshot' as the sandbox name
Resolutions: - run.ts: kept HEAD's removal of --dry-run/--validate-only modes; took main's richer --emit-matrix payload (runner/label/platform/suites) consumed by the dynamic matrix workflow (#4359). - 00-slack-provider-state.sh: kept HEAD's removal of the openclaw dry-run skip block; carried main's 'agent' local var. - docs/README.md: kept HEAD (live-only runner, deprecated bash entrypoints) and folded in main's 'Migration status is tracked outside the repository' line so e2e-migration-inventory-lock stays green. - e2e-coverage-report.test.ts, e2e-scenario-resolver.test.ts, runtime/resolver/coverage.ts: confirmed deletes (intentional in 9da75ac, 'retire legacy expected-states.yaml + runtime/resolver'). - e2e-expected-state.test.ts: applied the trim that 9da75ac's commit message described but did not include — drop the YAML-mirror parity tests that depended on the now-deleted expected-states.yaml, replace with id-coverage assertions against listExpectedStates(). - e2e-scenarios.yaml: retitled the summary step from 'typed dry-run summary' to 'typed scenario run' so the workflow text matches the PR's live-only stance.
#4939 added 'transitional model / target runner' guidance to docs/README.md after our merge. Resolved by taking that content but reconciling it with this PR's retirement of nemoclaw_scenarios/expected-states.yaml and the runtime/resolver/ tree: - 'Current sources of truth' table: drop the row pointing at expected-states.yaml and runtime/resolver; add a row for the typed expected-state registry (scenarios/expected-states.ts) as the single source of truth. - Replaced 'fan-out and dry-run planning' with 'fan-out and live execution'. - Added one paragraph noting the legacy YAML resolver and expected-states.yaml have been retired so the doc and the codebase agree.
CI on the merge commit (863e8eb) flagged two regressions caused by the merge with main: 1. ShellCheck SC2148: 5 probe scripts under nemoclaw_scenarios/probes/ start with the SPDX comment and have no shebang. ShellCheck cannot infer the target shell. Added '#!/usr/bin/env bash' and ensured executable bit. dispatch.sh already had a shebang. 2. markdown-links: docs/README.md still pointed at two paths that the YAML-resolver retirement (commit 9da75ac) deleted: - ../nemoclaw_scenarios/expected-states.yaml - ../runtime/resolver/schema.ts Repointed both to the typed registry that replaced them (scenarios/expected-states.ts, scenarios/types.ts). Verified locally: check-docs --only-links now passes; test/e2e-scenario/ vitest still 366 green; typecheck:cli clean.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh (1)
39-44:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winInconsistent grep pattern can cause false negatives.
Line 33 uses a word-boundary regex to prevent partial matches (e.g.,
"test"won't match"test-2"), but line 40 uses fixed-string substring matching (grep -Fq), which will match partial names. If a sandbox named"${E2E_SANDBOX_NAME}-suffix"exists but"${E2E_SANDBOX_NAME}"doesn't, this probe will incorrectly fail.🔍 Proposed fix to align openshell check with nemoclaw pattern
if command -v openshell >/dev/null 2>&1; then - if openshell sandbox list 2>/dev/null | grep -Fq "${E2E_SANDBOX_NAME}"; then + if openshell sandbox list 2>/dev/null | grep -qE "(^|[[:space:]])${E2E_SANDBOX_NAME}([[:space:]]|$)"; then echo "probe sandbox-absent: openshell reports sandbox '${E2E_SANDBOX_NAME}', expected absent" >&2 exit 1 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh` around lines 39 - 44, The probe sandbox-absent uses substring matching (grep -Fq) which can false-match names like "${E2E_SANDBOX_NAME}-suffix"; update the openshell check so it matches whole words instead (e.g., replace grep -Fq "${E2E_SANDBOX_NAME}" with grep -wq -- "${E2E_SANDBOX_NAME}" or an equivalent word-boundary regex like grep -Eq "\\b${E2E_SANDBOX_NAME}\\b") in the block containing the openshell sandbox list call and the probe named "sandbox-absent" so the check aligns with the word-boundary behavior used elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh`:
- Around line 39-44: The probe sandbox-absent uses substring matching (grep -Fq)
which can false-match names like "${E2E_SANDBOX_NAME}-suffix"; update the
openshell check so it matches whole words instead (e.g., replace grep -Fq
"${E2E_SANDBOX_NAME}" with grep -wq -- "${E2E_SANDBOX_NAME}" or an equivalent
word-boundary regex like grep -Eq "\\b${E2E_SANDBOX_NAME}\\b") in the block
containing the openshell sandbox list call and the probe named "sandbox-absent"
so the check aligns with the word-boundary behavior used elsewhere.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dbfe9c5f-8470-47b6-9fc2-8259436800ba
📒 Files selected for processing (6)
test/e2e-scenario/docs/README.mdtest/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.shtest/e2e-scenario/nemoclaw_scenarios/probes/gateway-absent.shtest/e2e-scenario/nemoclaw_scenarios/probes/gateway-healthy.shtest/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.shtest/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
✅ Files skipped from review due to trivial changes (2)
- test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
- test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh
The PR audit claimed 'Audited absent in production paths: e2e_env_is_dry_run, E2E_DRY_RUN, ...' but the live scenarios run on commit 3d8adfa revealed two survivors: 1. validation_suites/hermes/01-history-writable.sh:34 - Calls 'if e2e_env_is_dry_run; then ... exit 0; fi' - The helper was deleted in b7acfb7 -> the whole hermes scenario blew up with 'e2e_env_is_dry_run: command not found' on every run, masquerading as a hermes regression - Surfaced in: ubuntu-repo-cloud-hermes scenario, runtime.hermes.history-writable 2. validation_suites/messaging/slack/00-slack-provider-state.sh:66 - Wrapped the entire hermes branch in 'if [[ -n "${E2E_DRY_RUN:-}" ]]; then 3x e2e_pass else <real probe>; fi' - Unreachable now (runner never sets E2E_DRY_RUN), but contradicts the audit and would skip the real probes if anything ever did Removed the dry-run wrapper in both files; dedented the slack else-block inline. bash -n clean; vitest 366 green; full repo grep for 'e2e_env_is_dry_run|E2E_DRY_RUN' (excluding .test.ts) returns zero. Re-enables: ubuntu-repo-cloud-hermes / runtime.hermes.history-writable
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh`:
- Line 70: Replace direct openshell sandbox invocations with the canonical
e2e_sandbox_exec wrapper to preserve per-call timeout and transport behavior;
specifically, change calls that currently run "openshell sandbox exec --name
\"${sandbox_name}\" -- /opt/hermes/.venv/bin/python -c '...'" to use
e2e_sandbox_exec passing the sandbox identifier and the same command payload
(same python -c string) so timeouts/fallbacks are honored. Apply the same
replacement for the other occurrences mentioned (the calls around the other two
locations) and keep the original command arguments unchanged except for swapping
the invocation to e2e_sandbox_exec.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 060de45a-9ece-4905-ba24-816fa28927ce
📒 Files selected for processing (2)
test/e2e-scenario/validation_suites/hermes/01-history-writable.shtest/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
💤 Files with no reviewable changes (1)
- test/e2e-scenario/validation_suites/hermes/01-history-writable.sh
Two batched mechanical fixes for the static-checks pre-commit job on PR head 381d654: 1. Files were modified by shfmt (-i 2 -ci -bn): 6 .sh files used '\\<newline>' line continuations where shfmt's -bn (binary-next-line) prefers '&&\n' / '||\n'. Pre-existing on main; the merge dragged them into the PR diff and our local commits weren't run through pre-commit. Re-formatted in place. 2. check-shebang-scripts-are-executable: 3 scripts had shebangs but were tracked as 0644: - test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh - test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh - test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh chmod +x and re-tracked. Verified: shfmt -d clean across all PR-touched .sh files; biome format clean across PR-touched ts/tsx/js/mjs/mts; bash -n clean on the 6 reformatted scripts; vitest test/e2e-scenario/ 366 green.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh (1)
13-13:⚠️ Potential issue | 🟡 MinorUpdate the PR summary:
runtime/lib/env.shis still present and is sourced by the scenario script
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.shline 13 still sources. "${LIB_DIR}/env.sh", andtest/e2e-scenario/runtime/lib/env.shexists in the repo.- The “removed”/runtime-failure claim doesn’t match the current code; align the summary (e.g., if only specific helpers were removed, state that).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh` at line 13, The PR summary claims runtime/lib/env.sh was removed but the scenario script still sources it via the LIB_DIR variable (the line sourcing "${LIB_DIR}/env.sh" in 02-inference-local-from-sandbox.sh) and the runtime/lib/env.sh file remains in the repo; update the PR summary to accurately reflect what was changed (e.g., which specific helpers were removed or refactored) or remove the source line if you truly intend to stop using runtime/lib/env.sh—ensure consistency between the commit changes and the summary by either modifying the script to stop sourcing LIB_DIR/env.sh or amending the PR description to state that env.sh is still present and which pieces were removed.
🧹 Nitpick comments (1)
test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh (1)
81-81: ⚖️ Poor tradeoffShell injection risk in command construction.
The command string interpolates
${marker_content}into a single-quoted context. Ifmarker_contentever contains a single quote, the quoting breaks. While currently safe (the content isREBUILD_LIFECYCLE_$(date +%s)_${RANDOM}), this pattern was explicitly flagged in PR coordination as a concern: "prevention of plan-artifact-to-shell injection."Pass variables as positional arguments to avoid quote interpretation:
🔒 Proposed fix using positional arguments
- if ! E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=30 \ - e2e_sandbox_exec "${sandbox_name}" -- sh -c \ - "mkdir -p '$(dirname "${LIFECYCLE_REBUILD_MARKER_PATH}")' && printf '%s' '${marker_content}' > '${LIFECYCLE_REBUILD_MARKER_PATH}'"; then + if ! E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=30 \ + e2e_sandbox_exec "${sandbox_name}" -- sh -c \ + 'mkdir -p "$(dirname "$1")" && printf %s "$2" > "$1"' \ + sh "${LIFECYCLE_REBUILD_MARKER_PATH}" "${marker_content}"; thenThis passes
marker_contentas$2to the remote shell, preventing quote interpretation issues.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh` at line 81, The current shell command interpolates marker_content directly into a single-quoted string which breaks if marker_content contains a single quote; update the command used to create LIFECYCLE_REBUILD_MARKER_PATH so marker_content is passed as a positional argument to the remote shell instead of being interpolated into the command string. Concretely, change the constructed command that references LIFECYCLE_REBUILD_MARKER_PATH and marker_content to use a printf (or echo) that reads from a positional parameter (e.g., $2) and then invoke the remote shell/exec call with marker_content supplied as that positional argument so no shell quoting of marker_content is required.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh`:
- Line 13: The PR summary claims runtime/lib/env.sh was removed but the scenario
script still sources it via the LIB_DIR variable (the line sourcing
"${LIB_DIR}/env.sh" in 02-inference-local-from-sandbox.sh) and the
runtime/lib/env.sh file remains in the repo; update the PR summary to accurately
reflect what was changed (e.g., which specific helpers were removed or
refactored) or remove the source line if you truly intend to stop using
runtime/lib/env.sh—ensure consistency between the commit changes and the summary
by either modifying the script to stop sourcing LIB_DIR/env.sh or amending the
PR description to state that env.sh is still present and which pieces were
removed.
---
Nitpick comments:
In `@test/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.sh`:
- Line 81: The current shell command interpolates marker_content directly into a
single-quoted string which breaks if marker_content contains a single quote;
update the command used to create LIFECYCLE_REBUILD_MARKER_PATH so
marker_content is passed as a positional argument to the remote shell instead of
being interpolated into the command string. Concretely, change the constructed
command that references LIFECYCLE_REBUILD_MARKER_PATH and marker_content to use
a printf (or echo) that reads from a positional parameter (e.g., $2) and then
invoke the remote shell/exec call with marker_content supplied as that
positional argument so no shell quoting of marker_content is required.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 19ad2265-3927-4694-80ee-0d4c10fded73
📒 Files selected for processing (6)
test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.shtest/e2e-scenario/nemoclaw_scenarios/lifecycle/rebuild-current-version.shtest/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.shtest/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.shtest/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.shtest/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
💤 Files with no reviewable changes (2)
- test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
- test/e2e-scenario/nemoclaw_scenarios/lifecycle/dispatch.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
…e-shape budget The static-checks job runs scripts/find-source-shape-tests.ts --check with budget ci/source-shape-test-budget.json (maxSourceShapeCases=0). e2e-redaction-parity.test.ts had 2 source-shape cases that read both redaction.ts files as text and compared regex literals via string parsing. Refactored to import the actual RegExp arrays from both modules and compare them by behavior — fingerprint each via .source + .flags. Same parity guarantee, no source-text reads, scanner now reports 0 cases. Required exporting TOKEN_PREFIX_PATTERNS and CONTEXT_PATTERNS from test/e2e-scenario/scenarios/orchestrators/redaction.ts. The framework runtime stays decoupled from src/lib/security/ — only the parity test crosses that boundary, which is the test's whole purpose. Added a comment on the framework export explaining the test-only contract. Verified: `npm run source-shape:check` reports 0 cases; vitest test/e2e-scenario/ 366 green; typecheck:cli clean.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e-scenario/scenarios/orchestrators/redaction.ts (1)
211-221:⚠️ Potential issue | 🟠 MajorFix chunk-boundary redaction in
pipeRedactedto prevent split tokens from leaking into evidence logs.
test/e2e-scenario/scenarios/orchestrators/redaction.ts(pipeRedacted, lines 211-221) redacts eachdatachunk independently viaredactString(chunk.toString("utf8"))and writes it immediately. With no carry/overlap buffering, a token split across adjacent chunks can fail the regex matches and persist in the evidence log (and into thestderrTail-basedmessagepath). The current redaction test coversecho-style output but doesn’t exercise split-token chunk boundaries.Suggested direction (stream-safe redaction)
+import { StringDecoder } from "node:string_decoder"; ... export function pipeRedacted( src: Readable, log: Writable, onChunk?: (redactedChunk: string) => void, ): void { - src.on("data", (chunk: Buffer) => { - const redacted = redactString(chunk.toString("utf8")); - log.write(redacted); - onChunk?.(redacted); - }); + const decoder = new StringDecoder("utf8"); + let carry = ""; + const OVERLAP_CHARS = 4096; + + const emitRedacted = (text: string) => { + if (!text) return; + const redacted = redactString(text); + log.write(redacted); + onChunk?.(redacted); + }; + + src.on("data", (chunk: Buffer | string) => { + const text = typeof chunk === "string" ? chunk : decoder.write(chunk); + const merged = carry + text; + const cutoff = Math.max(0, merged.length - OVERLAP_CHARS); + emitRedacted(merged.slice(0, cutoff)); + carry = merged.slice(cutoff); + }); + + src.on("end", () => { + emitRedacted(carry + decoder.end()); + carry = ""; + }); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts` around lines 211 - 221, pipeRedacted currently redacts each Buffer chunk independently (src.on("data") → redactString(...); log.write(...)) which allows sensitive tokens split across chunk boundaries to slip through; change pipeRedacted to maintain a small carry buffer between chunks: append incoming chunk to a leftover string, run redactString on the combined text but retain any trailing partial token (notably the same token/regex semantics used by redactString) by slicing off an appropriate suffix into leftover for the next chunk, write only the fully-redacted portion to log and call onChunk for that portion, and on "end"/"close"/"error" flush the remaining leftover through redactString and write it; ensure the same function names (pipeRedacted, redactString) and event handlers (src.on("data"), src.on("end"/"close"/"error")) are updated.
🧹 Nitpick comments (1)
test/e2e-scenario/scenarios/orchestrators/redaction.ts (1)
52-74: ⚡ Quick winExported regex arrays should be immutable to prevent runtime mutation of redaction behavior.
Now that these arrays are exported for tests, any importer can mutate them (
push/splice), which can silently alter production redaction behavior in-process.Hardening patch
-export const TOKEN_PREFIX_PATTERNS: RegExp[] = [ +export const TOKEN_PREFIX_PATTERNS: readonly RegExp[] = Object.freeze([ ... -]; +]); -export const CONTEXT_PATTERNS: RegExp[] = [ +export const CONTEXT_PATTERNS: readonly RegExp[] = Object.freeze([ ... -]; +]);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts` around lines 52 - 74, The exported arrays TOKEN_PREFIX_PATTERNS and CONTEXT_PATTERNS are mutable and can be altered by importers; change their exports to immutable frozen/read-only arrays and freeze each RegExp instance. Replace direct array exports with something like: export const TOKEN_PREFIX_PATTERNS: ReadonlyArray<RegExp> = Object.freeze([ ...patterns... ].map(r => Object.freeze(r))); and do the same for CONTEXT_PATTERNS so callers cannot push/splice or mutate the regex objects.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts`:
- Around line 211-221: pipeRedacted currently redacts each Buffer chunk
independently (src.on("data") → redactString(...); log.write(...)) which allows
sensitive tokens split across chunk boundaries to slip through; change
pipeRedacted to maintain a small carry buffer between chunks: append incoming
chunk to a leftover string, run redactString on the combined text but retain any
trailing partial token (notably the same token/regex semantics used by
redactString) by slicing off an appropriate suffix into leftover for the next
chunk, write only the fully-redacted portion to log and call onChunk for that
portion, and on "end"/"close"/"error" flush the remaining leftover through
redactString and write it; ensure the same function names (pipeRedacted,
redactString) and event handlers (src.on("data"), src.on("end"/"close"/"error"))
are updated.
---
Nitpick comments:
In `@test/e2e-scenario/scenarios/orchestrators/redaction.ts`:
- Around line 52-74: The exported arrays TOKEN_PREFIX_PATTERNS and
CONTEXT_PATTERNS are mutable and can be altered by importers; change their
exports to immutable frozen/read-only arrays and freeze each RegExp instance.
Replace direct array exports with something like: export const
TOKEN_PREFIX_PATTERNS: ReadonlyArray<RegExp> = Object.freeze([ ...patterns...
].map(r => Object.freeze(r))); and do the same for CONTEXT_PATTERNS so callers
cannot push/splice or mutate the regex objects.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 50d05560-6afd-4f7b-927c-bafdd8cf0b74
📒 Files selected for processing (2)
test/e2e-scenario/framework-tests/e2e-redaction-parity.test.tstest/e2e-scenario/scenarios/orchestrators/redaction.ts
Two batched review-thread fixes (Phase 3 of CI loop):
1. CodeRabbit (major) on
validation_suites/messaging/slack/00-slack-provider-state.sh:
3 calls used 'openshell sandbox exec --name X -- ...' directly,
bypassing the canonical e2e_sandbox_exec wrapper. The wrapper
provides per-call timeout via E2E_SANDBOX_EXEC_TIMEOUT_SECONDS
plus ssh-config-preferred / openshell-exec-fallback transport. A
wedged openshell call without the wrapper can stall the suite
indefinitely in live mode. Replaced all 3 with the wrapper call,
matching the timeout cap pattern already used at line 36.
Fixes hermes-platforms-enabled, hermes-allowed-channels-scoped,
and hermes-gateway-running probes in the slack scenario.
2. CodeQL alert 715 (Shell command built from environment values) on
test/e2e-scenario/scenarios/probes/util.ts:
runSandboxCmd interpolated wrapperPath (derived from
ProbeContext.repoRoot) and the sandbox name + payload argv into
a bash -c script string. Even with JSON.stringify-style escaping,
CodeQL flags taint flow from runtime context into shell. Refactored
to pass all user-controlled values as positional bash parameters:
spawn('bash', ['-c', script, 'e2e-probe-spawn', wrapperPath,
fnName, sandboxName, ...args]). Script body references them as
"$1", "$2", "$3", "${@:4}". No user data is concatenated
into the script. Smoke-tested the new pattern; runtime behavior
is identical for callers (single argv vector preserved).
Updated runSandboxCmd doc-comment which described the old
single-quote-into-script approach.
Verified: bash -n + shfmt clean on slack file; vitest 366 green;
typecheck:cli clean; source-shape:check still 0 cases.
Iteration 5's refactor cleared CodeQL alert 715 (interpolated user data
into shell-command body) but introduced alert 792 on the resulting
spawn("bash", ["-c", script, sentinel, ...bashArgs]) call.
CodeQL's js/shell-command-injection-from-environment rule fires on any
data flow from runtime context into a 'bash -c' invocation, even when
the data reaches the script as positional bash parameters rather than
through interpolation.
Applied the established repo pattern (src/lib/runner.ts:101-102):
1. Added a NUL-byte rejector on every bashArg (mirrors
normalizeSpawnArgs / rejectNulByte from runner.ts).
2. Documented the spawn contract in spawnBash's doc-comment:
- script body is a string literal at every call site
- all variable data flows through positional bash params ($1,$2,...)
- bash treats positional argv as data, not code
- bash binary is hard-coded; shell:false is implicit
3. Suppressed alert 792 with an explicit lgtm[] comment that points
at the contract above.
This follows the documented repo convention: validate inputs, document
why the spawn is safe, suppress the false positive with the rule name
and a short justification.
Verified: typecheck:cli clean; vitest test/e2e-scenario/ 366 green;
smoke-tested the new spawn behavior with a temp wrapper script.
# Conflicts: # test/e2e-scenario/docs/README.md
…awn() Iter 9 added the lgtm suppression but with a multi-line justification that pushed the actual marker 3 lines above the spawn() call. CodeQL treats each '//' comment as a separate node, so my marker on line N followed by 2 continuation comments on N+1, N+2 was not recognized as a suppression for the spawn() on N+3. The new alert 793 fired on the post-iter-9 code at line 113. Restructured to match src/lib/runner.ts:101-104 exactly: contextual explanation as plain comments above, then the bare lgtm directive on the immediately-preceding line. Justification (literal script body, NUL-validated positional argv, hard-coded bash binary) is already in spawnBash's doc-comment, which is the documentation site of record for the contract. Verified: typecheck:cli clean; vitest 366 green.
…#4996) ## Summary Bash is not the direction the E2E scenario stack is heading, and the vitest-runner migration underway makes that even clearer — none of these bash files have a future role. This PR sweeps out a set of inert tombstones and orphan helpers under `test/e2e-scenario/runtime/` that have no production callers, so the migration lands on a smaller surface. Net: 6 file deletions + minor test trimming. ~250 lines removed. No production behavior change. ## Changes **Tombstone bash entrypoints (deleted; bodies were just `exit 2` stubs):** - `test/e2e-scenario/runtime/run-scenario.sh` - `test/e2e-scenario/runtime/run-suites.sh` **Orphan helper libs (deleted; zero callers in production):** - `test/e2e-scenario/runtime/lib/artifacts.sh` — only caller was a self-test in `e2e-lib-helpers.test.ts`. - `test/e2e-scenario/runtime/lib/cleanup.sh` — was a thin re-export of `sandbox-teardown.sh`. - `test/e2e-scenario/runtime/lib/negative.sh` — superseded by `scenarios/orchestrators/negative-matcher.ts`. - `test/e2e-scenario/runtime/lib/port-holder.sh` — superseded by typed probes. **Test cleanup in `test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts` (rides along):** - Dropped `artifact_helper_should_collect_known_logs_without_failing_when_missing` — it was the only caller of `artifacts.sh`. - Dropped `gateway_helper_should_report_unhealthy_gateway_clearly` — it sourced `${RUNTIME_LIB}/gateway.sh`, which does not exist on main. The test was passing only by accident: the missing-file stderr happens to match `/gateway/i`. - Dropped the dead `E2E_DRY_RUN: "1"` env arg in `older_base_image_should_emit_dockerfile_pointing_at_tagged_base`. PR #4380 deleted the `e2e_env_is_dry_run` short-circuit from `older-base-image.sh`, so the env var is unread test scaffolding now. **Comment hygiene:** - `nemoclaw_scenarios/probes/sandbox-absent.sh` — dropped the "Typed replacement for the legacy `run-scenario.sh` inline check" note now that the legacy file is deleted. **Intentionally untouched:** - `test/e2e-scenario-advisor.test.ts` still references `test/e2e-scenario/runtime/run-scenario.sh` as a string fixture. The advisor never reads the file — it filters changed-file paths by shape — so the fixture continues to validate the same plumbing. Refreshing those fixtures is out of scope. - `test/e2e-scenario/docs/{README,MIGRATION}.md` reference the old bash entrypoints. Doc updates are out of scope; the docs already mark these paths as deprecated. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Notes on verification: - Targeted vitest run is green: `npx vitest run test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts test/e2e-scenario-advisor.test.ts test/e2e-scenario/` — 3 test files, 106 tests, 0 failures. - `npx prek run --all-files` reports the same `Test (CLI)` flakes that exist on a clean `origin/main` checkout (timing-sensitive tests in `nemoclaw-start.test.ts`, `doctor-gateway-token.test.ts`, `e2e-fixture-context.test.ts`, `runtime-hermes-secret-boundary-behavioural.test.ts`). All pass on isolated re-run; none reference any of the deleted bash files. Pre-existing — not caused by this PR. Leaving the verification box unchecked because the local prek run wasn't fully clean even though the issue is unrelated. --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Why
Removing
--dry-runand plan-mode execution from the E2E runner: they let a coding agent ship work behind false-green test runs. The TS runner now has one execution mode: live. No flag, env var, helper, or branch in any production path bypasses real assertion execution.Two seams needed closing:
phase.ts:executeStephad nochild_process.spawn— every real shell/probe step fell through to a hardcodedfailed: unsupported live step, and four placeholder refs (fake-pass,fake-retry-once-pass,fake-always-transient,phase-1-skeleton) were the only paths that reportedpassed.validation_suites/,onboarding_assertions/, andnemoclaw_scenarios/began withif e2e_env_is_dry_run; then exit 0; fi, so when the workflow passed--dry-runthe scripts exited 0 before running.What changed
Orchestrator (TypeScript)
phase.ts:executeStepnow spawns shell steps viachild_process.spawn, with detached process groups so timeouts killbash + sleepcleanly via negative-pid signal. Probe steps returnskipped: "probe not registered"until the registry lands. Pending steps returnskipped: "pending: <ref>". Unknown kinds throw. Real evidence captured to.e2e/logs/<step-id>.log. Step-levelreliability.timeoutSecondsandretry.{attempts,on}enforced here, not in clients.run.ts:--dry-run,--validate-onlydeleted. Default invocation is live execution.--listand--plan-only(local debug) survive read-only.--emit-matrixadded for the dynamic-matrix workflow (feat(e2e): generate scenario fan-out matrix from typed registry #4359), with the richer payload (runner,label,platform,suites) merged from main.types.ts:RunContext.dryRundeleted.Workflow
e2e-scenarios.yaml: the resolve-runner--plan-onlywarmup, and both--dry-runinvocations (Linux + WSL), are gone. Workflows execute live. Summary step retitled from "typed dry-run summary" to "typed scenario run".tools/e2e-scenarios/workflow-boundary.mtsvalidator now rejects--dry-run,--plan-only,--validate-onlyin the workflow.Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)
runtime/run-scenario.sh: 483 lines of duplicated install/onboard/gateway-check/suite-execution → 5-line fail-fast stub pointing atrun.ts.runtime/run-suites.sh: same treatment.PhaseOrchestrator.runShellStepwalks typedassertionGroupsdirectly; nothing in TS calls a YAML-walking bash runner.Shell scripts (the leaves stay, the dry-run skip blocks die)
validation_suites/**,onboarding_assertions/**,nemoclaw_scenarios/**: everyif e2e_env_is_dry_run; then ... exit 0; fiand every[[ "${E2E_DRY_RUN:-0}" == "1" ]]short-circuit removed. The real assertion logic that was hiding underneath now runs unconditionally.runtime/lib/env.sh:e2e_env_is_dry_runhelper deleted.inference_routing.sh: dead_e2e_inference_planhelper deleted.Legacy YAML resolver retired (rolled in from the planned follow-up PR)
nemoclaw_scenarios/expected-states.yamland the entireruntime/resolver/tree (load.ts,plan.ts,schema.ts,coverage.ts,index.ts,validator.ts,expected-failure.ts,js-yaml.d.ts) plusruntime/coverage-report.sh.e2e-scenario-schema,e2e-scenario-resolver,e2e-coverage-report,e2e-scenario-additional-families,e2e-metadata-final-hygiene,e2e-expected-failure(typed equivalent now lives ine2e-negative-matcher.test.ts).e2e-expected-state.test.ts: YAML-mirror parity tests dropped; replaced with id-coverage assertions againstlistExpectedStates(). The typed registry inscenarios/expected-states.tsis the single source of truth.Tests
e2e-suite-runner.test.ts,e2e-scenario-first-migration.test.ts,e2e-expected-state-validator.test.ts.e2e-phase-orchestrators.test.ts: now exercises real shell spawning via temp scripts (pass / fail-with-stderr-tail / timeout / retry-on-classified-transient / missing-ref), real probe skipping with visible reason, and real pending skipping. Replaces the prior placeholder refs with assertions that observe actual subprocess behavior.e2e-lib-helpers.test.ts,e2e-scenario-additional-families.test.ts,e2e-scenario-resolver.test.ts,e2e-context-helper.test.ts: dry-run-mode /run-scenario.sh-spawning tests deleted; tests of real bash semantics survive.Docs
test/e2e-scenario/docs/README.md: one runner, one mode (live), no dry-run, no validate-only. Migration tracking section reconciled with main: "Migration status is tracked outside the repository in GitHub issues and pull requests" (parent issue Implement layered E2E scenario model #3588).Merge with main
run.ts,00-slack-provider-state.sh,docs/README.md, three modify/delete pairs underframework-tests/andruntime/resolver/, ande2e-scenarios.yaml.--validate-only/--dry-runplumbing fix that was layered onto the live runner during merge —~/.local/binis now prepended to child PATH at the framework boundary (redaction.ts:buildChildEnv) so post-install nemoclaw shims resolve on runners (notably self-hosted GPU and WSL) where it isn't on the default PATH; and snapshot lifecycle assertions invalidation_suites/lib/sandbox_lifecycle.shnow use the canonicalnemoclaw <sandbox> snapshot <subcommand>argv shape, matchingtest-snapshot-commands.sh.Verification
Audited absent in production paths:
--dry-run,dryRun,E2E_DRY_RUN,e2e_env_is_dry_run,fake-pass,fake-retry-once-pass,fake-always-transient,phase-1-skeleton,unsupported-live-step,--validate-only,RunContext.dryRun.Spec gates addressed
--dry-run.--dry-run,--validate-only, andsuite_filtergone from active paths.Known follow-ups (not blocking this PR)
if [[ -n "${E2E_DRY_RUN:-}" ]]short-circuits remain inside the hermes branch ofvalidation_suites/messaging/slack/00-slack-provider-state.sh. They are dead code now that nothing in production setsE2E_DRY_RUN, but should be removed for consistency with the stated audit.Summary by CodeRabbit
Tests
New Features
Chores