Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 4 additions & 8 deletions .github/workflows/peekaboo-e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,7 @@ jobs:
set -euo pipefail
node tools/computer-use-e2e/run-local.mjs verify-report "$FETCH_REPORT_DIR" \
--method ci-static \
--notes "CI workflow inspected the rendered Peekaboo HTML report with static integrity checks and verified the summary, PR #75 baseline parity coverage, evidence video/storyboard, screenshots, visual proof, and artifact sections before publishing."
--notes "CI workflow inspected the rendered Peekaboo HTML report with static integrity checks and verified the summary, Computer Use correspondence map, evidence video/storyboard, screenshots, visual proof, and artifact sections before publishing."

- name: Collect report metadata
id: report-meta
Expand Down Expand Up @@ -447,9 +447,6 @@ jobs:
verdict="$(jq -r '([.scenarios[]?.status] // []) as $statuses | if any($statuses[]; . == "fail") then "fail" elif any($statuses[]; . == "inconclusive") then "inconclusive" else "pass" end' "$state_file")"
cu_covered="$(jq -r '[.peekaboo.coverageMap.phaseCoverage[]? as $coverage | select((.scenarios[$coverage.key].status // "") == "pass") | $coverage.correspondsTo[]?] | unique | length' "$state_file")"
cu_required=24
if [[ "$verdict" == "pass" && "$cu_covered" != "$cu_required" ]]; then
verdict="inconclusive"
fi
screenshots="$(jq -r '(.screenshots // []) | length' "$state_file")"
video_frames="$(jq -r '(.video.frames // 0) | if type == "number" then . elif type == "array" then length else 0 end' "$state_file")"
state_secret_scan_passed="$(jq -r '(.peekaboo.secretScan.status // "missing") == "passed"' "$state_file")"
Expand Down Expand Up @@ -525,7 +522,7 @@ jobs:
if [[ "${{ steps.report-meta.outputs.has_report }}" == "true" ]]; then
echo "- Verdict: \`${{ steps.report-meta.outputs.verdict }}\`"
echo "- Counts: \`${{ steps.report-meta.outputs.pass }} pass / ${{ steps.report-meta.outputs.fail }} fail / ${{ steps.report-meta.outputs.inconclusive }} inconclusive / ${{ steps.report-meta.outputs.not_required }} not required\`"
echo "- CU coverage: \`${{ steps.report-meta.outputs.cu_covered }}/${{ steps.report-meta.outputs.cu_required }}\`"
echo "- Computer Use correspondence: \`${{ steps.report-meta.outputs.cu_covered }}/${{ steps.report-meta.outputs.cu_required }}\`"
echo "- Evidence: \`${{ steps.report-meta.outputs.screenshots }} screenshots / ${{ steps.report-meta.outputs.video_frames }} video frames\`"
echo "- Artifact path: \`${{ steps.report-meta.outputs.report_dir }}/index.html\`"
if [[ "${{ steps.report-meta.outputs.rsync_status }}" != "0" ]]; then
Expand Down Expand Up @@ -668,7 +665,7 @@ jobs:
### nixmac Peekaboo E2E: $status_label

- Result: \`$PASS_COUNT pass / $FAIL_COUNT fail / $INCONCLUSIVE_COUNT inconclusive / $NOT_REQUIRED_COUNT not required\`
- CU parity coverage: \`$CU_COVERED/$CU_REQUIRED\`
- Computer Use correspondence: \`$CU_COVERED/$CU_REQUIRED\`
- Evidence: \`$SCREENSHOTS screenshots / $VIDEO_FRAMES timestamped video frames\`
- Report fetch: \`$(if [[ "${RSYNC_STATUS:-0}" == "0" ]]; then printf 'complete'; else printf 'partial rsync status %s' "$RSYNC_STATUS"; fi)\`
- Secret scan: \`$SECRET_SCAN_PASSED\`
Expand Down Expand Up @@ -728,8 +725,7 @@ jobs:
exit 1
fi
if [[ "$cu_covered" != "$cu_required" ]]; then
echo "Peekaboo E2E covered $cu_covered/$cu_required required Computer Use parity keys."
exit 1
echo "Peekaboo E2E mapped $cu_covered/$cu_required Computer Use correspondence keys; missing breadth is reported but no longer fails this complementary lane."
fi
if [[ "$verdict" != "pass" ]]; then
echo "Peekaboo E2E verdict was $verdict."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@
# `nix-overlays.nix` and reference that package here.
#
# Example commands to preview changes:
# $ darwin-rebuild build --flake .#Scotts-MacBook-Pro-2
# $ darwin-rebuild switch --flake .#Scotts-MacBook-Pro-2
# $ darwin-rebuild build --flake .#my-mac
# $ darwin-rebuild switch --flake .#my-mac

environment.systemPackages = with pkgs; [
# Example packages (uncomment or add your own):
Expand Down
85 changes: 85 additions & 0 deletions docs/e2e-dual-lane-strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# nixmac E2E Dual-Lane Strategy

## TL;DR

nixmac should keep two desktop E2E lanes:

- **Computer Use Product Proof** is the broad reviewer-facing lane for PR evidence, release/high-risk workflows, and real UI interaction inside the running app.
- **Peekaboo AX/screen-capture E2E** is a complementary Mac proof lane for fast deterministic launch/readiness checks, screenshots, shell-owned fixture/state checks, Nix/system boundaries, MacInCloud health, and focused smoke tests.

Peekaboo should not be treated as a full replacement for Computer Use. It can show where its evidence corresponds to Computer Use coverage, but missing Computer Use breadth is not itself a Peekaboo failure.

## Why Parity Was Not Reached

The gap was not mainly scenario count. More scenarios would have increased breadth, but would not have fixed the core interaction boundary.

Computer Use can operate through a higher-level app-server interaction path that proved broad user workflows in PR #75: launch, settings, history, console, feedback/report dialogs, suggestion cards, typed prompt submission, Review/Summary/Diff/Build boundaries, save/rollback, and discard boundaries.

Peekaboo runs through macOS screen capture, Accessibility (AX), coordinates, keyboard/paste, shell fixtures, and report artifacts. Those are excellent for deterministic Mac evidence, but on MacInCloud they did not expose the same reliable interaction surface for WebKit/React prompt controls.

The decisive checks were:

- An SSH-launched Swift AX probe against the running app reported `AX_TRUSTED=false`, `NODE_COUNT=1`, and zero matches for `evolve-prompt-input`, `Install vim`, `Add Rectangle`, `Describe changes`, `What to change`, or `Configuration change`.
- Peekaboo Bridge itself had Screen Recording and Accessibility permissions and could see the app generally, but the useful prompt/suggestion controls were not addressable through the trusted Peekaboo AX scan in the failure state.
- The focused MacInCloud `macos_core_product_proof` run reached the visible suggestion target, ran coordinate and paste/type fallbacks, then failed because the prompt state did not update: `Suggestion text did not reach the prompt after system input fallback`.

PR #105 also explored an app-owned WebKit eval bridge. That is useful, but it is a different proof class. It can prove React/app-level behavior when explicitly enabled, but it does not prove host pointer/compositor behavior such as pointerdown, mousedown, hover, touch, focus transfer, or real OS click delivery. A bridge-backed result should therefore be labeled separately from a Computer Use pass.

The defensible conclusion is narrower than “Peekaboo cannot test nixmac.” Peekaboo can test important nixmac behavior. It just cannot honestly claim full Computer Use parity on the current MacInCloud stack without a materially different trusted driver.

## When To Use Each Lane

Use **Computer Use Product Proof** when the question is:

- Can a reviewer trust that the real user workflow works?
- Did a PR change app UI, app state flow, provider flow, save/rollback, discard, or prompt interaction?
- Do we need broad Product Proof evidence linked from a PR comment?
- Do we need screenshot/video/report evidence from the same lane that drove the interaction?

Use **Peekaboo AX/screen-capture E2E** when the question is:

- Does the app launch and reach a stable shell on a real Mac?
- Is MacInCloud healthy enough for GUI capture and app staging?
- Are screenshots, diagnostics, report structure, and visual proof quality intact?
- Do shell-owned fixtures, Nix install/state boundaries, cleanup, and non-destructive smoke tests behave deterministically?
- Do we need a faster local or remote smoke check before spending Computer Use time?

Use **both** for high-risk desktop work: Peekaboo catches Mac/fixture/report regressions cheaply; Computer Use remains the broad user-workflow proof.

## Reporting Policy

Peekaboo reports should use **Computer Use correspondence**, not “Computer Use parity,” for mapped keys.

- A mapped key means Peekaboo evidence corresponds to part of the Computer Use coverage model.
- An unmapped key means that behavior is outside the current Peekaboo lane, not automatically a failed Peekaboo run.
- A Peekaboo run should fail for failed Peekaboo-owned evidence: app launch, scenario assertions, screenshots, diagnostics, secret scan, cleanup, report generation, or explicit lane-specific checks.
- A Peekaboo run should not fail solely because it does not cover every Computer Use Product Proof key.

## Repo Hygiene

Current state after the pivot decision:

- PR #75 is the merged Computer Use Product Proof baseline.
- PR #90 introduced the Peekaboo local Product Proof lane.
- PR #101 hardened Peekaboo boot diagnostics.
- PR #105 is an open follow-up experiment that tried to close the MacInCloud interaction gap and still failed the key prompt interaction criterion.

The PR #105 experiment should be preserved as evidence, but not merged as the new direction unless specific pieces are intentionally salvaged in smaller follow-ups. The go-forward policy belongs in a clean branch so reviewers do not have to separate a strategy decision from a large experimental diagnostic stack.

Salvage candidates from PR #105 should be evaluated individually:

- Keep if they improve lane-specific evidence without implying Computer Use parity.
- Keep if they produce clearer visual/report diagnostics.
- Keep bridge-backed behavior only with explicit wording that it proves app/React behavior, not host pointer/compositor parity.
- Drop or defer changes whose only purpose is to force Peekaboo into full Computer Use replacement semantics.

## Revisit Criteria

Revisit “Peekaboo can replace Computer Use” only if the driver changes materially:

- a trusted helper lives inside the already-granted Peekaboo Bridge or equivalent signed/granted process;
- the driver can address WebKit controls reliably in the same states Computer Use covers;
- text/click actions update app state through the same user-observable path the product depends on;
- report evidence distinguishes app-level synthetic proof from real host pointer/compositor proof.

Until then, the right product is a dual-lane system: Computer Use for broad workflow confidence, Peekaboo for fast deterministic Mac proof and complementary evidence.
19 changes: 13 additions & 6 deletions tests/e2e/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
# macos-e2e

GUI test framework for macOS apps. Uses [Peekaboo](https://peekaboo.boo) for accessibility-based automation over SSH, with ffmpeg screen recording.
Complementary macOS E2E runner for deterministic desktop checks. It uses
[Peekaboo](https://peekaboo.boo) for Accessibility-based automation over SSH,
plus ffmpeg screen recording.

> Built for [nixmac](https://github.com/darkmatter/nixmac), designed to be extracted as a standalone tool.
Built for [nixmac](https://github.com/darkmatter/nixmac). This runner is not the
broad Product Proof replacement for the Computer Use lane; it is the
Peekaboo AX/screen-capture lane for launch, readiness, screenshots, shell-owned
fixtures, Nix/system boundaries, and focused smoke tests. See
`../docs/e2e-dual-lane-strategy.md` for the lane split.

## Quick start

Expand Down Expand Up @@ -124,10 +130,11 @@ Then reference it in your scenario: `E2E_ADAPTER="myapp"`

`macos_descriptor_prompt_smoke` is the safe inner-loop scenario used by
`tools/computer-use-e2e/run-local.mjs run-peekaboo`. It launches the real app,
drives the descriptor prompt through Peekaboo accessibility metadata, captures
screenshots/video/logs, and does not install, uninstall, build, save, discard,
or mutate system Nix state. It does write temporary nixmac settings; the
`run-peekaboo` bridge backs up and restores Application Support around the run.
drives the descriptor prompt through Peekaboo accessibility metadata when the
host exposes it, captures screenshots/video/logs, and does not install,
uninstall, build, save, discard, or mutate system Nix state. It does write
temporary nixmac settings; the `run-peekaboo` bridge backs up and restores
Application Support around the run.

`macos_provider_evolve_full_smoke` is the stronger local proof. It owns a local
OpenAI-compatible provider stub, applies a tool-driven config edit, mocks only
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# =============================================================================
# Scenario: macos_customization_save_rollback_smoke
#
# Focused parity proof for the untracked macOS customizations badge: Add to
# Focused correspondence proof for the untracked macOS customizations badge: Add to
# config, Build & Test, Save, and History rollback against a disposable repo.
# =============================================================================

Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/scenarios/macos_homebrew_save_rollback_smoke.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# =============================================================================
# Scenario: macos_homebrew_save_rollback_smoke
#
# Focused parity proof for the untracked Homebrew badge: Add to config,
# Focused correspondence proof for the untracked Homebrew badge: Add to config,
# Build & Test, Save, and History rollback against a disposable config repo.
# =============================================================================

Expand Down
4 changes: 3 additions & 1 deletion tools/computer-use-e2e/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ stable generic runner API.

The current production path is the Codex app-server Computer Use lane driven by
`run-remote-cua.mjs`. Future drivers are planned work, not current production
behavior.
behavior. The Peekaboo AX/screen-capture runner is a complementary proof lane,
not the production Computer Use driver adapter; its current policy lives in
`docs/e2e-dual-lane-strategy.md`.

## Current Boundary

Expand Down
18 changes: 16 additions & 2 deletions tools/computer-use-e2e/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,19 @@ Proof report must make uncertainty visible: missing proof, low-signal evidence,
stale coverage, expired waivers, remote-infra blockers, and provider failures
must fail or downgrade the run instead of being hidden behind a passing check.

## Dual-Lane Policy

nixmac uses two complementary desktop E2E lanes. The Computer Use lane is the
broad PR-ready Product Proof lane for user workflows. The Peekaboo AX/screen-capture
lane is a fast deterministic Mac proof lane for launch/readiness, screenshots,
fixture/state checks, Nix/system boundaries, MacInCloud health, and focused
smoke coverage.

Peekaboo reports may map evidence to Computer Use keys, but that map is
correspondence, not a replacement claim. Missing Computer Use breadth is visible
in the report and does not by itself fail a Peekaboo run. See
`docs/e2e-dual-lane-strategy.md` for the engineering rationale and lane policy.

## Remote Computer Use Lane

This is the PR-ready lane. Start Codex app-server on the target Mac and tunnel
Expand Down Expand Up @@ -649,8 +662,9 @@ MacInCloud operator notes:
CSS-only solid capture instead.

The remote Codex app-server lane remains the PR/Product Proof production lane.
The Peekaboo lane is isolated local evidence so the team can compare driver
approaches without changing the remote workflow contract.
The Peekaboo lane is complementary local and MacInCloud evidence so the team can
catch Mac/fixture/report regressions without changing the Computer Use workflow
contract.

### Real Provider Local Lane

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -127,14 +127,14 @@ assert.match(proof, /artifacts\/computer-use-local\/\.current-run/, 'workflow mu
assert.match(proof, /remote_artifact_root="\$\{PEEKABOO_REPO_DIR%\/\}\/artifacts\/computer-use-local"/, 'workflow must anchor fetched reports to the expected remote artifact root');
assert.match(proof, /remote_run_dir" != "\$remote_artifact_root"\/\*/, 'workflow must reject current-run paths outside the artifact root before rsync');
assert.match(proof, /remote_run_dir_physical="\$\(ssh[\s\S]*pwd -P/, 'workflow must verify the physical remote report path before rsyncing');
assert.match(proof, /name: Record CI report inspection proof[\s\S]*run-local\.mjs verify-report "\$FETCH_REPORT_DIR"[\s\S]*--method ci-static/, 'workflow must record reportInspection parity proof before metadata is collected');
assert.match(proof, /name: Record CI report inspection proof[\s\S]*run-local\.mjs verify-report "\$FETCH_REPORT_DIR"[\s\S]*--method ci-static/, 'workflow must record reportInspection proof before metadata is collected');
{
const dropKeyIndex = proof.indexOf('name: Drop MacInCloud SSH key before local report processing');
const verifyReportIndex = proof.indexOf('run-local.mjs verify-report "$FETCH_REPORT_DIR"');
assert.notEqual(dropKeyIndex, -1, 'workflow must explicitly remove the MacInCloud SSH key before local report processing');
assert.ok(dropKeyIndex < verifyReportIndex, 'workflow must remove the MacInCloud SSH key before running local report verification code');
}
assert.match(proof, /CI workflow inspected the rendered Peekaboo HTML report[\s\S]*PR #75 baseline parity coverage[\s\S]*evidence video\/storyboard/, 'CI report inspection notes must describe concrete report sections');
assert.match(proof, /CI workflow inspected the rendered Peekaboo HTML report[\s\S]*Computer Use correspondence map[\s\S]*evidence video\/storyboard/, 'CI report inspection notes must describe concrete report sections');
assert.match(proof, /name: Upload Peekaboo report artifact[\s\S]*name: peekaboo-e2e-report/, 'proof job must upload the HTML report artifact');
assert.doesNotMatch(proof, /sudo apt-get install/, 'proof job must not spend PR time installing media packages on the hosted runner');

Expand Down Expand Up @@ -169,8 +169,9 @@ assert.match(result, /verdict="\$\{\{ needs\.peekaboo-product-proof\.outputs\.ve
assert.match(result, /publish_result="\$\{\{ needs\.publish-peekaboo-report\.result \}\}"/, 'result job must observe report publishing');
assert.match(result, /secret_scan_passed="\$\{\{ needs\.peekaboo-product-proof\.outputs\.secret_scan_passed \}\}"/, 'result job must observe the report secret scan result');
assert.match(result, /secret scan did not pass; hosted publishing is intentionally blocked/, 'result job must fail clearly when secret scan blocks publishing');
assert.match(result, /cu_covered="\$\{\{ needs\.peekaboo-product-proof\.outputs\.cu_covered \}\}"/, 'result job must observe covered required Computer Use keys');
assert.match(result, /cu_covered" != "\$cu_required"/, 'result job must fail when required Computer Use parity coverage is incomplete');
assert.match(result, /cu_covered="\$\{\{ needs\.peekaboo-product-proof\.outputs\.cu_covered \}\}"/, 'result job must observe mapped Computer Use correspondence keys');
assert.match(result, /missing breadth is reported but no longer fails this complementary lane/, 'result job must report missing Computer Use breadth without failing the complementary Peekaboo lane');
assert.doesNotMatch(result, /required Computer Use parity keys/, 'result job must not describe Peekaboo as a required Computer Use parity lane');
assert.match(result, /verdict" != "pass"/, 'result job must fail non-pass reports');
assert.match(result, /publish job result was \$publish_result/, 'result job must fail PR runs when publishing fails');

Expand Down
Loading
Loading