Skip to content

[codex] Redesign Minimap as a lean navigation graph#1

Draft
himattm wants to merge 4 commits into
mainfrom
codex/lean-minimap-navigation-graph
Draft

[codex] Redesign Minimap as a lean navigation graph#1
himattm wants to merge 4 commits into
mainfrom
codex/lean-minimap-navigation-graph

Conversation

@himattm

@himattm himattm commented May 28, 2026

Copy link
Copy Markdown
Owner

Summary

This PR resets Minimap around the narrower product goal we aligned on: an Android-only navigation memory layer for agents. It replaces the heavier proposal/journal-oriented model with a lean graph of semantic places and deterministic UI edges that agents can grow, replay, and validate over time.

Key changes:

  • Introduces the v1 .minimap/graph model with semantic places, place variants, deterministic edge files, explicit viewport compatibility for coordinate taps, and one active Android app profile.
  • Refactors the CLI around the agent-facing commands: init, doctor, whereami, layout, tap, scroll, back, and go.
  • Uses Android layout observations to recognize known places, add variants for changed screens, record new paths, and avoid silently claiming unknown screens as known places.
  • Adds fresh session/layout reuse so an agent can navigate via Minimap and verify from cached observed layout without immediately re-dumping Android UI state.
  • Updates the Codex/Claude agent skill packaging so agents are guided to use Minimap first and raw Android only as fallback evidence.
  • Adds benchmark notes and an active-development change benchmark protocol for known-route replay, graph growth, route repair, and changed screens.
  • Keeps the earlier selector compatibility fix so selector taps work against the real android layout shape.

Review Focus

Please review this as a breaking v1 redesign rather than an incremental compatibility patch.

High-value review areas:

  • CLI contract in crates/minimap-cli/src/main.rs and crates/minimap-cli/tests/cli_contract.rs.
  • Place matching and variant behavior in crates/minimap-core/src/lib.rs.
  • Graph path resolution and viewport compatibility in crates/minimap-graph/src/lib.rs.
  • Repository layout and init behavior in crates/minimap-repo/src/lib.rs.
  • Schema names and serialized graph shape in crates/minimap-schemas/src/lib.rs.
  • Agent instructions in plugins/minimap-claude-code/skills/minimap-app-navigation/SKILL.md.
  • Benchmark interpretation in docs/MINIMAP_BENCHMARK_NOTES.md and docs/MINIMAP_CHANGE_BENCHMARK_PROTOCOL.md.

Known local-only files intentionally excluded from the PR: .claude/ settings/checkpoints.

Validation

Ran locally:

cargo fmt --check
cargo test
cargo clippy --all-targets -- -D warnings
git diff --cached --check

Also ran a controlled change smoke against the installed Minimap CLI using fake android layout and fake adb commands:

/private/tmp/minimap-change-bench/runs/20260528-134010/change-smoke-results.json

Smoke coverage:

  • Existing place grew -> known_changed, one place variant, no edge churn.
  • New option/new screen -> needs_label, then new place plus new edge.
  • Known route with changed destination -> go succeeded, destination variant added.
  • Renamed selector -> old route surfaced config_error, repair recorded a replacement edge.
  • Removed option -> old route surfaced config_error, no new edge recorded.

Notes

The real Compose sample changed-app benchmark is not included yet. The current benchmark evidence covers known-path replay on Jetsnack plus deterministic change-case smoke. A follow-up should run the same protocol against a modified Compose sample build before treating the performance numbers as product claims.

Update (2026-06-11): hardening + device targeting

Two commits landed since the original push, closing the CI failure and the validated hardening backlog:

Harden matching tolerance, graph writes, and CLI safety

  • Safety/correctness fixes, each with regression tests: edge_id panic on long selectors; no_compatible_path vs no_known_path reachability; atomic graph writes; pending-transition TTL + dangling-edge guard; cross-device cache bleed (serial-less -> no cache); arg validation before mutation; duplicate-id detection in validate_graph; overlay android:id/button1 false-positive removed; doctor exit code routed through exit_code_for_status; 0600/0700 cache-file perms; default-deny redaction with tightened email/numeric heuristics.
  • Matching tolerance: per-dimension scoring is now a blend of Jaccard + containment so one-sided scroll drift heals without size-mismatched false merges; role histogram replaced with presence-set + min/max count term; normalize_label uses deunicode transliteration with a never-empty fallback; duplicate slugs surface label_mismatch unless --allow-duplicate-label. KNOWN_CHANGED_THRESHOLD tuned to 0.80, band-center of the measured clean gap [0.689, 0.902] on Jetsnack. Sibling detail screens (e.g. two product details) intentionally merge into one item-agnostic place.
  • Fixes the CI failure on the previous push (clippy manual_option_zip, rustfmt drift).

Thread device serial through adb and android CLI calls

  • New global --serial flag with ANDROID_SERIAL env fallback; every adb call now carries -s <serial>, android layout carries --device=<serial>, and android screen subcommands get ANDROID_SERIAL on the child process. With a configured serial, cache scoping no longer depends on adb get-serialno succeeding.
  • doctor now detects the multiple-devices-without-serial condition and reports an actionable hint instead of a raw adb failure; with a serial it reports the targeted device.
  • Contract tests use serial-asserting fakes that fail on any serial-less invocation, so the threading is proven end to end.

Validation: cargo fmt --check, cargo clippy --all-targets -- -D warnings, cargo test -> 100 passed, 0 failed. Earlier live e2e on Jetsnack validated the full loop (init -> label -> grow -> re-identify known -> go replay -> viewport-mismatch refusal).

Deferred follow-ups (intentionally out of scope for v1):

  • Overlay detector false-negative on stock AlertDialogs (the android layout CLI emits their buttons as text without the resource-ids the detector scans).
  • Session-cache 30s staleness window: whereami can report a stale place if the screen changed via raw adb in between.
  • 1000ms settle is occasionally too short right after an app relaunch; skipped_edges diagnostics are noisy.
  • No per-command layout-call counter, so the "second pass cheaper" benchmark claim is inferable but not directly measured.
  • The real changed-Compose-app benchmark from the change protocol still needs a run before the performance numbers become product claims.

himattm added 4 commits May 21, 2026 18:29
`android layout` emits a flat array of nodes with hyphenated keys
(content-desc, resource-id) and a stringified center "[x,y]".
`resolve_selector_point` was looking for the legacy UIAutomator shape
(camelCase contentDescription/testTag, bounds object) and never
matched against real CLI output. Live smoke confirmed:
`minimap tap --selector content_desc=Settings` always failed with
"Selector not found" against the running emulator.

Extend the resolver to accept both shapes so existing fake-adb test
fixtures keep working and real CLI output is now supported:

- Selector key lookup tries hyphenated first, then camelCase.
- center_of falls back to parsing the "[x,y]" string when no
  bounds object is present.

Two new tests: tap_selector_resolves_real_cli_shape exercises the
end-to-end selector path against a real-CLI-shaped layout fixture;
parse_center_string_parses_bracketed_pair covers the parser edges.

Test count: 49 -> 51 passing, 0 failed, 1 ignored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant