wbugitlab1 · wbugitlab1 · Jun 20, 2026 · Jun 20, 2026 · Jun 20, 2026 · Jun 20, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -68,3 +68,38 @@ jobs:
           path: coverage/
           if-no-files-found: error
           retention-days: 7
+
+  restart-retest:
+    name: "Issue #349 restart retest (${{ matrix.os }})"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-latest, macos-latest]
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          persist-credentials: false
+      - uses: actions/setup-node@v6
+        with:
+          node-version: 22
+      - run: corepack enable
+      - run: pnpm install --frozen-lockfile --ignore-scripts
+      - run: pnpm run build
+      - run: node scripts/github/issue-349-restart-retest.mjs
+
+  engine-state-probe:
+    name: "Issue #349 engine state probe (${{ matrix.os }})"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-latest, macos-latest]
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          persist-credentials: false
+      - uses: actions/setup-node@v6
+        with:
+          node-version: 22
+      - run: node scripts/github/issue-349-engine-state-probe.mjs
diff --git a/docs/todos/2026-06-20-issue-349-lost-data-after-restart/arena-synthesis.md b/docs/todos/2026-06-20-issue-349-lost-data-after-restart/arena-synthesis.md
@@ -0,0 +1,102 @@
+# Arena Synthesis: Issue 349
+
+## Rubric
+
+1. Uses current repo evidence for restart, persistence, and stop behavior.
+2. Distinguishes issue #349's laptop restart / v0.9.27 screenshot from issue
+   #338's CLI stop path and Windows residual.
+3. Correctly identifies duplicate, stale, already-fixed, or remaining valid
+   scope and the Human Checkpoint requirement.
+4. Proposes the smallest testable next step without broad persistence or
+   iii-engine boundary changes.
+5. Names inspected sources, commands, files, and residual uncertainty.
+
+## Scores
+
+| Candidate | Repo evidence | Issue distinction | Classification / checkpoint | Next step | Sources / uncertainty | Total |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: |
+| A | 5 | 5 | 5 | 4 | 5 | 24 |
+| B | 5 | 5 | 5 | 5 | 5 | 25 |
+| C | 4 | 5 | 5 | 3 | 3 | 20 |
+
+## Decision
+
+Base: Candidate B.
+
+Candidate B is the strongest base because it cleanly decomposes "restarted
+laptop" into separate possible paths: supported `agentmemory stop`, OS reboot
+while the daemon is running, forceful power loss, or a startup/catalog issue.
+That keeps the conclusion precise: issue #349 overlaps with the #338 data-loss
+family, but it is not implementation-ready and should not be closed or mutated
+without a Human Checkpoint.
+
+Grafts:
+- From Candidate A: current app-level index rebuilds use iii `state::list`; they
+  do not scan raw iii state files. If iii's catalog returns empty after boot,
+  patching around raw state files would cross engine/persistence boundaries and
+  needs approval.
+- From Candidate A: the #338 path/data-dir class is stale on current code, but a
+  literal OS/laptop restart is broader than the CLI stop path.
+- From Candidate C: compact final framing: #349 is stale or likely duplicate
+  only for the #338 `agentmemory stop` interpretation and is not independently
+  valid for implementation without a Human Checkpoint.
+
+Rejected:
+- Closing #349 now as already fixed. The public issue action requires approval,
+  and #349 says "restarted laptop", not confirmed `agentmemory stop`.
+- Implementing now. There is no current-main reproduction and the likely
+  distinct paths cross restart, persistence, iii-engine lifecycle, or startup
+  reconciliation boundaries.
+- Treating #1034 as a persistence change. The diff from the #338 merge to
+  current `origin/main` is iii runtime compatibility diagnostics and task docs.
+- Claiming the CLI stop fix covers arbitrary OS reboot. PR #1033 invokes the
+  checkpoint through `agentmemory stop`; a laptop reboot may bypass that endpoint
+  and rely on worker process signals or platform shutdown ordering.
+
+## Validity Finding
+
+Issue #349 requires a Human Checkpoint.
+
+Current evidence supports **already fixed / stale / likely duplicate only for
+the #338 class**: `agentmemory stop` now checkpoints the worker before native
+signals through `postShutdownFlush()`, `executeResponsiveNativeStop()`,
+`mem::shutdown-flush`, and authenticated `POST /agentmemory/shutdown/flush`.
+
+Current evidence does **not** prove a literal laptop or OS restart is fixed.
+The issue body and upstream source provide no commands, OS, logs, data-dir
+details, or current-version reproduction. The screenshot shows v0.9.27 before
+PR #1033 merged. The worker still has a normal `SIGINT`/`SIGTERM` shutdown path
+for non-CLI process termination, so a non-CLI reboot can bypass the #1033 CLI
+checkpoint.
+
+## Recommended Checkpoint Options
+
+Recommended: keep the issue open and post a clarification/retest comment asking
+for OS, current version, whether `agentmemory stop` was used before reboot,
+whether the issue reproduces on a build containing PR #1033, and whether old
+state files remain under the data directory after restart.
+
+Other options:
+- Close as covered by #338 / PR #1033 if the user accepts the ambiguity and
+  wants to treat the v0.9.27 report as stale or duplicate.
+- Approve a narrow validation task that first builds a reproduction harness for
+  OS/laptop restart behavior separately from the already-fixed CLI stop path.
+
+## Verification
+
+Arena verification completed by reading every candidate report and comparing
+the judge verdict with a parent source inspection:
+- Public issue #349 and upstream #876 contain the same sparse v0.9.27
+  laptop-restart report and no comments.
+- Public issue #338 is closed completed by PR #1033, merge commit
+  `2ecbe54aa822462c5480beb59ac0f391723dfabd`.
+- Current `origin/main` is
+  `257238ab1c318b2e9ae5efcbe72863b99c41ee35`.
+- `git diff --quiet 2ecbe54aa822462c5480beb59ac0f391723dfabd..origin/main --`
+  the shutdown, index-persistence, API flush, and relevant test files returned
+  `0`, meaning #1034 did not change those surfaces.
+- `rg` confirms the #1033 shutdown flush path and the remaining worker signal
+  shutdown path.
+
+No implementation tests were run because the current outcome is a read-only
+validity checkpoint, not a code change.
diff --git a/docs/todos/2026-06-20-issue-349-lost-data-after-restart/plan.md b/docs/todos/2026-06-20-issue-349-lost-data-after-restart/plan.md
@@ -0,0 +1,180 @@
+# Issue 349 GitHub Restart Retest Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Add a GitHub Actions retest harness that validates the current #338 restart fix in clean GitHub runners for issue #349.
+
+**Architecture:** Keep production code unchanged. Add a small Node.js script under `scripts/github/` that starts the built CLI in an isolated temp `HOME`, writes a sentinel memory through REST, stops via the supported CLI path, restarts, and verifies the sentinel survives via REST search/list endpoints. Wire it into the existing CI workflow as a dedicated job on Ubuntu and macOS so the workflow runs from the normal PR path.
+
+**Tech Stack:** GitHub Actions, Node.js 22, pnpm 11, existing built `dist/cli.mjs`, built-in `fetch`, `node:child_process`, and repository REST endpoints.
+
+---
+
+## Files
+
+- Create: `scripts/github/issue-349-restart-retest.mjs`
+- Modify: `.github/workflows/ci.yml`
+- Modify: `docs/todos/2026-06-20-issue-349-lost-data-after-restart/todo.md`
+
+## Task 1: Add The Retest Harness Script
+
+**Files:**
+- Create: `scripts/github/issue-349-restart-retest.mjs`
+
+- [x] **Step 1: Write the harness script**
+
+Create `scripts/github/issue-349-restart-retest.mjs` with these responsibilities:
+- Create a temp root, temp `HOME`, temp data dir, and temp invocation cwd.
+- Start `node dist/cli.mjs --data-dir <temp-data-dir>` with `HOME` and `AGENTMEMORY_READY_TIMEOUT_MS=120000`.
+- Wait for `/agentmemory/health`.
+- POST a unique sentinel to `/agentmemory/remember`.
+- Verify the sentinel appears via `/agentmemory/search` and `/agentmemory/memories`.
+- POST `/agentmemory/shutdown/flush`.
+- Run `node dist/cli.mjs stop` with the same temp `HOME` and data dir.
+- Restart the server with the same temp `HOME` and data dir.
+- Verify the sentinel still appears via `/agentmemory/search` and `/agentmemory/memories`.
+- Stop the restarted server.
+- Print structured progress lines and fail fast with safe diagnostics if any step fails.
+
+- [x] **Step 2: Run the script without a built `dist/` expectation if needed**
+
+Run after build only:
+
+```bash
+corepack pnpm run build
+node scripts/github/issue-349-restart-retest.mjs
+```
+
+Expected on a clean runner: PASS. Expected locally in this worktree: may fail or be skipped if the default iii ports are already occupied by the user's daemon. Do not stop the user's daemon.
+
+Actual: `node --check scripts/github/issue-349-restart-retest.mjs` passed. The live harness was intentionally not run locally because the user's existing daemon is already listening on the default iii/REST ports; it will run on clean GitHub Actions runners after push.
+
+## Task 2: Wire The Harness Into Existing CI
+
+**Files:**
+- Modify: `.github/workflows/ci.yml`
+
+- [x] **Step 1: Add a dedicated job**
+
+Add a `restart-retest` job after the existing test job:
+- `runs-on: ${{ matrix.os }}`
+- matrix `os: [ubuntu-latest, macos-latest]`
+- checkout with `persist-credentials: false`
+- setup Node 22
+- enable corepack
+- `pnpm install --frozen-lockfile --ignore-scripts`
+- `pnpm run build`
+- `node scripts/github/issue-349-restart-retest.mjs`
+
+Keep it separate from the existing `test` job so failures point directly at issue #349 restart behavior.
+
+- [x] **Step 2: Verify the workflow text**
+
+Run:
+
+```bash
+git diff --check
+```
+
+Expected: no whitespace errors.
+
+Actual: `git diff --check` passed.
+
+## Task 3: Local Verification
+
+**Files:**
+- All touched files
+
+- [x] **Step 1: Run focused tests**
+
+Run:
+
+```bash
+corepack pnpm exec vitest run test/index-persistence.test.ts test/search.test.ts test/shutdown-flush.test.ts test/api-boundary-coverage.test.ts test/cli-stop-port-detection.test.ts test/reconnect-registration.test.ts test/engine-launch.test.ts test/runtime-config.test.ts test/cli-iii-config.test.ts test/consistency.test.ts
+```
+
+Expected: all targeted tests pass.
+
+Actual: passed, 10 test files / 138 tests.
+
+- [x] **Step 2: Run build**
+
+Run:
+
+```bash
+corepack pnpm run build
+```
+
+Expected: build exits 0 and produces `dist/cli.mjs`.
+
+Actual: `corepack pnpm run build` passed and produced `dist/cli.mjs`.
+
+- [x] **Step 3: Run local harness only if safe**
+
+Before running the live harness locally, verify no existing iii/agentmemory process is listening on `49134` or `3111`:
+
+```bash
+lsof -nP -iTCP:49134 -sTCP:LISTEN
+lsof -nP -iTCP:3111 -sTCP:LISTEN
+```
+
+If those ports are occupied, do not run the local live harness. Record the blocker and rely on GitHub Actions clean runners after push.
+
+Actual: ports `49134` and `3111` are occupied by the user's existing daemon, so the live harness was not run locally.
+
+## Task 4: Publish For GitHub Retest
+
+**Files:**
+- Git branch / PR metadata
+
+- [x] **Step 1: Stage and commit task-owned files**
+
+Run:
+
+```bash
+git add .github/workflows/ci.yml scripts/github/issue-349-restart-retest.mjs docs/todos/2026-06-20-issue-349-lost-data-after-restart/todo.md docs/todos/2026-06-20-issue-349-lost-data-after-restart/arena-synthesis.md docs/todos/2026-06-20-issue-349-lost-data-after-restart/plan.md
+git commit -m "test: add issue 349 restart retest harness"
+```
+
+Actual: staged paths were limited to the workflow, GitHub retest script, and issue #349 task notes. `git diff --cached --check` passed and `gitleaks protect --staged --redact` found no leaks across about 30 KB of staged content before commit.
+
+- [x] **Step 2: Push to origin**
+
+Run only after local verification:
+
+```bash
+git push -u origin issue/349-lost-data-after-restart
+```
+
+Actual: pushed branch `issue/349-lost-data-after-restart` to `origin`.
+
+- [x] **Step 3: Create PR against `origin/main`**
+
+Use a PR body that states:
+- This is a retest harness for #349, not a product fix.
+- It compares #349 against #338 / PR #1033.
+- Local targeted tests and build passed.
+- Local live harness was blocked by an existing user daemon on default ports.
+- GitHub Actions clean runners are expected to run the restart harness.
+
+Actual: created PR #1038 against `origin/main`.
+
+- [x] **Step 4: Monitor GitHub Actions**
+
+Fetch PR check status and inspect failed job logs if any. Do not merge until the retest result is understood and the user approves the final issue outcome.
+
+Actual: first GitHub Actions run `27859863679` completed with both normal `test` jobs green and both new `Issue #349 restart retest` jobs red. Logs show the sentinel was visible before stop, `agentmemory stop` reported persistence, and the second start rebuilt the search index with zero entries before failing to find the sentinel.
+
+Follow-up: pushed diagnostic commit `78dd8f48` and reran GitHub Actions as
+`27859962631`. Normal test jobs passed again. Both restart-retest jobs failed
+again, now with explicit evidence that after restart both search and memory list
+lost the sentinel: `search=false memories=false`,
+`search={"format":"full","results":[],"tokens_used":0,"truncated":false}`,
+and `memories={"limit":null,"memories":[],"offset":0,"total":0}`.
+
+## Self-Review
+
+- The plan does not change production TypeScript runtime behavior.
+- The workflow change is isolated to a dedicated CI job for a user-approved GitHub retest.
+- The live harness uses temp `HOME`/data directories and no credentials.
+- The local daemon is explicitly protected from stop/reuse.