Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,38 @@ jobs:
path: coverage/
if-no-files-found: error
retention-days: 7

restart-retest:
name: "Issue #349 restart retest (${{ matrix.os }})"
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest]
steps:
- uses: actions/checkout@v6
with:
persist-credentials: false
- uses: actions/setup-node@v6
with:
node-version: 22
- run: corepack enable
- run: pnpm install --frozen-lockfile --ignore-scripts
- run: pnpm run build
- run: node scripts/github/issue-349-restart-retest.mjs

engine-state-probe:
name: "Issue #349 engine state probe (${{ matrix.os }})"
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest]
steps:
- uses: actions/checkout@v6
with:
persist-credentials: false
- uses: actions/setup-node@v6
with:
node-version: 22
- run: node scripts/github/issue-349-engine-state-probe.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Arena Synthesis: Issue 349

## Rubric

1. Uses current repo evidence for restart, persistence, and stop behavior.
2. Distinguishes issue #349's laptop restart / v0.9.27 screenshot from issue
#338's CLI stop path and Windows residual.
3. Correctly identifies duplicate, stale, already-fixed, or remaining valid
scope and the Human Checkpoint requirement.
4. Proposes the smallest testable next step without broad persistence or
iii-engine boundary changes.
5. Names inspected sources, commands, files, and residual uncertainty.

## Scores

| Candidate | Repo evidence | Issue distinction | Classification / checkpoint | Next step | Sources / uncertainty | Total |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| A | 5 | 5 | 5 | 4 | 5 | 24 |
| B | 5 | 5 | 5 | 5 | 5 | 25 |
| C | 4 | 5 | 5 | 3 | 3 | 20 |

## Decision

Base: Candidate B.

Candidate B is the strongest base because it cleanly decomposes "restarted
laptop" into separate possible paths: supported `agentmemory stop`, OS reboot
while the daemon is running, forceful power loss, or a startup/catalog issue.
That keeps the conclusion precise: issue #349 overlaps with the #338 data-loss
family, but it is not implementation-ready and should not be closed or mutated
without a Human Checkpoint.

Grafts:
- From Candidate A: current app-level index rebuilds use iii `state::list`; they
do not scan raw iii state files. If iii's catalog returns empty after boot,
patching around raw state files would cross engine/persistence boundaries and
needs approval.
- From Candidate A: the #338 path/data-dir class is stale on current code, but a
literal OS/laptop restart is broader than the CLI stop path.
- From Candidate C: compact final framing: #349 is stale or likely duplicate
only for the #338 `agentmemory stop` interpretation and is not independently
valid for implementation without a Human Checkpoint.

Rejected:
- Closing #349 now as already fixed. The public issue action requires approval,
and #349 says "restarted laptop", not confirmed `agentmemory stop`.
- Implementing now. There is no current-main reproduction and the likely
distinct paths cross restart, persistence, iii-engine lifecycle, or startup
reconciliation boundaries.
- Treating #1034 as a persistence change. The diff from the #338 merge to
current `origin/main` is iii runtime compatibility diagnostics and task docs.
- Claiming the CLI stop fix covers arbitrary OS reboot. PR #1033 invokes the
checkpoint through `agentmemory stop`; a laptop reboot may bypass that endpoint
and rely on worker process signals or platform shutdown ordering.

## Validity Finding

Issue #349 requires a Human Checkpoint.

Current evidence supports **already fixed / stale / likely duplicate only for
the #338 class**: `agentmemory stop` now checkpoints the worker before native
signals through `postShutdownFlush()`, `executeResponsiveNativeStop()`,
`mem::shutdown-flush`, and authenticated `POST /agentmemory/shutdown/flush`.

Current evidence does **not** prove a literal laptop or OS restart is fixed.
The issue body and upstream source provide no commands, OS, logs, data-dir
details, or current-version reproduction. The screenshot shows v0.9.27 before
PR #1033 merged. The worker still has a normal `SIGINT`/`SIGTERM` shutdown path
for non-CLI process termination, so a non-CLI reboot can bypass the #1033 CLI
checkpoint.

## Recommended Checkpoint Options

Recommended: keep the issue open and post a clarification/retest comment asking
for OS, current version, whether `agentmemory stop` was used before reboot,
whether the issue reproduces on a build containing PR #1033, and whether old
state files remain under the data directory after restart.

Other options:
- Close as covered by #338 / PR #1033 if the user accepts the ambiguity and
wants to treat the v0.9.27 report as stale or duplicate.
- Approve a narrow validation task that first builds a reproduction harness for
OS/laptop restart behavior separately from the already-fixed CLI stop path.

## Verification

Arena verification completed by reading every candidate report and comparing
the judge verdict with a parent source inspection:
- Public issue #349 and upstream #876 contain the same sparse v0.9.27
laptop-restart report and no comments.
- Public issue #338 is closed completed by PR #1033, merge commit
`2ecbe54aa822462c5480beb59ac0f391723dfabd`.
- Current `origin/main` is
`257238ab1c318b2e9ae5efcbe72863b99c41ee35`.
- `git diff --quiet 2ecbe54aa822462c5480beb59ac0f391723dfabd..origin/main --`
the shutdown, index-persistence, API flush, and relevant test files returned
`0`, meaning #1034 did not change those surfaces.
- `rg` confirms the #1033 shutdown flush path and the remaining worker signal
shutdown path.

No implementation tests were run because the current outcome is a read-only
validity checkpoint, not a code change.
180 changes: 180 additions & 0 deletions docs/todos/2026-06-20-issue-349-lost-data-after-restart/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Issue 349 GitHub Restart Retest Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Add a GitHub Actions retest harness that validates the current #338 restart fix in clean GitHub runners for issue #349.

**Architecture:** Keep production code unchanged. Add a small Node.js script under `scripts/github/` that starts the built CLI in an isolated temp `HOME`, writes a sentinel memory through REST, stops via the supported CLI path, restarts, and verifies the sentinel survives via REST search/list endpoints. Wire it into the existing CI workflow as a dedicated job on Ubuntu and macOS so the workflow runs from the normal PR path.

**Tech Stack:** GitHub Actions, Node.js 22, pnpm 11, existing built `dist/cli.mjs`, built-in `fetch`, `node:child_process`, and repository REST endpoints.

---

## Files

- Create: `scripts/github/issue-349-restart-retest.mjs`
- Modify: `.github/workflows/ci.yml`
- Modify: `docs/todos/2026-06-20-issue-349-lost-data-after-restart/todo.md`

## Task 1: Add The Retest Harness Script

**Files:**
- Create: `scripts/github/issue-349-restart-retest.mjs`

- [x] **Step 1: Write the harness script**

Create `scripts/github/issue-349-restart-retest.mjs` with these responsibilities:
- Create a temp root, temp `HOME`, temp data dir, and temp invocation cwd.
- Start `node dist/cli.mjs --data-dir <temp-data-dir>` with `HOME` and `AGENTMEMORY_READY_TIMEOUT_MS=120000`.
- Wait for `/agentmemory/health`.
- POST a unique sentinel to `/agentmemory/remember`.
- Verify the sentinel appears via `/agentmemory/search` and `/agentmemory/memories`.
- POST `/agentmemory/shutdown/flush`.
- Run `node dist/cli.mjs stop` with the same temp `HOME` and data dir.
- Restart the server with the same temp `HOME` and data dir.
- Verify the sentinel still appears via `/agentmemory/search` and `/agentmemory/memories`.
- Stop the restarted server.
- Print structured progress lines and fail fast with safe diagnostics if any step fails.

- [x] **Step 2: Run the script without a built `dist/` expectation if needed**

Run after build only:

```bash
corepack pnpm run build
node scripts/github/issue-349-restart-retest.mjs
```

Expected on a clean runner: PASS. Expected locally in this worktree: may fail or be skipped if the default iii ports are already occupied by the user's daemon. Do not stop the user's daemon.

Actual: `node --check scripts/github/issue-349-restart-retest.mjs` passed. The live harness was intentionally not run locally because the user's existing daemon is already listening on the default iii/REST ports; it will run on clean GitHub Actions runners after push.

## Task 2: Wire The Harness Into Existing CI

**Files:**
- Modify: `.github/workflows/ci.yml`

- [x] **Step 1: Add a dedicated job**

Add a `restart-retest` job after the existing test job:
- `runs-on: ${{ matrix.os }}`
- matrix `os: [ubuntu-latest, macos-latest]`
- checkout with `persist-credentials: false`
- setup Node 22
- enable corepack
- `pnpm install --frozen-lockfile --ignore-scripts`
- `pnpm run build`
- `node scripts/github/issue-349-restart-retest.mjs`

Keep it separate from the existing `test` job so failures point directly at issue #349 restart behavior.

- [x] **Step 2: Verify the workflow text**

Run:

```bash
git diff --check
```

Expected: no whitespace errors.

Actual: `git diff --check` passed.

## Task 3: Local Verification

**Files:**
- All touched files

- [x] **Step 1: Run focused tests**

Run:

```bash
corepack pnpm exec vitest run test/index-persistence.test.ts test/search.test.ts test/shutdown-flush.test.ts test/api-boundary-coverage.test.ts test/cli-stop-port-detection.test.ts test/reconnect-registration.test.ts test/engine-launch.test.ts test/runtime-config.test.ts test/cli-iii-config.test.ts test/consistency.test.ts
```

Expected: all targeted tests pass.

Actual: passed, 10 test files / 138 tests.

- [x] **Step 2: Run build**

Run:

```bash
corepack pnpm run build
```

Expected: build exits 0 and produces `dist/cli.mjs`.

Actual: `corepack pnpm run build` passed and produced `dist/cli.mjs`.

- [x] **Step 3: Run local harness only if safe**

Before running the live harness locally, verify no existing iii/agentmemory process is listening on `49134` or `3111`:

```bash
lsof -nP -iTCP:49134 -sTCP:LISTEN
lsof -nP -iTCP:3111 -sTCP:LISTEN
```

If those ports are occupied, do not run the local live harness. Record the blocker and rely on GitHub Actions clean runners after push.

Actual: ports `49134` and `3111` are occupied by the user's existing daemon, so the live harness was not run locally.

## Task 4: Publish For GitHub Retest

**Files:**
- Git branch / PR metadata

- [x] **Step 1: Stage and commit task-owned files**

Run:

```bash
git add .github/workflows/ci.yml scripts/github/issue-349-restart-retest.mjs docs/todos/2026-06-20-issue-349-lost-data-after-restart/todo.md docs/todos/2026-06-20-issue-349-lost-data-after-restart/arena-synthesis.md docs/todos/2026-06-20-issue-349-lost-data-after-restart/plan.md
git commit -m "test: add issue 349 restart retest harness"
```

Actual: staged paths were limited to the workflow, GitHub retest script, and issue #349 task notes. `git diff --cached --check` passed and `gitleaks protect --staged --redact` found no leaks across about 30 KB of staged content before commit.

- [x] **Step 2: Push to origin**

Run only after local verification:

```bash
git push -u origin issue/349-lost-data-after-restart
```

Actual: pushed branch `issue/349-lost-data-after-restart` to `origin`.

- [x] **Step 3: Create PR against `origin/main`**

Use a PR body that states:
- This is a retest harness for #349, not a product fix.
- It compares #349 against #338 / PR #1033.
- Local targeted tests and build passed.
- Local live harness was blocked by an existing user daemon on default ports.
- GitHub Actions clean runners are expected to run the restart harness.

Actual: created PR #1038 against `origin/main`.

- [x] **Step 4: Monitor GitHub Actions**

Fetch PR check status and inspect failed job logs if any. Do not merge until the retest result is understood and the user approves the final issue outcome.

Actual: first GitHub Actions run `27859863679` completed with both normal `test` jobs green and both new `Issue #349 restart retest` jobs red. Logs show the sentinel was visible before stop, `agentmemory stop` reported persistence, and the second start rebuilt the search index with zero entries before failing to find the sentinel.

Follow-up: pushed diagnostic commit `78dd8f48` and reran GitHub Actions as
`27859962631`. Normal test jobs passed again. Both restart-retest jobs failed
again, now with explicit evidence that after restart both search and memory list
lost the sentinel: `search=false memories=false`,
`search={"format":"full","results":[],"tokens_used":0,"truncated":false}`,
and `memories={"limit":null,"memories":[],"offset":0,"total":0}`.

## Self-Review

- The plan does not change production TypeScript runtime behavior.
- The workflow change is isolated to a dedicated CI job for a user-approved GitHub retest.
- The live harness uses temp `HOME`/data directories and no credentials.
- The local daemon is explicitly protected from stop/reuse.
Loading
Loading