Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1815,8 +1815,10 @@ The `/agentmemory/health` evaluator uses conservative defaults, and operators ca
| `AGENTMEMORY_HEALTH_MEM_WARN_PCT` | `80` | Heap usage warning threshold percent |
| `AGENTMEMORY_HEALTH_MEM_CRITICAL_PCT` | `95` | Heap usage critical threshold percent |
| `AGENTMEMORY_HEALTH_MEM_RSS_FLOOR_MB` | `512` | Minimum RSS in MB before heap warning/critical alerts fire |
| `AGENTMEMORY_HEALTH_MEM_CRITICAL_RSS_MB` | `4096` | Process RSS in MB that counts as critical memory pressure |
| `AGENTMEMORY_HEALTH_MEM_SYSTEM_FREE_FLOOR_RATIO` | `0.1` | Host free-memory ratio below which high heap usage may become critical; set to `0` to disable this host-memory gate |

Unset, empty, invalid, zero, or negative values fall back to the defaults above.
Unset, empty, invalid, zero, or negative values fall back to the defaults above, except `AGENTMEMORY_HEALTH_MEM_SYSTEM_FREE_FLOOR_RATIO=0`, which disables the host free-memory pressure gate.

### Scheduled backups

Expand Down
202 changes: 202 additions & 0 deletions docs/todos/2026-06-18-issue-491-memory-critical-pressure/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Memory Critical Pressure Gate Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Make `memory_critical` require real memory pressure instead of only high heap ratio plus the existing RSS floor.

**Architecture:** Keep the health evaluator as the single policy owner in `src/health/thresholds.ts`. Add critical-pressure thresholds to the existing config, collect host memory in `src/health/monitor.ts`, and preserve warn-level memory semantics.

**Tech Stack:** TypeScript ESM, Node built-ins, Vitest, existing health monitor/evaluator modules.

---

## File Structure

- Modify `src/types.ts`: allow optional host memory fields in `HealthSnapshot.memory`.
- Modify `src/health/thresholds.ts`: add critical RSS ceiling and system-free ratio floor config, env parsing, and critical memory gate logic.
- Modify `src/health/monitor.ts`: collect `os.freemem()` and `os.totalmem()` in the snapshot.
- Modify `test/health-thresholds.test.ts`: add red/green coverage for false-positive suppression, process RSS critical pressure, host low-memory critical pressure, env overrides, and invalid fallback.
- Modify `README.md`: document new health threshold environment variables.
- Update `docs/todos/2026-06-18-issue-491-memory-critical-pressure/todo.md`: record evidence, verification, reviews, and final status.

## Task 1: Write Failing Threshold Tests

**Files:**
- Modify: `test/health-thresholds.test.ts`

- [x] **Step 1: Add new env keys to test isolation**

Add `AGENTMEMORY_HEALTH_MEM_CRITICAL_RSS_MB` and `AGENTMEMORY_HEALTH_MEM_SYSTEM_FREE_FLOOR_RATIO` to `HEALTH_ENV_KEYS`.

- [x] **Step 2: Add regression tests for pressure gating**

Add tests in `describe("evaluateHealth memory severity", ...)`:

```ts
it("does not go critical for high heap ratio without real memory pressure", () => {
const s = snap({
memory: {
heapUsed: 970 * 1024 * 1024,
heapTotal: 1000 * 1024 * 1024,
rss: 1100 * 1024 * 1024,
external: 0,
systemFree: 8 * 1024 * 1024 * 1024,
systemTotal: 16 * 1024 * 1024 * 1024,
},
});
const { status, alerts, notes } = evaluateWithSnapshotHeapTotal(s);
expect(status).toBe("degraded");
expect(alerts.some((a) => a.startsWith("memory_critical_"))).toBe(false);
expect(alerts.some((a) => a.startsWith("memory_warn_"))).toBe(true);
expect(notes.some((n) => n.startsWith("memory_heap_tight_"))).toBe(false);
});

it("goes critical when high heap ratio has critical process RSS pressure", () => {
const s = snap({
memory: {
heapUsed: 970 * 1024 * 1024,
heapTotal: 1000 * 1024 * 1024,
rss: 5 * 1024 * 1024 * 1024,
external: 0,
systemFree: 8 * 1024 * 1024 * 1024,
systemTotal: 16 * 1024 * 1024 * 1024,
},
});
const { status, alerts } = evaluateWithSnapshotHeapTotal(s);
expect(status).toBe("critical");
expect(alerts.some((a) => a.startsWith("memory_critical_"))).toBe(true);
});

it("goes critical when high heap ratio has low host free memory", () => {
const s = snap({
memory: {
heapUsed: 970 * 1024 * 1024,
heapTotal: 1000 * 1024 * 1024,
rss: 1100 * 1024 * 1024,
external: 0,
systemFree: 1 * 1024 * 1024 * 1024,
systemTotal: 16 * 1024 * 1024 * 1024,
},
});
const { status, alerts } = evaluateWithSnapshotHeapTotal(s);
expect(status).toBe("critical");
expect(alerts.some((a) => a.startsWith("memory_critical_"))).toBe(true);
});
```

- [x] **Step 3: Run red test**

Run: `corepack pnpm exec vitest run test/health-thresholds.test.ts --exclude test/integration.test.ts`

Expected before implementation: false-positive suppression test fails because status is `critical`.

## Task 2: Implement Minimal Pressure Gate

**Files:**
- Modify: `src/types.ts`
- Modify: `src/health/thresholds.ts`
- Modify: `src/health/monitor.ts`

- [x] **Step 1: Extend snapshot memory type**

Add optional `systemFree?: number` and `systemTotal?: number` to `HealthSnapshot.memory` in `src/types.ts`.

- [x] **Step 2: Extend threshold config**

In `src/health/thresholds.ts`, add `memoryCriticalRssBytes` and `memorySystemFreeFloorRatio` to `ThresholdConfig`, default to `4096 * 1024 * 1024` and `0.1`, and parse env vars `AGENTMEMORY_HEALTH_MEM_CRITICAL_RSS_MB` plus `AGENTMEMORY_HEALTH_MEM_SYSTEM_FREE_FLOOR_RATIO`.

- [x] **Step 3: Gate only critical memory alerts**

Compute:

```ts
const criticalRssPressure = rss >= cfg.memoryCriticalRssBytes;
const systemFreeRatio =
snapshot.memory.systemTotal && snapshot.memory.systemTotal > 0
? (snapshot.memory.systemFree ?? snapshot.memory.systemTotal) /
snapshot.memory.systemTotal
: 1;
const systemMemoryPressure =
cfg.memorySystemFreeFloorRatio > 0 &&
systemFreeRatio < cfg.memorySystemFreeFloorRatio;
const realMemoryPressure = criticalRssPressure || systemMemoryPressure;
```

Then require `realMemoryPressure` only in the `memory_critical` branch:

```ts
if (memPercent > cfg.memoryCriticalPercent && rssAboveFloor && realMemoryPressure) {
alerts.push(`memory_critical_${Math.round(memPercent)}%_rss${memMb}mb`);
critical = true;
} else if (memPercent > cfg.memoryWarnPercent && rssAboveFloor) {
alerts.push(`memory_warn_${Math.round(memPercent)}%_rss${memMb}mb`);
degraded = true;
} else if (memPercent > cfg.memoryWarnPercent) {
notes.push(`memory_heap_tight_${Math.round(memPercent)}%_rss${memMb}mb`);
}
```

- [x] **Step 4: Collect host memory**

In `src/health/monitor.ts`, import `freemem` and `totalmem` from `node:os`, call them once during snapshot collection, and include `systemFree` and `systemTotal` in `snapshot.memory`.

- [x] **Step 5: Run green targeted test**

Run: `corepack pnpm exec vitest run test/health-thresholds.test.ts --exclude test/integration.test.ts`

Expected: health threshold tests pass.

## Task 3: Cover Env Overrides And Docs

**Files:**
- Modify: `test/health-thresholds.test.ts`
- Modify: `README.md`
- Modify: `docs/todos/2026-06-18-issue-491-memory-critical-pressure/todo.md`

- [x] **Step 1: Add env override assertions**

Extend the existing env override/fallback tests to prove:
- `AGENTMEMORY_HEALTH_MEM_CRITICAL_RSS_MB=2048` allows a 3 GB RSS critical sample.
- `AGENTMEMORY_HEALTH_MEM_SYSTEM_FREE_FLOOR_RATIO=0.25` treats 20% host free memory as pressure.
- Invalid values for both new vars fall back to defaults.

- [x] **Step 2: Update README**

Add rows under "Health Thresholds":
- `AGENTMEMORY_HEALTH_MEM_CRITICAL_RSS_MB` default `4096`, "Process RSS in MB that counts as critical memory pressure."
- `AGENTMEMORY_HEALTH_MEM_SYSTEM_FREE_FLOOR_RATIO` default `0.1`, "Host free-memory ratio below which high heap usage may become critical; set to `0` to disable this host-memory gate."

- [x] **Step 3: Run targeted checks**

Run:
- `corepack pnpm exec vitest run test/health-thresholds.test.ts test/health-monitor.test.ts --exclude test/integration.test.ts`
- `git diff --check`
- `rg -n "AGENTMEMORY_HEALTH_MEM_(CRITICAL_RSS_MB|SYSTEM_FREE_FLOOR_RATIO)|memory_critical" README.md src test`

Expected: tests pass, diff check exits 0, references are consistent.

## Task 4: Review, Security Gates, Commit, Push, PR, Merge

**Files:**
- All task-owned files above

- [ ] **Step 1: Focused simplification and review**

Inspect the diff for unnecessary abstraction, stale tests, warning behavior regressions, and boundary creep. Preserve the existing health API shape and alert strings except for when `memory_critical` is emitted.

- [x] **Step 2: Final verification**

Run the smallest covering repo-native checks first, then broader checks if feasible:
- `corepack pnpm exec vitest run test/health-thresholds.test.ts test/health-monitor.test.ts --exclude test/integration.test.ts`
- `corepack pnpm run lint`
- `corepack pnpm run build`
- `semgrep scan --config p/default --error --metrics=off src/health/thresholds.ts src/health/monitor.ts src/types.ts test/health-thresholds.test.ts`
- `git diff --check`

- [ ] **Step 3: Secret gate and commit**

Stage only task-owned paths, inspect staged diff, run `gitleaks protect --staged --redact`, then commit with `fix: gate memory critical on real pressure`.

- [ ] **Step 4: GitHub PR flow**

Fetch `origin main`, verify branch diff against refreshed `origin/main`, push `issue/491-memory-critical-pressure` to `origin`, create PR against `main`, monitor CI, merge when checks pass, then verify final target state and archive the source thread if the thread tool is available.
Loading
Loading