Skip to content

fix: add per-ID mutex lock for update() to prevent concurrent corruption#143

Open
baizenghu wants to merge 1 commit intoCortexReach:mainfrom
baizenghu:fix/update-mutex-lock
Open

fix: add per-ID mutex lock for update() to prevent concurrent corruption#143
baizenghu wants to merge 1 commit intoCortexReach:mainfrom
baizenghu:fix/update-mutex-lock

Conversation

@baizenghu
Copy link

Problem

When a single agent processes rapid consecutive messages, the auto-capture hook can trigger multiple update() calls on the same memory entry simultaneously. Since update() performs a read-modify-write cycle without any locking, this leads to data corruption:

  1. update() vs update(): Two concurrent updates on the same ID — one overwrites the other's changes (last-write-wins, silently losing data)
    1. update() vs bulkDelete(): An update re-inserts a row that was just deleted (e.g. during scope purge), leaving orphaned entries

Solution

  • Add a per-ID mutex lock (withUpdateLock) using a Promise-chain pattern (zero external dependencies)
    • Add a global write lock (withBulkDeleteLock) that drains all in-flight updates before executing bulk delete
    • Lock map entries are automatically cleaned up after each operation to prevent memory leaks

Implementation Details

  • updateLocks: Map<string, Promise<void>> — per-ID serialization via Promise chaining
    • bulkDeleteLock: Promise<void> — global barrier that bulkDelete() waits on before proceeding
    • No external mutex library needed — uses native Promise chains
    • Backward compatible — no API changes

Testing

Verified in production environment with agents handling rapid consecutive messages. No more orphaned entries or lost updates after applying this fix.

…add data loss

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AliceLJY
Copy link
Collaborator

Review: fix: add per-ID mutex lock for update() to prevent concurrent delete+add data loss

Verdict: Fix-then-merge (two blockers — see below)


✅ What's working

The core mutex design is correct for JavaScript's single-threaded event loop:

// withUpdateLock — per-ID Promise-chain mutex
const prev = this.updateLocks.get(id) ?? Promise.resolve();
const next = new Promise<void>(r => { resolve = r; });
this.updateLocks.set(id, next);
try {
  await prev;       // wait for previous holder
  return await fn();
} finally {
  resolve!();       // release next waiter
  if (this.updateLocks.get(id) === next) this.updateLocks.delete(id); // cleanup
}
  • The get → set race is impossible in JS because no preemption occurs between synchronous lines ✅
  • Multi-queue cleanup is correct: only the last holder deletes the Map entry ✅
  • The two-way blocking is correct: bulkDelete waits for all in-flight update() calls, and new update() calls wait for an in-flight bulkDelete
  • No external library needed — standard JS Promise chain pattern ✅
  • Different IDs remain fully concurrent (per-ID granularity, not global) ✅

🔴 Blocker 1: Direct conflict with PR #144

The 102 lines added to src/tools.ts (the entire registerMemoryPurgeScopeTool function) are byte-for-byte identical to PR #144. Both PRs were opened 3 minutes apart from the same author — likely a branching artifact.

Whichever PR merges second will have a merge conflict on src/tools.ts. Please coordinate with the maintainer: remove registerMemoryPurgeScopeTool from either #143 or #144 before merging. The mutex fix in src/store.ts is the unique contribution of this PR and should be kept.

🔴 Blocker 2: Zero test coverage for the mutex logic

A concurrent data-corruption fix with no tests is not safe to ship. If someone later simplifies update() and inadvertently removes the lock, there's no regression guard. Minimum coverage needed in test/update-mutex.test.mjs:

// 1. Two concurrent updates on the same ID — only the last one should win
// 2. bulkDelete blocks during a concurrent update — no interleaving
// 3. Sequential updates still produce correct results (sanity check)

Wire it into package.json test script (same issue as #148, #141, #144, #145).

🟡 Known limitation: prefix vs full UUID lock key mismatch

withUpdateLock(id, fn) keys on the raw id argument. If two concurrent callers use different representations of the same memory — one uses a short prefix ("abc12345") and the other uses the full UUID ("abc12345-0000-...") — they get different lock keys and can still race through the original delete+add window.

This is low-probability in practice, but it's a gap in the stated guarantee. Suggested fix: resolve prefix → full UUID before acquiring the lock, then use the canonical full UUID as the key. Or at minimum, add a code comment documenting this limitation.

⚪ Non-blocking

  • updateLocks Map has no upper bound on size, but the cleanup logic (delete only when map.get(id) === next) is correct and prevents permanent leaks under normal usage
  • bulkDeleteLock correctly drains pending per-ID locks (await Promise.all(pending)) before proceeding — the ordering proof holds

The mutex design itself is solid and addresses a real data-loss bug. The two blockers are the tools.ts duplication (coordination issue) and missing tests — neither is about the lock logic itself.

@win4r
Copy link
Collaborator

win4r commented Mar 10, 2026

@claude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants