Memory is AgentStudio's long-term recall system. It captures what the user and agents talked about, organizes it into a hierarchical "palace" structure, and surfaces relevant slices on every new turn so the assistant doesn't start each conversation from scratch.
Memory runs automatically. After each conversation reaches a stopping point, the system mines the exchange into structured memories. On the user's next message, it pulls the most relevant memories back in as context. Users can browse, search, and prune their memory palace from the /memory page.
The design is ported from MemPalace — see docs/memory/spec.md for the full data-model contract and docs/memory/plan.md for the build sequence.
Memories are stored in a four-level tree:
| Level | What it represents | Example |
|---|---|---|
| Wing | A subject — a person, project, topic, or agent | "Efoil rebuild", "Derek", "tax research" |
| Room | A time-slice within a wing (typically one conversation) | "2026-04-30 evening session" |
| Closet | A topic discussed during that time-slice | "Battery wiring options" |
| Drawer | One verbatim message or note within that topic | "I tried a 12-AWG silicone wire and it overheated…" |
The hierarchy is pre-built so the model isn't reasoning over a flat blob of past chats — it gets pre-grouped, time-stamped slices that are easy to filter.
Each drawer carries an "AAAK pointer" — a compressed reference like § W-042/R-11/D-007 plus a few semantic tags (@p for people, @l for locations, @e for events, @i for items, @t for time). The pointer lets the model cite a memory by ID instead of regurgitating it, which keeps the assistant's responses tight and traceable.
Alongside the palace, memory tracks entities (people, projects, items) and relations between them (Derek owns efoil, efoil has_battery 12V20Ah). Relations carry a validity window — validFrom and an optional validTo — so swap-outs and changes are recorded as new relations rather than overwrites. Querying the timeline of an entity returns the full history.
Drawers carry a 1536-dimension embedding (OpenAI text-embedding-3-small). When the user asks something new, recall combines four signals:
- Semantic similarity — cosine match on embeddings via pgvector HNSW.
- Keyword boost — Postgres full-text search over drawer content + AAAK tags.
- Temporal proximity — recent drawers ranked higher than old ones for time-sensitive queries.
- Preference patterns — recurring user choices boost relevance for matching topics.
The top results get formatted into a <memory_context>…</memory_context> block prepended to the system prompt.
When useRerank is enabled in settings, the top 20 candidates are sent to a cheap reader model (default anthropic/claude-haiku) which promotes the best 5. This trades a small amount of latency for higher precision on ambiguous queries.
- The user finishes a chat exchange (or the run completes naturally).
- The system kicks off
mineConversation(conversationId)in the background — the user sees their assistant reply immediately and never waits. - Mining extracts entities and topics via a small LLM call, then writes one drawer per turn into the palace, computing AAAK indexes and embeddings inline.
- An
agent_actionactivity event of typememory_minedrecords what landed.
- The user types a message and submits.
- Before the model is called,
recallForUser(userId, message, { topK })runs. - The retrieval pipeline returns the top-K drawers ranked by hybrid score.
- The drawers are rendered into a compact memory context block (
<memory_context>…</memory_context>) and prepended to the system prompt. - The model now has the relevant past context and can answer with continuity.
The Memory page shows the palace tree (wings → rooms → closets → drawers), a search box that runs the same retrieval pipeline against arbitrary queries, an AAAK preview for each drawer, and a delete control for surgical pruning.
Users can configure memory behavior under Settings → Memory:
- Enabled — turn auto-mining + recall on or off entirely.
- Top-K — how many drawers to inject per turn (default 5; higher = more context, more tokens).
- Use rerank — pass top-20 through a reader model for higher precision (small latency cost).
- Rerank model — defaults to
anthropic/claude-haiku-4.5. - Embedding model — defaults to
openai/text-embedding-3-small(1536-dim, must match the pgvector column). - Auto-mine — disable to make mining manual-only.
Per-agent override: agents.config.memory lets you disable recall for specific agents (e.g. the orchestrator) without affecting others.
- All authenticated users: see + manage their own palace; settings are per-user.
- Agents: read recalled memories for the conversation's owning user; never write across user boundaries.
- Admins: same as users for their own palace; no special cross-user access (memory is private by design).
- Chat domain — automatic mining hook fires when a chat run reaches
completed; recall runs in the chat stream entry point before the LLM call. - Settings domain — memory behavior toggles live in
appSettings.memoryConfig(enabled / topK / useRerank / rerankModel / embeddingModel / autoMine). - Activity domain — every mining run emits an
agent_actionevent so users can see what got remembered. - OpenRouter — embeddings + entity-extraction LLM calls + optional rerank model all route through the existing OpenRouter client; cost rolls into the existing per-source breakdown (
memory_embed,memory_extract,memory_rerank,memory_qa).
- Verbatim-only drawers — drawer content is never paraphrased. AAAK + embeddings are the index; the source text stays exact for auditability.
- Per-user isolation — every drawer/wing/entity is FK'd to a
userIdwith cascade-on-delete. There's no shared memory pool. - Soft staleness on relations — overwriting a relation creates a new row and bumps
validToon the old one rather than mutating it; the timeline is preserved. - Embedding-dimension lock — the pgvector column is
vector(1536). Switching embedding models that change dimension requires a migration + reindex; the settings UI restricts choices to compatible models. - Mining cost cap — each conversation incurs one small LLM call for entity extraction (default
openai/gpt-4o-minivia OpenRouter) plus one embedding call per turn. These show up in the cost dashboard taggedsource='memory_extract'andsource='memory_embed'.
The /scripts/bench/longmemeval/ directory is a complete LongMemEval pipeline that ingests 500 long-context conversation instances from HuggingFace, runs the retrieval and QA stages through this implementation, and scores against the upstream baseline. The npm scripts:
bun run bench:longmemeval:download— fetcheslongmemeval_s_cleaned.json,longmemeval_oracle.json, andlongmemeval_m_cleaned.jsonintodata/longmemeval/.bun run bench:longmemeval:ingest— replays haystack sessions throughmineConversationper instance.bun run bench:longmemeval:retrieve— runsrecallfor each question; emitsretrieval_logs/{run_id}.jsonl.bun run bench:longmemeval:score-retrieval— computes session-level R@5, R@10, turn-level recall (target ≥96% R@5).bun run bench:longmemeval:qa— RAG prompt + reader model; emitsgeneration_logs/{run_id}.jsonl.bun run bench:longmemeval:score-qa— GPT-4o judge scoring per upstreamevaluate_qa.py.bun run bench:longmemeval:full— end-to-end pipeline onlongmemeval_s.bun run bench:longmemeval:smoke— 10-instance subset for fast smoke checks (gated byRUN_LONGMEMEVAL=1).
The benchmark uses an isolated test schema scoped per-run so it never pollutes the dev/prod database.
- Empty conversation — mining no-ops; no drawers created.
- No matching memories — recall returns an empty context block (the
<memory_context>element is omitted) so the model doesn't see "no memories found" filler. - Agent with
memory.disabled = true— recall is skipped for that agent's chats but mining still runs (so other agents in the same user's palace benefit). - Embedding API failure — drawer is still written but with a null embedding; a backfill job (future work, queued onto the
#17jobs system) re-embeds nullable rows. - Massive conversations (>50 turns) — mining batches the entity-extraction call across windows of 8-10 turns to keep the LLM input bounded.
- Duplicate detection — wings/rooms/closets dedupe by slug + alias matching; mining the same conversation twice is idempotent.