Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions demo/gist-memory/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Gist Memory Demo — Buried Preference Recall

This demo shows how purpose-directed gisting enables an AI agent to recall user preferences that were briefly mentioned during an unrelated conversation — a task that generic summarization and standard RAG both fail at.

## The Problem

When a user buries a preference in a long, topically unrelated conversation, generic approaches fail:
- **Topic-preserving summarization** discards the preference as low-salience noise
- **Standard RAG** dilutes the preference signal in full-passage embeddings dominated by the conversation's main topic

Purpose-directed gisting solves this by compressing conversations specifically to foreground user attributes.

## Setup

### Option A: Kaizen Lite (Claude Code Plugin)

```bash
# Install the plugin
claude --plugin-dir /path/to/kaizen/platform-integrations/claude/plugins/kaizen-lite
```

### Option B: Full Kaizen (MCP Server)

```bash
# Start the MCP server
uv run fastmcp run kaizen/frontend/mcp/mcp_server.py --transport sse --port 8201
```

## Demo Script

### Session 1: Preference Embedding

Have a multi-turn conversation about an unrelated technical topic. Bury a preference in one of the messages.

See [session1_script.md](session1_script.md) for the full conversation script.

**Key message (message 5 of 12):**
> "That makes sense about the CNI plugin architecture. By the way, I strongly prefer Python over R for all my data analysis work — I find pandas much more intuitive than tidyverse. Anyway, back to the networking question — how does Cilium handle network policy enforcement?"

The preference ("Python over R", "pandas over tidyverse") is <5% of the total conversation content.

**At end of session:**
- **Lite path:** Run `/kaizen:gist`
- **MCP path:** Call `store_gist` with the conversation JSON

**Expected gist output:**
```
user prefers Python over R for data analysis; finds pandas more intuitive than tidyverse; works with Kubernetes networking (Cilium, CNI plugins)
```
Comment on lines +47 to +49
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

The code block is missing a language identifier, which triggers a markdownlint warning (MD040). Since this shows plain text output, use text or plaintext.

📝 Suggested fix
-```
+```text
 user prefers Python over R for data analysis; finds pandas more intuitive than tidyverse; works with Kubernetes networking (Cilium, CNI plugins)
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 47-47: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demo/gist-memory/README.md` around lines 47 - 49, The fenced code block in
the README.md currently has no language tag which triggers MD040; update the
block that contains "user prefers Python over R for data analysis; finds pandas
more intuitive than tidyverse; works with Kubernetes networking (Cilium, CNI
plugins)" by adding a language specifier (e.g., change the opening ``` to
```text or ```plaintext) so the markdown linter treats it as plain text and the
MD040 warning is resolved.


Note how the gist foregrounds the Python/pandas preference despite it being a tiny fraction of the conversation.

### Session 2: Preference Recall

Start a new session and ask:

> "I need to start a new data analysis project working with network telemetry data. What language and tools would you recommend I use?"

**With gist memory:** Claude recommends Python and pandas, citing your stated preference.

**Without gist memory:** Claude gives a generic recommendation (likely mentioning both Python and R, or asking about your preference).

See [session2_script.md](session2_script.md) for the verification prompts.

## What to Look For

1. **Gist content:** Does the gist capture the Python/pandas preference despite it being buried?
2. **Recall accuracy:** In Session 2, does the agent correctly apply the preference?
3. **A/B contrast:** Run Session 2 without gist memory to see the failure mode.
62 changes: 62 additions & 0 deletions demo/gist-memory/session1_script.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Session 1: Preference Embedding

Use these messages in order. The buried preference is in **Message 4**.

---
Comment on lines +1 to +5
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix inconsistent message number reference.

Line 3 states the buried preference is in "Message 4", but the actual buried preference appears in "Message 5 (User)" at line 19.

📝 Proposed fix
 # Session 1: Preference Embedding
 
-Use these messages in order. The buried preference is in **Message 4**.
+Use these messages in order. The buried preference is in **Message 5**.
 
 ---
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Session 1: Preference Embedding
Use these messages in order. The buried preference is in **Message 4**.
---
# Session 1: Preference Embedding
Use these messages in order. The buried preference is in **Message 5**.
---
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demo/gist-memory/session1_script.md` around lines 1 - 5, The header text
incorrectly references "Message 4" as containing the buried preference; update
that reference to the correct message number ("Message 5" or "Message 5 (User)")
in the "Session 1: Preference Embedding" block so the line that currently reads
the buried preference is in Message 4 matches the actual buried preference
location (Message 5 (User) at line 19).


**Message 1 (User):**
> I'm trying to understand how Kubernetes handles pod-to-pod networking across nodes. Can you explain how the CNI plugin architecture works?

**Message 2 (Assistant):**
> [Detailed explanation of CNI plugin architecture, pod networking, veth pairs, bridge interfaces, etc.]

**Message 3 (User):**
> That's helpful. How does this differ between Calico and Cilium? I've heard Cilium uses eBPF instead of iptables.

**Message 4 (Assistant):**
> [Explanation comparing Calico's iptables-based approach vs Cilium's eBPF dataplane, performance characteristics, etc.]

**Message 5 (User) — THE BURIED PREFERENCE:**
> That makes sense about the CNI plugin architecture. By the way, I strongly prefer Python over R for all my data analysis work — I find pandas much more intuitive than tidyverse. Anyway, back to the networking question — how does Cilium handle network policy enforcement at the kernel level?

**Message 6 (Assistant):**
> [Explanation of Cilium's eBPF-based network policy enforcement, kernel-level packet filtering, etc.]

**Message 7 (User):**
> What about service mesh integration? Does Cilium replace the need for something like Istio?

**Message 8 (Assistant):**
> [Discussion of Cilium service mesh capabilities vs Istio, sidecar-free model, etc.]

**Message 9 (User):**
> I'm also curious about network observability. What tools do you recommend for monitoring pod-to-pod traffic patterns in a large cluster?

**Message 10 (Assistant):**
> [Recommendations for Hubble, Pixie, Grafana with Cilium metrics, etc.]

**Message 11 (User):**
> Great, this has been really helpful. One last question — how do I troubleshoot DNS resolution failures in pods? I've been seeing intermittent CoreDNS timeouts.

**Message 12 (Assistant):**
> [DNS troubleshooting guidance for CoreDNS, ndots settings, etc.]

---

## After the conversation

**Kaizen Lite:** Run `/kaizen:gist`

**Full Kaizen (MCP):**
```bash
# Store the conversation as a gist
curl -X POST http://localhost:8201/tools/store_gist \
-H "Content-Type: application/json" \
-d '{"conversation_data": "<JSON of messages above>", "conversation_id": "demo-session-1"}'
```

## Expected Gist Output

The gist should surface the buried preference:
```
user prefers Python over R for data analysis; finds pandas more intuitive than tidyverse; works with Kubernetes networking; troubleshooting CoreDNS; large cluster environment
```
Comment on lines +57 to +62
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifier to the expected output code block.

The fenced code block is missing a language specifier, which triggers a Markdown lint warning.

📝 Proposed fix
 ## Expected Gist Output
 
 The gist should surface the buried preference:
-```
+```text
 user prefers Python over R for data analysis; finds pandas more intuitive than tidyverse; works with Kubernetes networking; troubleshooting CoreDNS; large cluster environment
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 60-60: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @demo/gist-memory/session1_script.md around lines 57 - 62, The fenced code
block under "Expected Gist Output" lacks a language specifier; update the block
fence that wraps the expected gist (the triple-backtick block containing "user
prefers Python over R...") to include a language tag (e.g., change totext or ```txt) so the Markdown linter no longer flags it.


</details>

<!-- fingerprinting:phantom:poseidon:ocelot -->

<!-- This is an auto-generated comment by CodeRabbit -->

45 changes: 45 additions & 0 deletions demo/gist-memory/session2_script.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Session 2: Preference Recall Verification

Start a **new session** (no conversation history from Session 1). The gist from Session 1 should be automatically injected via the recall hook.

---

## Primary Verification Prompt

> I need to start a new data analysis project working with network telemetry data. What language and tools would you recommend I use?

### Expected Response WITH Gist Memory

The agent should recommend **Python and pandas**, referencing your known preference. Example:

> "Based on your preference for Python and pandas, I'd recommend using Python with pandas for the data analysis..."

### Expected Response WITHOUT Gist Memory

The agent gives a **generic recommendation** — likely mentioning both Python and R as options, or asking about your preference:

> "For network telemetry data analysis, popular options include Python (with pandas/numpy) or R (with tidyverse). Which do you prefer?"

---

## Additional Verification Prompts

These test whether the gist captured other signals:

**Prompt 2:**
> What's my background — do you know what kind of infrastructure I work with?

Expected (with gist): Mentions Kubernetes, container networking, cluster operations.

**Prompt 3:**
> If I need to do some quick data wrangling, which library should I reach for?

Expected (with gist): Recommends pandas specifically (not tidyverse or dplyr).

---

## Running the A/B Comparison

1. **With gist memory:** Ensure the gist entity from Session 1 exists in `.kaizen/entities/gist/` (Lite) or in the MCP backend
2. **Without gist memory:** Temporarily rename/remove the gist entity, or use a clean project directory
3. Run each verification prompt in both conditions and compare responses
2 changes: 2 additions & 0 deletions kaizen/config/kaizen.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ class KaizenConfig(BaseSettings):
namespace_id: str = "kaizen"
settings: BaseSettings | None = None
clustering_threshold: float = 0.80
gist_context_budget: int = 64000
gist_trigger_interval: int = 5


# to reload settings call kaizen_config.__init__()
Expand Down
1 change: 1 addition & 0 deletions kaizen/config/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ class LLMSettings(BaseSettings):
tips_model: str = Field(default_factory=_default_model_name)
conflict_resolution_model: str = Field(default_factory=_default_model_name)
fact_extraction_model: str = Field(default_factory=_default_model_name)
gist_model: str = Field(default_factory=_default_model_name)
categorization_mode: Literal["predefined", "dynamic", "hybrid"] = "predefined"
allow_dynamic_categories: bool = False
confirm_new_categories: bool = False
Expand Down
134 changes: 134 additions & 0 deletions kaizen/frontend/client/kaizen_client.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import logging
import uuid
from typing import Any

from kaizen.backend.base import BaseEntityBackend
from kaizen.config.kaizen import KaizenConfig
from kaizen.llm.fact_extraction.fact_extraction import ExtractedFact, extract_facts_from_messages
from kaizen.llm.gist.gist import generate_gist
from kaizen.schema.conflict_resolution import EntityUpdate
from kaizen.schema.core import Entity, Namespace, RecordedEntity
from kaizen.schema.exceptions import NamespaceAlreadyExistsException, NamespaceNotFoundException
Expand Down Expand Up @@ -295,3 +297,135 @@ def retrieve_user_facts(
)

return categorized_preferences

# ── Gist memory ──────────────────────────────────────────────────

def store_gists(
self,
namespace_id: str,
messages: list[dict],
conversation_id: str | None = None,
metadata: dict[str, Any] | None = None,
) -> list[EntityUpdate]:
"""Generate purpose-directed gists from conversation messages and store them.

Implements rolling consolidation: deletes any existing gists for the same
conversation_id before storing new ones, so the latest gist always reflects
the full session.
"""
if not messages:
return []

conversation_id = conversation_id or str(uuid.uuid4())
self.ensure_namespace(namespace_id)

# Delete existing gists for this conversation (rolling replacement)
existing = self.search_entities(
namespace_id=namespace_id,
query=None,
filters={"type": "gist", "metadata.conversation_id": conversation_id},
limit=100,
)
for entity in existing:
try:
self.delete_entity_by_id(namespace_id, entity.id)
except Exception:
logger.warning("Failed to delete old gist %s during rolling replacement", entity.id, exc_info=True)

# Generate gists
result = generate_gist(messages, conversation_id=conversation_id)

if not result.gists:
return []

# Store gist entities
base_metadata: dict[str, Any] = dict(metadata or {})
base_metadata["conversation_id"] = conversation_id
base_metadata["message_count"] = result.message_count

gist_entities = []
for i, gist_text in enumerate(result.gists):
gist_metadata = dict(base_metadata)
gist_metadata["chunk_index"] = i
gist_metadata["chunk_count"] = result.chunk_count
gist_entities.append(Entity(type="gist", content=gist_text, metadata=gist_metadata))

updates = self.update_entities(namespace_id, gist_entities, enable_conflict_resolution=False)

# Store original messages as gist_source for durable retrieval
source_entities = []
for i, msg in enumerate(messages):
content = msg.get("content", "")
if isinstance(content, list):
content = str(content)
source_entities.append(
Entity(
type="gist_source",
content=content,
metadata={
"conversation_id": conversation_id,
"message_index": i,
"role": msg.get("role", "unknown"),
},
)
)

if source_entities:
# Delete existing sources for this conversation first
existing_sources = self.search_entities(
namespace_id=namespace_id,
query=None,
filters={"type": "gist_source", "metadata.conversation_id": conversation_id},
limit=1000,
)
for entity in existing_sources:
try:
self.delete_entity_by_id(namespace_id, entity.id)
except Exception:
logger.warning("Failed to delete old gist_source %s", entity.id, exc_info=True)

self.update_entities(namespace_id, source_entities, enable_conflict_resolution=False)

return updates
Comment on lines +303 to +389
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there are any other callers of store_gists that might invoke it concurrently
rg -n "store_gists" --type=py -C2

Repository: AgentToolkit/kaizen

Length of output: 802


🏁 Script executed:

cat -n kaizen/frontend/mcp/mcp_server.py | sed -n '260,280p'

Repository: AgentToolkit/kaizen

Length of output: 949


🏁 Script executed:

# Also check if there are any async/concurrent patterns in the MCP server
rg -n "async|concurrent|thread|lock|pool" kaizen/frontend/mcp/mcp_server.py

Repository: AgentToolkit/kaizen

Length of output: 152


🏁 Script executed:

# Check the broader context of how the MCP server handles requests
head -50 kaizen/frontend/mcp/mcp_server.py

Repository: AgentToolkit/kaizen

Length of output: 1696


🏁 Script executed:

# Check the backend implementation for any locking mechanisms
rg -n "def delete_entity_by_id|def search_entities|def update_entities" kaizen/frontend/client/kaizen_client.py -A 5

Repository: AgentToolkit/kaizen

Length of output: 1085


🏁 Script executed:

# Check if there's a storage backend with its own locking
find . -name "*.py" -type f | xargs grep -l "delete_entity_by_id" | head -5

Repository: AgentToolkit/kaizen

Length of output: 194


🏁 Script executed:

# Check the filesystem backend for delete and search implementations
rg -n "def delete_entity_by_id|def search_entities|def update_entities|lock|Lock" kaizen/backend/filesystem.py -B 2 -A 8

Repository: AgentToolkit/kaizen

Length of output: 5080


🏁 Script executed:

# Get size of filesystem backend to decide how much to read
wc -l kaizen/backend/filesystem.py

Repository: AgentToolkit/kaizen

Length of output: 95


Document the non-atomic nature of the rolling replacement logic in store_gists.

The search-delete-insert sequence is not atomic. While the backend uses per-operation locking, concurrent calls with the same conversation_id can interleave, potentially resulting in duplicate gists. For example:

  1. Thread A searches and finds gist X
  2. Thread B searches and finds gist X
  3. Thread A deletes X, inserts Y
  4. Thread B deletes nothing (X already gone), inserts Z
    Result: both Y and Z coexist for the same conversation.

This is likely acceptable for the current single-user MCP tool context, but should be documented in the docstring or with a code comment explaining the assumption and acceptable degradation mode.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kaizen/frontend/client/kaizen_client.py` around lines 303 - 389, The rolling
replacement in store_gists is not atomic: the search/delete/insert sequence
(search_entities -> delete_entity_by_id -> update_entities) can interleave
across concurrent calls using the same conversation_id and produce duplicate
gist entities; update the store_gists docstring (and add an inline comment above
the delete+insert block) to explicitly state this non-atomic behavior, give the
concurrency example/acceptable degradation, and note the assumption that the
current single-user MCP context accepts possible duplicate gists rather than
implementing locking or transactional semantics.


def retrieve_gists(
self,
namespace_id: str,
query: str,
limit: int = 10,
) -> list[RecordedEntity]:
"""Retrieve gists relevant to a query via semantic search."""
if not self.namespace_exists(namespace_id):
return []
return self.search_entities(
namespace_id=namespace_id,
query=query,
filters={"type": "gist"},
limit=limit,
)

def retrieve_gist_with_source(
self,
namespace_id: str,
query: str,
limit: int = 3,
) -> list[dict[str, Any]]:
"""Retrieve gists with their original source messages.

Returns a list of dicts, each with 'gist' (RecordedEntity) and
'source_messages' (list[RecordedEntity]) keys.
"""
gists = self.retrieve_gists(namespace_id, query=query, limit=limit)
results = []
for gist in gists:
conversation_id = (gist.metadata or {}).get("conversation_id")
source_messages: list[RecordedEntity] = []
if conversation_id:
source_messages = self.search_entities(
namespace_id=namespace_id,
query=None,
filters={"type": "gist_source", "metadata.conversation_id": conversation_id},
limit=100,
)
results.append({"gist": gist, "source_messages": source_messages})
return results
Loading
Loading