Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ dist/
plugin/scripts/*.map
plugin/scripts/*.d.mts
data/
!benchmark/longmemeval/data/
!benchmark/longmemeval/data/smoke-ids.txt
!eval/data/
!eval/data/**
data-*/
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,8 @@ Latest release notes: [CHANGELOG.md](CHANGELOG.md).

**Reproduce locally:** [`eval/README.md`](eval/README.md) — adapter-pluggable harness for LongMemEval `_s` (public 500-Q) + `coding-agent-life-v1` (in-house 15-session corpus). Grep / vector / agentmemory adapters score side-by-side, NDJSON output, published scorecards land in [`docs/benchmarks/`](docs/benchmarks/).

**QA harness foundation:** [`benchmark/longmemeval/`](benchmark/longmemeval/) contains the hermetic issue #313 harness scaffold for future end-to-end LongMemEval-S reader/judge runs, statistical tests, manifests, and reproducibility checks. The published numbers above remain retrieval-only until an approved provider-backed run publishes QA results.

**Pairs with [codegraph](https://github.com/colbymchenry/codegraph), [Understand Anything](https://github.com/Lum1104/Understand-Anything), and [Graphify](https://github.com/safishamsi/graphify).** Code-graph indexing, multi-agent build pipelines, and broader knowledge graphs across docs / PDFs / images / videos. agentmemory remembers the work; those three projects light up the rest of the context layer. Recipes + question-routing table: [`docs/recipes/pairings.md`](docs/recipes/pairings.md).

---
Expand Down
5 changes: 5 additions & 0 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ Two kinds of numbers live in this directory:
throughput against a running daemon. This is the file you want when
somebody asks "what's p99 at 100k memories under concurrency 100?".

`longmemeval/` is the hermetic foundation for the issue #313
LongMemEval-S QA/statistics harness. It does not run provider-backed
reader or judge models yet; use `corepack pnpm run bench:longmemeval:check`
to validate the local harness structure.

## load-100k.ts

Hand-rolled, dependency-free load harness. Issues real HTTP against a
Expand Down
21 changes: 21 additions & 0 deletions benchmark/longmemeval/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
ROOT := ../..

.PHONY: check smoke-ids data reproduce reproduce-full

check:
cd $(ROOT) && corepack pnpm run bench:longmemeval:check

smoke-ids:
cd $(ROOT) && corepack pnpm run bench:longmemeval:check

data:
@echo "LongMemEval data download is approval-required and intentionally not part of the hermetic check."
@exit 2

reproduce:
@echo "Provider-backed LongMemEval smoke reproduction requires explicit approval, dataset availability, and model credentials."
@exit 2

reproduce-full:
@echo "Full LongMemEval-S reproduction requires explicit approval, dataset availability, model credentials, and runtime budget."
@exit 2
41 changes: 41 additions & 0 deletions benchmark/longmemeval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# LongMemEval-S Harness

This directory is the foundation for the issue #313 LongMemEval-S QA harness.
It is intentionally hermetic in this first slice: no dataset download, no
submodule, no provider calls, no new dependencies, and no historical baseline
claim.

The existing LongMemEval numbers in `benchmark/LONGMEMEVAL.md` are
retrieval-only diagnostics. This harness adds the reproducibility and
statistics structure needed for a later approved end-to-end run with reader and
judge models.

## Local Check

```sh
corepack pnpm run bench:longmemeval:check
make -C benchmark/longmemeval check
```

The check validates:

- the tiny checked-in fixture shape,
- the 50 smoke IDs against committed retrieval result provenance,
- the six issue #313 system definitions,
- prompt hashing and manifest redaction,
- markdown table rendering.

## Deferred Full-Scope Work

These targets are present to document the intended workflow, but they fail
closed until maintainers approve the needed boundaries:

```sh
make -C benchmark/longmemeval data
make -C benchmark/longmemeval reproduce
make -C benchmark/longmemeval reproduce-full
```

Approval is required before adding a LongMemEval submodule, downloading the
real dataset in automation, calling reader or judge models, wiring a real PR CI
benchmark gate, or publishing a `v0.9.24` QA baseline.
50 changes: 50 additions & 0 deletions benchmark/longmemeval/data/smoke-ids.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
e47becba
118b2229
51a45a95
58bf7951
1e043500
c5e8278d
6ade9755
6f9b354f
58ef2f1c
f8c5f88b
5d3d2817
7527f7e2
c960da58
3b6f954b
726462e0
94f70d80
66f24dbb
ad7109d1
af8d2e46
dccbc061
c8c3f81d
8ebdbe50
6b168ec8
75499fd8
21436231
95bcc1c8
0862e8bf
853b0a1d
a06e4cfe
37d43f65
b86304ba
d52b4f67
25e5aa4f
caf9ead2
8550ddae
60d45044
3f1e9474
86b68151
577d4d32
ec81a493
15745da0
e01b8e2f
bc8a6e93
ccb36322
001be529
b320f3f8
19b5f2b3
4fd1909e
545bd2b5
8a137a7f
1 change: 1 addition & 0 deletions benchmark/longmemeval/prompts/judge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
You are the LongMemEval-S judge. Score whether the answer matches the reference answer.
1 change: 1 addition & 0 deletions benchmark/longmemeval/prompts/reader.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
You are the LongMemEval-S reader. Answer the question using only the retrieved context.
116 changes: 116 additions & 0 deletions benchmark/longmemeval/src/check.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
import { readFileSync } from "node:fs";
import { fileURLToPath } from "node:url";
import { loadLongMemEvalRows, selectSmokeRows } from "./data.js";
import { buildManifest } from "./manifest.js";
import { renderResultsTables } from "./render-tables.js";
import { getLongMemEvalSystems } from "./systems.js";

export interface LongMemEvalCheckResult {
ok: true;
fixtureRows: number;
systems: number;
smokeIds: number;
}

const fixturePath = fileURLToPath(
new URL("../../../test/fixtures/longmemeval/mini.json", import.meta.url),
);
const smokeIdsPath = fileURLToPath(new URL("../data/smoke-ids.txt", import.meta.url));
const hybridResultsPath = fileURLToPath(
new URL("../../data/longmemeval_results_hybrid.json", import.meta.url),
);

function readSmokeIds(): string[] {
return readFileSync(smokeIdsPath, "utf8")
.split(/\r?\n/)
.map((line) => line.trim())
.filter(Boolean);
}

function validateSmokeIds(smokeIds: string[]): void {
if (smokeIds.length !== 50) {
throw new Error(`expected 50 smoke ids, got ${smokeIds.length}`);
}
if (new Set(smokeIds).size !== smokeIds.length) {
throw new Error("smoke ids must be unique");
}
const hybridResults = JSON.parse(readFileSync(hybridResultsPath, "utf8")) as {
per_question?: Array<{ question_id?: string }>;
};
const knownIds = new Set(
(hybridResults.per_question ?? []).map((row) => row.question_id).filter(Boolean),
);
const missing = smokeIds.filter((id) => !knownIds.has(id));
if (missing.length > 0) {
throw new Error(`smoke ids missing from checked-in hybrid results: ${missing.join(", ")}`);
}
}

export async function runLongMemEvalCheck(): Promise<LongMemEvalCheckResult> {
const fixtureRows = loadLongMemEvalRows(fixturePath);
selectSmokeRows(fixtureRows, ["q1", "q2", "q3"]);

const smokeIds = readSmokeIds();
validateSmokeIds(smokeIds);

const systems = getLongMemEvalSystems();
const manifest = buildManifest({
runId: "check",
createdAt: "2026-06-19T00:00:00.000Z",
commitSha: "local-check",
packageVersion: "0.0.0-check",
dataset: { name: "fixture", sha256: "fixture" },
prompts: { reader: "reader", judge: "judge" },
models: {
reader: { provider: "mock", model: "mock-reader", temperature: 0 },
judge: { provider: "mock", model: "mock-judge", temperature: 0 },
},
systems,
});
if (manifest.systems.length !== 6) {
throw new Error(`expected 6 systems in manifest, got ${manifest.systems.length}`);
}

const markdown = renderResultsTables({
categories: {
fixture: {
"agentmemory-baseline": {
n: 1,
correct: 1,
accuracy: 1,
ci: { low: 1, high: 1 },
},
},
},
hypotheses: [
{
id: "check",
comparison: "fixture",
rawPValue: 1,
adjustedPValue: 1,
claimed: false,
},
],
});
if (!markdown.includes("directional")) {
throw new Error("rendered table should include directional claim label");
}

return {
ok: true,
fixtureRows: fixtureRows.length,
systems: systems.length,
smokeIds: smokeIds.length,
};
}

if (process.argv[1] && fileURLToPath(import.meta.url) === process.argv[1]) {
runLongMemEvalCheck()
.then((result) => {
console.log(JSON.stringify(result, null, 2));
})
.catch((err: unknown) => {
console.error(err instanceof Error ? err.message : String(err));
process.exit(1);
});
}
128 changes: 128 additions & 0 deletions benchmark/longmemeval/src/data.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
import { createHash } from "node:crypto";
import { readFileSync } from "node:fs";
import type {
ChecksumResult,
LongMemEvalRawRow,
LongMemEvalRow,
LongMemEvalTurn,
} from "./types.js";

function requireString(value: unknown, field: string): string {
if (typeof value !== "string") {
throw new Error(`${field} must be a string`);
}
return value;
}

function requireStringArray(value: unknown, field: string): string[] {
if (!Array.isArray(value) || value.some((item) => typeof item !== "string")) {
throw new Error(`${field} must be an array of strings`);
}
return value;
}

function validateTurn(value: unknown, index: number): LongMemEvalTurn {
if (typeof value !== "object" || value === null) {
throw new Error(`turn ${index} must be an object`);
}
const record = value as Record<string, unknown>;
return {
role: requireString(record.role, `turn ${index} role`),
content: requireString(record.content, `turn ${index} content`),
};
}

function validateRawRow(value: unknown): LongMemEvalRawRow {
if (typeof value !== "object" || value === null) {
throw new Error("LongMemEval row must be an object");
}
const record = value as Record<string, unknown>;
const questionId = requireString(record.question_id, "question_id");
const haystackSessionIds = requireStringArray(
record.haystack_session_ids,
"haystack_session_ids",
);
if (!Array.isArray(record.haystack_sessions)) {
throw new Error("haystack_sessions must be an array");
}
if (haystackSessionIds.length !== record.haystack_sessions.length) {
throw new Error(
`LongMemEval row ${questionId}: haystack_session_ids (${haystackSessionIds.length}) and haystack_sessions (${record.haystack_sessions.length}) length mismatch`,
);
}
const haystackSessions = record.haystack_sessions.map((session, sessionIndex) => {
if (!Array.isArray(session)) {
throw new Error(`haystack_sessions ${sessionIndex} must be an array`);
}
return session.map((turn, turnIndex) => validateTurn(turn, turnIndex));
});

const raw: LongMemEvalRawRow = {
question_id: questionId,
question_type: requireString(record.question_type, "question_type"),
question: requireString(record.question, "question"),
answer_session_ids: requireStringArray(record.answer_session_ids, "answer_session_ids"),
haystack_session_ids: haystackSessionIds,
haystack_sessions: haystackSessions,
};
if (record.answer !== undefined) raw.answer = requireString(record.answer, "answer");
return raw;
}

function flattenSession(turns: LongMemEvalTurn[]): string {
return turns.map((turn) => `[${turn.role}] ${turn.content}`).join("\n\n");
}

export function loadLongMemEvalRows(path: string): LongMemEvalRow[] {
const raw = JSON.parse(readFileSync(path, "utf8")) as unknown;
if (!Array.isArray(raw)) {
throw new Error("expected LongMemEval JSON array");
}
return raw.map((item) => {
const row = validateRawRow(item);
return {
id: row.question_id,
type: row.question_type,
question: row.question,
answer: row.answer,
answerSessionIds: row.answer_session_ids,
haystack: row.haystack_session_ids.map((id, index) => ({
id,
turns: row.haystack_sessions[index],
content: flattenSession(row.haystack_sessions[index]),
})),
};
});
}

export function selectSmokeRows(
rows: LongMemEvalRow[],
questionIds: string[],
): LongMemEvalRow[] {
const byId = new Map(rows.map((row) => [row.id, row]));
const missing = questionIds.filter((id) => !byId.has(id));
if (missing.length > 0) {
throw new Error(`missing smoke question ids: ${missing.join(", ")}`);
}
return questionIds.map((id) => byId.get(id)!);
}

export function sha256File(path: string): string {
return createHash("sha256").update(readFileSync(path)).digest("hex");
}

export function verifyChecksumLine(line: string): ChecksumResult {
const match = line.match(/^([a-fA-F0-9]+)\s+(.+)$/);
if (!match) {
throw new Error("checksum line must be '<sha256> <path>'");
}
const expected = match[1].toLowerCase();
const path = match[2];
const actual = sha256File(path);
return {
ok: actual === expected,
expected,
actual,
path,
};
}
Loading
Loading