wbugitlab1 · wbugitlab1 · Jun 20, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/.gitignore b/.gitignore
@@ -13,6 +13,8 @@ dist/
 plugin/scripts/*.map
 plugin/scripts/*.d.mts
 data/
+!benchmark/longmemeval/data/
+!benchmark/longmemeval/data/smoke-ids.txt
 !eval/data/
 !eval/data/**
 data-*/

diff --git a/README.md b/README.md
@@ -286,6 +286,8 @@ Latest release notes: [CHANGELOG.md](CHANGELOG.md).
 
 **Reproduce locally:** [`eval/README.md`](eval/README.md) — adapter-pluggable harness for LongMemEval `_s` (public 500-Q) + `coding-agent-life-v1` (in-house 15-session corpus). Grep / vector / agentmemory adapters score side-by-side, NDJSON output, published scorecards land in [`docs/benchmarks/`](docs/benchmarks/).
 
+**QA harness foundation:** [`benchmark/longmemeval/`](benchmark/longmemeval/) contains the hermetic issue #313 harness scaffold for future end-to-end LongMemEval-S reader/judge runs, statistical tests, manifests, and reproducibility checks. The published numbers above remain retrieval-only until an approved provider-backed run publishes QA results.
+
 **Pairs with [codegraph](https://github.com/colbymchenry/codegraph), [Understand Anything](https://github.com/Lum1104/Understand-Anything), and [Graphify](https://github.com/safishamsi/graphify).** Code-graph indexing, multi-agent build pipelines, and broader knowledge graphs across docs / PDFs / images / videos. agentmemory remembers the work; those three projects light up the rest of the context layer. Recipes + question-routing table: [`docs/recipes/pairings.md`](docs/recipes/pairings.md).
 
 ---

diff --git a/benchmark/README.md b/benchmark/README.md
@@ -11,6 +11,11 @@ Two kinds of numbers live in this directory:
    throughput against a running daemon. This is the file you want when
    somebody asks "what's p99 at 100k memories under concurrency 100?".
 
+`longmemeval/` is the hermetic foundation for the issue #313
+LongMemEval-S QA/statistics harness. It does not run provider-backed
+reader or judge models yet; use `corepack pnpm run bench:longmemeval:check`
+to validate the local harness structure.
+
 ## load-100k.ts
 
 Hand-rolled, dependency-free load harness. Issues real HTTP against a

diff --git a/benchmark/longmemeval/Makefile b/benchmark/longmemeval/Makefile
@@ -0,0 +1,21 @@
+ROOT := ../..
+
+.PHONY: check smoke-ids data reproduce reproduce-full
+
+check:
+	cd $(ROOT) && corepack pnpm run bench:longmemeval:check
+
+smoke-ids:
+	cd $(ROOT) && corepack pnpm run bench:longmemeval:check
+
+data:
+	@echo "LongMemEval data download is approval-required and intentionally not part of the hermetic check."
+	@exit 2
+
+reproduce:
+	@echo "Provider-backed LongMemEval smoke reproduction requires explicit approval, dataset availability, and model credentials."
+	@exit 2
+
+reproduce-full:
+	@echo "Full LongMemEval-S reproduction requires explicit approval, dataset availability, model credentials, and runtime budget."
+	@exit 2
diff --git a/benchmark/longmemeval/README.md b/benchmark/longmemeval/README.md
@@ -0,0 +1,41 @@
+# LongMemEval-S Harness
+
+This directory is the foundation for the issue #313 LongMemEval-S QA harness.
+It is intentionally hermetic in this first slice: no dataset download, no
+submodule, no provider calls, no new dependencies, and no historical baseline
+claim.
+
+The existing LongMemEval numbers in `benchmark/LONGMEMEVAL.md` are
+retrieval-only diagnostics. This harness adds the reproducibility and
+statistics structure needed for a later approved end-to-end run with reader and
+judge models.
+
+## Local Check
+
+```sh
+corepack pnpm run bench:longmemeval:check
+make -C benchmark/longmemeval check
+```
+
+The check validates:
+
+- the tiny checked-in fixture shape,
+- the 50 smoke IDs against committed retrieval result provenance,
+- the six issue #313 system definitions,
+- prompt hashing and manifest redaction,
+- markdown table rendering.
+
+## Deferred Full-Scope Work
+
+These targets are present to document the intended workflow, but they fail
+closed until maintainers approve the needed boundaries:
+
+```sh
+make -C benchmark/longmemeval data
+make -C benchmark/longmemeval reproduce
+make -C benchmark/longmemeval reproduce-full
+```
+
+Approval is required before adding a LongMemEval submodule, downloading the
+real dataset in automation, calling reader or judge models, wiring a real PR CI
+benchmark gate, or publishing a `v0.9.24` QA baseline.
diff --git a/benchmark/longmemeval/data/smoke-ids.txt b/benchmark/longmemeval/data/smoke-ids.txt
@@ -0,0 +1,50 @@
+e47becba
+118b2229
+51a45a95
+58bf7951
+1e043500
+c5e8278d
+6ade9755
+6f9b354f
+58ef2f1c
+f8c5f88b
+5d3d2817
+7527f7e2
+c960da58
+3b6f954b
+726462e0
+94f70d80
+66f24dbb
+ad7109d1
+af8d2e46
+dccbc061
+c8c3f81d
+8ebdbe50
+6b168ec8
+75499fd8
+21436231
+95bcc1c8
+0862e8bf
+853b0a1d
+a06e4cfe
+37d43f65
+b86304ba
+d52b4f67
+25e5aa4f
+caf9ead2
+8550ddae
+60d45044
+3f1e9474
+86b68151
+577d4d32
+ec81a493
+15745da0
+e01b8e2f
+bc8a6e93
+ccb36322
+001be529
+b320f3f8
+19b5f2b3
+4fd1909e
+545bd2b5
+8a137a7f
diff --git a/benchmark/longmemeval/prompts/judge.md b/benchmark/longmemeval/prompts/judge.md
@@ -0,0 +1 @@
+You are the LongMemEval-S judge. Score whether the answer matches the reference answer.
diff --git a/benchmark/longmemeval/prompts/reader.md b/benchmark/longmemeval/prompts/reader.md
@@ -0,0 +1 @@
+You are the LongMemEval-S reader. Answer the question using only the retrieved context.
diff --git a/benchmark/longmemeval/src/check.ts b/benchmark/longmemeval/src/check.ts
@@ -0,0 +1,116 @@
+import { readFileSync } from "node:fs";
+import { fileURLToPath } from "node:url";
+import { loadLongMemEvalRows, selectSmokeRows } from "./data.js";
+import { buildManifest } from "./manifest.js";
+import { renderResultsTables } from "./render-tables.js";
+import { getLongMemEvalSystems } from "./systems.js";
+
+export interface LongMemEvalCheckResult {
+  ok: true;
+  fixtureRows: number;
+  systems: number;
+  smokeIds: number;
+}
+
+const fixturePath = fileURLToPath(
+  new URL("../../../test/fixtures/longmemeval/mini.json", import.meta.url),
+);
+const smokeIdsPath = fileURLToPath(new URL("../data/smoke-ids.txt", import.meta.url));
+const hybridResultsPath = fileURLToPath(
+  new URL("../../data/longmemeval_results_hybrid.json", import.meta.url),
+);
+
+function readSmokeIds(): string[] {
+  return readFileSync(smokeIdsPath, "utf8")
+    .split(/\r?\n/)
+    .map((line) => line.trim())
+    .filter(Boolean);
+}
+
+function validateSmokeIds(smokeIds: string[]): void {
+  if (smokeIds.length !== 50) {
+    throw new Error(`expected 50 smoke ids, got ${smokeIds.length}`);
+  }
+  if (new Set(smokeIds).size !== smokeIds.length) {
+    throw new Error("smoke ids must be unique");
+  }
+  const hybridResults = JSON.parse(readFileSync(hybridResultsPath, "utf8")) as {
+    per_question?: Array<{ question_id?: string }>;
+  };
+  const knownIds = new Set(
+    (hybridResults.per_question ?? []).map((row) => row.question_id).filter(Boolean),
+  );
+  const missing = smokeIds.filter((id) => !knownIds.has(id));
+  if (missing.length > 0) {
+    throw new Error(`smoke ids missing from checked-in hybrid results: ${missing.join(", ")}`);
+  }
+}
+
+export async function runLongMemEvalCheck(): Promise<LongMemEvalCheckResult> {
+  const fixtureRows = loadLongMemEvalRows(fixturePath);
+  selectSmokeRows(fixtureRows, ["q1", "q2", "q3"]);
+
+  const smokeIds = readSmokeIds();
+  validateSmokeIds(smokeIds);
+
+  const systems = getLongMemEvalSystems();
+  const manifest = buildManifest({
+    runId: "check",
+    createdAt: "2026-06-19T00:00:00.000Z",
+    commitSha: "local-check",
+    packageVersion: "0.0.0-check",
+    dataset: { name: "fixture", sha256: "fixture" },
+    prompts: { reader: "reader", judge: "judge" },
+    models: {
+      reader: { provider: "mock", model: "mock-reader", temperature: 0 },
+      judge: { provider: "mock", model: "mock-judge", temperature: 0 },
+    },
+    systems,
+  });
+  if (manifest.systems.length !== 6) {
+    throw new Error(`expected 6 systems in manifest, got ${manifest.systems.length}`);
+  }
+
+  const markdown = renderResultsTables({
+    categories: {
+      fixture: {
+        "agentmemory-baseline": {
+          n: 1,
+          correct: 1,
+          accuracy: 1,
+          ci: { low: 1, high: 1 },
+        },
+      },
+    },
+    hypotheses: [
+      {
+        id: "check",
+        comparison: "fixture",
+        rawPValue: 1,
+        adjustedPValue: 1,
+        claimed: false,
+      },
+    ],
+  });
+  if (!markdown.includes("directional")) {
+    throw new Error("rendered table should include directional claim label");
+  }
+
+  return {
+    ok: true,
+    fixtureRows: fixtureRows.length,
+    systems: systems.length,
+    smokeIds: smokeIds.length,
+  };
+}
+
+if (process.argv[1] && fileURLToPath(import.meta.url) === process.argv[1]) {
+  runLongMemEvalCheck()
+    .then((result) => {
+      console.log(JSON.stringify(result, null, 2));
+    })
+    .catch((err: unknown) => {
+      console.error(err instanceof Error ? err.message : String(err));
+      process.exit(1);
+    });
+}
diff --git a/benchmark/longmemeval/src/data.ts b/benchmark/longmemeval/src/data.ts
@@ -0,0 +1,128 @@
+import { createHash } from "node:crypto";
+import { readFileSync } from "node:fs";
+import type {
+  ChecksumResult,
+  LongMemEvalRawRow,
+  LongMemEvalRow,
+  LongMemEvalTurn,
+} from "./types.js";
+
+function requireString(value: unknown, field: string): string {
+  if (typeof value !== "string") {
+    throw new Error(`${field} must be a string`);
+  }
+  return value;
+}
+
+function requireStringArray(value: unknown, field: string): string[] {
+  if (!Array.isArray(value) || value.some((item) => typeof item !== "string")) {
+    throw new Error(`${field} must be an array of strings`);
+  }
+  return value;
+}
+
+function validateTurn(value: unknown, index: number): LongMemEvalTurn {
+  if (typeof value !== "object" || value === null) {
+    throw new Error(`turn ${index} must be an object`);
+  }
+  const record = value as Record<string, unknown>;
+  return {
+    role: requireString(record.role, `turn ${index} role`),
+    content: requireString(record.content, `turn ${index} content`),
+  };
+}
+
+function validateRawRow(value: unknown): LongMemEvalRawRow {
+  if (typeof value !== "object" || value === null) {
+    throw new Error("LongMemEval row must be an object");
+  }
+  const record = value as Record<string, unknown>;
+  const questionId = requireString(record.question_id, "question_id");
+  const haystackSessionIds = requireStringArray(
+    record.haystack_session_ids,
+    "haystack_session_ids",
+  );
+  if (!Array.isArray(record.haystack_sessions)) {
+    throw new Error("haystack_sessions must be an array");
+  }
+  if (haystackSessionIds.length !== record.haystack_sessions.length) {
+    throw new Error(
+      `LongMemEval row ${questionId}: haystack_session_ids (${haystackSessionIds.length}) and haystack_sessions (${record.haystack_sessions.length}) length mismatch`,
+    );
+  }
+  const haystackSessions = record.haystack_sessions.map((session, sessionIndex) => {
+    if (!Array.isArray(session)) {
+      throw new Error(`haystack_sessions ${sessionIndex} must be an array`);
+    }
+    return session.map((turn, turnIndex) => validateTurn(turn, turnIndex));
+  });
+
+  const raw: LongMemEvalRawRow = {
+    question_id: questionId,
+    question_type: requireString(record.question_type, "question_type"),
+    question: requireString(record.question, "question"),
+    answer_session_ids: requireStringArray(record.answer_session_ids, "answer_session_ids"),
+    haystack_session_ids: haystackSessionIds,
+    haystack_sessions: haystackSessions,
+  };
+  if (record.answer !== undefined) raw.answer = requireString(record.answer, "answer");
+  return raw;
+}
+
+function flattenSession(turns: LongMemEvalTurn[]): string {
+  return turns.map((turn) => `[${turn.role}] ${turn.content}`).join("\n\n");
+}
+
+export function loadLongMemEvalRows(path: string): LongMemEvalRow[] {
+  const raw = JSON.parse(readFileSync(path, "utf8")) as unknown;
+  if (!Array.isArray(raw)) {
+    throw new Error("expected LongMemEval JSON array");
+  }
+  return raw.map((item) => {
+    const row = validateRawRow(item);
+    return {
+      id: row.question_id,
+      type: row.question_type,
+      question: row.question,
+      answer: row.answer,
+      answerSessionIds: row.answer_session_ids,
+      haystack: row.haystack_session_ids.map((id, index) => ({
+        id,
+        turns: row.haystack_sessions[index],
+        content: flattenSession(row.haystack_sessions[index]),
+      })),
+    };
+  });
+}
+
+export function selectSmokeRows(
+  rows: LongMemEvalRow[],
+  questionIds: string[],
+): LongMemEvalRow[] {
+  const byId = new Map(rows.map((row) => [row.id, row]));
+  const missing = questionIds.filter((id) => !byId.has(id));
+  if (missing.length > 0) {
+    throw new Error(`missing smoke question ids: ${missing.join(", ")}`);
+  }
+  return questionIds.map((id) => byId.get(id)!);
+}
+
+export function sha256File(path: string): string {
+  return createHash("sha256").update(readFileSync(path)).digest("hex");
+}
+
+export function verifyChecksumLine(line: string): ChecksumResult {
+  const match = line.match(/^([a-fA-F0-9]+)\s+(.+)$/);
+  if (!match) {
+    throw new Error("checksum line must be '<sha256>  <path>'");
+  }
+  const expected = match[1].toLowerCase();
+  const path = match[2];
+  const actual = sha256File(path);
+  return {
+    ok: actual === expected,
+    expected,
+    actual,
+    path,
+  };
+}
-Original file line number
+Diff line change
@@ Expand Up / @@ -286,6 +286,8 @@ Latest release notes: [CHANGELOG.md](CHANGELOG.md). @@
     **Reproduce locally:** [`eval/README.md`](eval/README.md) — adapter-pluggable harness for LongMemEval `_s` (public 500-Q) + `coding-agent-life-v1` (in-house 15-session corpus). Grep / vector / agentmemory adapters score side-by-side, NDJSON output, published scorecards land in [`docs/benchmarks/`](docs/benchmarks/).
+    **QA harness foundation:** [`benchmark/longmemeval/`](benchmark/longmemeval/) contains the hermetic issue #313 harness scaffold for future end-to-end LongMemEval-S reader/judge runs, statistical tests, manifests, and reproducibility checks. The published numbers above remain retrieval-only until an approved provider-backed run publishes QA results.
     **Pairs with [codegraph](https://github.com/colbymchenry/codegraph), [Understand Anything](https://github.com/Lum1104/Understand-Anything), and [Graphify](https://github.com/safishamsi/graphify).** Code-graph indexing, multi-agent build pipelines, and broader knowledge graphs across docs / PDFs / images / videos. agentmemory remembers the work; those three projects light up the rest of the context layer. Recipes + question-routing table: [`docs/recipes/pairings.md`](docs/recipes/pairings.md).
     ---
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		You are the LongMemEval-S judge. Score whether the answer matches the reference answer.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		You are the LongMemEval-S reader. Answer the question using only the retrieved context.