Skip to content

feat: add hermetic LongMemEval harness foundation#1024

Merged
wbugitlab1 merged 7 commits into
mainfrom
issue/313-longmemeval-harness
Jun 20, 2026
Merged

feat: add hermetic LongMemEval harness foundation#1024
wbugitlab1 merged 7 commits into
mainfrom
issue/313-longmemeval-harness

Conversation

@wbugitlab1

Copy link
Copy Markdown
Owner

Summary

  • Add a hermetic benchmark/longmemeval/ foundation for issue feat(bench): LongMemEval-S harness with statistical rigor + CI gate #313 with fixture data validation, six system definitions, manifest hashing/redaction, statistical utilities, markdown table rendering, and a local check entrypoint.
  • Add deterministic LongMemEval harness tests and bench:longmemeval:check.
  • Document that provider-backed reader/judge runs, real dataset download/submodule policy, historical QA baselines, and real CI benchmark gates remain approval-required future work.

Refs #313

Verification

  • corepack pnpm exec vitest run test/longmemeval-harness.test.ts test/eval-adapters.test.ts test/quality-gates.test.ts passed: 3 files / 43 tests.
  • corepack pnpm run bench:longmemeval:check passed: { ok: true, fixtureRows: 3, systems: 6, smokeIds: 50 }.
  • corepack pnpm run lint passed.
  • corepack pnpm test passed after base merge: 212 files / 2920 tests. One earlier full-suite post-merge run hit a transient test/codex-sdk-provider.test.ts 2000ms timeout; the isolated file then passed and the full rerun passed.
  • semgrep scan --config p/default --error --metrics=off . passed: 0 findings.
  • gitleaks protect --staged --redact passed: no leaks. It initially caught two synthetic redaction-test literals, which were changed to runtime-composed fakes before the passing run.
  • OSV was not run because this change does not alter dependencies, lockfiles, container images, vendored code, or third-party package surfaces.

@wbugitlab1 wbugitlab1 merged commit be1b009 into main Jun 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant