Skip to content

Add agent benchmarks, fix various issues to improve benchmarks#271

Merged
ahyatt merged 75 commits intodevelopfrom
agent-bench
Apr 14, 2026
Merged

Add agent benchmarks, fix various issues to improve benchmarks#271
ahyatt merged 75 commits intodevelopfrom
agent-bench

Conversation

@ahyatt
Copy link
Copy Markdown
Owner

@ahyatt ahyatt commented Mar 14, 2026

No description provided.

ahyatt added 30 commits March 8, 2026 15:10
- Indent body text to align past heading stars for clearer hierarchy
- Stable sort order using note ID as tiebreaker, preventing reordering
  on state changes when creation times are tied
- Fix ekg-org-view-up-heading: bind current-id (was a free variable)
- Remove unused win-start variable in ekg-org-view--refresh
Use shadow-inheriting ekg-org-view-body face for task body text,
making it visually distinct from heading text.
ekg-agent-org--tool-add-item now assigns an explicit :org/sort-order
when creating tasks, placing them at the end of existing siblings.
This ensures consistent ordering regardless of ID generation or
creation-time ties, and prevents conflicts when mixing agent-created
and UI-created tasks under the same parent.
ekg-org-view-delete and ekg-org-view-archive now operate on the
task and all its descendants, preventing orphaned children.
When pressing 'c' on a heading, the insert placeholder now defaults
to the first-child slot of that heading rather than the nearest slot
by buffer position.  Previously, pressing 'c' on 'Project Beta'
would show the placeholder as a child of 'Project Alpha' (the
nearest position above), which was confusing and led to tasks being
created under the wrong parent.

Add ekg-org-test-view-insert-initial-position to verify the fix,
and ekg-org-test-view-top-level-spacing to confirm blank lines
separate top-level groups after task creation.
ekg-org-view--refresh was calling ekg-org-view--mount which ends
with switch-to-buffer, stealing the selected window.  When saving
a note triggered the ekg-note-save-hook refresh, quit-window in
ekg-edit-finalize would then operate on the org-view window
instead of the edit buffer window, so the edit buffer stayed open.

Fix by using vui-rerender on the existing instance instead of
remounting.  This re-renders in place without touching windows.
Introduces a benchmark framework that runs predefined tasks in a fresh
Emacs subprocess (via llm-test infrastructure) and measures the
ekg-agent's effectiveness along three axes:

- Task completion: did the agent accomplish the goal?
- Skill adherence: did the agent follow instructions stored in ekg?
- Memory discipline: did the agent store/retrieve knowledge in ekg?

Metrics collected per task: pass/fail for each axis, iteration count,
tools used, and wall-clock time.

Includes 7 benchmark tasks:
- coding-with-skills: write elisp following skill note conventions
- memory-retrieval: use previously stored decisions for a follow-up task
- elisp-debugging: find and fix a bug in an elisp file (emacs-y)
- buffer-transform: read CSV data and produce org-mode report (emacs-y)
- knowledge-synthesis: combine info from multiple notes to answer a question
- store-findings: investigate code and document findings as ekg notes
- follow-complex-skill: follow a multi-step checklist from a skill note

Entry points:
- ekg-agent-bench-run: run all benchmarks
- ekg-agent-bench-run-one: run a single task by name
- ekg-agent-bench-register-ert-tests: register as ERT tests
- run-bench.el: batch script for CI
- Write init errors to a temp file via condition-case, then check it
  after the daemon starts (the daemon starts even when init errors
  because server vars are set first)
- Show the actual error and subprocess load-path in the error message
- Add ekg-agent-bench-diagnose command for interactive troubleshooting
- Add vui and llm-prompt to required libraries list
The host Emacs (e.g. /Applications/Emacs.app) and the subprocess
daemon (e.g. homebrew emacs-mac 29.1) may be different versions.
Built-in .elc files from one can't be loaded by the other, causing
'Invalid (or missing) doc string' errors.

The subprocess already has its own built-in lisp on load-path, so
we only need to pass user-installed package directories.
The subprocess must run the same Emacs version as the host to avoid
.elc version mismatches and missing built-in packages. Use
`invocation-name' and `invocation-directory' to resolve the exact
binary path instead of relying on llm-test-emacs-executable (which
defaults to 'emacs' and may resolve to a different installation).
Include the full agent log in each result so we can see what the
agent actually did. Display it in the results buffer for debugging.
- eval-verify now returns 'skip for nil expressions (not applicable),
  t for pass, nil for fail
- Display and counting correctly distinguish FAIL from n/a
Override ekg-agent--read-agents-md and ekg-agent--agents-md-context
to return nil in the subprocess so benchmarks test agent behavior
without being influenced by the user's personal AGENTS.md files.
read_file: LLMs sometimes pass empty strings for omitted optional
args (begin, end, range_type). Empty strings for begin/end with
range_type='line_number' caused string-to-number to return 0,
making the range 1..0 and returning empty content. Now treat
empty strings as nil.

write_file: New base tool that creates or overwrites files. Creates
parent directories if needed. If the file is open in an Emacs
buffer, replaces buffer contents without saving. Returns content
with line identifiers like read_file.

Tests: 4 new tests covering the empty-string fix, file creation,
overwrite, and buffer-update behavior.
New benchmark files testing four key agent behaviors:

1. task-progress-notes.yaml: Agent should create agent-task notes
   and append progress entries during multi-step work. Includes a
   restart scenario where a prior task note exists.

2. org-task-management.yaml: Agent should create ekg-org tasks with
   subtasks and mark them done. Tests both creating from scratch
   and continuing an existing org task. Sets org tools as extra tools.

3. auto-create-skills.yaml: Agent should store learnings as prompt-
   tagged skill notes. Tests discovery of performance issues (nth
   in loop) and project conventions (error-first returns).

4. use-existing-skills.yaml: Agent should find and follow error
   handling conventions from a skill note when writing new code.

Total benchmarks: 14 tasks across 11 groups.
The 60-second wait for ekg-agent-org-plan-task to produce subtasks
was too tight for CI, where OpenRouter latency and runner CPU are
slower than on local hardware.  The step completes in ~22s locally
but exceeded 60s in GitHub Actions, causing the test agent to fail
cleanly on a timing issue rather than a behavior issue.  Bump the
window to 120s, still well under the job's 10-minute cap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CI llm-test job surfaces "LLM provider was nil" errors from
ekg-agent-note-response, suggesting the init-forms that set
ekg-llm-provider aren't taking effect in the daemon — but the
same test passes locally.  Add a diagnostic file at
/tmp/ekg-llm-test-init-diag.log written from the daemon init
chain that records (a) whether the env var is visible, (b)
whether the provider setq succeeded, and (c) any eval error from
the provider expression.  Upload the file as part of the CI
debug-logs artifact on failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The diagnostic block added earlier to debug the daemon provider
setup is only useful when something is failing.  Gate the whole
block (including the side-effect force-require of llm-openai)
behind EKG_LLM_TEST_DIAG so normal runs take the lean path.
This also lets us test whether the force-require was inadvertently
what fixed CI, or whether the test was genuinely green on its own.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Errors in the test daemon's run-at-time timer callbacks are
silent — if ekg-agent-org-plan-task signals before creating
its log buffer, no trace is left.  Add advice that wraps
ekg-agent--iterate and ekg-agent--prompt-id, writing any error
to /tmp/ekg-llm-test-init-diag.log before re-signaling.  Also
re-enable the force-require of llm-openai at daemon init (which
previously resolved a "void-function make-llm-openrouter" error
in CI) and remove the EKG_LLM_TEST_DIAG gate so the diagnostics
always run in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The advice we added around ekg-agent--iterate, --prompt-id, and
friends was using dynamic-scoped references to 'diag-file' and
'fn-name' from the init-forms' lexical block.  The init-file that
llm-test writes for the daemon is loaded without lexical-binding,
so those names resolved in dynamic scope — they were in scope
while the init-forms progn ran, but void by the time a timer
callback fired the advice.

The net effect was that EVERY invocation of ekg-agent--prompt-id
(called from ekg-agent--iterate iteration 0) signaled
  (void-variable diag-file)
inside the run-command timer.  The error was silently swallowed
by the timer's default handler and no agent log buffer was ever
created, so it looked like the plan-task command never ran.

Removing the advice.  Provider setup diagnostics for the init
phase itself stay because they run inside the dynamic scope where
diag-file is still bound.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In daemon mode without a visible frame, (current-message) returns
nil once the command that produced the message finishes — so the
llm-test frame-state capture's `message` field was always null,
and the test agent never saw mode hints like
  "Insert mode: n/p move, R/L demote/promote, RET confirm, C-g cancel"
that ekg-org-view-create displays after `c'.

Intercept `message' in the llm-test init forms to save the most
recent non-empty message in a global, and patch `current-message'
to fall back to the saved value if it's fresh (within 30 seconds).
This lets the frame-state capture surface transient messages
across the emacsclient boundary.

Also expand the ekg-org-view test description to document both
branches of `ekg-org-view-create' (empty vs non-empty view) and
spell out the insert-mode workflow so gpt-5.4-mini stops typing
into a read-only placeholder.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a data-driven benchmark suite and LLM-driven integration tests for ekg-agent, while also improving agent robustness (status update discipline + transient LLM retry) and fixing a DB timestamp edge case.

Changes:

  • Add an ekg-agent-bench runner with .eld benchmark specs plus a batch entrypoint (run-bench.el).
  • Add llm-test-based integration test harness and YAML specs; wire into CI as a dedicated job.
  • Improve ekg-agent reliability via status-update reminders and transient LLM error retries; fix ekg note creation-time invariant.

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
run-bench.el Batch script to run benchmarks locally / in CI-like environments.
ekg-agent-bench.el New benchmark harness: subprocess management, metric extraction, ERT registration.
benchmarks/use-existing-skills.eld Benchmark: agent applies an existing “skill” note to code output.
benchmarks/task-progress-notes.eld Benchmarks: journaling progress notes; continuity across restart.
benchmarks/subagent-memory-handoff.eld Benchmark: subagent delegation + context handoff via notes.
benchmarks/store-findings.eld Benchmark: investigate a file and store findings as notes.
benchmarks/status-update-gap.eld Benchmark: enforce explicit summarize_state cadence (gap metric).
benchmarks/periodic-state-checkpoint.eld Benchmark: periodic progress notes across multi-file work.
benchmarks/org-task-management.eld Benchmarks: create/continue org tasks and mark DONE as work completes.
benchmarks/note-response.eld Benchmarks: note-response agent appends/replaces content in edit buffer.
benchmarks/memory-retrieval.eld Benchmark: retrieve prior decisions/conventions and apply them.
benchmarks/memory-contradiction.eld Benchmark: detect contradiction with stored convention; ask user.
benchmarks/memory-continuity.eld Benchmark: continue work without redoing already-completed files.
benchmarks/knowledge-synthesis.eld Benchmark: synthesize answer from multiple notes.
benchmarks/follow-complex-skill.eld Benchmark: follow a multi-step “checklist” skill note.
benchmarks/file-resource-notes.eld Benchmark: follow file-specific resource-note conventions.
benchmarks/elisp-debugging.eld Benchmark: debug/fix elisp bug and verify via tool execution.
benchmarks/coding-with-skills.eld Benchmark: generate code following basic “skill” conventions.
benchmarks/buffer-transform.eld Benchmark: transform CSV into an org report file.
benchmarks/auto-create-skills.eld Benchmarks: detect pattern and store reusable prompt-tagged skills.
llm-tests/ekg-org-view.yaml LLM-driven UI workflow test for ekg-org-view task creation/planning.
llm-tests/ekg-basics.yaml LLM-driven smoke tests for note capture/list/search/drafts flows.
llm-tests/agent-response.yaml LLM-driven test for agent response insertion behavior.
ekg-agent-llm-test.el Registers YAML specs as ERT tests; sets up subprocess init + diagnostics.
ekg-agent.el Adds status reminders + status timestamp tracking; adds transient LLM retry wrapper; improves error logging/termination behavior.
ekg-agent-test.el Adds ERT coverage for status reminders and file tool edge cases + note append behavior.
ekg.el Ensures creation-time is populated when missing before persisting time-tracked.
doc/ekg.org Documents new status reminder behavior/customization.
agent_tools/skill.md Markdown table formatting fix for org task tag documentation.
Eldev Adds llm-test and adjusts vui dependency declaration.
.github/workflows/ci.yaml Adds llm-test job to CI to run LLM-driven integration tests.
.gitignore Ignores generated *.log artifacts (e.g., llm-test debug logs).
AGENTS.md Adds packaging/distribution constraint note (no dependency on specific Emacs setup).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ekg-agent-bench.el
Comment thread ekg-agent-llm-test.el
Comment thread ekg-agent-llm-test.el Outdated
Comment thread run-bench.el Outdated
- ekg-agent-bench.el: explicit `(require 'seq)` since the file
  uses `seq-filter' inside elisp strings evaluated in a subprocess.
  `seq' is preloaded so this never failed in practice, but the
  explicit require silences the lint and documents the dependency.
- ekg-agent-llm-test.el: explicit `(require 'subr-x)' for
  `string-empty-p', and `(require 'cl-lib)' as the first form in
  the daemon init-forms so `cl-flet' macro-expansion no longer
  relies on llm-test's helper having required cl-lib first.
- run-bench.el: drop the stale `LLM_TEST_PATH' line from the
  header — the script never read it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test agent kept calling `run-command "switch-to-buffer"' to
get back to *ekg-org-tasks* after the planning step.  That command
is interactive and opens a minibuffer prompt for the buffer name;
the agent never typed the name, just kept re-invoking the command
and getting more stuck.

Tell the agent to use `run-command "ekg-org-view"' instead (we
documented earlier that ekg-org-view switches to the tasks
buffer), explicitly forbid switch-to-buffer, and spell out the
exact insert-mode RET-then-type sequence so it doesn't try to
type into the read-only placeholder.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ahyatt ahyatt merged commit ac409f0 into develop Apr 14, 2026
3 checks passed
@ahyatt ahyatt deleted the agent-bench branch April 14, 2026 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants