Add agent benchmarks, fix various issues to improve benchmarks by ahyatt · Pull Request #271 · ahyatt/ekg

ahyatt · 2026-03-14T05:19:27Z

No description provided.

- Indent body text to align past heading stars for clearer hierarchy - Stable sort order using note ID as tiebreaker, preventing reordering on state changes when creation times are tied - Fix ekg-org-view-up-heading: bind current-id (was a free variable) - Remove unused win-start variable in ekg-org-view--refresh

Use shadow-inheriting ekg-org-view-body face for task body text, making it visually distinct from heading text.

ekg-agent-org--tool-add-item now assigns an explicit :org/sort-order when creating tasks, placing them at the end of existing siblings. This ensures consistent ordering regardless of ID generation or creation-time ties, and prevents conflicts when mixing agent-created and UI-created tasks under the same parent.

ekg-org-view-delete and ekg-org-view-archive now operate on the task and all its descendants, preventing orphaned children.

When pressing 'c' on a heading, the insert placeholder now defaults to the first-child slot of that heading rather than the nearest slot by buffer position. Previously, pressing 'c' on 'Project Beta' would show the placeholder as a child of 'Project Alpha' (the nearest position above), which was confusing and led to tasks being created under the wrong parent. Add ekg-org-test-view-insert-initial-position to verify the fix, and ekg-org-test-view-top-level-spacing to confirm blank lines separate top-level groups after task creation.

ekg-org-view--refresh was calling ekg-org-view--mount which ends with switch-to-buffer, stealing the selected window. When saving a note triggered the ekg-note-save-hook refresh, quit-window in ekg-edit-finalize would then operate on the org-view window instead of the edit buffer window, so the edit buffer stayed open. Fix by using vui-rerender on the existing instance instead of remounting. This re-renders in place without touching windows.

Introduces a benchmark framework that runs predefined tasks in a fresh Emacs subprocess (via llm-test infrastructure) and measures the ekg-agent's effectiveness along three axes: - Task completion: did the agent accomplish the goal? - Skill adherence: did the agent follow instructions stored in ekg? - Memory discipline: did the agent store/retrieve knowledge in ekg? Metrics collected per task: pass/fail for each axis, iteration count, tools used, and wall-clock time. Includes 7 benchmark tasks: - coding-with-skills: write elisp following skill note conventions - memory-retrieval: use previously stored decisions for a follow-up task - elisp-debugging: find and fix a bug in an elisp file (emacs-y) - buffer-transform: read CSV data and produce org-mode report (emacs-y) - knowledge-synthesis: combine info from multiple notes to answer a question - store-findings: investigate code and document findings as ekg notes - follow-complex-skill: follow a multi-step checklist from a skill note Entry points: - ekg-agent-bench-run: run all benchmarks - ekg-agent-bench-run-one: run a single task by name - ekg-agent-bench-register-ert-tests: register as ERT tests - run-bench.el: batch script for CI

- Write init errors to a temp file via condition-case, then check it after the daemon starts (the daemon starts even when init errors because server vars are set first) - Show the actual error and subprocess load-path in the error message - Add ekg-agent-bench-diagnose command for interactive troubleshooting - Add vui and llm-prompt to required libraries list

The host Emacs (e.g. /Applications/Emacs.app) and the subprocess daemon (e.g. homebrew emacs-mac 29.1) may be different versions. Built-in .elc files from one can't be loaded by the other, causing 'Invalid (or missing) doc string' errors. The subprocess already has its own built-in lisp on load-path, so we only need to pass user-installed package directories.

The subprocess must run the same Emacs version as the host to avoid .elc version mismatches and missing built-in packages. Use `invocation-name' and `invocation-directory' to resolve the exact binary path instead of relying on llm-test-emacs-executable (which defaults to 'emacs' and may resolve to a different installation).

Include the full agent log in each result so we can see what the agent actually did. Display it in the results buffer for debugging.

- eval-verify now returns 'skip for nil expressions (not applicable), t for pass, nil for fail - Display and counting correctly distinguish FAIL from n/a

Override ekg-agent--read-agents-md and ekg-agent--agents-md-context to return nil in the subprocess so benchmarks test agent behavior without being influenced by the user's personal AGENTS.md files.

read_file: LLMs sometimes pass empty strings for omitted optional args (begin, end, range_type). Empty strings for begin/end with range_type='line_number' caused string-to-number to return 0, making the range 1..0 and returning empty content. Now treat empty strings as nil. write_file: New base tool that creates or overwrites files. Creates parent directories if needed. If the file is open in an Emacs buffer, replaces buffer contents without saving. Returns content with line identifiers like read_file. Tests: 4 new tests covering the empty-string fix, file creation, overwrite, and buffer-update behavior.

New benchmark files testing four key agent behaviors: 1. task-progress-notes.yaml: Agent should create agent-task notes and append progress entries during multi-step work. Includes a restart scenario where a prior task note exists. 2. org-task-management.yaml: Agent should create ekg-org tasks with subtasks and mark them done. Tests both creating from scratch and continuing an existing org task. Sets org tools as extra tools. 3. auto-create-skills.yaml: Agent should store learnings as prompt- tagged skill notes. Tests discovery of performance issues (nth in loop) and project conventions (error-first returns). 4. use-existing-skills.yaml: Agent should find and follow error handling conventions from a skill note when writing new code. Total benchmarks: 14 tasks across 11 groups.

The 60-second wait for ekg-agent-org-plan-task to produce subtasks was too tight for CI, where OpenRouter latency and runner CPU are slower than on local hardware. The step completes in ~22s locally but exceeded 60s in GitHub Actions, causing the test agent to fail cleanly on a timing issue rather than a behavior issue. Bump the window to 120s, still well under the job's 10-minute cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The CI llm-test job surfaces "LLM provider was nil" errors from ekg-agent-note-response, suggesting the init-forms that set ekg-llm-provider aren't taking effect in the daemon — but the same test passes locally. Add a diagnostic file at /tmp/ekg-llm-test-init-diag.log written from the daemon init chain that records (a) whether the env var is visible, (b) whether the provider setq succeeded, and (c) any eval error from the provider expression. Upload the file as part of the CI debug-logs artifact on failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The diagnostic block added earlier to debug the daemon provider setup is only useful when something is failing. Gate the whole block (including the side-effect force-require of llm-openai) behind EKG_LLM_TEST_DIAG so normal runs take the lean path. This also lets us test whether the force-require was inadvertently what fixed CI, or whether the test was genuinely green on its own. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Errors in the test daemon's run-at-time timer callbacks are silent — if ekg-agent-org-plan-task signals before creating its log buffer, no trace is left. Add advice that wraps ekg-agent--iterate and ekg-agent--prompt-id, writing any error to /tmp/ekg-llm-test-init-diag.log before re-signaling. Also re-enable the force-require of llm-openai at daemon init (which previously resolved a "void-function make-llm-openrouter" error in CI) and remove the EKG_LLM_TEST_DIAG gate so the diagnostics always run in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The advice we added around ekg-agent--iterate, --prompt-id, and friends was using dynamic-scoped references to 'diag-file' and 'fn-name' from the init-forms' lexical block. The init-file that llm-test writes for the daemon is loaded without lexical-binding, so those names resolved in dynamic scope — they were in scope while the init-forms progn ran, but void by the time a timer callback fired the advice. The net effect was that EVERY invocation of ekg-agent--prompt-id (called from ekg-agent--iterate iteration 0) signaled (void-variable diag-file) inside the run-command timer. The error was silently swallowed by the timer's default handler and no agent log buffer was ever created, so it looked like the plan-task command never ran. Removing the advice. Provider setup diagnostics for the init phase itself stay because they run inside the dynamic scope where diag-file is still bound. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

In daemon mode without a visible frame, (current-message) returns nil once the command that produced the message finishes — so the llm-test frame-state capture's `message` field was always null, and the test agent never saw mode hints like "Insert mode: n/p move, R/L demote/promote, RET confirm, C-g cancel" that ekg-org-view-create displays after `c'. Intercept `message' in the llm-test init forms to save the most recent non-empty message in a global, and patch `current-message' to fall back to the saved value if it's fresh (within 30 seconds). This lets the frame-state capture surface transient messages across the emacsclient boundary. Also expand the ekg-org-view test description to document both branches of `ekg-org-view-create' (empty vs non-empty view) and spell out the insert-mode workflow so gpt-5.4-mini stops typing into a read-only placeholder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces a data-driven benchmark suite and LLM-driven integration tests for ekg-agent, while also improving agent robustness (status update discipline + transient LLM retry) and fixing a DB timestamp edge case.

Changes:

Add an ekg-agent-bench runner with .eld benchmark specs plus a batch entrypoint (run-bench.el).
Add llm-test-based integration test harness and YAML specs; wire into CI as a dedicated job.
Improve ekg-agent reliability via status-update reminders and transient LLM error retries; fix ekg note creation-time invariant.

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
run-bench.el	Batch script to run benchmarks locally / in CI-like environments.
ekg-agent-bench.el	New benchmark harness: subprocess management, metric extraction, ERT registration.
benchmarks/use-existing-skills.eld	Benchmark: agent applies an existing “skill” note to code output.
benchmarks/task-progress-notes.eld	Benchmarks: journaling progress notes; continuity across restart.
benchmarks/subagent-memory-handoff.eld	Benchmark: subagent delegation + context handoff via notes.
benchmarks/store-findings.eld	Benchmark: investigate a file and store findings as notes.
benchmarks/status-update-gap.eld	Benchmark: enforce explicit `summarize_state` cadence (gap metric).
benchmarks/periodic-state-checkpoint.eld	Benchmark: periodic progress notes across multi-file work.
benchmarks/org-task-management.eld	Benchmarks: create/continue org tasks and mark DONE as work completes.
benchmarks/note-response.eld	Benchmarks: note-response agent appends/replaces content in edit buffer.
benchmarks/memory-retrieval.eld	Benchmark: retrieve prior decisions/conventions and apply them.
benchmarks/memory-contradiction.eld	Benchmark: detect contradiction with stored convention; ask user.
benchmarks/memory-continuity.eld	Benchmark: continue work without redoing already-completed files.
benchmarks/knowledge-synthesis.eld	Benchmark: synthesize answer from multiple notes.
benchmarks/follow-complex-skill.eld	Benchmark: follow a multi-step “checklist” skill note.
benchmarks/file-resource-notes.eld	Benchmark: follow file-specific resource-note conventions.
benchmarks/elisp-debugging.eld	Benchmark: debug/fix elisp bug and verify via tool execution.
benchmarks/coding-with-skills.eld	Benchmark: generate code following basic “skill” conventions.
benchmarks/buffer-transform.eld	Benchmark: transform CSV into an org report file.
benchmarks/auto-create-skills.eld	Benchmarks: detect pattern and store reusable prompt-tagged skills.
llm-tests/ekg-org-view.yaml	LLM-driven UI workflow test for `ekg-org-view` task creation/planning.
llm-tests/ekg-basics.yaml	LLM-driven smoke tests for note capture/list/search/drafts flows.
llm-tests/agent-response.yaml	LLM-driven test for agent response insertion behavior.
ekg-agent-llm-test.el	Registers YAML specs as ERT tests; sets up subprocess init + diagnostics.
ekg-agent.el	Adds status reminders + status timestamp tracking; adds transient LLM retry wrapper; improves error logging/termination behavior.
ekg-agent-test.el	Adds ERT coverage for status reminders and file tool edge cases + note append behavior.
ekg.el	Ensures `creation-time` is populated when missing before persisting `time-tracked`.
doc/ekg.org	Documents new status reminder behavior/customization.
agent_tools/skill.md	Markdown table formatting fix for org task tag documentation.
Eldev	Adds `llm-test` and adjusts `vui` dependency declaration.
.github/workflows/ci.yaml	Adds `llm-test` job to CI to run LLM-driven integration tests.
.gitignore	Ignores generated `*.log` artifacts (e.g., llm-test debug logs).
AGENTS.md	Adds packaging/distribution constraint note (no dependency on specific Emacs setup).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- ekg-agent-bench.el: explicit `(require 'seq)` since the file uses `seq-filter' inside elisp strings evaluated in a subprocess. `seq' is preloaded so this never failed in practice, but the explicit require silences the lint and documents the dependency. - ekg-agent-llm-test.el: explicit `(require 'subr-x)' for `string-empty-p', and `(require 'cl-lib)' as the first form in the daemon init-forms so `cl-flet' macro-expansion no longer relies on llm-test's helper having required cl-lib first. - run-bench.el: drop the stale `LLM_TEST_PATH' line from the header — the script never read it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The test agent kept calling `run-command "switch-to-buffer"' to get back to *ekg-org-tasks* after the planning step. That command is interactive and opens a minibuffer prompt for the buffer name; the agent never typed the name, just kept re-invoking the command and getting more stuck. Tell the agent to use `run-command "ekg-org-view"' instead (we documented earlier that ekg-org-view switches to the tasks buffer), explicitly forbid switch-to-buffer, and spell out the exact insert-mode RET-then-type sequence so it doesn't try to type into the read-only placeholder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ahyatt added 30 commits March 8, 2026 15:10

Make sure org children and parents don't break or appear in header

b6a7d5b

Fix issue with org entries appending instead of replacing

f687ca4

Added vui as a dependency, created ekg-org-view

1a375eb

Ensure we always set up the note types

d253c88

Merge branch 'develop' into org-new-view

ac29f52

Fix lint errors

7205cd6

Add and fix tests for org-view

1f8c408

Add line highlighting, viewport stability

f44bf2b

Enable agent (and other things) to act on ekg-org-view notes

bf9225b

Fix error with saving virtual reversed properties

88a568e

Add back highlighting of current line

6a8f959

Add new ekg org skills, properties, ensure we have a current node

d0de698

Add documentation

fdfc48a

Add distinct face for body text in org-view

bf83916

Use shadow-inheriting ekg-org-view-body face for task body text, making it visually distinct from heading text.

Recursively delete/archive children in org-view

db7ca9c

ekg-org-view-delete and ekg-org-view-archive now operate on the task and all its descendants, preventing orphaned children.

Better UI for task creation

d4312c5

Use futur instead of async for ekg-agent

0436773

Add agent log capture to benchmark results

794f83a

Include the full agent log in each result so we can see what the agent actually did. Display it in the results buffer for debugging.

Fix pass/fail/skip distinction in verify results

07b2a87

- eval-verify now returns 'skip for nil expressions (not applicable), t for pass, nil for fail - Display and counting correctly distinguish FAIL from n/a

Isolate subprocess from user AGENTS.md

42e8f40

Override ekg-agent--read-agents-md and ekg-agent--agents-md-context to return nil in the subprocess so benchmarks test agent behavior without being influenced by the user's personal AGENTS.md files.

ahyatt had a problem deploying to llm-test April 13, 2026 21:18 — with GitHub Actions Failure

ahyatt had a problem deploying to llm-test April 13, 2026 21:47 — with GitHub Actions Failure

ahyatt had a problem deploying to llm-test April 13, 2026 21:58 — with GitHub Actions Failure

Add more init diagnostics (locate llm-openai, fboundp probes)

bf02fbd

ahyatt temporarily deployed to llm-test April 13, 2026 22:04 — with GitHub Actions Inactive

ahyatt had a problem deploying to llm-test April 13, 2026 22:14 — with GitHub Actions Failure

ahyatt had a problem deploying to llm-test April 13, 2026 22:53 — with GitHub Actions Failure

Advise more entry points in the plan-task call chain

177a9ac

ahyatt had a problem deploying to llm-test April 13, 2026 22:58 — with GitHub Actions Failure

Log advice attachment status and ENTER/EXIT with buffer context

b65e9f0

ahyatt had a problem deploying to llm-test April 13, 2026 23:05 — with GitHub Actions Failure

ahyatt had a problem deploying to llm-test April 13, 2026 23:23 — with GitHub Actions Failure

ahyatt temporarily deployed to llm-test April 14, 2026 00:28 — with GitHub Actions Inactive

ahyatt requested a review from Copilot April 14, 2026 00:58

Copilot started reviewing on behalf of ahyatt April 14, 2026 00:58 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Comment thread ekg-agent-bench.el

Comment thread ekg-agent-llm-test.el

Comment thread ekg-agent-llm-test.el Outdated

Comment thread run-bench.el Outdated

ahyatt had a problem deploying to llm-test April 14, 2026 01:28 — with GitHub Actions Failure

ahyatt temporarily deployed to llm-test April 14, 2026 03:26 — with GitHub Actions Inactive

Use n/p navigation for step 4 instead of eval-elisp

4a649e4

ahyatt temporarily deployed to llm-test April 14, 2026 04:03 — with GitHub Actions Inactive

ahyatt merged commit ac409f0 into develop Apr 14, 2026
3 checks passed

ahyatt deleted the agent-bench branch April 14, 2026 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add agent benchmarks, fix various issues to improve benchmarks#271

Add agent benchmarks, fix various issues to improve benchmarks#271
ahyatt merged 75 commits intodevelopfrom
agent-bench

ahyatt commented Mar 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ahyatt commented Mar 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants