Add agent benchmarks, fix various issues to improve benchmarks#271
Merged
Add agent benchmarks, fix various issues to improve benchmarks#271
Conversation
- Indent body text to align past heading stars for clearer hierarchy - Stable sort order using note ID as tiebreaker, preventing reordering on state changes when creation times are tied - Fix ekg-org-view-up-heading: bind current-id (was a free variable) - Remove unused win-start variable in ekg-org-view--refresh
Use shadow-inheriting ekg-org-view-body face for task body text, making it visually distinct from heading text.
ekg-agent-org--tool-add-item now assigns an explicit :org/sort-order when creating tasks, placing them at the end of existing siblings. This ensures consistent ordering regardless of ID generation or creation-time ties, and prevents conflicts when mixing agent-created and UI-created tasks under the same parent.
ekg-org-view-delete and ekg-org-view-archive now operate on the task and all its descendants, preventing orphaned children.
When pressing 'c' on a heading, the insert placeholder now defaults to the first-child slot of that heading rather than the nearest slot by buffer position. Previously, pressing 'c' on 'Project Beta' would show the placeholder as a child of 'Project Alpha' (the nearest position above), which was confusing and led to tasks being created under the wrong parent. Add ekg-org-test-view-insert-initial-position to verify the fix, and ekg-org-test-view-top-level-spacing to confirm blank lines separate top-level groups after task creation.
ekg-org-view--refresh was calling ekg-org-view--mount which ends with switch-to-buffer, stealing the selected window. When saving a note triggered the ekg-note-save-hook refresh, quit-window in ekg-edit-finalize would then operate on the org-view window instead of the edit buffer window, so the edit buffer stayed open. Fix by using vui-rerender on the existing instance instead of remounting. This re-renders in place without touching windows.
Introduces a benchmark framework that runs predefined tasks in a fresh Emacs subprocess (via llm-test infrastructure) and measures the ekg-agent's effectiveness along three axes: - Task completion: did the agent accomplish the goal? - Skill adherence: did the agent follow instructions stored in ekg? - Memory discipline: did the agent store/retrieve knowledge in ekg? Metrics collected per task: pass/fail for each axis, iteration count, tools used, and wall-clock time. Includes 7 benchmark tasks: - coding-with-skills: write elisp following skill note conventions - memory-retrieval: use previously stored decisions for a follow-up task - elisp-debugging: find and fix a bug in an elisp file (emacs-y) - buffer-transform: read CSV data and produce org-mode report (emacs-y) - knowledge-synthesis: combine info from multiple notes to answer a question - store-findings: investigate code and document findings as ekg notes - follow-complex-skill: follow a multi-step checklist from a skill note Entry points: - ekg-agent-bench-run: run all benchmarks - ekg-agent-bench-run-one: run a single task by name - ekg-agent-bench-register-ert-tests: register as ERT tests - run-bench.el: batch script for CI
- Write init errors to a temp file via condition-case, then check it after the daemon starts (the daemon starts even when init errors because server vars are set first) - Show the actual error and subprocess load-path in the error message - Add ekg-agent-bench-diagnose command for interactive troubleshooting - Add vui and llm-prompt to required libraries list
The host Emacs (e.g. /Applications/Emacs.app) and the subprocess daemon (e.g. homebrew emacs-mac 29.1) may be different versions. Built-in .elc files from one can't be loaded by the other, causing 'Invalid (or missing) doc string' errors. The subprocess already has its own built-in lisp on load-path, so we only need to pass user-installed package directories.
The subprocess must run the same Emacs version as the host to avoid .elc version mismatches and missing built-in packages. Use `invocation-name' and `invocation-directory' to resolve the exact binary path instead of relying on llm-test-emacs-executable (which defaults to 'emacs' and may resolve to a different installation).
Include the full agent log in each result so we can see what the agent actually did. Display it in the results buffer for debugging.
- eval-verify now returns 'skip for nil expressions (not applicable), t for pass, nil for fail - Display and counting correctly distinguish FAIL from n/a
Override ekg-agent--read-agents-md and ekg-agent--agents-md-context to return nil in the subprocess so benchmarks test agent behavior without being influenced by the user's personal AGENTS.md files.
read_file: LLMs sometimes pass empty strings for omitted optional args (begin, end, range_type). Empty strings for begin/end with range_type='line_number' caused string-to-number to return 0, making the range 1..0 and returning empty content. Now treat empty strings as nil. write_file: New base tool that creates or overwrites files. Creates parent directories if needed. If the file is open in an Emacs buffer, replaces buffer contents without saving. Returns content with line identifiers like read_file. Tests: 4 new tests covering the empty-string fix, file creation, overwrite, and buffer-update behavior.
New benchmark files testing four key agent behaviors: 1. task-progress-notes.yaml: Agent should create agent-task notes and append progress entries during multi-step work. Includes a restart scenario where a prior task note exists. 2. org-task-management.yaml: Agent should create ekg-org tasks with subtasks and mark them done. Tests both creating from scratch and continuing an existing org task. Sets org tools as extra tools. 3. auto-create-skills.yaml: Agent should store learnings as prompt- tagged skill notes. Tests discovery of performance issues (nth in loop) and project conventions (error-first returns). 4. use-existing-skills.yaml: Agent should find and follow error handling conventions from a skill note when writing new code. Total benchmarks: 14 tasks across 11 groups.
The 60-second wait for ekg-agent-org-plan-task to produce subtasks was too tight for CI, where OpenRouter latency and runner CPU are slower than on local hardware. The step completes in ~22s locally but exceeded 60s in GitHub Actions, causing the test agent to fail cleanly on a timing issue rather than a behavior issue. Bump the window to 120s, still well under the job's 10-minute cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CI llm-test job surfaces "LLM provider was nil" errors from ekg-agent-note-response, suggesting the init-forms that set ekg-llm-provider aren't taking effect in the daemon — but the same test passes locally. Add a diagnostic file at /tmp/ekg-llm-test-init-diag.log written from the daemon init chain that records (a) whether the env var is visible, (b) whether the provider setq succeeded, and (c) any eval error from the provider expression. Upload the file as part of the CI debug-logs artifact on failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The diagnostic block added earlier to debug the daemon provider setup is only useful when something is failing. Gate the whole block (including the side-effect force-require of llm-openai) behind EKG_LLM_TEST_DIAG so normal runs take the lean path. This also lets us test whether the force-require was inadvertently what fixed CI, or whether the test was genuinely green on its own. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Errors in the test daemon's run-at-time timer callbacks are silent — if ekg-agent-org-plan-task signals before creating its log buffer, no trace is left. Add advice that wraps ekg-agent--iterate and ekg-agent--prompt-id, writing any error to /tmp/ekg-llm-test-init-diag.log before re-signaling. Also re-enable the force-require of llm-openai at daemon init (which previously resolved a "void-function make-llm-openrouter" error in CI) and remove the EKG_LLM_TEST_DIAG gate so the diagnostics always run in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The advice we added around ekg-agent--iterate, --prompt-id, and friends was using dynamic-scoped references to 'diag-file' and 'fn-name' from the init-forms' lexical block. The init-file that llm-test writes for the daemon is loaded without lexical-binding, so those names resolved in dynamic scope — they were in scope while the init-forms progn ran, but void by the time a timer callback fired the advice. The net effect was that EVERY invocation of ekg-agent--prompt-id (called from ekg-agent--iterate iteration 0) signaled (void-variable diag-file) inside the run-command timer. The error was silently swallowed by the timer's default handler and no agent log buffer was ever created, so it looked like the plan-task command never ran. Removing the advice. Provider setup diagnostics for the init phase itself stay because they run inside the dynamic scope where diag-file is still bound. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In daemon mode without a visible frame, (current-message) returns nil once the command that produced the message finishes — so the llm-test frame-state capture's `message` field was always null, and the test agent never saw mode hints like "Insert mode: n/p move, R/L demote/promote, RET confirm, C-g cancel" that ekg-org-view-create displays after `c'. Intercept `message' in the llm-test init forms to save the most recent non-empty message in a global, and patch `current-message' to fall back to the saved value if it's fresh (within 30 seconds). This lets the frame-state capture surface transient messages across the emacsclient boundary. Also expand the ekg-org-view test description to document both branches of `ekg-org-view-create' (empty vs non-empty view) and spell out the insert-mode workflow so gpt-5.4-mini stops typing into a read-only placeholder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a data-driven benchmark suite and LLM-driven integration tests for ekg-agent, while also improving agent robustness (status update discipline + transient LLM retry) and fixing a DB timestamp edge case.
Changes:
- Add an
ekg-agent-benchrunner with.eldbenchmark specs plus a batch entrypoint (run-bench.el). - Add
llm-test-based integration test harness and YAML specs; wire into CI as a dedicated job. - Improve
ekg-agentreliability via status-update reminders and transient LLM error retries; fixekgnotecreation-timeinvariant.
Reviewed changes
Copilot reviewed 32 out of 33 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| run-bench.el | Batch script to run benchmarks locally / in CI-like environments. |
| ekg-agent-bench.el | New benchmark harness: subprocess management, metric extraction, ERT registration. |
| benchmarks/use-existing-skills.eld | Benchmark: agent applies an existing “skill” note to code output. |
| benchmarks/task-progress-notes.eld | Benchmarks: journaling progress notes; continuity across restart. |
| benchmarks/subagent-memory-handoff.eld | Benchmark: subagent delegation + context handoff via notes. |
| benchmarks/store-findings.eld | Benchmark: investigate a file and store findings as notes. |
| benchmarks/status-update-gap.eld | Benchmark: enforce explicit summarize_state cadence (gap metric). |
| benchmarks/periodic-state-checkpoint.eld | Benchmark: periodic progress notes across multi-file work. |
| benchmarks/org-task-management.eld | Benchmarks: create/continue org tasks and mark DONE as work completes. |
| benchmarks/note-response.eld | Benchmarks: note-response agent appends/replaces content in edit buffer. |
| benchmarks/memory-retrieval.eld | Benchmark: retrieve prior decisions/conventions and apply them. |
| benchmarks/memory-contradiction.eld | Benchmark: detect contradiction with stored convention; ask user. |
| benchmarks/memory-continuity.eld | Benchmark: continue work without redoing already-completed files. |
| benchmarks/knowledge-synthesis.eld | Benchmark: synthesize answer from multiple notes. |
| benchmarks/follow-complex-skill.eld | Benchmark: follow a multi-step “checklist” skill note. |
| benchmarks/file-resource-notes.eld | Benchmark: follow file-specific resource-note conventions. |
| benchmarks/elisp-debugging.eld | Benchmark: debug/fix elisp bug and verify via tool execution. |
| benchmarks/coding-with-skills.eld | Benchmark: generate code following basic “skill” conventions. |
| benchmarks/buffer-transform.eld | Benchmark: transform CSV into an org report file. |
| benchmarks/auto-create-skills.eld | Benchmarks: detect pattern and store reusable prompt-tagged skills. |
| llm-tests/ekg-org-view.yaml | LLM-driven UI workflow test for ekg-org-view task creation/planning. |
| llm-tests/ekg-basics.yaml | LLM-driven smoke tests for note capture/list/search/drafts flows. |
| llm-tests/agent-response.yaml | LLM-driven test for agent response insertion behavior. |
| ekg-agent-llm-test.el | Registers YAML specs as ERT tests; sets up subprocess init + diagnostics. |
| ekg-agent.el | Adds status reminders + status timestamp tracking; adds transient LLM retry wrapper; improves error logging/termination behavior. |
| ekg-agent-test.el | Adds ERT coverage for status reminders and file tool edge cases + note append behavior. |
| ekg.el | Ensures creation-time is populated when missing before persisting time-tracked. |
| doc/ekg.org | Documents new status reminder behavior/customization. |
| agent_tools/skill.md | Markdown table formatting fix for org task tag documentation. |
| Eldev | Adds llm-test and adjusts vui dependency declaration. |
| .github/workflows/ci.yaml | Adds llm-test job to CI to run LLM-driven integration tests. |
| .gitignore | Ignores generated *.log artifacts (e.g., llm-test debug logs). |
| AGENTS.md | Adds packaging/distribution constraint note (no dependency on specific Emacs setup). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- ekg-agent-bench.el: explicit `(require 'seq)` since the file uses `seq-filter' inside elisp strings evaluated in a subprocess. `seq' is preloaded so this never failed in practice, but the explicit require silences the lint and documents the dependency. - ekg-agent-llm-test.el: explicit `(require 'subr-x)' for `string-empty-p', and `(require 'cl-lib)' as the first form in the daemon init-forms so `cl-flet' macro-expansion no longer relies on llm-test's helper having required cl-lib first. - run-bench.el: drop the stale `LLM_TEST_PATH' line from the header — the script never read it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test agent kept calling `run-command "switch-to-buffer"' to get back to *ekg-org-tasks* after the planning step. That command is interactive and opens a minibuffer prompt for the buffer name; the agent never typed the name, just kept re-invoking the command and getting more stuck. Tell the agent to use `run-command "ekg-org-view"' instead (we documented earlier that ekg-org-view switches to the tasks buffer), explicitly forbid switch-to-buffer, and spell out the exact insert-mode RET-then-type sequence so it doesn't try to type into the read-only placeholder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.