Skip to content

feat: auto-push eval results to configurable git repo #826

@christso

Description

@christso

Summary

Add a configuration option to automatically push eval result artifacts to a git repository after each eval run. The agent creates a PR (not auto-merge) so a human still reviews and merges.

Motivation

Currently eval results live in .agentv/results/runs/ locally and are lost unless manually committed. For reproducibility and historical comparison, results should be automatically pushed to a dedicated repo (e.g., EntityProcess/agentv-evals).

Design

Configuration

In .agentv/config.yaml:

results:
  export:
    repo: EntityProcess/agentv-evals     # GitHub repo path
    path: autopilot-dev/runs             # Directory within the repo
    auto_push: true                      # Enable auto-push after each run
    branch_prefix: eval-results          # Branch naming prefix

Clone / Cache Strategy

The target repo is cloned/fetched to ~/.agentv/cache/results-repo/. Subsequent runs reuse this cached clone (fetch-only). A broader data_dir config option can be added later as a separate concern.

Authentication

Assumes gh CLI and git CLI are already authenticated. If not, show a meaningful error message (e.g., Run 'gh auth login' to authenticate). Do not fail the eval run — warn and skip the export step.

Artifacts

The entire runs/<run-id>/ directory is pushed. No filtering.

Workflow

  1. After agentv eval run or agentv pipeline completes, if auto_push is enabled:
  2. Fetch the cached clone of the target repo (or clone if first run)
  3. Create a branch: <branch_prefix>/<experiment>-<eval-file>-<timestamp> (e.g., eval-results/autopilot-dev-ad-explore-2026-03-29T01-15-06)
  4. Copy the entire runs/<run-id>/ directory to the configured path
  5. Commit with a structured message including eval summary (pass/fail counts, mean score)
  6. Push branch and create a draft PR with results summary in the body
  7. Human reviews and merges the PR

PR Granularity

One PR per run invocation. A single agentv eval run or agentv pipeline execution produces one PR that bundles all evals from that run. The PR body contains a summary table per eval.

PR Format

feat(results): ad-explore claude-cli — 3/3 PASS (1.000)

## Results
| Test | Score | Status |
|---|---|---|
| discovers-existing-implementation | 1.000 | PASS |
| finds-all-consumers | 1.000 | PASS |
| structured-summary | 1.000 | PASS |

Run: 2026-03-29T01-15-06-826Z
Target: claude-cli
Eval: evals/autopilot-dev/ad-explore.eval.yaml

For bundled runs (multiple evals), repeat the results table per eval.

Size Warning

Warn (don't error) if total artifact size exceeds 10MB.

Acceptance Signals

  • .agentv/config.yaml supports results.export section
  • After eval run, artifacts are pushed to configured repo as a draft PR
  • One PR per run invocation (bundles all evals in that run)
  • PR includes structured results summary
  • Human must merge — no auto-merge
  • Works with agentv eval run and agentv pipeline commands
  • Graceful fallback if repo is not accessible or auth fails (warning, not error)
  • Warn if artifact size exceeds 10MB

Non-Goals

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreAnything pertaining to core functionality of AgentVin-progressClaimed by an agent — do not duplicate work

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions