Skip to content

E2E run-and-eval agent — the Board #206

@akashgit

Description

@akashgit

Problem

The current build→eval→decide flow has the Evaluator and smoke test as separate, sequential steps orchestrated by the CEO. There's no integrated agent that can run the project end-to-end, evaluate it holistically, and re-trigger the Builder when things fail — acting as a quality board that gates what ships.

Current gaps:

  • Smoke test configuration is manual and fragile (must be set in factory.md, CEO can forget)
  • E2E results are a binary precheck gate, not a scored dimension
  • No single agent owns the "does this actually work?" question — it's split across Evaluator, Reviewer, and CEO
  • Builder doesn't get structured E2E feedback to iterate on — just a pass/fail from precheck
  • No artifact collection from E2E runs (logs, outputs, screenshots)

What's needed

A dedicated E2E agent (or expanded Evaluator role) that:

  1. Runs the project end-to-end — starts it, exercises key paths, collects outputs
  2. Evaluates holistically — beyond unit tests and lint, assesses "does this thing actually work as intended?"
  3. Works bidirectionally with Builder — when E2E fails, provides structured feedback (what failed, where, suggested fix direction) and can re-trigger the Builder for targeted fixes
  4. Reports to CEO as a board member — CEO gets a structured E2E verdict (not just pass/fail) with evidence, and this factors into the keep/revert decision
  5. Auto-detects smoke test — infers how to run the project from its type (Flask → health check, CLI → --help + sample run, etc.) instead of requiring manual config

Think of it as the "board of directors" for each experiment — the CEO proposes, the board evaluates whether it actually works.

Key interaction

CEO → Builder (implement) → E2E Agent (run + evaluate) 
                                  ↓ fail
                            Builder (fix) → E2E Agent (re-evaluate)
                                  ↓ pass
                            CEO (keep/revert decision with E2E evidence)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions