E2E run-and-eval agent — the Board

## Problem

The current build→eval→decide flow has the Evaluator and smoke test as separate, sequential steps orchestrated by the CEO. There's no integrated agent that can run the project end-to-end, evaluate it holistically, and re-trigger the Builder when things fail — acting as a quality board that gates what ships.

Current gaps:
- Smoke test configuration is manual and fragile (must be set in `factory.md`, CEO can forget)
- E2E results are a binary precheck gate, not a scored dimension
- No single agent owns the "does this actually work?" question — it's split across Evaluator, Reviewer, and CEO
- Builder doesn't get structured E2E feedback to iterate on — just a pass/fail from precheck
- No artifact collection from E2E runs (logs, outputs, screenshots)

## What's needed

A dedicated **E2E agent** (or expanded Evaluator role) that:

1. **Runs the project end-to-end** — starts it, exercises key paths, collects outputs
2. **Evaluates holistically** — beyond unit tests and lint, assesses "does this thing actually work as intended?"
3. **Works bidirectionally with Builder** — when E2E fails, provides structured feedback (what failed, where, suggested fix direction) and can re-trigger the Builder for targeted fixes
4. **Reports to CEO as a board member** — CEO gets a structured E2E verdict (not just pass/fail) with evidence, and this factors into the keep/revert decision
5. **Auto-detects smoke test** — infers how to run the project from its type (Flask → health check, CLI → `--help` + sample run, etc.) instead of requiring manual config

Think of it as the "board of directors" for each experiment — the CEO proposes, the board evaluates whether it actually works.

## Key interaction

```
CEO → Builder (implement) → E2E Agent (run + evaluate) 
                                  ↓ fail
                            Builder (fix) → E2E Agent (re-evaluate)
                                  ↓ pass
                            CEO (keep/revert decision with E2E evidence)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E run-and-eval agent — the Board #206

Problem

What's needed

Key interaction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

E2E run-and-eval agent — the Board #206

Description

Problem

What's needed

Key interaction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions