Problem
The current build→eval→decide flow has the Evaluator and smoke test as separate, sequential steps orchestrated by the CEO. There's no integrated agent that can run the project end-to-end, evaluate it holistically, and re-trigger the Builder when things fail — acting as a quality board that gates what ships.
Current gaps:
- Smoke test configuration is manual and fragile (must be set in
factory.md, CEO can forget)
- E2E results are a binary precheck gate, not a scored dimension
- No single agent owns the "does this actually work?" question — it's split across Evaluator, Reviewer, and CEO
- Builder doesn't get structured E2E feedback to iterate on — just a pass/fail from precheck
- No artifact collection from E2E runs (logs, outputs, screenshots)
What's needed
A dedicated E2E agent (or expanded Evaluator role) that:
- Runs the project end-to-end — starts it, exercises key paths, collects outputs
- Evaluates holistically — beyond unit tests and lint, assesses "does this thing actually work as intended?"
- Works bidirectionally with Builder — when E2E fails, provides structured feedback (what failed, where, suggested fix direction) and can re-trigger the Builder for targeted fixes
- Reports to CEO as a board member — CEO gets a structured E2E verdict (not just pass/fail) with evidence, and this factors into the keep/revert decision
- Auto-detects smoke test — infers how to run the project from its type (Flask → health check, CLI →
--help + sample run, etc.) instead of requiring manual config
Think of it as the "board of directors" for each experiment — the CEO proposes, the board evaluates whether it actually works.
Key interaction
CEO → Builder (implement) → E2E Agent (run + evaluate)
↓ fail
Builder (fix) → E2E Agent (re-evaluate)
↓ pass
CEO (keep/revert decision with E2E evidence)
Problem
The current build→eval→decide flow has the Evaluator and smoke test as separate, sequential steps orchestrated by the CEO. There's no integrated agent that can run the project end-to-end, evaluate it holistically, and re-trigger the Builder when things fail — acting as a quality board that gates what ships.
Current gaps:
factory.md, CEO can forget)What's needed
A dedicated E2E agent (or expanded Evaluator role) that:
--help+ sample run, etc.) instead of requiring manual configThink of it as the "board of directors" for each experiment — the CEO proposes, the board evaluates whether it actually works.
Key interaction