Add comprehensive evaluation infrastructure for spec and plan templates #1479

kfinkels · 2026-01-15T14:31:05Z

Summary

Implemented comprehensive evaluation infrastructure using PromptFoo to test spec and plan template quality
Added custom annotation tool with FastHTML-based UI for manual spec evaluation
Integrated automated error analysis for iterative prompt refinement
Achieved 90% evaluation pass rate through systematic prompt improvements
Added GitHub Actions workflow for continuous evaluation on pull requests
Created dataset of 17 real specs and 2 real plans for testing
Made LLM model configurable via environment variables

Key Features

PromptFoo Integration: Automated evaluation framework with custom graders for assessing spec/plan quality
Annotation Tool: Interactive web UI for reviewing and annotating evaluation results
Error Analysis: Automated analysis of failures with actionable improvement recommendations
GitHub Actions: CI/CD integration for running evaluations on PRs
Comprehensive Documentation: Setup guides, workflows, and quick reference materials
Model Flexibility: Configurable LLM provider and model selection

Test plan

Run evaluation locally using evals/scripts/run-promptfoo-eval.sh
Verify GitHub Actions workflow executes successfully on PR
Test annotation tool with evals/scripts/run-annotation-tool.sh
Review automated error analysis output
Confirm evaluation scores meet minimum thresholds (60% for spec, 75% for plan)

…stitution handling - Added logic to setup-plan.ps1 to handle constitution and team directives file paths, ensuring they are set in the environment. - Implemented sync_team_ai_directives function in specify_cli to clone or update the team-ai-directives repository. - Updated init command in specify_cli to accept a team-ai-directives repository URL and sync it during project initialization. - Enhanced command templates (implement.md, levelup.md, plan.md, specify.md, tasks.md) to incorporate checks for constitution and team directives. - Created new levelup command to capture learnings and draft knowledge assets post-implementation. - Improved task generation to include execution modes (SYNC/ASYNC) based on the implementation plan. - Added tests for new functionality, including syncing team directives and validating outputs from setup and levelup scripts.

# Conflicts: # .github/workflows/scripts/create-github-release.sh

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

This reverts commit 952c676.

This reverts commit ba69392.

…umentation

…n __init__.py

…ocumentation

…optimized testing, GitHub issues integration, and code quality automation

…ized testing infrastructure, GitHub issues integration, and code quality automation

…bility

… placeholders - Update /specify command template to include context population instructions - Modify create-new-feature.sh to intelligently populate context.md fields - Add mode-aware context population (build vs spec modes) - Update PowerShell equivalent script - Fix bash syntax error in check-prerequisites.sh - Ensure context.md passes validation without [NEEDS INPUT] markers Closes context.md population bug that was blocking basic workflow.

…sure, and session persistence in roadmap

- Updated quickstart guide to clarify the new automation scripts in Bash and PowerShell, including step-by-step instructions for project initialization and specification creation. - Revised upgrade documentation to improve clarity on handling existing directories and agent setup. - Refactored Bash and PowerShell scripts for creating new features to streamline branch number retrieval and improve error handling. - Added support for new agents (Qoder CLI and IBM Bob) in context update scripts and CLI initialization. - Improved checklist and requirement templates to ensure clarity and completeness in specifications. - Enhanced agent configuration to include new agents with appropriate metadata. - Added cautionary notes in task-to-issues template to prevent issues creation in incorrect repositories.

…iorities

…d consistency

…refix

…ity in complex systems

… of spec and plan template outputs using PromptFoo with Claude Sonnet 4.5.

- Create run-auto-error-analysis.sh script for automated spec evaluation - Add run-automated-error-analysis.py with Claude-powered categorization - Evaluate specs with binary pass/fail and failure categorization - Generate detailed CSV reports and summary files - Update .gitignore to exclude analysis results - Document automated and manual error analysis workflows in README - Mark Week 1 (Error Analysis Foundation) as completed in workplan Provides two error analysis options: 1. Automated (Claude API) - fast, batch evaluation 2. Manual (Jupyter) - deep investigation and exploration

Implement keyboard-driven web interface for reviewing generated specs, providing 10x faster review workflow. Includes auto-save, progress tracking, and JSON export capabilities. Update documentation with complete annotation tool guide and usage instructions.

…or improved navigation and maintainability

kanfil and others added 30 commits October 1, 2025 16:47

Merge branch 'main' of https://github.com/tikalk/agentic-sdlc-spec-kit

ff749e1

update download template repo

b30e3a8

add agentic-sdlc prefix

7655913

support gateway_url

b90ed16

add support to local team-ai-directive

03e49a7

orange theme

d22b021

Merge remote-tracking branch 'upstream/main'

209b127

# Conflicts: # .github/workflows/scripts/create-github-release.sh

add risk-based-testing

dad002f

fix Run failed: Create Release

5eb9ea6

Merge upstream/main

c62caa4

fix setup pages docs

24fde44

fix release fetch

394f0fd

fix regression after merge with fork

8906388

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

remove .specify dir

ba69392

remove .roo

952c676

Revert "remove .roo"

22b9e15

This reverts commit 952c676.

Revert "remove .specify dir"

4232e53

This reverts commit ba69392.

restore functionality of orange themes

31b6b00

chore: rebrand docs and CLI messaging to Agentic SDLC Spec Kit

a5edebe

chore: change icon to octopus

8945fc4

fix: use tag prefix to the forked version

3210252

Merge remote-tracking branch 'upstream/main'

75faeed

Merge upstream/main and fix f-string

0c6f851

feat: risk based testing

27d96a4

Merge branch 'main' into main

64563a5

feat: issue tracker integration

b0c47f5

docs: initialize project constitution from team directives

5ba0cc4

feat: add constitution from team-ai-directives

9791698

remove redandant test code

1151053

kanfil and others added 30 commits October 25, 2025 22:05

feat: add Git Platform MCP integration and enhance workflow modes doc…

4f0c63f

…umentation

chore: update version to 0.0.21 and clean up inline script metadata i…

691daa8

…n __init__.py

fix: update README and roadmap for clarity; enhance levelup command d…

4f2c8cf

…ocumentation

feat: add new roadmap items for team directives restructuring, agent-…

7c4bc43

…optimized testing, GitHub issues integration, and code quality automation

feat: add future enhancements for issue tracker labeling, agent-optim…

5334834

…ized testing infrastructure, GitHub issues integration, and code quality automation

fix: escape short name in branch checking functions for regex compati…

6c9808e

…bility

Merge remote-tracking branch 'upstream/main'

00d5eda

feat: unified spec template implementation

339ff35

docs: mark Strategic Tooling Improvements as 100% complete

6bcff74

feat: add future phase items for tool auto-activation, context disclo…

f3c011b

…sure, and session persistence in roadmap

Refine roadmap structure and enhance clarity for future phases and pr…

9c53000

…iorities

Refactor context population and mode guidance sections for clarity an…

45fc3fe

…d consistency

Update quickstart.md to use new team-ai-directives paths without v1 p…

d893dba

…refix

Add Architecture Description Command to roadmap for structural integr…

773bf7f

…ity in complex systems

Implement comprehensive evaluation infrastructure to test the quality…

d7a123b

… of spec and plan template outputs using PromptFoo with Claude Sonnet 4.5.

upadte README file and set the model as env var

cb1d00d

Add high-priority referenceable cross-referencing system to roadmap f…

c26cb0d

…or improved navigation and maintainability

Achieve 90% evaluation pass rate through iterative prompt refinement

c3031f0

update gitignore

b051f12

Add plan error analysis foundation (100% pass rate)

00583ae

move the sh to scripts

cc2c6f5

Merge branch 'main' into adding_eval

669257e

add documentation

5a199a2

add github actions, replace claude with any model (configurable)

cf5ccba

fix linting issues

2768469

update documentation

bb25b69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add comprehensive evaluation infrastructure for spec and plan templates #1479

Add comprehensive evaluation infrastructure for spec and plan templates #1479

Uh oh!

kfinkels commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add comprehensive evaluation infrastructure for spec and plan templates #1479

Are you sure you want to change the base?

Add comprehensive evaluation infrastructure for spec and plan templates #1479

Uh oh!

Conversation

kfinkels commented Jan 15, 2026

Summary

Key Features

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants