Skip to content

Conversation

@kfinkels
Copy link

Summary

  • Implemented comprehensive evaluation infrastructure using PromptFoo to test spec and plan template quality
  • Added custom annotation tool with FastHTML-based UI for manual spec evaluation
  • Integrated automated error analysis for iterative prompt refinement
  • Achieved 90% evaluation pass rate through systematic prompt improvements
  • Added GitHub Actions workflow for continuous evaluation on pull requests
  • Created dataset of 17 real specs and 2 real plans for testing
  • Made LLM model configurable via environment variables

Key Features

  • PromptFoo Integration: Automated evaluation framework with custom graders for assessing spec/plan quality
  • Annotation Tool: Interactive web UI for reviewing and annotating evaluation results
  • Error Analysis: Automated analysis of failures with actionable improvement recommendations
  • GitHub Actions: CI/CD integration for running evaluations on PRs
  • Comprehensive Documentation: Setup guides, workflows, and quick reference materials
  • Model Flexibility: Configurable LLM provider and model selection

Test plan

  • Run evaluation locally using evals/scripts/run-promptfoo-eval.sh
  • Verify GitHub Actions workflow executes successfully on PR
  • Test annotation tool with evals/scripts/run-annotation-tool.sh
  • Review automated error analysis output
  • Confirm evaluation scores meet minimum thresholds (60% for spec, 75% for plan)

kanfil and others added 30 commits October 1, 2025 16:47
…stitution handling

- Added logic to setup-plan.ps1 to handle constitution and team directives file paths, ensuring they are set in the environment.
- Implemented sync_team_ai_directives function in specify_cli to clone or update the team-ai-directives repository.
- Updated init command in specify_cli to accept a team-ai-directives repository URL and sync it during project initialization.
- Enhanced command templates (implement.md, levelup.md, plan.md, specify.md, tasks.md) to incorporate checks for constitution and team directives.
- Created new levelup command to capture learnings and draft knowledge assets post-implementation.
- Improved task generation to include execution modes (SYNC/ASYNC) based on the implementation plan.
- Added tests for new functionality, including syncing team directives and validating outputs from setup and levelup scripts.
# Conflicts:
#	.github/workflows/scripts/create-github-release.sh
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
This reverts commit 952c676.
kanfil and others added 30 commits October 25, 2025 22:05
…optimized testing, GitHub issues integration, and code quality automation
…ized testing infrastructure, GitHub issues integration, and code quality automation
… placeholders

- Update /specify command template to include context population instructions
- Modify create-new-feature.sh to intelligently populate context.md fields
- Add mode-aware context population (build vs spec modes)
- Update PowerShell equivalent script
- Fix bash syntax error in check-prerequisites.sh
- Ensure context.md passes validation without [NEEDS INPUT] markers

Closes context.md population bug that was blocking basic workflow.
- Updated quickstart guide to clarify the new automation scripts in Bash and PowerShell, including step-by-step instructions for project initialization and specification creation.
- Revised upgrade documentation to improve clarity on handling existing directories and agent setup.
- Refactored Bash and PowerShell scripts for creating new features to streamline branch number retrieval and improve error handling.
- Added support for new agents (Qoder CLI and IBM Bob) in context update scripts and CLI initialization.
- Improved checklist and requirement templates to ensure clarity and completeness in specifications.
- Enhanced agent configuration to include new agents with appropriate metadata.
- Added cautionary notes in task-to-issues template to prevent issues creation in incorrect repositories.
… of spec and plan template outputs using PromptFoo with Claude Sonnet 4.5.
  - Create run-auto-error-analysis.sh script for automated spec evaluation
  - Add run-automated-error-analysis.py with Claude-powered categorization
  - Evaluate specs with binary pass/fail and failure categorization
  - Generate detailed CSV reports and summary files
  - Update .gitignore to exclude analysis results
  - Document automated and manual error analysis workflows in README
  - Mark Week 1 (Error Analysis Foundation) as completed in workplan

  Provides two error analysis options:
  1. Automated (Claude API) - fast, batch evaluation
  2. Manual (Jupyter) - deep investigation and exploration
Implement keyboard-driven web interface for reviewing generated specs, providing 10x faster review workflow. Includes auto-save, progress tracking, and JSON export capabilities. Update documentation with complete annotation tool guide and usage instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants