feat: creation workflow quality guidance system#7
Merged
Conversation
487e3b5 to
eb0a73d
Compare
…omptlet Replaced ~308 lines of fragile regex-based policy extraction with a schema-driven reasoning-agent promptlet architecture. This architectural improvement makes the system more reliable and leverages agent reasoning capabilities instead of brittle pattern matching. Background: Started by investigating hanging integration test (test_very_large_data_handling with 600KB text causing regex backtracking). Investigation revealed that the regex extraction approach was architecturally wrong - agents should reason about policies from schema, not rely on pattern matching to extract them. Architectural change: - Removed 7 regex extraction methods (~308 lines total): * _suggest_policy_from_alternatives * _suggest_import_policies * _suggest_pattern_policies * _suggest_architecture_policies * _suggest_config_policies * _suggest_rationales * _normalize_library_name - Replaced _generate_policy_guidance() with promptlet providing: * agent_task with role, objective, and 5 reasoning steps * policy_capabilities (full schema documentation) * example_workflow (concrete scenario showing decision → policy mapping) * guidance (dos/don'ts for constraint extraction) This implements Task 2 (Policy Construction) of the two-step reasoning flow. Task 1 (Decision Creation) guidance documented in new DEC backlog task. Test updates: - Removed 14 regex-based integration tests (TestPolicySuggestionLogic class) - Removed 3 regex-based unit tests (TestPolicySuggestion class) - Added 3 new promptlet-validation tests - All 161 tests passing in 4.93s (was hanging before) - Coverage increased from 21% to 49% Performance: - test_very_large_data_handling: passes in 0.40s (was hanging indefinitely) - Full test suite: 4.93s (130 unit + 31 integration) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement complete decision quality guidance to strengthen ADR creation
workflow. This establishes the foundation for high-quality architectural
decisions that enable effective policy extraction.
## What Changed
**New Decision Guidance Module** (`decision_guidance.py`):
- Comprehensive promptlet with ADR structure explanation
- Quality criteria (specific, actionable, complete, policy-ready, balanced)
- Good vs bad examples for database, frontend, and generic decisions
- Anti-patterns with fixes (vague, one-sided, missing context, etc.)
- Connection between Task 1 (decision quality) and Task 2 (policy extraction)
- Dos/don'ts and workflow guidance
**Enhanced Creation Workflow** (`creation.py`):
- New `_assess_decision_quality()` method with scoring system (0-100)
- Detects 6 quality dimensions:
1. Specificity (generic terms vs specific tech names)
2. Balance (pros AND cons documented)
3. Context quality (explains WHY)
4. Explicit constraints (for policy extraction)
5. Alternatives documentation (enables disallow policies)
6. Decision completeness
- Returns quality_feedback with:
- Quality score and grade (A-F)
- Issues found with severity and suggestions
- Strengths recognized
- Prioritized recommendations
- Context-aware next steps
- Improved validation error messages with examples
**Enhanced MCP Tool** (`server.py`):
- Expanded `adr_create` docstring with inline guidance:
- ADR structure (Context/Decision/Consequences/Alternatives)
- Quality guidelines (be specific, document trade-offs, explain WHY)
- Explicit constraint language examples
- Response contents explanation
**Comprehensive Tests**:
- 14 unit tests for decision guidance module
- 12 integration tests for quality assessment
- All existing tests still pass (11 in test_workflow_creation.py)
- 80% code coverage for creation.py
## Why This Matters
**Foundation Enhancement**: Decision quality directly impacts:
- Policy extraction effectiveness (Task 2)
- Agent understanding of constraints
- Future decision reasoning
- Automated enforcement reliability
**Two-Step Creation Flow**:
1. Task 1 (NEW): Guide agents to write high-quality decisions
2. Task 2 (existing): Extract enforceable policies from decisions
Good Task 1 output makes Task 2 trivial. Example:
- Bad: "Use a modern framework"
- Good: "Use FastAPI. Don't use Flask or Django."
→ Enables: {'imports': {'disallow': ['flask', 'django']}}
**Agent Experience**:
- Before: Minimal guidance, vague validation errors
- After: Inline structure guide, quality scoring, actionable feedback
## Implementation Details
**Scoring System**:
- Start at 100, deduct for issues:
- Vague terms: -15
- One-sided consequences: -25
- Weak context: -20
- No explicit constraints: -15
- Missing alternatives: -15
- Too brief: -10
- Grades: A (90+), B (75+), C (60+), D (40+), F (<40)
**Quality Checks**:
- Pattern matching for generic terms
- Keyword detection for balance (pros AND cons)
- Regex for explicit constraints ("don't use", "must have")
- Length checks for completeness
- Alternatives presence validation
**Feedback Structure**:
{
"quality_score": 85,
"grade": "B",
"issues": [{category, severity, issue, suggestion, example_fix}],
"strengths": ["..."],
"recommendations": ["..."],
"next_steps": ["..."]
}
## Testing
All tests pass (26 new + 11 existing = 37 total):
- Unit tests verify guidance structure completeness
- Integration tests validate quality assessment accuracy
- Edge cases covered (vague, one-sided, missing context, etc.)
- Existing creation workflow tests unaffected
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…tion loop BREAKING CHANGE: Quality gate now blocks ADR creation if score < 75 Previously, the quality assessment ran AFTER creating the ADR file, which led to file pollution when agents needed to revise low-quality decisions. The new flow enables a clean correction loop: 1. Agent submits ADR creation request 2. Quality gate runs deterministic checks (BEFORE file I/O) 3. If score < 75: Return REQUIRES_ACTION with feedback, no file created 4. Agent revises and resubmits (correction loop) 5. Only create ADR file when quality passes threshold ## Changes ### Core Workflow (`adr_kit/workflows/creation.py`) - Add `_quick_quality_gate()` method for pre-validation quality checks - Refactor `execute()` to run quality gate BEFORE `_generate_adr_id()` - Return `WorkflowStatus.REQUIRES_ACTION` when quality < threshold - Add `skip_quality_gate` parameter to `CreationInput` for test override ### Enum Extension (`adr_kit/workflows/base.py`) - Add `WorkflowStatus.REQUIRES_ACTION` status for quality gate failures ### Test Updates - Update test_decision_quality_assessment.py: expect `success=False` + `REQUIRES_ACTION` - Add `skip_quality_gate=True` to test fixtures that use minimal inputs - Improve `sample_creation_input` fixture to be high-quality (pass gate) ## Quality Threshold - B grade (75/100) minimum required - Scoring: Specificity (15), Balance (25), Context (20), Constraints (15), Alternatives (15), Completeness (10) ## Backward Compatibility - Tests can set `skip_quality_gate=True` to bypass validation - Quality gate skipped returns placeholder feedback structure 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added skip_quality_gate=True parameter to integration tests that use minimal ADR content for testing error scenarios, edge cases, and workflow integration rather than testing decision quality. **Tests Fixed (14 failing → passing):** 1. test_comprehensive_scenarios.py (3 tests): - test_disk_full_simulation: Testing disk I/O errors - test_malformed_input_data: Testing malformed policy handling - test_unicode_and_encoding_handling: Testing Unicode support 2. test_workflow_creation.py (8 tests): - test_conflict_detection: Testing conflict detection logic - test_policy_integration: Testing policy block handling - test_very_long_title_handling: Testing title length limits - test_special_characters_in_title: Testing filename sanitization - test_semantic_similarity_detection: Testing similarity matching - test_incremental_id_generation: Testing ID generation - Second ADR in test_incremental_id_generation 3. test_mcp_workflow_integration.py (4 tests): - test_mcp_create_integration: Testing MCP request translation - test_mcp_approve_integration: Testing approval workflow - test_mcp_supersede_integration: Testing supersede workflow - test_end_to_end_workflow_chain: Testing analyze → create → approve **Why This Fix:** These tests are validating workflow mechanics (error handling, ID generation, MCP integration, etc.) not decision quality. The quality gate would block these tests from reaching the code paths they're designed to test. **Note:** sample_creation_input fixture already has high-quality content and passes the quality gate without skip_quality_gate flag. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed issue where Consequences section was written empty due to circular content parsing in _generate_madr_content method. **Problem:** - _build_adr_structure built content with formatted sections - _generate_madr_content tried to parse those sections back using adr.context/adr.decision/adr.consequences properties - These properties use ParsedContent which re-parses the content - This circular parsing was failing, resulting in empty sections **Solution:** - Simplified _generate_madr_content to use adr.content directly - Removed redundant parsing and rebuilding logic - Content is now built once in _build_adr_structure and used as-is **Testing:** - All 187 tests passing (144 unit + 43 integration) - test_successful_adr_creation now passes with full content 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
README audit session 3 addressing accumulated drift from implementation: Accuracy fixes: - Corrected MCP Tools table layer assignments (5 tools had wrong layers) - Fixed FAQ layer references to use correct layer names - Terminology: "update ADR" → "supersede" (ADRs are immutable records), "creates ADR" → "proposes ADR" (adr_create proposes with status:proposed) - Added migration awareness to supersede examples - Updated policy schema to match current Pydantic models (architecture replaces boundaries, config_enforcement with typescript/python) Structural improvements: - Replaced ASCII flow diagram with Mermaid flowchart color-coded by layer - Restructured Quick Start: universal flow first, collapsible brownfield details via HTML details tag - Moved Current Capabilities next to What's Coming for coherence - Consolidated "Writing ADRs for Constraint Extraction" + "ADR Format" into single "How ADRs Get Their Policies" reflecting agent-driven two-step creation flow (quality gate + policy guidance) Dedup: - Removed 3 redundant sections (Example Complete Lifecycle, Example Conversations, Discovering Implicit Decisions) - Trimmed Layer 1 Deep Dive to unique content (supersede flow, quality gate) - Removed pattern-matching fallback documentation README reduced from 989 to ~790 lines.
eb0a73d to
90824f9
Compare
MCP integration tests failed because CreateADRRequest and SupersedeADRRequest didn't expose skip_quality_gate, causing the quality gate to reject minimal test inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
decision_guidance.py— quality assessment framework with scoring and actionable feedbackTest plan
make test-unitpassesmake test-integrationpassesmake test-allpassesuv run adr-kit mcp-serverstarts successfully🤖 Generated with Claude Code