feat(v3): AI testing, graph hardening, and benchmark infrastructure#95
Merged
feat(v3): AI testing, graph hardening, and benchmark infrastructure#95
Conversation
…ge types Remove node types never created by Build(): Package, Service, GeneratedArtifact, ExternalService, Fixture, Helper. Remove edge types never created: TestUsesFixture, TestUsesHelper, FixtureImportsSource, HelperImportsSource, Validates, TestExercises, DependsOnService. Update all consumers: coverage single-pass loop, duplicate fingerprinting, stability clustering, graph queries, and tests. Schema after hardening: 20 node types, 15 edge types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove packages never imported by any production code: - internal/reasoning/ (~1,300 lines) — superseded by domain-specific reasoning in depgraph, impact, stability, matrix - internal/assertion/ (~400 lines) — superseded by quality detectors - internal/clustering/ (~300 lines) — superseded by stability clustering - internal/envdepth/ (~350 lines) — superseded by matrix analysis - internal/failure/ (~400 lines) — superseded by health detectors - internal/suppression/ (~350 lines) — feature deferred Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add SurfacePrompt and SurfaceDataset kinds to CodeSurfaceKind. Detect prompt surfaces by naming convention (*Prompt*, *Template*, *PROMPT*) and dataset surfaces (*Dataset*, *Dataloader*, *TrainingData*, *EvalData*) in both JS/TS and Python extractors. 5 new tests covering JS and Python prompt/dataset detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ImpactedScenario type and findImpactedScenarios() — maps changed prompt/dataset surfaces to impacted eval scenarios - ScenarioExplanation and ExplainScenario() — explains why a scenario is impacted with changed surface list - scenarioDuplicationFindings() — detects >50% surface overlap between scenario pairs in insights - Scenario loading from .terrain/terrain.yaml with ToScenarios() - LoadTerrainConfig checks .terrain/ subdirectory first - 7 new tests for AI impact path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New package internal/gauntlet/ for ingesting Gauntlet execution results: - Artifact model (scenarios, metrics, baselines, regressions) - Ingest() with validation (version, provider, non-empty scenarios) - ApplyToSnapshot() — matches scenarios, generates evalFailure and evalRegression signals - 10 tests covering parsing, matching, signal generation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… code New package internal/aidetect/ — users never need to write YAML scenarios. Framework detection (12 frameworks): promptfoo, deepeval, ragas, langchain, langsmith, openai, anthropic, llamaindex, huggingface, vertexai, aws-bedrock Detection methods: config files, dependency manifests, source imports. Auto-scenario derivation from: - Eval test files importing prompt/dataset surfaces - Test files importing AI framework libraries - Promptfoo config test cases 9 tests covering detection and derivation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- terrain ai list: shows detected frameworks, auto-derived scenarios, prompt/dataset surfaces, eval files, model files - terrain ai doctor: 5-point diagnostic with framework detection - terrain ai run: detects eval framework, builds and executes command (promptfoo, deepeval, pytest, vitest delegation) - terrain ai record: saves scenario state as baseline snapshot - terrain ai baseline: shows baseline contents and comparison guidance Pipeline wired with: - AI framework detection at Step 2c - Auto-scenario derivation (merged with manual YAML, no duplicates) - Gauntlet artifact ingestion at Step 4c - --gauntlet flag on terrain analyze 7 AI workflow integration tests + 5 CLI command tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename 'Tests Detected' to 'Validation Inventory' with new fields: CodeSurfaceCount, ScenarioCount, PromptCount, DatasetCount. Add stability hints section: shows runtime data guidance when no runtime artifacts provided, skip hints when skips detected. 5 edge case scenario tests (few tests, heavy manual, AI-heavy, flaky signals, report structure). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Graph struct gains typeIndex, familyIndex (populated during AddNode), neighborCache, outDegreeCache (populated lazily after Seal). NodesByType/NodesByFamily: O(n) scan+sort → O(1) index lookup (12ns). Nodes(): O(n log n) → O(1) cache hit (1.8ns). Neighbors(): O(e log e) → O(1) cache hit (23ns). AnalyzeCoverage: merged dual Incoming() loops into single pass, removed dead pre-built edge index. AnalyzeImpact: uses OutDegree() cache instead of precomputing all. 12 benchmarks added covering build, query, and analysis operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runtime artifact ingestion: worker pool for multi-file parsing. Coverage directory ingestion: worker pool with up to 8 workers. Duplicate scoring: parallel pair scoring for >100 candidates. All use per-index result arrays for deterministic ordering. 4 determinism tests added (duplicates, coverage, impact, fanout). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5-step progress reporting via ProgressFunc callback: [1/5] Scanning repository [2/5] Building graph [3/5] Inferring validations [4/5] Computing insights [5/5] Writing report TTY detection via os.Stderr.Stat(). Suppressed in JSON mode and non-interactive (pipe/redirect) environments. 4 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add benchmark smoke tests to go-test job: validates 4 canonical commands across 3 fixtures with JSON structure assertions. Remove develop branch trigger from ci.yml, terrain-pr.yml, codeql.yml (no develop branch exists on remote). Update benchmark assessment sections for current output format. Fix explain fallback to use 'selection' when no test ID found. Write benchmark-results.json and benchmark-report.md as canonical names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delete bin/hamlet.js (deprecated JS CLI alias). Remove hamlet binary detection from main.go (already committed). Remove 'hamlet' bin entry from package.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated: - terrain-overview.md: current state with AI capabilities, 20 node types, 5 reasoning pipelines, benchmark count - 00-overview.md: architecture with inference pipeline, AI validation, expanded 'How They Relate' table - 22-reasoning-engine.md: 5 production reasoning pipelines - feature-matrix.md: all AI features with implemented/partial/future status - README.md: AI command table, updated architecture package list - impact-report.md: AI repo example with scenario impact - cli-spec.md: terrain ai commands documented New: - behavior-inference.md: code surface and behavior inference pipeline - persona-matrix.md: per-persona feature coverage with hero workflows - ai-user-journey.md: full PR → impact → ai run → explain workflow - build-test-release.md: CI/CD pipeline, test taxonomy, release flow - gauntlet.md: Gauntlet integration guide - cleanup-report.md: deprecated functionality removal report - analyze-backend.md, analyze-ai.md: domain-specific examples - truth-validation.md: truth validation harness documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New package internal/truthcheck/ and cmd/terrain-truthcheck/ for validating Terrain output against documented ground truth specs. 7 category checkers: impact, coverage, redundancy, fanout, stability, AI, environment. Each computes precision, recall, and F1 score. Usage: terrain-truthcheck --root <repo> --truth <spec.yaml> [--output <dir>] Outputs: report.json (structured) + report.md (human-readable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
terrain-world: 7 domains, 6 layers, 12 intentional problems (100% F1) saas-control-plane: B2B SaaS, 7 domains, auth+billing+AI (100% F1) python-ml-observatory: Python ML, 6 domains, pytest evals (100% F1) legacy-omnichannel: Mixed JS/TS, 7 domains, framework overlap (94% F1) healthcare-ehr: EHR, 7 domains, patient+triage workflows (100% F1) fintech-trading: Trading, 7 domains, risk+orders+AI (100% F1) edtech-lms: LMS, 5 domains + AI tutor (100% F1) logistics-platform: Shipping, 5 domains + AI routing (100% F1) social-media-api: Social, 5 domains + AI content (100% F1) iot-device-hub: IoT, 5 domains + AI anomaly (100% F1) devops-pipeline: CI/CD, 5 domains + AI insights (100% F1) gaming-backend: Gaming, 5 domains + AI matchmaking (100% F1) real-estate-crm: CRM, 5 domains + AI valuation (100% F1) food-delivery: Delivery, 5 domains + AI recommendations (100% F1) Each fixture includes truth spec, README, .terrain config. benchmarks/repos.json updated with all new entries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Terrain — Change AnalysisPosture: [RISK] WEAKLY_PROTECTED
Findings
Recommended Tests418 test(s) with exact coverage of 162 impacted unit(s). 63 impacted unit(s) have no covering tests in the selected set.
...and 403 more test(s). Affected Owners
Generated by Terrain — test system intelligence platform Targeted Test ResultsTerrain selected 418 test(s) instead of the full suite.
|
Remove committed benchmark output (generated, should be gitignored): - benchmarks/output/cli-benchmark-assessment.json - benchmarks/output/cli-benchmark-summary.md Remove stale committed files: - fixtures/demos/ (internal Go testdata duplicates) - ci/README.md (empty placeholder) Rewrite .gitignore: - Remove 80+ lines of Node.js boilerplate - Add Go binary patterns (/terrain, /terrain-bench, /terrain-truthcheck) - Add benchmark output patterns (all generated files) - Add truthcheck output directory - Add stale directory patterns (hamlet/, fixtures/demos/, ci/) Rewrite .npmignore for defense-in-depth: - Add Go source exclusions (internal/, cmd/, go.mod, go.sum) - Add test/fixture/benchmark exclusions - Add extension/ and examples/ exclusions - Note: package.json "files" allowlist is the primary gate npm pack: 99 files, 246KB (unchanged, verified clean). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Major feature release implementing AI test validation, graph schema hardening, and comprehensive benchmark infrastructure.
AI Testing (users never write YAML)
terrain ai list,ai doctor,ai run,ai record,ai baselineterrain impactsurfaces impacted scenarios when prompt/dataset files changeterrain explain <scenario-id>traces why a scenario was selected--gauntletflag for eval artifact ingestionGraph & Reasoning Hardening
Benchmark Infrastructure
terrain-truthcheck) with precision/recall scoringUX Improvements
Cleanup
developbranch CI triggers removedTest plan
go test ./internal/... ./cmd/... -count=1 -race— 38 packages, 0 failures🤖 Generated with Claude Code