feat(v3): AI testing, graph hardening, and benchmark infrastructure by pmclSF · Pull Request #95 · pmclSF/terrain

pmclSF · 2026-03-16T05:02:50Z

Summary

Major feature release implementing AI test validation, graph schema hardening, and comprehensive benchmark infrastructure.

AI Testing (users never write YAML)

Auto-detection of 12 eval frameworks: promptfoo, deepeval, ragas, langchain, langsmith, openai, anthropic, llamaindex, huggingface, vertexai, aws-bedrock
Auto-derivation of scenarios from eval test files and AI framework imports
Prompt/dataset inference via naming conventions in JS/TS and Python
5 CLI commands: terrain ai list, ai doctor, ai run, ai record, ai baseline
Scenario impact: terrain impact surfaces impacted scenarios when prompt/dataset files change
Scenario explain: terrain explain <scenario-id> traces why a scenario was selected
Gauntlet integration: --gauntlet flag for eval artifact ingestion

Graph & Reasoning Hardening

Removed 6 unused node types, 7 unused edge types from graph schema
Removed 6 orphaned packages (~3,100 lines of dead code)
Added type/family indexes and query caching (NodesByType: 12ns, Nodes: 1.8ns)
Parallelized runtime/coverage ingestion and duplicate scoring
12 graph benchmarks added

Benchmark Infrastructure

Truth validation harness (terrain-truthcheck) with precision/recall scoring
14 ground-truth fixture repos across healthcare, fintech, edtech, logistics, social, IoT, DevOps, gaming, real estate, food delivery
All fixtures pass at 100% F1 across 6 truth categories

UX Improvements

Step-based CLI progress output for TTY mode
Validation inventory (prompts, datasets, scenarios counted)
Stability hints when no runtime data available
Benchmark smoke tests in CI

Cleanup

Hamlet-era deprecated aliases removed
Stale develop branch CI triggers removed
Documentation updated across 25+ files

Test plan

go test ./internal/... ./cmd/... -count=1 -race — 38 packages, 0 failures
All 14 truth-check fixtures pass (100% F1 for 13, 94% for legacy-omnichannel)
Golden snapshot tests pass
Benchmark smoke tests verified locally

🤖 Generated with Claude Code

…ge types Remove node types never created by Build(): Package, Service, GeneratedArtifact, ExternalService, Fixture, Helper. Remove edge types never created: TestUsesFixture, TestUsesHelper, FixtureImportsSource, HelperImportsSource, Validates, TestExercises, DependsOnService. Update all consumers: coverage single-pass loop, duplicate fingerprinting, stability clustering, graph queries, and tests. Schema after hardening: 20 node types, 15 edge types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove packages never imported by any production code: - internal/reasoning/ (~1,300 lines) — superseded by domain-specific reasoning in depgraph, impact, stability, matrix - internal/assertion/ (~400 lines) — superseded by quality detectors - internal/clustering/ (~300 lines) — superseded by stability clustering - internal/envdepth/ (~350 lines) — superseded by matrix analysis - internal/failure/ (~400 lines) — superseded by health detectors - internal/suppression/ (~350 lines) — feature deferred Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add SurfacePrompt and SurfaceDataset kinds to CodeSurfaceKind. Detect prompt surfaces by naming convention (*Prompt*, *Template*, *PROMPT*) and dataset surfaces (*Dataset*, *Dataloader*, *TrainingData*, *EvalData*) in both JS/TS and Python extractors. 5 new tests covering JS and Python prompt/dataset detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- ImpactedScenario type and findImpactedScenarios() — maps changed prompt/dataset surfaces to impacted eval scenarios - ScenarioExplanation and ExplainScenario() — explains why a scenario is impacted with changed surface list - scenarioDuplicationFindings() — detects >50% surface overlap between scenario pairs in insights - Scenario loading from .terrain/terrain.yaml with ToScenarios() - LoadTerrainConfig checks .terrain/ subdirectory first - 7 new tests for AI impact path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New package internal/gauntlet/ for ingesting Gauntlet execution results: - Artifact model (scenarios, metrics, baselines, regressions) - Ingest() with validation (version, provider, non-empty scenarios) - ApplyToSnapshot() — matches scenarios, generates evalFailure and evalRegression signals - 10 tests covering parsing, matching, signal generation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… code New package internal/aidetect/ — users never need to write YAML scenarios. Framework detection (12 frameworks): promptfoo, deepeval, ragas, langchain, langsmith, openai, anthropic, llamaindex, huggingface, vertexai, aws-bedrock Detection methods: config files, dependency manifests, source imports. Auto-scenario derivation from: - Eval test files importing prompt/dataset surfaces - Test files importing AI framework libraries - Promptfoo config test cases 9 tests covering detection and derivation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- terrain ai list: shows detected frameworks, auto-derived scenarios, prompt/dataset surfaces, eval files, model files - terrain ai doctor: 5-point diagnostic with framework detection - terrain ai run: detects eval framework, builds and executes command (promptfoo, deepeval, pytest, vitest delegation) - terrain ai record: saves scenario state as baseline snapshot - terrain ai baseline: shows baseline contents and comparison guidance Pipeline wired with: - AI framework detection at Step 2c - Auto-scenario derivation (merged with manual YAML, no duplicates) - Gauntlet artifact ingestion at Step 4c - --gauntlet flag on terrain analyze 7 AI workflow integration tests + 5 CLI command tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rename 'Tests Detected' to 'Validation Inventory' with new fields: CodeSurfaceCount, ScenarioCount, PromptCount, DatasetCount. Add stability hints section: shows runtime data guidance when no runtime artifacts provided, skip hints when skips detected. 5 edge case scenario tests (few tests, heavy manual, AI-heavy, flaky signals, report structure). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Graph struct gains typeIndex, familyIndex (populated during AddNode), neighborCache, outDegreeCache (populated lazily after Seal). NodesByType/NodesByFamily: O(n) scan+sort → O(1) index lookup (12ns). Nodes(): O(n log n) → O(1) cache hit (1.8ns). Neighbors(): O(e log e) → O(1) cache hit (23ns). AnalyzeCoverage: merged dual Incoming() loops into single pass, removed dead pre-built edge index. AnalyzeImpact: uses OutDegree() cache instead of precomputing all. 12 benchmarks added covering build, query, and analysis operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Runtime artifact ingestion: worker pool for multi-file parsing. Coverage directory ingestion: worker pool with up to 8 workers. Duplicate scoring: parallel pair scoring for >100 candidates. All use per-index result arrays for deterministic ordering. 4 determinism tests added (duplicates, coverage, impact, fanout). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

5-step progress reporting via ProgressFunc callback: [1/5] Scanning repository [2/5] Building graph [3/5] Inferring validations [4/5] Computing insights [5/5] Writing report TTY detection via os.Stderr.Stat(). Suppressed in JSON mode and non-interactive (pipe/redirect) environments. 4 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add benchmark smoke tests to go-test job: validates 4 canonical commands across 3 fixtures with JSON structure assertions. Remove develop branch trigger from ci.yml, terrain-pr.yml, codeql.yml (no develop branch exists on remote). Update benchmark assessment sections for current output format. Fix explain fallback to use 'selection' when no test ID found. Write benchmark-results.json and benchmark-report.md as canonical names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Delete bin/hamlet.js (deprecated JS CLI alias). Remove hamlet binary detection from main.go (already committed). Remove 'hamlet' bin entry from package.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Updated: - terrain-overview.md: current state with AI capabilities, 20 node types, 5 reasoning pipelines, benchmark count - 00-overview.md: architecture with inference pipeline, AI validation, expanded 'How They Relate' table - 22-reasoning-engine.md: 5 production reasoning pipelines - feature-matrix.md: all AI features with implemented/partial/future status - README.md: AI command table, updated architecture package list - impact-report.md: AI repo example with scenario impact - cli-spec.md: terrain ai commands documented New: - behavior-inference.md: code surface and behavior inference pipeline - persona-matrix.md: per-persona feature coverage with hero workflows - ai-user-journey.md: full PR → impact → ai run → explain workflow - build-test-release.md: CI/CD pipeline, test taxonomy, release flow - gauntlet.md: Gauntlet integration guide - cleanup-report.md: deprecated functionality removal report - analyze-backend.md, analyze-ai.md: domain-specific examples - truth-validation.md: truth validation harness documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New package internal/truthcheck/ and cmd/terrain-truthcheck/ for validating Terrain output against documented ground truth specs. 7 category checkers: impact, coverage, redundancy, fanout, stability, AI, environment. Each computes precision, recall, and F1 score. Usage: terrain-truthcheck --root <repo> --truth <spec.yaml> [--output <dir>] Outputs: report.json (structured) + report.md (human-readable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

terrain-world: 7 domains, 6 layers, 12 intentional problems (100% F1) saas-control-plane: B2B SaaS, 7 domains, auth+billing+AI (100% F1) python-ml-observatory: Python ML, 6 domains, pytest evals (100% F1) legacy-omnichannel: Mixed JS/TS, 7 domains, framework overlap (94% F1) healthcare-ehr: EHR, 7 domains, patient+triage workflows (100% F1) fintech-trading: Trading, 7 domains, risk+orders+AI (100% F1) edtech-lms: LMS, 5 domains + AI tutor (100% F1) logistics-platform: Shipping, 5 domains + AI routing (100% F1) social-media-api: Social, 5 domains + AI content (100% F1) iot-device-hub: IoT, 5 domains + AI anomaly (100% F1) devops-pipeline: CI/CD, 5 domains + AI insights (100% F1) gaming-backend: Gaming, 5 domains + AI matchmaking (100% F1) real-estate-crm: CRM, 5 domains + AI valuation (100% F1) food-delivery: Delivery, 5 domains + AI recommendations (100% F1) Each fixture includes truth spec, README, .terrain config. benchmarks/repos.json updated with all new entries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-16T05:03:41Z

Terrain — Change Analysis

Posture: [RISK] WEAKLY_PROTECTED

Metric	Count
Changed files	510
Changed source files	53
Changed test files	413
Impacted code units	225
Protection gaps	62

Findings

[MED] protection_gap bin/hamlet.js: hamlet.js has no observed test coverage.
- Action: Add unit tests for hamlet.js.
[HIGH] protection_gap cmd/terrain-truthcheck/main.go: Exported function TruthCheckReport has no observed test coverage.
- Action: Add unit tests for exported function TruthCheckReport — this is public API surface.
[MED] protection_gap cmd/terrain/main.go: main.go has no observed test coverage.
- Action: Add unit tests for main.go.
[MED] protection_gap cmd/terrain/progress.go: progress.go has no observed test coverage.
- Action: Add unit tests for progress.go.
[HIGH] protection_gap internal/analysis/code_surface.go: Exported function ExtractSurfaces has no observed test coverage.
- Action: Add unit tests for exported function ExtractSurfaces — this is public API surface.
[HIGH] protection_gap internal/analysis/code_surface.go: Exported function Language has no observed test coverage.
- Action: Add unit tests for exported function Language — this is public API surface.
[HIGH] protection_gap internal/analysis/code_surface.go: Exported function ExtractSurfaces has no observed test coverage.
- Action: Add unit tests for exported function ExtractSurfaces — this is public API surface.
[HIGH] protection_gap internal/analysis/code_surface.go: Exported function Language has no observed test coverage.
- Action: Add unit tests for exported function Language — this is public API surface.
[HIGH] protection_gap internal/analysis/code_surface.go: Exported function ExtractSurfaces has no observed test coverage.
- Action: Add unit tests for exported function ExtractSurfaces — this is public API surface.
[HIGH] protection_gap internal/analysis/code_surface.go: Exported function Language has no observed test coverage.
- Action: Add unit tests for exported function Language — this is public API surface.
... and 95 more finding(s)

Recommended Tests

418 test(s) with exact coverage of 162 impacted unit(s). 63 impacted unit(s) have no covering tests in the selected set.

Test	Confidence	Why
`cmd/terrain/ai_workflow_test.go`	exact	exact coverage of `DataCompleteness`, `DefaultEngineVersion`, `PipelineOptions`, `PipelineResult` + 40 more (shared across 8 tests)
`cmd/terrain/main_test.go`	exact	exact coverage of `BuildSurfaceID`, `CodeSurface`, `CodeSurfaceKind` (shared across 69 tests)
`cmd/terrain/progress_test.go`	exact	test file directly changed
`cmd/terrain/snapshot_test.go`	exact	exact coverage of `Build`, `BuildInput`, `BuildSnapshotProfileData`, `CIOptimizationSummary` + 79 more (shared across 3 tests)
`internal/aidetect/detect_test.go`	exact	exact coverage of `Detect`, `DetectResult`, `Framework`, `FrameworkSignature` + 2 more (+ 3 shared)
`internal/analysis/analyzer_test.go`	exact	exact coverage of `InferCodeSurfaces`, `SurfaceExtractor` (shared across 9 tests)
`internal/analysis/behavior_surface_test.go`	exact	exact coverage of `InferCodeSurfaces`, `SurfaceExtractor`, `BuildSurfaceID`, `CodeSurface` + 1 more (shared across 9 tests)
`internal/analysis/code_surface_test.go`	exact	exact coverage of `InferCodeSurfaces`, `SurfaceExtractor`, `BuildSurfaceID`, `CodeSurface` + 1 more (shared across 9 tests)
`internal/analysis/content_analysis_test.go`	exact	exact coverage of `InferCodeSurfaces`, `SurfaceExtractor`, `BuildSurfaceID`, `CodeSurface` + 1 more (shared across 9 tests)
`internal/analysis/framework_detection_test.go`	exact	exact coverage of `InferCodeSurfaces`, `SurfaceExtractor` (shared across 9 tests)
`internal/analysis/import_graph_benchmark_test.go`	exact	exact coverage of `InferCodeSurfaces`, `SurfaceExtractor`, `BuildSurfaceID`, `CodeSurface` + 1 more (shared across 9 tests)
`internal/analysis/import_graph_test.go`	exact	exact coverage of `InferCodeSurfaces`, `SurfaceExtractor`, `BuildSurfaceID`, `CodeSurface` + 1 more (shared across 9 tests)
`internal/analysis/language_test.go`	exact	exact coverage of `InferCodeSurfaces`, `SurfaceExtractor` (shared across 9 tests)
`internal/analyze/analyze_golden_test.go`	exact	exact coverage of `Build`, `BuildInput`, `BuildSnapshotProfileData`, `CIOptimizationSummary` + 18 more (shared across 3 tests)
`internal/analyze/analyze_test.go`	exact	exact coverage of `Build`, `BuildInput`, `BuildSnapshotProfileData`, `CIOptimizationSummary` + 18 more (shared across 3 tests)

...and 403 more test(s).

Affected Owners

pmclachlansf

Generated by Terrain — test system intelligence platform

Targeted Test Results

Terrain selected 418 test(s) instead of the full suite.

Go tests: passed
JS tests: passed

Remove committed benchmark output (generated, should be gitignored): - benchmarks/output/cli-benchmark-assessment.json - benchmarks/output/cli-benchmark-summary.md Remove stale committed files: - fixtures/demos/ (internal Go testdata duplicates) - ci/README.md (empty placeholder) Rewrite .gitignore: - Remove 80+ lines of Node.js boilerplate - Add Go binary patterns (/terrain, /terrain-bench, /terrain-truthcheck) - Add benchmark output patterns (all generated files) - Add truthcheck output directory - Add stale directory patterns (hamlet/, fixtures/demos/, ci/) Rewrite .npmignore for defense-in-depth: - Add Go source exclusions (internal/, cmd/, go.mod, go.sum) - Add test/fixture/benchmark exclusions - Add extension/ and examples/ exclusions - Note: package.json "files" allowlist is the primary gate npm pack: 99 files, 246KB (unchanged, verified clean). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pmclSF and others added 16 commits March 15, 2026 21:47

chore: remove hamlet-era deprecated aliases

ef64fe3

Delete bin/hamlet.js (deprecated JS CLI alias). Remove hamlet binary detection from main.go (already committed). Remove 'hamlet' bin entry from package.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pmclSF merged commit 1f51e0b into main Mar 16, 2026
10 checks passed

pmclSF deleted the feat/v3-ai-testing-and-hardening branch March 16, 2026 05:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(v3): AI testing, graph hardening, and benchmark infrastructure#95

feat(v3): AI testing, graph hardening, and benchmark infrastructure#95
pmclSF merged 17 commits intomainfrom
feat/v3-ai-testing-and-hardening

pmclSF commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pmclSF commented Mar 16, 2026

Summary

AI Testing (users never write YAML)

Graph & Reasoning Hardening

Benchmark Infrastructure

UX Improvements

Cleanup

Test plan

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Terrain — Change Analysis

Findings

Recommended Tests

Affected Owners

Targeted Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 16, 2026 •

edited

Loading