Skip to content

feat(v3): AI testing, graph hardening, and benchmark infrastructure#95

Merged
pmclSF merged 17 commits intomainfrom
feat/v3-ai-testing-and-hardening
Mar 16, 2026
Merged

feat(v3): AI testing, graph hardening, and benchmark infrastructure#95
pmclSF merged 17 commits intomainfrom
feat/v3-ai-testing-and-hardening

Conversation

@pmclSF
Copy link
Copy Markdown
Owner

@pmclSF pmclSF commented Mar 16, 2026

Summary

Major feature release implementing AI test validation, graph schema hardening, and comprehensive benchmark infrastructure.

AI Testing (users never write YAML)

  • Auto-detection of 12 eval frameworks: promptfoo, deepeval, ragas, langchain, langsmith, openai, anthropic, llamaindex, huggingface, vertexai, aws-bedrock
  • Auto-derivation of scenarios from eval test files and AI framework imports
  • Prompt/dataset inference via naming conventions in JS/TS and Python
  • 5 CLI commands: terrain ai list, ai doctor, ai run, ai record, ai baseline
  • Scenario impact: terrain impact surfaces impacted scenarios when prompt/dataset files change
  • Scenario explain: terrain explain <scenario-id> traces why a scenario was selected
  • Gauntlet integration: --gauntlet flag for eval artifact ingestion

Graph & Reasoning Hardening

  • Removed 6 unused node types, 7 unused edge types from graph schema
  • Removed 6 orphaned packages (~3,100 lines of dead code)
  • Added type/family indexes and query caching (NodesByType: 12ns, Nodes: 1.8ns)
  • Parallelized runtime/coverage ingestion and duplicate scoring
  • 12 graph benchmarks added

Benchmark Infrastructure

  • Truth validation harness (terrain-truthcheck) with precision/recall scoring
  • 14 ground-truth fixture repos across healthcare, fintech, edtech, logistics, social, IoT, DevOps, gaming, real estate, food delivery
  • All fixtures pass at 100% F1 across 6 truth categories

UX Improvements

  • Step-based CLI progress output for TTY mode
  • Validation inventory (prompts, datasets, scenarios counted)
  • Stability hints when no runtime data available
  • Benchmark smoke tests in CI

Cleanup

  • Hamlet-era deprecated aliases removed
  • Stale develop branch CI triggers removed
  • Documentation updated across 25+ files

Test plan

  • go test ./internal/... ./cmd/... -count=1 -race — 38 packages, 0 failures
  • All 14 truth-check fixtures pass (100% F1 for 13, 94% for legacy-omnichannel)
  • Golden snapshot tests pass
  • Benchmark smoke tests verified locally

🤖 Generated with Claude Code

pmclSF and others added 16 commits March 15, 2026 21:47
…ge types

Remove node types never created by Build(): Package, Service,
GeneratedArtifact, ExternalService, Fixture, Helper.

Remove edge types never created: TestUsesFixture, TestUsesHelper,
FixtureImportsSource, HelperImportsSource, Validates, TestExercises,
DependsOnService.

Update all consumers: coverage single-pass loop, duplicate fingerprinting,
stability clustering, graph queries, and tests.

Schema after hardening: 20 node types, 15 edge types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove packages never imported by any production code:
- internal/reasoning/ (~1,300 lines) — superseded by domain-specific
  reasoning in depgraph, impact, stability, matrix
- internal/assertion/ (~400 lines) — superseded by quality detectors
- internal/clustering/ (~300 lines) — superseded by stability clustering
- internal/envdepth/ (~350 lines) — superseded by matrix analysis
- internal/failure/ (~400 lines) — superseded by health detectors
- internal/suppression/ (~350 lines) — feature deferred

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add SurfacePrompt and SurfaceDataset kinds to CodeSurfaceKind.
Detect prompt surfaces by naming convention (*Prompt*, *Template*, *PROMPT*)
and dataset surfaces (*Dataset*, *Dataloader*, *TrainingData*, *EvalData*)
in both JS/TS and Python extractors.

5 new tests covering JS and Python prompt/dataset detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ImpactedScenario type and findImpactedScenarios() — maps changed
  prompt/dataset surfaces to impacted eval scenarios
- ScenarioExplanation and ExplainScenario() — explains why a scenario
  is impacted with changed surface list
- scenarioDuplicationFindings() — detects >50% surface overlap between
  scenario pairs in insights
- Scenario loading from .terrain/terrain.yaml with ToScenarios()
- LoadTerrainConfig checks .terrain/ subdirectory first
- 7 new tests for AI impact path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New package internal/gauntlet/ for ingesting Gauntlet execution results:
- Artifact model (scenarios, metrics, baselines, regressions)
- Ingest() with validation (version, provider, non-empty scenarios)
- ApplyToSnapshot() — matches scenarios, generates evalFailure and
  evalRegression signals
- 10 tests covering parsing, matching, signal generation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… code

New package internal/aidetect/ — users never need to write YAML scenarios.

Framework detection (12 frameworks):
  promptfoo, deepeval, ragas, langchain, langsmith, openai, anthropic,
  llamaindex, huggingface, vertexai, aws-bedrock

Detection methods: config files, dependency manifests, source imports.

Auto-scenario derivation from:
  - Eval test files importing prompt/dataset surfaces
  - Test files importing AI framework libraries
  - Promptfoo config test cases

9 tests covering detection and derivation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- terrain ai list: shows detected frameworks, auto-derived scenarios,
  prompt/dataset surfaces, eval files, model files
- terrain ai doctor: 5-point diagnostic with framework detection
- terrain ai run: detects eval framework, builds and executes command
  (promptfoo, deepeval, pytest, vitest delegation)
- terrain ai record: saves scenario state as baseline snapshot
- terrain ai baseline: shows baseline contents and comparison guidance

Pipeline wired with:
- AI framework detection at Step 2c
- Auto-scenario derivation (merged with manual YAML, no duplicates)
- Gauntlet artifact ingestion at Step 4c
- --gauntlet flag on terrain analyze

7 AI workflow integration tests + 5 CLI command tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename 'Tests Detected' to 'Validation Inventory' with new fields:
CodeSurfaceCount, ScenarioCount, PromptCount, DatasetCount.

Add stability hints section: shows runtime data guidance when no
runtime artifacts provided, skip hints when skips detected.

5 edge case scenario tests (few tests, heavy manual, AI-heavy,
flaky signals, report structure).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Graph struct gains typeIndex, familyIndex (populated during AddNode),
neighborCache, outDegreeCache (populated lazily after Seal).

NodesByType/NodesByFamily: O(n) scan+sort → O(1) index lookup (12ns).
Nodes(): O(n log n) → O(1) cache hit (1.8ns).
Neighbors(): O(e log e) → O(1) cache hit (23ns).

AnalyzeCoverage: merged dual Incoming() loops into single pass,
removed dead pre-built edge index.
AnalyzeImpact: uses OutDegree() cache instead of precomputing all.

12 benchmarks added covering build, query, and analysis operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runtime artifact ingestion: worker pool for multi-file parsing.
Coverage directory ingestion: worker pool with up to 8 workers.
Duplicate scoring: parallel pair scoring for >100 candidates.

All use per-index result arrays for deterministic ordering.
4 determinism tests added (duplicates, coverage, impact, fanout).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5-step progress reporting via ProgressFunc callback:
  [1/5] Scanning repository
  [2/5] Building graph
  [3/5] Inferring validations
  [4/5] Computing insights
  [5/5] Writing report

TTY detection via os.Stderr.Stat(). Suppressed in JSON mode and
non-interactive (pipe/redirect) environments. 4 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add benchmark smoke tests to go-test job: validates 4 canonical
commands across 3 fixtures with JSON structure assertions.

Remove develop branch trigger from ci.yml, terrain-pr.yml, codeql.yml
(no develop branch exists on remote).

Update benchmark assessment sections for current output format.
Fix explain fallback to use 'selection' when no test ID found.
Write benchmark-results.json and benchmark-report.md as canonical names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delete bin/hamlet.js (deprecated JS CLI alias).
Remove hamlet binary detection from main.go (already committed).
Remove 'hamlet' bin entry from package.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated:
- terrain-overview.md: current state with AI capabilities, 20 node types,
  5 reasoning pipelines, benchmark count
- 00-overview.md: architecture with inference pipeline, AI validation,
  expanded 'How They Relate' table
- 22-reasoning-engine.md: 5 production reasoning pipelines
- feature-matrix.md: all AI features with implemented/partial/future status
- README.md: AI command table, updated architecture package list
- impact-report.md: AI repo example with scenario impact
- cli-spec.md: terrain ai commands documented

New:
- behavior-inference.md: code surface and behavior inference pipeline
- persona-matrix.md: per-persona feature coverage with hero workflows
- ai-user-journey.md: full PR → impact → ai run → explain workflow
- build-test-release.md: CI/CD pipeline, test taxonomy, release flow
- gauntlet.md: Gauntlet integration guide
- cleanup-report.md: deprecated functionality removal report
- analyze-backend.md, analyze-ai.md: domain-specific examples
- truth-validation.md: truth validation harness documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New package internal/truthcheck/ and cmd/terrain-truthcheck/ for
validating Terrain output against documented ground truth specs.

7 category checkers: impact, coverage, redundancy, fanout, stability,
AI, environment. Each computes precision, recall, and F1 score.

Usage: terrain-truthcheck --root <repo> --truth <spec.yaml> [--output <dir>]

Outputs: report.json (structured) + report.md (human-readable).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
terrain-world: 7 domains, 6 layers, 12 intentional problems (100% F1)
saas-control-plane: B2B SaaS, 7 domains, auth+billing+AI (100% F1)
python-ml-observatory: Python ML, 6 domains, pytest evals (100% F1)
legacy-omnichannel: Mixed JS/TS, 7 domains, framework overlap (94% F1)
healthcare-ehr: EHR, 7 domains, patient+triage workflows (100% F1)
fintech-trading: Trading, 7 domains, risk+orders+AI (100% F1)
edtech-lms: LMS, 5 domains + AI tutor (100% F1)
logistics-platform: Shipping, 5 domains + AI routing (100% F1)
social-media-api: Social, 5 domains + AI content (100% F1)
iot-device-hub: IoT, 5 domains + AI anomaly (100% F1)
devops-pipeline: CI/CD, 5 domains + AI insights (100% F1)
gaming-backend: Gaming, 5 domains + AI matchmaking (100% F1)
real-estate-crm: CRM, 5 domains + AI valuation (100% F1)
food-delivery: Delivery, 5 domains + AI recommendations (100% F1)

Each fixture includes truth spec, README, .terrain config.
benchmarks/repos.json updated with all new entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 16, 2026

Terrain — Change Analysis

Posture: [RISK] WEAKLY_PROTECTED

Metric Count
Changed files 510
Changed source files 53
Changed test files 413
Impacted code units 225
Protection gaps 62

Findings

  • [MED] protection_gap bin/hamlet.js: hamlet.js has no observed test coverage.
    • Action: Add unit tests for hamlet.js.
  • [HIGH] protection_gap cmd/terrain-truthcheck/main.go: Exported function TruthCheckReport has no observed test coverage.
    • Action: Add unit tests for exported function TruthCheckReport — this is public API surface.
  • [MED] protection_gap cmd/terrain/main.go: main.go has no observed test coverage.
    • Action: Add unit tests for main.go.
  • [MED] protection_gap cmd/terrain/progress.go: progress.go has no observed test coverage.
    • Action: Add unit tests for progress.go.
  • [HIGH] protection_gap internal/analysis/code_surface.go: Exported function ExtractSurfaces has no observed test coverage.
    • Action: Add unit tests for exported function ExtractSurfaces — this is public API surface.
  • [HIGH] protection_gap internal/analysis/code_surface.go: Exported function Language has no observed test coverage.
    • Action: Add unit tests for exported function Language — this is public API surface.
  • [HIGH] protection_gap internal/analysis/code_surface.go: Exported function ExtractSurfaces has no observed test coverage.
    • Action: Add unit tests for exported function ExtractSurfaces — this is public API surface.
  • [HIGH] protection_gap internal/analysis/code_surface.go: Exported function Language has no observed test coverage.
    • Action: Add unit tests for exported function Language — this is public API surface.
  • [HIGH] protection_gap internal/analysis/code_surface.go: Exported function ExtractSurfaces has no observed test coverage.
    • Action: Add unit tests for exported function ExtractSurfaces — this is public API surface.
  • [HIGH] protection_gap internal/analysis/code_surface.go: Exported function Language has no observed test coverage.
    • Action: Add unit tests for exported function Language — this is public API surface.
  • ... and 95 more finding(s)

Recommended Tests

418 test(s) with exact coverage of 162 impacted unit(s). 63 impacted unit(s) have no covering tests in the selected set.

Test Confidence Why
cmd/terrain/ai_workflow_test.go exact exact coverage of DataCompleteness, DefaultEngineVersion, PipelineOptions, PipelineResult + 40 more (shared across 8 tests)
cmd/terrain/main_test.go exact exact coverage of BuildSurfaceID, CodeSurface, CodeSurfaceKind (shared across 69 tests)
cmd/terrain/progress_test.go exact test file directly changed
cmd/terrain/snapshot_test.go exact exact coverage of Build, BuildInput, BuildSnapshotProfileData, CIOptimizationSummary + 79 more (shared across 3 tests)
internal/aidetect/detect_test.go exact exact coverage of Detect, DetectResult, Framework, FrameworkSignature + 2 more (+ 3 shared)
internal/analysis/analyzer_test.go exact exact coverage of InferCodeSurfaces, SurfaceExtractor (shared across 9 tests)
internal/analysis/behavior_surface_test.go exact exact coverage of InferCodeSurfaces, SurfaceExtractor, BuildSurfaceID, CodeSurface + 1 more (shared across 9 tests)
internal/analysis/code_surface_test.go exact exact coverage of InferCodeSurfaces, SurfaceExtractor, BuildSurfaceID, CodeSurface + 1 more (shared across 9 tests)
internal/analysis/content_analysis_test.go exact exact coverage of InferCodeSurfaces, SurfaceExtractor, BuildSurfaceID, CodeSurface + 1 more (shared across 9 tests)
internal/analysis/framework_detection_test.go exact exact coverage of InferCodeSurfaces, SurfaceExtractor (shared across 9 tests)
internal/analysis/import_graph_benchmark_test.go exact exact coverage of InferCodeSurfaces, SurfaceExtractor, BuildSurfaceID, CodeSurface + 1 more (shared across 9 tests)
internal/analysis/import_graph_test.go exact exact coverage of InferCodeSurfaces, SurfaceExtractor, BuildSurfaceID, CodeSurface + 1 more (shared across 9 tests)
internal/analysis/language_test.go exact exact coverage of InferCodeSurfaces, SurfaceExtractor (shared across 9 tests)
internal/analyze/analyze_golden_test.go exact exact coverage of Build, BuildInput, BuildSnapshotProfileData, CIOptimizationSummary + 18 more (shared across 3 tests)
internal/analyze/analyze_test.go exact exact coverage of Build, BuildInput, BuildSnapshotProfileData, CIOptimizationSummary + 18 more (shared across 3 tests)

...and 403 more test(s).

Affected Owners

  • pmclachlansf

Generated by Terrain — test system intelligence platform

Targeted Test Results

Terrain selected 418 test(s) instead of the full suite.

  • Go tests: passed
  • JS tests: passed

Remove committed benchmark output (generated, should be gitignored):
- benchmarks/output/cli-benchmark-assessment.json
- benchmarks/output/cli-benchmark-summary.md

Remove stale committed files:
- fixtures/demos/ (internal Go testdata duplicates)
- ci/README.md (empty placeholder)

Rewrite .gitignore:
- Remove 80+ lines of Node.js boilerplate
- Add Go binary patterns (/terrain, /terrain-bench, /terrain-truthcheck)
- Add benchmark output patterns (all generated files)
- Add truthcheck output directory
- Add stale directory patterns (hamlet/, fixtures/demos/, ci/)

Rewrite .npmignore for defense-in-depth:
- Add Go source exclusions (internal/, cmd/, go.mod, go.sum)
- Add test/fixture/benchmark exclusions
- Add extension/ and examples/ exclusions
- Note: package.json "files" allowlist is the primary gate

npm pack: 99 files, 246KB (unchanged, verified clean).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pmclSF pmclSF merged commit 1f51e0b into main Mar 16, 2026
10 checks passed
@pmclSF pmclSF deleted the feat/v3-ai-testing-and-hardening branch March 16, 2026 05:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant