Skip to content

Add eval coverage for dotnet-test/test-gap-analysis#829

Merged
Evangelink merged 2 commits into
mainfrom
evangelink-eval-test-gap-analysis
Jun 25, 2026
Merged

Add eval coverage for dotnet-test/test-gap-analysis#829
Evangelink merged 2 commits into
mainfrom
evangelink-eval-test-gap-analysis

Conversation

@Evangelink

Copy link
Copy Markdown
Member

Extends tests/dotnet-test/test-gap-analysis/eval.yaml to cover eight previously-uncovered SKILL.md teaching points, adding two scenarios with supporting fixtures.

Now-covered points

Validation

  • Trivial code (simple getters, auto-properties) is excluded from analysis
  • Findings are prioritized by risk, not just listed in source order
  • Report includes strengths (killed mutations) alongside gaps
  • Mutation categories are correctly labeled

Common Pitfalls

  • Analyzing trivial code
  • Ignoring call chains
  • Over-counting mutations in generated code
  • Forgetting Rust's ? operator

What changed

  • Scenario 5 (report-quality) — new C# Billing fixture with trivial auto-properties/getters, an auto-generated InvoiceProcessor.g.cs, and private helpers reached via a call chain. Rubric items verify the analysis excludes trivial code, skips generated code, traces call chains, prioritizes by risk, reports strengths alongside gaps, and labels mutation categories.
  • Scenario 6 (rust-error-propagation) — new Rust fixture using the ? operator with no test on the error path. Rubric items verify the ? propagation is flagged as an Exception/Panic mutation point.

Verification

  • Measure-SkillCoverage.ps1: 100% (24/24), uncovered empty, no regressions (was 66.7%).
  • SkillValidator check --plugin ./plugins/dotnet-test: ✅ all checks passed (only pre-existing token-size warnings).

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 25, 2026 08:15
@github-actions

Copy link
Copy Markdown
Contributor

Skill Coverage Report

Plugin Skill Covered Coverage
dotnet-test test-gap-analysis 24/24 100%

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the dotnet-test/test-gap-analysis evaluation suite by adding two new scenarios (C# “report quality” and Rust ? error propagation) plus supporting fixtures, to cover previously-uncovered SKILL.md teaching points around prioritization, trivial/generated code handling, call-chain tracing, strengths reporting, and Rust error propagation.

Changes:

  • Added Scenario 5 to eval.yaml with a new C# Billing fixture (includes trivial members, a generated .g.cs file, and helper call chains) and rubric/assertions targeting report quality.
  • Added Scenario 6 to eval.yaml with a new Rust fixture demonstrating unobserved ? error propagation and rubric/assertions to classify it as an Exception/Panic mutation point.
  • Introduced new fixture projects/files under fixtures/report-quality and fixtures/rust-error-propagation.
Show a summary per file
File Description
tests/dotnet-test/test-gap-analysis/fixtures/rust-error-propagation/src/lib.rs Rust library + tests to demonstrate unobserved ?-based error propagation.
tests/dotnet-test/test-gap-analysis/fixtures/rust-error-propagation/Cargo.toml Minimal Cargo manifest for the Rust fixture.
tests/dotnet-test/test-gap-analysis/fixtures/report-quality/Billing/InvoiceProcessor.g.cs Auto-generated partial type stub to validate generated-code exclusion.
tests/dotnet-test/test-gap-analysis/fixtures/report-quality/Billing/InvoiceProcessor.cs Billing business-logic fixture used to drive risk-prioritized gap analysis.
tests/dotnet-test/test-gap-analysis/fixtures/report-quality/Billing/Billing.csproj C# project file for the Billing fixture.
tests/dotnet-test/test-gap-analysis/fixtures/report-quality/Billing.Tests/InvoiceProcessorTests.cs MSTest fixture tests intentionally leaving gaps to be reported.
tests/dotnet-test/test-gap-analysis/fixtures/report-quality/Billing.Tests/Billing.Tests.csproj Test project file referencing Billing + MSTest.
tests/dotnet-test/test-gap-analysis/eval.yaml Adds two new evaluation scenarios and associated rubrics/assertions.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 8/8 changed files
  • Comments generated: 1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
test-gap-analysis Find boundary mutation gaps in tiered discount and shipping logic 5.0/5 → 5.0/5 ✅ test-gap-analysis; tools: skill ✅ 0.15
test-gap-analysis Find logic and null-check mutation gaps in access control code 4.7/5 → 5.0/5 🟢 ✅ test-gap-analysis; tools: skill, bash / ✅ test-gap-analysis; tools: skill ✅ 0.15 [1]
test-gap-analysis Acknowledge well-tested code with few surviving mutations 3.7/5 → 3.3/5 🔴 ✅ test-gap-analysis; tools: skill ✅ 0.15 [2]
test-gap-analysis Decline request to write new tests from scratch 1.7/5 ⏰ → 2.3/5 ⏰ 🟢 ℹ️ not activated (expected) ✅ 0.15 [3]
test-gap-analysis Produce a risk-prioritized report that excludes trivial and generated code 3.3/5 → 4.7/5 🟢 ✅ test-gap-analysis; tools: skill ✅ 0.15
test-gap-analysis Flag the Rust ? operator propagation as an unobserved mutation point 4.3/5 → 5.0/5 🟢 ✅ test-gap-analysis; tools: skill ✅ 0.15 [4]

[1] ⚠️ High run-to-run variance (CV=230%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=276%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=705%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=77%) — consider re-running with --runs 5

timeout — run(s) hit the (120s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 829 in dotnet/skills, download eval artifacts with gh run download 28157475694 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/4f1b8ebcda7952b5b0be18221902a1813361c5a4/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@github-actions github-actions Bot added the waiting-on-review PR state label label Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ Evaluation passed for 4f1b8eb. cc @dotnet/dotnet-testing — please review.

@Evangelink Evangelink enabled auto-merge (squash) June 25, 2026 10:38
@Evangelink Evangelink merged commit 91d68cb into main Jun 25, 2026
34 of 36 checks passed
@Evangelink Evangelink deleted the evangelink-eval-test-gap-analysis branch June 25, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-review PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants