Add eval coverage for dotnet-test/grade-tests by Evangelink · Pull Request #825 · dotnet/skills

Evangelink · 2026-06-25T08:14:16Z

Extends tests/dotnet-test/grade-tests/eval.yaml with two new scenarios (plus supporting fixtures) so the previously-uncovered teaching points in plugins/dotnet-test/skills/grade-tests/SKILL.md are now exercised.

Now-covered points

[Validation] Every grade is justified by at least one observable signal (L307)
[Pitfall] Inflating deductions to justify the grade (L331)
[Pitfall] Flagging Go/Rust table-driven loops as conditional logic (L335)
[Pitfall] Penalizing tests when production code is unavailable (L337)
[Pitfall] Spilling a 500-row table into a PR comment (L339)

What was added

Scenario 4 — Go table-driven tests (fixtures/go-table-driven/): grades idiomatic for ... range / t.Run subtests; rubric asserts the loop is NOT misread as conditional logic, every grade rests on an observable signal, and deductions are not inflated.
Scenario 5 — production code unavailable (fixtures/production-unavailable/): grades tests whose code under test (Payments.Core) is absent; rubric asserts behavioral concerns are marked Unverified (not deducted) and the report stays compact rather than spilling a giant 500-row table into the PR comment.

Prompts are natural (no skill references); rubric items are outcome-focused and independently evaluable.

Verification

Measure-SkillCoverage.ps1 -PluginName dotnet-test -SkillName grade-tests → 100% (28/28), uncovered empty, no regressions.
SkillValidator check --plugin ./plugins/dotnet-test → ✅ all checks passed (YAML parses; remaining warnings are pre-existing token-size notes).

No SKILL.md or other skills were modified.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-25T08:14:51Z

Skill Coverage Report

	Plugin	Skill	Covered	Coverage
✅	`dotnet-test`	`grade-tests`	28/28	100%

Copilot

Pull request overview

This PR expands the dotnet-test/grade-tests evaluation suite by adding two new scenarios (Go table-driven tests; production code unavailable) plus new fixtures to exercise previously uncovered guidance in the grade-tests skill.

Changes:

Added Scenario 4 to grade idiomatic Go table-driven tests and ensure loops/assertion patterns aren’t misclassified as branching.
Added Scenario 5 to grade tests when the production code is missing, expecting “Unverified” (not deductions) and compact PR-comment output.
Added supporting Go and C# fixture workspaces for the new scenarios.

Show a summary per file

File	Description
tests/dotnet-test/grade-tests/eval.yaml	Adds two new eval scenarios and their assertions/rubrics to cover additional grade-tests teaching points.
tests/dotnet-test/grade-tests/fixtures/go-table-driven/go.mod	Introduces a minimal Go module for the table-driven test fixture.
tests/dotnet-test/grade-tests/fixtures/go-table-driven/calculator.go	Adds simple Go production code to be graded against by the Go scenario.
tests/dotnet-test/grade-tests/fixtures/go-table-driven/calculator_test.go	Adds Go tests of varying quality (A/C/F) including a table-driven subtest loop.
tests/dotnet-test/grade-tests/fixtures/production-unavailable/Payments.Tests/Payments.Tests.csproj	Adds a .NET test project fixture explicitly lacking the production project reference.
tests/dotnet-test/grade-tests/fixtures/production-unavailable/Payments.Tests/PaymentGatewayTests.cs	Adds MSTest methods of varying quality for the “production unavailable” grading scenario.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 6/6 changed files
Comments generated: 2

Address PR review: verify TestParse_NoError=C in scenario 4, and Charge_NegativeAmount=A / Refund_ExistingCharge=C in scenario 5 so the scenarios cannot pass on misgraded output. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-25T08:32:31Z

/evaluate

github-actions · 2026-06-25T08:42:07Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
grade-tests	Grade a curated list of MSTest tests with mixed quality	4.3/5 → 5.0/5 🟢	✅ grade-tests; tools: skill, glob / ✅ grade-tests; tools: skill, glob, bash, task, read_agent	✅ 0.20	❌ [1]
grade-tests	Grade pytest test methods using the same rubric	5.0/5 → 4.0/5 🔴	✅ grade-tests; tools: skill, glob, bash / ✅ grade-tests; tools: skill, glob	✅ 0.20	❌ [2]
grade-tests	Decline to grade the entire workspace when no test list is provided	1.0/5 → 5.0/5 🟢	✅ grade-tests; tools: skill / ✅ grade-tests; test-analysis-extensions; tools: skill, bash, glob	✅ 0.20	❌ [3]
grade-tests	Grade Go table-driven tests without misreading the loop as branching	4.0/5 → 4.0/5	✅ grade-tests; tools: skill / ✅ grade-tests; tools: skill, glob, bash	✅ 0.20	❌ [4]
grade-tests	Grade tests when the production code under test is unavailable	4.0/5 → 4.0/5	✅ grade-tests; tools: skill, glob, bash / ✅ grade-tests; tools: skill, bash, glob	✅ 0.20	❌ [5]

[1] ⚠️ High run-to-run variance (CV=318%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.2% due to: tokens (40505 → 156555), tool calls (3 → 8), time (51.0s → 156.7s)
[2] (Plugin) Quality unchanged but weighted score is -8.6% due to: tokens (26417 → 122036), tool calls (2 → 6), time (28.0s → 112.7s)
[3] (Plugin) Quality unchanged but weighted score is -12.7% due to: errors (0 → 1), tokens (52678 → 105055), time (71.9s → 118.6s), tool calls (7 → 10)
[4] ⚠️ High run-to-run variance (CV=133%) — consider re-running with --runs 5
[5] (Plugin) Quality unchanged but weighted score is -9.2% due to: tokens (26558 → 131127), tool calls (2 → 7), time (47.0s → 110.8s)

⏰ timeout — run(s) hit the (120s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 825 in dotnet/skills, download eval artifacts with gh run download 28157445404 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/6b6633b6c01c3ca134e0a53af5e52365ca699461/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

github-actions · 2026-06-25T09:29:19Z

👋 @Evangelink — this PR has 1 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

github-actions · 2026-06-25T11:15:05Z

✅ Evaluation passed for 6b6633b. cc @dotnet/dotnet-testing — please review.

Add eval coverage for dotnet-test/grade-tests

3102cd7

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 25, 2026 08:14

Copilot started reviewing on behalf of Evangelink June 25, 2026 08:14 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread tests/dotnet-test/grade-tests/eval.yaml

Comment thread tests/dotnet-test/grade-tests/eval.yaml

github-actions Bot added the waiting-on-author PR state label label Jun 25, 2026

Evangelink enabled auto-merge (squash) June 25, 2026 10:39

github-actions Bot added waiting-on-review PR state label and removed waiting-on-author PR state label labels Jun 25, 2026

YuliiaKovalova approved these changes Jun 25, 2026

View reviewed changes

Evangelink merged commit 29f09cc into main Jun 25, 2026
34 of 36 checks passed

Evangelink deleted the evangelink-eval-coverage-grade-tests branch June 25, 2026 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add eval coverage for dotnet-test/grade-tests#825

Add eval coverage for dotnet-test/grade-tests#825
Evangelink merged 2 commits into
mainfrom
evangelink-eval-coverage-grade-tests

Evangelink commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Evangelink commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Evangelink commented Jun 25, 2026

Now-covered points

What was added

Verification

Uh oh!

github-actions Bot commented Jun 25, 2026

Skill Coverage Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Evangelink commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Skill Validation Results

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants