Skip to content

Add eval coverage for dotnet-test/grade-tests#825

Merged
Evangelink merged 2 commits into
mainfrom
evangelink-eval-coverage-grade-tests
Jun 25, 2026
Merged

Add eval coverage for dotnet-test/grade-tests#825
Evangelink merged 2 commits into
mainfrom
evangelink-eval-coverage-grade-tests

Conversation

@Evangelink

Copy link
Copy Markdown
Member

Extends tests/dotnet-test/grade-tests/eval.yaml with two new scenarios (plus supporting fixtures) so the previously-uncovered teaching points in plugins/dotnet-test/skills/grade-tests/SKILL.md are now exercised.

Now-covered points

  • [Validation] Every grade is justified by at least one observable signal (L307)
  • [Pitfall] Inflating deductions to justify the grade (L331)
  • [Pitfall] Flagging Go/Rust table-driven loops as conditional logic (L335)
  • [Pitfall] Penalizing tests when production code is unavailable (L337)
  • [Pitfall] Spilling a 500-row table into a PR comment (L339)

What was added

  • Scenario 4 — Go table-driven tests (fixtures/go-table-driven/): grades idiomatic for ... range / t.Run subtests; rubric asserts the loop is NOT misread as conditional logic, every grade rests on an observable signal, and deductions are not inflated.
  • Scenario 5 — production code unavailable (fixtures/production-unavailable/): grades tests whose code under test (Payments.Core) is absent; rubric asserts behavioral concerns are marked Unverified (not deducted) and the report stays compact rather than spilling a giant 500-row table into the PR comment.

Prompts are natural (no skill references); rubric items are outcome-focused and independently evaluable.

Verification

  • Measure-SkillCoverage.ps1 -PluginName dotnet-test -SkillName grade-tests100% (28/28), uncovered empty, no regressions.
  • SkillValidator check --plugin ./plugins/dotnet-test → ✅ all checks passed (YAML parses; remaining warnings are pre-existing token-size notes).

No SKILL.md or other skills were modified.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 25, 2026 08:14
@github-actions

Copy link
Copy Markdown
Contributor

Skill Coverage Report

Plugin Skill Covered Coverage
dotnet-test grade-tests 28/28 100%

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the dotnet-test/grade-tests evaluation suite by adding two new scenarios (Go table-driven tests; production code unavailable) plus new fixtures to exercise previously uncovered guidance in the grade-tests skill.

Changes:

  • Added Scenario 4 to grade idiomatic Go table-driven tests and ensure loops/assertion patterns aren’t misclassified as branching.
  • Added Scenario 5 to grade tests when the production code is missing, expecting “Unverified” (not deductions) and compact PR-comment output.
  • Added supporting Go and C# fixture workspaces for the new scenarios.
Show a summary per file
File Description
tests/dotnet-test/grade-tests/eval.yaml Adds two new eval scenarios and their assertions/rubrics to cover additional grade-tests teaching points.
tests/dotnet-test/grade-tests/fixtures/go-table-driven/go.mod Introduces a minimal Go module for the table-driven test fixture.
tests/dotnet-test/grade-tests/fixtures/go-table-driven/calculator.go Adds simple Go production code to be graded against by the Go scenario.
tests/dotnet-test/grade-tests/fixtures/go-table-driven/calculator_test.go Adds Go tests of varying quality (A/C/F) including a table-driven subtest loop.
tests/dotnet-test/grade-tests/fixtures/production-unavailable/Payments.Tests/Payments.Tests.csproj Adds a .NET test project fixture explicitly lacking the production project reference.
tests/dotnet-test/grade-tests/fixtures/production-unavailable/Payments.Tests/PaymentGatewayTests.cs Adds MSTest methods of varying quality for the “production unavailable” grading scenario.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 6/6 changed files
  • Comments generated: 2

Comment thread tests/dotnet-test/grade-tests/eval.yaml
Comment thread tests/dotnet-test/grade-tests/eval.yaml
Address PR review: verify TestParse_NoError=C in scenario 4, and Charge_NegativeAmount=A / Refund_ExistingCharge=C in scenario 5 so the scenarios cannot pass on misgraded output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
grade-tests Grade a curated list of MSTest tests with mixed quality 4.3/5 → 5.0/5 🟢 ✅ grade-tests; tools: skill, glob / ✅ grade-tests; tools: skill, glob, bash, task, read_agent ✅ 0.20 [1]
grade-tests Grade pytest test methods using the same rubric 5.0/5 → 4.0/5 🔴 ✅ grade-tests; tools: skill, glob, bash / ✅ grade-tests; tools: skill, glob ✅ 0.20 [2]
grade-tests Decline to grade the entire workspace when no test list is provided 1.0/5 → 5.0/5 🟢 ✅ grade-tests; tools: skill / ✅ grade-tests; test-analysis-extensions; tools: skill, bash, glob ✅ 0.20 [3]
grade-tests Grade Go table-driven tests without misreading the loop as branching 4.0/5 → 4.0/5 ✅ grade-tests; tools: skill / ✅ grade-tests; tools: skill, glob, bash ✅ 0.20 [4]
grade-tests Grade tests when the production code under test is unavailable 4.0/5 → 4.0/5 ✅ grade-tests; tools: skill, glob, bash / ✅ grade-tests; tools: skill, bash, glob ✅ 0.20 [5]

[1] ⚠️ High run-to-run variance (CV=318%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.2% due to: tokens (40505 → 156555), tool calls (3 → 8), time (51.0s → 156.7s)
[2] (Plugin) Quality unchanged but weighted score is -8.6% due to: tokens (26417 → 122036), tool calls (2 → 6), time (28.0s → 112.7s)
[3] (Plugin) Quality unchanged but weighted score is -12.7% due to: errors (0 → 1), tokens (52678 → 105055), time (71.9s → 118.6s), tool calls (7 → 10)
[4] ⚠️ High run-to-run variance (CV=133%) — consider re-running with --runs 5
[5] (Plugin) Quality unchanged but weighted score is -9.2% due to: tokens (26558 → 131127), tool calls (2 → 7), time (47.0s → 110.8s)

timeout — run(s) hit the (120s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 825 in dotnet/skills, download eval artifacts with gh run download 28157445404 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/6b6633b6c01c3ca134e0a53af5e52365ca699461/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

@github-actions github-actions Bot added the waiting-on-author PR state label label Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 @Evangelink — this PR has 1 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

@Evangelink Evangelink enabled auto-merge (squash) June 25, 2026 10:39
@github-actions github-actions Bot added waiting-on-review PR state label and removed waiting-on-author PR state label labels Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ Evaluation passed for 6b6633b. cc @dotnet/dotnet-testing — please review.

@Evangelink Evangelink merged commit 29f09cc into main Jun 25, 2026
34 of 36 checks passed
@Evangelink Evangelink deleted the evangelink-eval-coverage-grade-tests branch June 25, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-review PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants