Skip to content

Add eval coverage for dotnet-test/writing-mstest-tests#828

Merged
Evangelink merged 2 commits into
mainfrom
evangelink-eval-coverage-writing-mstest-tests
Jun 25, 2026
Merged

Add eval coverage for dotnet-test/writing-mstest-tests#828
Evangelink merged 2 commits into
mainfrom
evangelink-eval-coverage-writing-mstest-tests

Conversation

@Evangelink

Copy link
Copy Markdown
Member

Summary

Adds eval-test coverage for six previously-uncovered modern MSTest assertion APIs taught in plugins/dotnet-test/skills/writing-mstest-tests/SKILL.md.

The existing "Write tests with collection, null, and reference assertions" scenario (Goal 7) already ships the ideal ServiceRegistry fixture — a class whose members naturally elicit each assertion — but lacked deterministic assertions tying the generated test code to these tokens. This PR enriches that scenario's prompt, adds deterministic file_contains assertions, and sharpens the rubric.

Now-covered CodePattern points

  • Assert.IsNull (SKILL.md ~L154) — resolving an unregistered service returns null
  • Assert.AreSame (~L154) — resolving a registered service returns the same instance
  • Assert.Contains (~L178) — GetAll() includes a registered service
  • Assert.DoesNotContain (~L178) — service absent after Remove
  • Assert.IsEmpty (~L178) — GetAll() empty before registration
  • Assert.IsNotEmpty (~L178) — GetAll() not empty after registration

Verification

  • eng/skill-coverage/Measure-SkillCoverage.ps1 -PluginName dotnet-test -SkillName writing-mstest-tests -Format Json"uncovered": [] (all six targets covered, no regressions).
  • dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check --plugin ./plugins/dotnet-test → ✅ All checks passed (only pre-existing token-budget warnings).

Only eval.yaml changed; no SKILL.md or other skills touched.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Enrich the ServiceRegistry collection/null/reference scenario with deterministic file_contains assertions for the modern MSTest assertion APIs Assert.IsNull, Assert.AreSame, Assert.Contains, Assert.DoesNotContain, Assert.IsEmpty, and Assert.IsNotEmpty.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 25, 2026 08:15
@github-actions

Copy link
Copy Markdown
Contributor

Skill Coverage Report

Plugin Skill Covered Coverage
dotnet-test writing-mstest-tests 45/45 100%

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the dotnet-test plugin’s evaluation scenario for writing-mstest-tests to deterministically cover several modern MSTest assertion APIs by refining the scenario prompt, adding file_contains checks for specific assertions, and tightening the rubric accordingly.

Changes:

  • Refines the Goal 7 prompt to explicitly request behaviors that map to the targeted MSTest assertions.
  • Adds file_contains assertions to require the generated test file to include specific Assert.* APIs.
  • Updates the rubric to explicitly call out the required assertion APIs and expected usage.
Show a summary per file
File Description
tests/dotnet-test/writing-mstest-tests/eval.yaml Tightens the Goal 7 scenario prompt/rubric and adds deterministic file assertions to enforce coverage of modern MSTest assertion APIs.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 1

Comment thread tests/dotnet-test/writing-mstest-tests/eval.yaml
…nario

Address PR review: file_contains is a case-sensitive substring check, so the Assert.Contains and Assert.DoesNotContain value checks would also pass for CollectionAssert.* and StringAssert.* helpers. Add file_not_contains guards to keep the scenario a deterministic check for the modern Assert.* API.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@github-actions

Copy link
Copy Markdown
Contributor

❌ Evaluation failed. View workflow run

@github-actions github-actions Bot added the waiting-on-author PR state label label Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 @Evangelink — this PR has 1 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

@github-actions github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation and removed waiting-on-author PR state label labels Jun 25, 2026
@Evangelink Evangelink enabled auto-merge (squash) June 25, 2026 11:24
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
writing-mstest-tests Write unit tests for a service class 2.7/5 → 3.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.41 [1]
writing-mstest-tests Write data-driven tests for a calculator 1.7/5 → 1.3/5 🔴 ✅ writing-mstest-tests; tools: skill, glob / ✅ writing-mstest-tests; tools: skill, edit 🟡 0.41 [2]
writing-mstest-tests Write async tests with cancellation 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent / ⚠️ NOT ACTIVATED 🟡 0.41 [3]
writing-mstest-tests Fix swapped Assert.AreEqual arguments 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.41 [4]
writing-mstest-tests Modernize legacy test patterns 3.7/5 → 2.0/5 🔴 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.41 [5]
writing-mstest-tests Replace ExpectedException with Assert.Throws 2.3/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.41 [6]
writing-mstest-tests Use proper collection assertions 2.3/5 → 1.3/5 🔴 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.41 [7]
writing-mstest-tests Use proper type assertions instead of casts 1.0/5 → 1.7/5 🟢 ⚠️ NOT ACTIVATED 🟡 0.41 [8]
writing-mstest-tests Set up test lifecycle correctly 2.3/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.41 [9]
writing-mstest-tests Use DynamicData with ValueTuples over object arrays 3.0/5 → 3.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.41 [10]
writing-mstest-tests Use string assertions for format validation 4.0/5 → 3.7/5 ⏰ 🔴 ✅ writing-mstest-tests; tools: skill, create / ⚠️ NOT ACTIVATED 🟡 0.41 [11]
writing-mstest-tests Use comparison assertions for boundary testing 2.0/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.41 [12]
writing-mstest-tests Write tests with collection, null, and reference assertions 3.0/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: glob, skill / ⚠️ NOT ACTIVATED 🟡 0.41 [13]
writing-mstest-tests Configure conditional execution, retry, and cleanup 2.3/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.41 [14]
writing-mstest-tests Configure test parallelization and MSTest.Sdk project 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.41 [15]

[1] ⚠️ High run-to-run variance (CV=376%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -44.4% due to: judgment, quality
[2] ⚠️ High run-to-run variance (CV=670%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=859%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=126%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -17.1% due to: judgment, quality
[5] ⚠️ High run-to-run variance (CV=275%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=110%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -2.4% due to: tokens (12605 → 17233), time (5.4s → 6.7s)
[7] ⚠️ High run-to-run variance (CV=101%) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=90%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=145%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -10.6% due to: judgment, quality
[10] ⚠️ High run-to-run variance (CV=82%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=193%) — consider re-running with --runs 5
[12] (Plugin) Quality improved but weighted score is -22.3% due to: judgment, quality, tokens (14074 → 18034)
[13] ⚠️ High run-to-run variance (CV=164%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.0% due to: tokens (218138 → 293429)
[14] ⚠️ High run-to-run variance (CV=719%) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=193%) — consider re-running with --runs 5

timeout — run(s) hit the (180s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 828 in dotnet/skills, download eval artifacts with gh run download 28166183426 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/81c1b2094c13d81619588b3ca975876826755f1e/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@Evangelink Evangelink merged commit 26f01b5 into main Jun 25, 2026
34 of 36 checks passed
@Evangelink Evangelink deleted the evangelink-eval-coverage-writing-mstest-tests branch June 25, 2026 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-state/ready-for-eval PR is mergeable and awaiting evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants