Skip to content

skill-validator: restore 15K aggregate cap as the real Copilot CLI skill-menu budget#803

Merged
Evangelink merged 12 commits into
dotnet:mainfrom
Evangelink:restore-15k-skill-budget
Jun 25, 2026
Merged

skill-validator: restore 15K aggregate cap as the real Copilot CLI skill-menu budget#803
Evangelink merged 12 commits into
dotnet:mainfrom
Evangelink:restore-15k-skill-budget

Conversation

@Evangelink

Copy link
Copy Markdown
Member

Why

The skill-validator''s per-plugin aggregate description cap (SkillProfiler.MaxAggregateDescriptionLength) had been raised 15,000 → 20,000 → 22,000, justified by a code comment asserting that 15K was "a local repo policy, NOT a documented Copilot/agentskills constraint."

That assertion is wrong. The GitHub Copilot CLI renders the model-facing <available_skills> menu under a hard 15,000-character budget (the agent SDK''s SKILL_CHAR_BUDGET, default 15e3 — confirmed in CLI 1.0.36 and 1.0.61). Skills are listed alphabetically by name and emitted with their full <description> only until the budget is exhausted; every skill past the cut-off collapses to a bare name with no description and can no longer be reliably model-activated.

Raising the validator cap didn''t add headroom — it masked silent menu truncation. This is the root cause behind the dotnet-test plugin-arm activation failures (e.g. run-tests, test-*): they sit alphabetically late, fell into the name-only overflow, and never activated in plugin eval runs even though they activate fine in isolation. Description tuning can''t fix that — the description is never shown.

What

  • SkillProfiler.MaxAggregateDescriptionLength: 22,000 → 15,000, with the comment rewritten to document the real Copilot CLI budget (and correct the prior claim).
  • Aggregate now excludes disable-model-invocation: true skills. The CLI drops those from the menu entirely, so they don''t consume the budget. This makes the cap satisfiable by hiding reference / agent-orchestrated primitives rather than only by trimming descriptions.
  • InvestigatingResults.md: documents plugin-arm-only non-activation caused by skill-menu budget overflow, and how to fix it.

⚠️ Sequencing

dotnet-test currently aggregates ~20.7K chars (the only plugin over 15K), so skill-check will fail for it until it is slimmed below the cap — via disable-model-invocation on reference/primitive skills (see #800) plus description trims. This PR should merge once dotnet-test is ≤ 15K visible. All other plugins are already under the cap (next largest: dotnet-msbuild at ~14.5K).

Verification

  • skill-validator builds clean (0 warnings).
  • Confirmed the cap is enforced and that disable-model-invocation skills are excluded from the aggregate (flagging two reference skills dropped the reported total by exactly their description lengths).

…opilot CLI skill-menu budget

The per-plugin aggregate description cap had been raised 15,000 -> 20,000
-> 22,000 under the belief that 15K was 'a local repo policy, NOT a
documented Copilot constraint'. That belief was wrong: the GitHub Copilot
CLI renders the model-facing <available_skills> menu under a hard 15,000-
char budget (the agent SDK's SKILL_CHAR_BUDGET, default 15e3, confirmed in
CLI 1.0.36 and 1.0.61). Skills are listed alphabetically and emitted with
their full <description> only until the budget is exhausted; every skill
past the cut-off collapses to a bare name with no description and can no
longer be reliably model-activated. Raising the validator cap merely
masked this silent menu truncation — e.g. dotnet-test's run-tests and
test-* skills stopped activating in plugin eval runs because they fell
into the name-only overflow.

Changes:
- SkillProfiler.MaxAggregateDescriptionLength: 22,000 -> 15,000, with the
  comment rewritten to document the real Copilot CLI budget (and correct
  the prior 'not a documented constraint' claim).
- CheckCommand aggregate now excludes skills marked
  'disable-model-invocation: true' — the CLI drops those from the menu, so
  they do not consume the budget. This makes the cap satisfiable by hiding
  reference / agent-orchestrated primitives rather than only by trimming.
- InvestigatingResults.md: document plugin-arm-only non-activation caused
  by skill-menu budget overflow, and how to fix it.

Note: dotnet-test currently exceeds 15K and must be slimmed below it
(via disable-model-invocation on reference/primitive skills plus
description trims) before this cap can go green repo-wide.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 22, 2026 15:37
@github-actions

Copy link
Copy Markdown
Contributor

Note

This PR is from a fork and modifies infrastructure files (eng/ or .github/).

Changes to infrastructure typically need to be submitted from a branch in dotnet/skills (not a fork) so that CI workflows run with the correct permissions and secrets.

Please consider recreating this PR from an upstream branch. If you don't have push access to dotnet/skills, ask a maintainer to push your branch for you.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Aligns skill-validator’s per-plugin aggregate description cap with the Copilot CLI’s effective 15,000-character skill-menu budget, and updates validation/docs to prevent silent <available_skills> truncation from masking plugin-arm non-activation.

Changes:

  • Restores SkillProfiler.MaxAggregateDescriptionLength to 15,000 and rewrites the rationale/commentary to reflect the Copilot CLI menu budget behavior.
  • Updates check to exclude skills with disable-model-invocation: true from the aggregate description total (matching CLI menu behavior).
  • Documents “plugin-arm-only non-activation due to menu overflow” troubleshooting steps in InvestigatingResults.md.
Show a summary per file
File Description
eng/skill-validator/src/docs/InvestigatingResults.md Adds guidance for diagnosing plugin-only non-activation caused by Copilot CLI skill-menu budget overflow and suggests mitigations.
eng/skill-validator/src/Check/SkillProfiler.cs Lowers the aggregate description cap to 15,000 and documents it as a Copilot CLI budget constraint.
eng/skill-validator/src/Check/CheckCommand.cs Excludes disable-model-invocation: true skills from the aggregate description calculation during plugin checks.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 2

Comment thread eng/skill-validator/src/Check/CheckCommand.cs Outdated
Comment thread eng/skill-validator/src/Check/SkillProfiler.cs Outdated
@AbhitejJohn

Copy link
Copy Markdown
Collaborator

@Evangelink : Can we re-create this from a branch in the repo please?

…ion check

Address review: replace Regex.IsMatch(pattern-string) with a
[GeneratedRegex] partial method (AOT-friendly, no per-call cache lookup),
matching FrontmatterParser's style. Runs once per skill during checks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the waiting-on-author PR state label label Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 @Evangelink — this PR has 2 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

@github-actions github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation and removed waiting-on-author PR state label labels Jun 22, 2026
github-actions Bot added a commit that referenced this pull request Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
system-text-json-net11 Serialize JSON in .NET 11 with PascalCase property names 4.0/5 → 5.0/5 🟢 ✅ system-text-json-net11; tools: skill ✅ 0.06 [1]
system-text-json-net11 Type-safe JsonTypeInfo access without exceptions in .NET 11 3.0/5 → 5.0/5 🟢 ✅ system-text-json-net11; tools: skill, edit, view ✅ 0.06
system-text-json-net11 Non-activation: camelCase JSON serialization on .NET 8 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.06 [2]
optimizing-ef-core-queries Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.22

[1] (Isolated) Quality improved but weighted score is -2.1% due to: tokens (65782 → 85343), tool calls (5 → 6)
[2] (Isolated) Quality unchanged but weighted score is -16.2% due to: judgment, tokens (51994 → 80828), tool calls (4 → 6)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 803 in dotnet/skills, download eval artifacts with gh run download 27975446091 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/78e9dda103845334f0ab7d390467ad30e744f360/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@github-actions github-actions Bot added ready-to-merge PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ Approved by @AbhitejJohn. cc @dotnet/skills-merge-approvers — ready to merge.

Copilot AI review requested due to automatic review settings June 23, 2026 06:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 1

Comment thread eng/skill-validator/src/Check/CheckCommand.cs Outdated
@github-actions github-actions Bot added waiting-on-author PR state label and removed ready-to-merge PR state label labels Jun 23, 2026
…ck-scalar false positives

The regex-based check matched any line in the frontmatter, so a block-scalar description that merely mentioned 'disable-model-invocation: true' on its own line was wrongly treated as disabling model invocation. Parse the frontmatter with the existing YAML deserializer (which correctly handles block scalars) by adding a DisableModelInvocation field to SkillFrontmatter, and drop the regex entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@Evangelink Evangelink enabled auto-merge (squash) June 23, 2026 10:11
github-actions Bot added a commit that referenced this pull request Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 2

Comment thread eng/skill-validator/src/Check/SkillProfiler.cs Outdated
Comment thread eng/skill-validator/tests/Check/CheckCommandTests.cs
…killMenuLength

The constant now caps the fully-rendered <skill>-menu size (name + description + location + markup) rather than the sum of raw description lengths, so the old name was misleading and risked someone reverting to summing Description.Length. Rename it (and update related comments/docs) to reflect what is actually enforced. Also fix the test-name grammar DescriptionsSummingToLimit_Fail -> _Fails.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added waiting-on-author PR state label and removed waiting-on-review PR state label labels Jun 24, 2026
@github-actions github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation and removed waiting-on-author PR state label labels Jun 24, 2026
github-actions Bot added a commit that referenced this pull request Jun 24, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
writing-mstest-tests Write unit tests for a service class 5.0/5 → 5.0/5 ✅ writing-mstest-tests; tools: skill, glob 🟡 0.31
writing-mstest-tests Write data-driven tests for a calculator 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent, view / ⚠️ NOT ACTIVATED 🟡 0.31
writing-mstest-tests Write async tests with cancellation 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.31
writing-mstest-tests Fix swapped Assert.AreEqual arguments 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.31 [1]
writing-mstest-tests Modernize legacy test patterns 5.0/5 → 5.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.31 [2]
writing-mstest-tests Replace ExpectedException with Assert.Throws 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.31
writing-mstest-tests Use proper collection assertions 3.0/5 → 2.0/5 🔴 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.31
writing-mstest-tests Use proper type assertions instead of casts 4.0/5 → 4.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.31
writing-mstest-tests Set up test lifecycle correctly 2.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.31
writing-mstest-tests Use DynamicData with ValueTuples over object arrays 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.31 [3]
writing-mstest-tests Use string assertions for format validation 3.0/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.31
writing-mstest-tests Use comparison assertions for boundary testing 2.0/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.31 [4]
writing-mstest-tests Write tests with collection, null, and reference assertions 4.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: glob, skill / ⚠️ NOT ACTIVATED 🟡 0.31
writing-mstest-tests Configure conditional execution, retry, and cleanup 3.0/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.31
writing-mstest-tests Configure test parallelization and MSTest.Sdk project 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.31
test-tagging Tag an untagged MSTest test suite 4.0/5 → 4.0/5 ✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob ✅ 0.20
test-tagging Tag an untagged xUnit test suite 3.0/5 → 5.0/5 🟢 ✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob ✅ 0.20
test-tagging Tag an untagged NUnit test suite 4.0/5 → 5.0/5 🟢 ✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob ✅ 0.20
test-tagging Audit test distribution without modifying files 5.0/5 → 4.0/5 🔴 ⚠️ NOT ACTIVATED / ✅ test-tagging; tools: skill ✅ 0.20
test-tagging Decline request to write new tests 4.0/5 → 4.0/5 ℹ️ not activated (expected) ✅ 0.20 [5]
test-tagging Tag a partially-tagged MSTest suite without duplicating existing traits 4.0/5 → 4.0/5 ✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob ✅ 0.20 [6]
test-tagging Accurately classify NUnit tests with misleading method names 5.0/5 → 5.0/5 ✅ test-tagging; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.20 [7]
test-tagging Tag MSTest tests and verify the project still builds 4.0/5 → 4.0/5 ✅ test-tagging; tools: skill ✅ 0.20
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 1.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.11
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.0/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.11
mtp-hot-reload Enable hot reload when package already installed 2.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.11
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.11
mtp-hot-reload Use dotnet run not dotnet test for hot reload 3.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.11
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 2.0/5 → 2.0/5 ✅ mtp-hot-reload; tools: skill, edit, bash, create / ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.11
mtp-hot-reload Run specific failing test with hot reload filter 1.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.11
assertion-quality Identify low assertion diversity in equality-dominated test suite 4.0/5 → 5.0/5 🟢 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill 🟡 0.23
assertion-quality Flag assertion-free tests and trivial-only assertions 4.0/5 → 4.0/5 ✅ assertion-quality; tools: skill, glob 🟡 0.23 [8]
assertion-quality Recognize well-diversified assertion usage 5.0/5 → 4.0/5 🔴 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill 🟡 0.23 [9]
assertion-quality Identify self-referential assertions in identity and round-trip tests 5.0/5 → 3.0/5 🔴 ✅ assertion-quality; tools: skill, glob 🟡 0.23
assertion-quality Decline request to write new tests from scratch 5.0/5 → 4.0/5 🔴 ℹ️ not activated (expected) 🟡 0.23
assertion-quality Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite 4.0/5 → 5.0/5 🟢 ✅ assertion-quality; tools: skill, glob 🟡 0.23
test-smell-detection Detect multiple test smells in order processing test suite 4.0/5 → 5.0/5 🟢 ✅ test-smell-detection; tools: skill 🟡 0.43
test-smell-detection Recognize well-written tests with no significant smells 4.0/5 → 5.0/5 🟢 ✅ test-smell-detection; tools: skill / ✅ test-smell-detection; tools: skill, glob 🟡 0.43
test-smell-detection Recognize integration tests and avoid false positives for external resources 5.0/5 → 5.0/5 ✅ test-smell-detection; tools: skill 🟡 0.43 [10]
test-smell-detection Decline request to write new tests from scratch 5.0/5 → 4.0/5 🔴 ℹ️ not activated (expected) 🟡 0.43
test-smell-detection Polyglot: detect canonical test smells in a JUnit/Java Catalog suite 5.0/5 → 5.0/5 ✅ test-smell-detection; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.43 [11]
test-gap-analysis Find boundary mutation gaps in tiered discount and shipping logic 5.0/5 → 5.0/5 ✅ test-gap-analysis; tools: skill ✅ 0.06 [12]
test-gap-analysis Find logic and null-check mutation gaps in access control code 4.0/5 → 5.0/5 🟢 ✅ test-gap-analysis; tools: skill ✅ 0.06
test-gap-analysis Acknowledge well-tested code with few surviving mutations 4.0/5 → 4.0/5 ✅ test-gap-analysis; tools: skill ✅ 0.06 [13]
test-gap-analysis Decline request to write new tests from scratch 4.0/5 → 2.0/5 ⏰ 🔴 ℹ️ not activated (expected) ✅ 0.06
migrate-static-to-wrapper Migrate DateTime.UtcNow to TimeProvider in a service class 5.0/5 → 5.0/5 ✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08 [14]
migrate-static-to-wrapper Migrate only in scoped files, leaving others untouched 5.0/5 → 5.0/5 ✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08 [15]
migrate-static-to-wrapper Decline migration when wrapper does not exist yet 2.0/5 → 5.0/5 🟢 ✅ migrate-static-to-wrapper; tools: skill ✅ 0.08
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill ✅ 0.20 [16]
test-anti-patterns Detect flakiness indicators and test coupling 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; tools: skill, glob ✅ 0.20 [17]
test-anti-patterns Detect duplicated tests and magic values 4.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill ✅ 0.20 [18]
test-anti-patterns Recognize well-written tests without inventing false positives 4.0/5 → 4.0/5 ✅ test-anti-patterns; tools: skill ✅ 0.20 [19]
test-anti-patterns Detect coverage-touching pattern across a service facade 5.0/5 → 4.0/5 🔴 ✅ test-anti-patterns; tools: skill ✅ 0.20
test-anti-patterns Detect self-referential assertions in round-trip and identity tests 4.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.20 [20]
test-anti-patterns Polyglot: detect anti-patterns in a Python pytest suite 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.20
crap-score Calculate CRAP score for a single method with partial coverage 3.0/5 → 4.0/5 🟢 ✅ crap-score; tools: skill, glob / ✅ crap-score; tools: skill ✅ 0.07
crap-score Identify riskiest methods across a file 5.0/5 → 5.0/5 ✅ crap-score; tools: skill ✅ 0.07
crap-score Generate coverage then compute CRAP score 4.0/5 → 4.0/5 ✅ crap-score; tools: skill ✅ 0.07
detect-static-dependencies Identify static dependencies in a multi-class project 4.0/5 → 5.0/5 🟢 ✅ detect-static-dependencies; tools: skill ✅ 0.08
detect-static-dependencies Detect time-related statics and recommend TimeProvider 4.0/5 → 5.0/5 🟢 ✅ detect-static-dependencies; tools: skill, glob / ✅ detect-static-dependencies; tools: skill ✅ 0.08 [21]
detect-static-dependencies Decline scan for non-C# project 5.0/5 → 5.0/5 ✅ detect-static-dependencies; tools: skill / ℹ️ not activated (expected) ✅ 0.08 [22]
detect-static-dependencies Verify structured report includes file count, categories, and top patterns 5.0/5 → 5.0/5 ✅ detect-static-dependencies; tools: skill, glob / ✅ detect-static-dependencies; tools: skill ✅ 0.08 [23]
detect-static-dependencies Exclude obj and bin directories from the scan 5.0/5 → 5.0/5 ✅ detect-static-dependencies; tools: skill, glob / ✅ detect-static-dependencies; tools: skill ✅ 0.08 [24]
detect-static-dependencies Detect statics inside lambda expressions and LINQ queries 4.0/5 → 4.0/5 ✅ detect-static-dependencies; tools: skill, glob ✅ 0.08
generate-testability-wrappers Generate TimeProvider adoption for DateTime.UtcNow 4.0/5 → 5.0/5 🟢 ✅ generate-testability-wrappers; tools: skill, report_intent / ✅ generate-testability-wrappers; tools: skill, report_intent, glob, view, edit, bash ✅ 0.10 [25]
generate-testability-wrappers Generate custom Environment wrapper 3.0/5 → 5.0/5 🟢 ✅ generate-testability-wrappers; tools: skill ✅ 0.10
generate-testability-wrappers Recommend System.IO.Abstractions for file system calls 4.0/5 → 5.0/5 🟢 ✅ generate-testability-wrappers; tools: skill / ✅ generate-testability-wrappers; tools: report_intent, skill ✅ 0.10
generate-testability-wrappers Decline wrapper generation for already-abstracted code 2.0/5 → 5.0/5 🟢 ✅ generate-testability-wrappers; tools: skill / ℹ️ not activated (expected) ✅ 0.10
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED ✅ 0.14 [26]
run-tests Run tests with trx reporting on MTP project (SDK 9) 2.0/5 → 5.0/5 🟢 ✅ run-tests; tools: report_intent, skill, view ✅ 0.14
run-tests Run tests with blame-hang on MTP project (SDK 10) 2.0/5 → 4.0/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Run tests on a specific TFM with TRX in a multi-TFM MTP project (SDK 9) 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.14 [27]
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: skill ✅ 0.14 [28]
run-tests Filter NUnit tests by class name on VSTest 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view ✅ 0.14
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Filter xUnit v3 tests by class pattern and trait using query filter language 1.0/5 → 1.0/5 ⏰ ✅ run-tests; tools: report_intent, skill, view, bash / ⚠️ NOT ACTIVATED ✅ 0.14 [29]
run-tests Filter TUnit tests by class using treenode-filter 1.0/5 → 4.0/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Combine multiple filter criteria on VSTest MSTest 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests MTP project on SDK 9 must use -- separator for args 1.0/5 → 3.0/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.14
run-tests MTP project on SDK 10 passes args directly 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Detect test platform from Directory.Build.props 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED ✅ 0.14
dotnet-test-frameworks Cross-framework assertion equivalence mapping 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.12 [30]
dotnet-test-frameworks Identify TUnit framework and its unique attributes 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.12 [31]
dotnet-test-frameworks Replace try-catch with framework-native exception assertions 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.12 [32]
dotnet-test-frameworks Skip annotations across all four frameworks 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.12 [33]
dotnet-test-frameworks Convert NUnit lifecycle methods to xUnit equivalents 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.12 [34]
dotnet-test-frameworks Identify integration tests by markers and code patterns 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.12 [35]
dotnet-test-frameworks Convert cross-framework assertions to TUnit syntax 2.0/5 → 1.0/5 🔴 ℹ️ not activated (expected) ✅ 0.12 [36]
dotnet-test-frameworks Diagnose silently-passing TUnit test with missing await 4.0/5 → 4.0/5 ℹ️ not activated (expected) ✅ 0.12 [37]
dotnet-test-frameworks Refactor TUnit try/catch to native exception assertion 2.0/5 → 2.0/5 ℹ️ not activated (expected) ✅ 0.12 [38]
dotnet-test-frameworks TUnit lifecycle hooks at test, class, assembly, and session scope 4.0/5 → 4.0/5 ℹ️ not activated (expected) ✅ 0.12 [39]
dotnet-test-frameworks TUnit skip mechanisms — attribute, assembly-wide, and dynamic 3.0/5 → 3.0/5 ℹ️ not activated (expected) ✅ 0.12
grade-tests Grade a curated list of MSTest tests with mixed quality 4.0/5 → 5.0/5 🟢 ✅ grade-tests; tools: skill, glob 🟡 0.21
grade-tests Grade pytest test methods using the same rubric 5.0/5 → 5.0/5 ✅ grade-tests; tools: skill / ✅ grade-tests; tools: skill, bash, glob 🟡 0.21
grade-tests Decline to grade the entire workspace when no test list is provided 1.0/5 → 1.0/5 ✅ grade-tests; tools: skill, glob / ✅ grade-tests; tools: skill 🟡 0.21 [40]
coverage-analysis Project-wide coverage analysis with existing Cobertura data 3.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, create ✅ 0.10
coverage-analysis Run coverage from scratch without existing data 4.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, glob, read_bash, stop_bash, create / ✅ coverage-analysis; tools: skill, glob, create ✅ 0.10
coverage-analysis Coverage plateau diagnosis 4.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, bash, create ✅ 0.10
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.0/5 → 3.0/5 ✅ code-testing-agent; tools: skill 🟡 0.21 [41]
code-testing-agent Generate pytest tests for the Flask tasks API (Python polyglot) 5.0/5 → 4.0/5 🔴 ✅ code-testing-agent; tools: skill 🟡 0.21
code-testing-agent Generate Vitest tests for the shopping-cart library (TypeScript polyglot) 4.0/5 → 5.0/5 🟢 ✅ code-testing-agent; tools: skill 🟡 0.21
code-testing-agent Does not revert a gutted-looking workspace (workspace integrity) 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.21 [42]
optimizing-ef-core-queries Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete 5.0/5 → 4.0/5 🔴 ✅ optimizing-ef-core-queries; tools: report_intent, skill 🟡 0.27

[1] (Plugin) Quality unchanged but weighted score is -2.9% due to: tokens (12709 → 17382), time (6.3s → 8.9s)
[2] (Plugin) Quality unchanged but weighted score is -6.8% due to: quality, tokens (161501 → 213972), tool calls (11 → 15)
[3] (Plugin) Quality unchanged but weighted score is -2.0% due to: tokens (12816 → 17376)
[4] (Plugin) Quality unchanged but weighted score is -21.0% due to: judgment, quality, tokens (13788 → 17868)
[5] (Isolated) Quality unchanged but weighted score is -0.2% due to: efficiency metrics
[6] (Plugin) Quality unchanged but weighted score is -5.5% due to: tokens (139842 → 425155), time (60.2s → 111.5s), tool calls (21 → 30)
[7] (Isolated) Quality unchanged but weighted score is -3.1% due to: tokens (114966 → 204093), time (55.3s → 76.0s), tool calls (19 → 23)
[8] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (26571 → 85903), tool calls (2 → 5), time (20.8s → 41.9s)
[9] (Plugin) Quality unchanged but weighted score is -1.0% due to: tokens (40609 → 85681), time (21.6s → 37.4s), tool calls (4 → 5)
[10] (Isolated) Quality unchanged but weighted score is -9.5% due to: tokens (40900 → 100100), tool calls (4 → 8), time (32.8s → 59.6s)
[11] (Plugin) Quality unchanged but weighted score is -1.8% due to: tokens (28144 → 37272)
[12] (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (41240 → 87606), time (23.9s → 49.9s), tool calls (5 → 7)
[13] (Plugin) Quality dropped but weighted score is +7.9% due to: completion (✗ → ✓), tokens (233474 → 110596), tool calls (21 → 7), time (102.2s → 76.8s)
[14] (Isolated) Quality unchanged but weighted score is -20.3% due to: judgment, tokens (53521 → 90318), quality, time (20.9s → 32.4s)
[15] (Isolated) Quality unchanged but weighted score is -19.0% due to: judgment, quality, tokens (68476 → 90944), time (26.3s → 33.8s), tool calls (8 → 10)
[16] (Plugin) Quality unchanged but weighted score is -8.5% due to: tokens (27791 → 65695), quality, tool calls (3 → 4)
[17] (Plugin) Quality unchanged but weighted score is -11.5% due to: tokens (40267 → 113966), quality, tool calls (3 → 7), time (28.5s → 43.4s)
[18] (Isolated) Quality improved but weighted score is -13.5% due to: judgment, tokens (40323 → 71088), tool calls (3 → 4)
[19] (Isolated) Quality unchanged but weighted score is -18.0% due to: completion (✓ → ✗), tokens (26398 → 51388), tool calls (2 → 3), time (14.9s → 21.0s)
[20] (Isolated) Quality improved but weighted score is -2.3% due to: tokens (42081 → 73087), tool calls (4 → 5)
[21] (Plugin) Quality unchanged but weighted score is -9.2% due to: tokens (39968 → 99609), tool calls (4 → 10), time (21.4s → 36.1s)
[22] (Isolated) Quality unchanged but weighted score is -6.7% due to: tokens (38629 → 74487), time (16.7s → 26.6s), tool calls (4 → 5)
[23] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (55800 → 100952), time (46.8s → 58.7s)
[24] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (52645 → 91698), time (24.9s → 33.2s)
[25] (Isolated) Quality improved but weighted score is -9.2% due to: tokens (12939 → 44171), tool calls (0 → 2), time (10.8s → 18.3s)
[26] (Plugin) Quality unchanged but weighted score is -1.7% due to: tokens (25319 → 34450)
[27] (Plugin) Quality unchanged but weighted score is -2.4% due to: tokens (25473 → 34665), time (10.1s → 12.5s)
[28] (Plugin) Quality unchanged but weighted score is -7.9% due to: tokens (25442 → 62594), tool calls (2 → 4)
[29] (Plugin) Quality unchanged but weighted score is -1.2% due to: tokens (12693 → 17210)
[30] (Plugin) Quality unchanged but weighted score is -2.3% due to: tokens (12836 → 17444)
[31] (Plugin) Quality unchanged but weighted score is -1.5% due to: tokens (13069 → 17617)
[32] (Plugin) Quality unchanged but weighted score is -2.2% due to: tokens (13107 → 17696)
[33] (Plugin) Quality unchanged but weighted score is -1.1% due to: tokens (12697 → 17169)
[34] (Isolated) Quality unchanged but weighted score is -0.2% due to: efficiency metrics
[35] (Plugin) Quality unchanged but weighted score is -1.5% due to: tokens (13334 → 17918)
[36] (Plugin) Quality unchanged but weighted score is -9.5% due to: tokens (12747 → 34598), tool calls (0 → 1), time (5.8s → 10.3s)
[37] (Isolated) Quality unchanged but weighted score is -0.1% due to: efficiency metrics
[38] (Isolated) Quality unchanged but weighted score is -0.6% due to: time (6.4s → 7.8s)
[39] (Plugin) Quality unchanged but weighted score is -2.6% due to: tokens (13154 → 17964), time (12.4s → 16.2s)
[40] (Isolated) Quality unchanged but weighted score is -2.6% due to: time (35.4s → 57.9s), tokens (82593 → 103252)
[41] (Isolated) Quality unchanged but weighted score is -15.3% due to: judgment, quality, tokens (663213 → 805576)
[42] (Isolated) Quality unchanged but weighted score is -15.7% due to: judgment, quality

timeout — run(s) hit the (120s, 180s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 803 in dotnet/skills, download eval artifacts with gh run download 28088953671 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/4765b90605280fea89478f211b55e516d1412265/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@github-actions github-actions Bot added ready-to-merge PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 24, 2026
Copilot AI review requested due to automatic review settings June 25, 2026 10:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 2

Comment thread eng/skill-validator/src/Check/SkillProfiler.cs Outdated
Comment thread eng/skill-validator/src/Check/CheckCommand.cs Outdated
@github-actions github-actions Bot added waiting-on-author PR state label and removed ready-to-merge PR state label labels Jun 25, 2026
Evangelink and others added 3 commits June 25, 2026 14:01
…ptions under 15K rendered menu budget

Merge the polyglot find-untested-sources-polyglot skill into find-untested-sources
as a single model-invocable skill documenting both engines (Roslyn for C#/.NET,
tree-sitter for polyglot), restoring discoverability that was lost when both were
hidden via disable-model-invocation to fit the budget.

Trim keyword-stuffed descriptions on coverage-analysis, test-anti-patterns,
test-tagging, assertion-quality, grade-tests, and migrate-static-to-wrapper.

Rendered skill-menu size for dotnet-test drops to 14,722 chars (278 under the
15,000 cap enforced by skill-validator on PR dotnet#803).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…x stale cap comment

Address review feedback:
- Plumb DisableModelInvocation through SkillInfo (populated once in
  SkillDiscovery.DiscoverSkillAt) instead of re-deserializing each skill's
  YAML frontmatter in CheckCommand. Removes the per-skill double parse.
- Correct the SkillProfiler comment that still called the constant an
  'aggregate description size cap'; it enforces a rendered skill-menu budget.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 25, 2026 12:06
@Evangelink

Copy link
Copy Markdown
Member Author

Pushed two commits (72e07c31f, 9ecec6aa4):

1. dotnet-test slimmed under the cap — resolves the ⚠️ Sequencing blocker. dotnet-test now renders at 14,722 chars (278 under the 15,000 budget). Done by:

  • Consolidating find-untested-sources + find-untested-sources-polyglot into a single model-invocable find-untested-sources skill documenting both engines (Roslyn for C#/.NET, tree-sitter for polyglot). This restores discoverability for "which files have no tests?" — both skills were previously hidden via disable-model-invocation purely to fit the budget.
  • Trimming keyword-stuffed descriptions on coverage-analysis, test-anti-patterns, test-tagging, assertion-quality, grade-tests, and migrate-static-to-wrapper (all routing keywords preserved).

2. Addressed the two open review comments — plumbed DisableModelInvocation through SkillInfo (no more per-skill double YAML parse) and corrected the stale aggregate description size cap comment in SkillProfiler.

skill-validator builds clean (0 warnings), all 590 tests pass, and check --plugin dotnet-test now exits 0.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 15/16 changed files
  • Comments generated: 0 new

@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
migrate-static-to-wrapper Migrate DateTime.UtcNow to TimeProvider in a service class 4.7/5 → 5.0/5 🟢 ✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08 [1]
migrate-static-to-wrapper Migrate only in scoped files, leaving others untouched 5.0/5 → 5.0/5 ✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08 [2]
migrate-static-to-wrapper Update test doubles when migrating a service to TimeProvider 5.0/5 → 5.0/5 ✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08 [3]
migrate-static-to-wrapper Handle a genuinely static class that cannot take constructor injection 2.7/5 → 2.0/5 🔴 ⚠️ NOT ACTIVATED ✅ 0.08 [4]
migrate-static-to-wrapper Decline migration when wrapper does not exist yet 4.7/5 → 4.7/5 ✅ migrate-static-to-wrapper; tools: skill ✅ 0.08 [5]
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.3/5 → 4.7/5 🟢 ✅ coverage-analysis; tools: skill, view, create, read_bash, stop_bash, bash / ✅ coverage-analysis; tools: skill, view, create, bash ✅ 0.08
coverage-analysis Run coverage from scratch without existing data 4.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, create / ✅ coverage-analysis; tools: skill, glob, create ✅ 0.08
coverage-analysis Coverage plateau diagnosis 3.0/5 → 4.7/5 🟢 ✅ coverage-analysis; tools: skill, create, view, read_bash, stop_bash / ✅ coverage-analysis; tools: skill, create, view ✅ 0.08
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill ✅ 0.20 [6]
test-anti-patterns Detect flakiness indicators and test coupling 5.0/5 → 4.7/5 🔴 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.20 [7]
test-anti-patterns Detect duplicated tests and magic values 4.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill ✅ 0.20 [8]
test-anti-patterns Recognize well-written tests without inventing false positives 4.0/5 → 4.3/5 🟢 ✅ test-anti-patterns; tools: skill ✅ 0.20 [9]
test-anti-patterns Detect coverage-touching pattern across a service facade 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill ✅ 0.20 [10]
test-anti-patterns Detect self-referential assertions in round-trip and identity tests 5.0/5 → 4.7/5 🔴 ✅ test-anti-patterns; tools: skill ✅ 0.20
test-anti-patterns Polyglot: detect anti-patterns in a Python pytest suite 4.3/5 → 5.0/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.20 [11]
test-tagging Tag an untagged MSTest test suite 4.0/5 → 4.7/5 🟢 ✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob 🟡 0.24
test-tagging Tag an untagged xUnit test suite 3.0/5 → 4.7/5 🟢 ✅ test-tagging; tools: skill 🟡 0.24
test-tagging Tag an untagged NUnit test suite 4.0/5 → 4.7/5 🟢 ✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob 🟡 0.24
test-tagging Audit test distribution without modifying files 5.0/5 → 3.0/5 🔴 ✅ test-tagging; tools: skill 🟡 0.24
test-tagging Decline request to write new tests 4.0/5 → 4.0/5 ℹ️ not activated (expected) 🟡 0.24 [12]
test-tagging Tag a partially-tagged MSTest suite without duplicating existing traits 4.0/5 → 4.0/5 ✅ test-tagging; tools: skill 🟡 0.24 [13]
test-tagging Accurately classify NUnit tests with misleading method names 3.7/5 → 4.7/5 🟢 ✅ test-tagging; tools: skill, glob / ⚠️ NOT ACTIVATED 🟡 0.24
test-tagging Tag MSTest tests and verify the project still builds 5.0/5 → 4.0/5 🔴 ✅ test-tagging; tools: skill 🟡 0.24 [14]
test-tagging Report traits for a Go test suite without modifying source 3.3/5 → 4.3/5 🟢 ✅ test-tagging; tools: skill 🟡 0.24 [15]
assertion-quality Identify low assertion diversity in equality-dominated test suite 3.0/5 → 5.0/5 🟢 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill 🟡 0.26
assertion-quality Flag assertion-free tests and trivial-only assertions 4.3/5 → 4.0/5 🔴 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill 🟡 0.26 [16]
assertion-quality Recognize well-diversified assertion usage 4.0/5 → 4.3/5 🟢 ✅ assertion-quality; tools: skill, glob 🟡 0.26
assertion-quality Identify self-referential assertions in identity and round-trip tests 4.7/5 → 4.7/5 ✅ assertion-quality; tools: skill, glob 🟡 0.26 [17]
assertion-quality Decline request to write new tests from scratch 4.0/5 → 4.0/5 ℹ️ not activated (expected) 🟡 0.26 [18]
assertion-quality Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite 5.0/5 → 5.0/5 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill, bash, glob 🟡 0.26 [19]
grade-tests Grade a curated list of MSTest tests with mixed quality 3.3/5 → 4.3/5 🟢 ✅ grade-tests; tools: skill, glob, bash / ✅ grade-tests; tools: skill, bash, glob 🟡 0.25 [20]
grade-tests Grade pytest test methods using the same rubric 5.0/5 → 4.0/5 🔴 ✅ grade-tests; tools: skill, bash, glob / ✅ grade-tests; tools: skill, glob, bash 🟡 0.25 [21]
grade-tests Decline to grade the entire workspace when no test list is provided 1.0/5 → 3.7/5 🟢 ✅ grade-tests; tools: skill, glob / ✅ grade-tests; test-analysis-extensions; tools: skill 🟡 0.25 [22]
grade-tests Grade Go table-driven tests without misreading the loop as branching 4.0/5 → 4.3/5 🟢 ✅ grade-tests; tools: skill, glob / ✅ grade-tests; tools: skill, glob, bash 🟡 0.25 [23]
grade-tests Grade tests when the production code under test is unavailable 4.0/5 → 4.0/5 ✅ grade-tests; tools: skill, bash, glob 🟡 0.25 [24]

[1] (Isolated) Quality improved but weighted score is -16.9% due to: judgment, tokens (63523 → 90348), quality, time (41.9s → 50.7s)
[2] ⚠️ High run-to-run variance (CV=67%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -3.4% due to: tokens (68462 → 95982), time (42.5s → 55.6s), tool calls (8 → 10)
[3] ⚠️ High run-to-run variance (CV=56%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=91%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=419%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=74%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=118%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.9% due to: tokens (40534 → 55003)
[8] ⚠️ High run-to-run variance (CV=638%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -12.8% due to: judgment, tokens (40346 → 64825), tool calls (3 → 4)
[9] ⚠️ High run-to-run variance (CV=75%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -4.5% due to: tokens (26404 → 63362), time (22.4s → 35.2s), tool calls (2 → 3)
[10] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (28438 → 66697), quality, tool calls (3 → 4), time (35.1s → 44.6s)
[11] ⚠️ High run-to-run variance (CV=67%) — consider re-running with --runs 5
[12] (Plugin) Quality unchanged but weighted score is -7.6% due to: tokens (40876 → 166232), tool calls (3 → 10), time (31.7s → 84.3s)
[13] ⚠️ High run-to-run variance (CV=346%) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=157%) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=463%) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=102%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -8.2% due to: tokens (26539 → 61526), time (21.8s → 39.0s), tool calls (2 → 3)
[17] ⚠️ High run-to-run variance (CV=246%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.3% due to: tokens (28090 → 93589), tool calls (2 → 4), time (39.3s → 53.6s)
[18] ⚠️ High run-to-run variance (CV=163%) — consider re-running with --runs 5
[19] (Plugin) Quality unchanged but weighted score is -7.7% due to: tokens (27080 → 88360), tool calls (3 → 6), time (18.9s → 44.0s)
[20] ⚠️ High run-to-run variance (CV=55%) — consider re-running with --runs 5
[21] (Plugin) Quality unchanged but weighted score is -7.4% due to: tokens (26374 → 141873), tool calls (2 → 7), time (17.3s → 54.4s)
[22] ⚠️ High run-to-run variance (CV=316%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -8.0% due to: tokens (52652 → 116941), time (30.1s → 53.0s), tool calls (7 → 10)
[23] ⚠️ High run-to-run variance (CV=474%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -11.0% due to: judgment, tokens (26851 → 64514), quality, tool calls (3 → 6), time (22.9s → 60.0s)
[24] ⚠️ High run-to-run variance (CV=55%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -24.4% due to: judgment, tokens (26572 → 71217), quality, tool calls (2 → 5), time (21.4s → 40.0s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 803 in dotnet/skills, download eval artifacts with gh run download 28169350954 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/9ecec6aa4e5c184d6805b23559bc7ccb9705b51f/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

@Evangelink Evangelink merged commit 592a530 into dotnet:main Jun 25, 2026
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-author PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants