skill-validator: restore 15K aggregate cap as the real Copilot CLI skill-menu budget by Evangelink · Pull Request #803 · dotnet/skills

Evangelink · 2026-06-22T15:37:06Z

Why

The skill-validator''s per-plugin aggregate description cap (SkillProfiler.MaxAggregateDescriptionLength) had been raised 15,000 → 20,000 → 22,000, justified by a code comment asserting that 15K was "a local repo policy, NOT a documented Copilot/agentskills constraint."

That assertion is wrong. The GitHub Copilot CLI renders the model-facing <available_skills> menu under a hard 15,000-character budget (the agent SDK''s SKILL_CHAR_BUDGET, default 15e3 — confirmed in CLI 1.0.36 and 1.0.61). Skills are listed alphabetically by name and emitted with their full <description> only until the budget is exhausted; every skill past the cut-off collapses to a bare name with no description and can no longer be reliably model-activated.

Raising the validator cap didn''t add headroom — it masked silent menu truncation. This is the root cause behind the dotnet-test plugin-arm activation failures (e.g. run-tests, test-*): they sit alphabetically late, fell into the name-only overflow, and never activated in plugin eval runs even though they activate fine in isolation. Description tuning can''t fix that — the description is never shown.

What

SkillProfiler.MaxAggregateDescriptionLength: 22,000 → 15,000, with the comment rewritten to document the real Copilot CLI budget (and correct the prior claim).
Aggregate now excludes disable-model-invocation: true skills. The CLI drops those from the menu entirely, so they don''t consume the budget. This makes the cap satisfiable by hiding reference / agent-orchestrated primitives rather than only by trimming descriptions.
InvestigatingResults.md: documents plugin-arm-only non-activation caused by skill-menu budget overflow, and how to fix it.

⚠️ Sequencing

dotnet-test currently aggregates ~20.7K chars (the only plugin over 15K), so skill-check will fail for it until it is slimmed below the cap — via disable-model-invocation on reference/primitive skills (see #800) plus description trims. This PR should merge once dotnet-test is ≤ 15K visible. All other plugins are already under the cap (next largest: dotnet-msbuild at ~14.5K).

Verification

skill-validator builds clean (0 warnings).
Confirmed the cap is enforced and that disable-model-invocation skills are excluded from the aggregate (flagging two reference skills dropped the reported total by exactly their description lengths).

…opilot CLI skill-menu budget The per-plugin aggregate description cap had been raised 15,000 -> 20,000 -> 22,000 under the belief that 15K was 'a local repo policy, NOT a documented Copilot constraint'. That belief was wrong: the GitHub Copilot CLI renders the model-facing <available_skills> menu under a hard 15,000- char budget (the agent SDK's SKILL_CHAR_BUDGET, default 15e3, confirmed in CLI 1.0.36 and 1.0.61). Skills are listed alphabetically and emitted with their full <description> only until the budget is exhausted; every skill past the cut-off collapses to a bare name with no description and can no longer be reliably model-activated. Raising the validator cap merely masked this silent menu truncation — e.g. dotnet-test's run-tests and test-* skills stopped activating in plugin eval runs because they fell into the name-only overflow. Changes: - SkillProfiler.MaxAggregateDescriptionLength: 22,000 -> 15,000, with the comment rewritten to document the real Copilot CLI budget (and correct the prior 'not a documented constraint' claim). - CheckCommand aggregate now excludes skills marked 'disable-model-invocation: true' — the CLI drops those from the menu, so they do not consume the budget. This makes the cap satisfiable by hiding reference / agent-orchestrated primitives rather than only by trimming. - InvestigatingResults.md: document plugin-arm-only non-activation caused by skill-menu budget overflow, and how to fix it. Note: dotnet-test currently exceeds 15K and must be slimmed below it (via disable-model-invocation on reference/primitive skills plus description trims) before this cap can go green repo-wide. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-22T15:37:17Z

Note

This PR is from a fork and modifies infrastructure files (eng/ or .github/).

Changes to infrastructure typically need to be submitted from a branch in dotnet/skills (not a fork) so that CI workflows run with the correct permissions and secrets.

Please consider recreating this PR from an upstream branch. If you don't have push access to dotnet/skills, ask a maintainer to push your branch for you.

Copilot

Pull request overview

Aligns skill-validator’s per-plugin aggregate description cap with the Copilot CLI’s effective 15,000-character skill-menu budget, and updates validation/docs to prevent silent <available_skills> truncation from masking plugin-arm non-activation.

Changes:

Restores SkillProfiler.MaxAggregateDescriptionLength to 15,000 and rewrites the rationale/commentary to reflect the Copilot CLI menu budget behavior.
Updates check to exclude skills with disable-model-invocation: true from the aggregate description total (matching CLI menu behavior).
Documents “plugin-arm-only non-activation due to menu overflow” troubleshooting steps in InvestigatingResults.md.

Show a summary per file

File	Description
eng/skill-validator/src/docs/InvestigatingResults.md	Adds guidance for diagnosing plugin-only non-activation caused by Copilot CLI skill-menu budget overflow and suggests mitigations.
eng/skill-validator/src/Check/SkillProfiler.cs	Lowers the aggregate description cap to 15,000 and documents it as a Copilot CLI budget constraint.
eng/skill-validator/src/Check/CheckCommand.cs	Excludes `disable-model-invocation: true` skills from the aggregate description calculation during plugin checks.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 3/3 changed files
Comments generated: 2

AbhitejJohn · 2026-06-22T16:16:25Z

@Evangelink : Can we re-create this from a branch in the repo please?

…ion check Address review: replace Regex.IsMatch(pattern-string) with a [GeneratedRegex] partial method (AOT-friendly, no per-call cache lookup), matching FrontmatterParser's style. Runs once per skill during checks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-22T17:00:01Z

👋 @Evangelink — this PR has 2 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

github-actions · 2026-06-22T18:44:19Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
system-text-json-net11	Serialize JSON in .NET 11 with PascalCase property names	4.0/5 → 5.0/5 🟢	✅ system-text-json-net11; tools: skill	✅ 0.06	❌ [1]
system-text-json-net11	Type-safe JsonTypeInfo access without exceptions in .NET 11	3.0/5 → 5.0/5 🟢	✅ system-text-json-net11; tools: skill, edit, view	✅ 0.06	✅
system-text-json-net11	Non-activation: camelCase JSON serialization on .NET 8	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.06	❌ [2]
optimizing-ef-core-queries	Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	🟡 0.22	❌

[1] (Isolated) Quality improved but weighted score is -2.1% due to: tokens (65782 → 85343), tool calls (5 → 6)
[2] (Isolated) Quality unchanged but weighted score is -16.2% due to: judgment, tokens (51994 → 80828), tool calls (4 → 6)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 803 in dotnet/skills, download eval artifacts with gh run download 27975446091 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/78e9dda103845334f0ab7d390467ad30e744f360/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

github-actions · 2026-06-22T22:00:33Z

✅ Approved by @AbhitejJohn. cc @dotnet/skills-merge-approvers — ready to merge.

Copilot

Copilot's findings

Files reviewed: 3/3 changed files
Comments generated: 1

…ck-scalar false positives The regex-based check matched any line in the frontmatter, so a block-scalar description that merely mentioned 'disable-model-invocation: true' on its own line was wrongly treated as disabling model invocation. Parse the frontmatter with the existing YAML deserializer (which correctly handles block scalars) by adding a DisableModelInvocation field to SkillFrontmatter, and drop the regex entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-23T09:44:17Z

/evaluate

Copilot

Copilot's findings

Files reviewed: 6/6 changed files
Comments generated: 2

…killMenuLength The constant now caps the fully-rendered <skill>-menu size (name + description + location + markup) rather than the sum of raw description lengths, so the old name was misleading and risked someone reverting to summing Description.Length. Rename it (and update related comments/docs) to reflect what is actually enforced. Also fix the test-name grammar DescriptionsSummingToLimit_Fail -> _Fails. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-24T10:10:47Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
writing-mstest-tests	Write unit tests for a service class	5.0/5 → 5.0/5	✅ writing-mstest-tests; tools: skill, glob	🟡 0.31	❌
writing-mstest-tests	Write data-driven tests for a calculator	3.0/5 → 5.0/5 🟢	✅ writing-mstest-tests; tools: skill, report_intent, view / ⚠️ NOT ACTIVATED	🟡 0.31	✅
writing-mstest-tests	Write async tests with cancellation	3.0/5 → 5.0/5 🟢	✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.31	✅
writing-mstest-tests	Fix swapped Assert.AreEqual arguments	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	🟡 0.31	❌ [1]
writing-mstest-tests	Modernize legacy test patterns	5.0/5 → 5.0/5	✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.31	❌ [2]
writing-mstest-tests	Replace ExpectedException with Assert.Throws	3.0/5 → 5.0/5 🟢	✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED	🟡 0.31	✅
writing-mstest-tests	Use proper collection assertions	3.0/5 → 2.0/5 🔴	✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.31	❌
writing-mstest-tests	Use proper type assertions instead of casts	4.0/5 → 4.0/5	✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.31	✅
writing-mstest-tests	Set up test lifecycle correctly	2.0/5 → 5.0/5 🟢	✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.31	✅
writing-mstest-tests	Use DynamicData with ValueTuples over object arrays	3.0/5 → 5.0/5 🟢	✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.31	❌ [3]
writing-mstest-tests	Use string assertions for format validation	3.0/5 → 4.0/5 🟢	✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.31	❌
writing-mstest-tests	Use comparison assertions for boundary testing	2.0/5 → 4.0/5 🟢	✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.31	❌ [4]
writing-mstest-tests	Write tests with collection, null, and reference assertions	4.0/5 → 5.0/5 🟢	✅ writing-mstest-tests; tools: glob, skill / ⚠️ NOT ACTIVATED	🟡 0.31	✅
writing-mstest-tests	Configure conditional execution, retry, and cleanup	3.0/5 → 4.0/5 🟢	✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED	🟡 0.31	✅
writing-mstest-tests	Configure test parallelization and MSTest.Sdk project	3.0/5 → 5.0/5 🟢	✅ writing-mstest-tests; tools: skill	🟡 0.31	✅
test-tagging	Tag an untagged MSTest test suite	4.0/5 → 4.0/5	✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob	✅ 0.20	✅
test-tagging	Tag an untagged xUnit test suite	3.0/5 → 5.0/5 🟢	✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob	✅ 0.20	✅
test-tagging	Tag an untagged NUnit test suite	4.0/5 → 5.0/5 🟢	✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob	✅ 0.20	✅
test-tagging	Audit test distribution without modifying files	5.0/5 → 4.0/5 🔴	⚠️ NOT ACTIVATED / ✅ test-tagging; tools: skill	✅ 0.20	❌
test-tagging	Decline request to write new tests	4.0/5 → 4.0/5	ℹ️ not activated (expected)	✅ 0.20	❌ [5]
test-tagging	Tag a partially-tagged MSTest suite without duplicating existing traits	4.0/5 → 4.0/5	✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob	✅ 0.20	❌ [6]
test-tagging	Accurately classify NUnit tests with misleading method names	5.0/5 → 5.0/5	✅ test-tagging; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.20	❌ [7]
test-tagging	Tag MSTest tests and verify the project still builds	4.0/5 → 4.0/5	✅ test-tagging; tools: skill	✅ 0.20	✅
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 9)	1.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.11	✅
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 10)	1.0/5 → 3.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.11	✅
mtp-hot-reload	Enable hot reload when package already installed	2.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.11	✅
mtp-hot-reload	Suggest launchSettings.json configuration for hot reload	1.0/5 → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create	✅ 0.11	✅
mtp-hot-reload	Use dotnet run not dotnet test for hot reload	3.0/5 → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.11	✅
mtp-hot-reload	Negative: VSTest project cannot use MTP hot reload	2.0/5 → 2.0/5	✅ mtp-hot-reload; tools: skill, edit, bash, create / ✅ mtp-hot-reload; tools: skill, bash, create	✅ 0.11	❌
mtp-hot-reload	Run specific failing test with hot reload filter	1.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.11	✅
assertion-quality	Identify low assertion diversity in equality-dominated test suite	4.0/5 → 5.0/5 🟢	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill	🟡 0.23	✅
assertion-quality	Flag assertion-free tests and trivial-only assertions	4.0/5 → 4.0/5	✅ assertion-quality; tools: skill, glob	🟡 0.23	❌ [8]
assertion-quality	Recognize well-diversified assertion usage	5.0/5 → 4.0/5 🔴	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill	🟡 0.23	❌ [9]
assertion-quality	Identify self-referential assertions in identity and round-trip tests	5.0/5 → 3.0/5 🔴	✅ assertion-quality; tools: skill, glob	🟡 0.23	❌
assertion-quality	Decline request to write new tests from scratch	5.0/5 → 4.0/5 🔴	ℹ️ not activated (expected)	🟡 0.23	❌
assertion-quality	Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite	4.0/5 → 5.0/5 🟢	✅ assertion-quality; tools: skill, glob	🟡 0.23	✅
test-smell-detection	Detect multiple test smells in order processing test suite	4.0/5 → 5.0/5 🟢	✅ test-smell-detection; tools: skill	🟡 0.43	✅
test-smell-detection	Recognize well-written tests with no significant smells	4.0/5 → 5.0/5 🟢	✅ test-smell-detection; tools: skill / ✅ test-smell-detection; tools: skill, glob	🟡 0.43	✅
test-smell-detection	Recognize integration tests and avoid false positives for external resources	5.0/5 → 5.0/5	✅ test-smell-detection; tools: skill	🟡 0.43	❌ [10]
test-smell-detection	Decline request to write new tests from scratch	5.0/5 → 4.0/5 🔴	ℹ️ not activated (expected)	🟡 0.43	❌
test-smell-detection	Polyglot: detect canonical test smells in a JUnit/Java Catalog suite	5.0/5 → 5.0/5	✅ test-smell-detection; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.43	❌ [11]
test-gap-analysis	Find boundary mutation gaps in tiered discount and shipping logic	5.0/5 → 5.0/5	✅ test-gap-analysis; tools: skill	✅ 0.06	❌ [12]
test-gap-analysis	Find logic and null-check mutation gaps in access control code	4.0/5 → 5.0/5 🟢	✅ test-gap-analysis; tools: skill	✅ 0.06	✅
test-gap-analysis	Acknowledge well-tested code with few surviving mutations	4.0/5 → 4.0/5	✅ test-gap-analysis; tools: skill	✅ 0.06	✅ [13]
test-gap-analysis	Decline request to write new tests from scratch	4.0/5 → 2.0/5 ⏰ 🔴	ℹ️ not activated (expected)	✅ 0.06	❌
migrate-static-to-wrapper	Migrate DateTime.UtcNow to TimeProvider in a service class	5.0/5 → 5.0/5	✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.08	❌ [14]
migrate-static-to-wrapper	Migrate only in scoped files, leaving others untouched	5.0/5 → 5.0/5	✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.08	❌ [15]
migrate-static-to-wrapper	Decline migration when wrapper does not exist yet	2.0/5 → 5.0/5 🟢	✅ migrate-static-to-wrapper; tools: skill	✅ 0.08	✅
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill	✅ 0.20	❌ [16]
test-anti-patterns	Detect flakiness indicators and test coupling	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; tools: skill, glob	✅ 0.20	❌ [17]
test-anti-patterns	Detect duplicated tests and magic values	4.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill	✅ 0.20	❌ [18]
test-anti-patterns	Recognize well-written tests without inventing false positives	4.0/5 → 4.0/5	✅ test-anti-patterns; tools: skill	✅ 0.20	❌ [19]
test-anti-patterns	Detect coverage-touching pattern across a service facade	5.0/5 → 4.0/5 🔴	✅ test-anti-patterns; tools: skill	✅ 0.20	❌
test-anti-patterns	Detect self-referential assertions in round-trip and identity tests	4.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.20	❌ [20]
test-anti-patterns	Polyglot: detect anti-patterns in a Python pytest suite	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.20	✅
crap-score	Calculate CRAP score for a single method with partial coverage	3.0/5 → 4.0/5 🟢	✅ crap-score; tools: skill, glob / ✅ crap-score; tools: skill	✅ 0.07	✅
crap-score	Identify riskiest methods across a file	5.0/5 → 5.0/5	✅ crap-score; tools: skill	✅ 0.07	✅
crap-score	Generate coverage then compute CRAP score	4.0/5 → 4.0/5	✅ crap-score; tools: skill	✅ 0.07	✅
detect-static-dependencies	Identify static dependencies in a multi-class project	4.0/5 → 5.0/5 🟢	✅ detect-static-dependencies; tools: skill	✅ 0.08	✅
detect-static-dependencies	Detect time-related statics and recommend TimeProvider	4.0/5 → 5.0/5 🟢	✅ detect-static-dependencies; tools: skill, glob / ✅ detect-static-dependencies; tools: skill	✅ 0.08	❌ [21]
detect-static-dependencies	Decline scan for non-C# project	5.0/5 → 5.0/5	✅ detect-static-dependencies; tools: skill / ℹ️ not activated (expected)	✅ 0.08	❌ [22]
detect-static-dependencies	Verify structured report includes file count, categories, and top patterns	5.0/5 → 5.0/5	✅ detect-static-dependencies; tools: skill, glob / ✅ detect-static-dependencies; tools: skill	✅ 0.08	❌ [23]
detect-static-dependencies	Exclude obj and bin directories from the scan	5.0/5 → 5.0/5	✅ detect-static-dependencies; tools: skill, glob / ✅ detect-static-dependencies; tools: skill	✅ 0.08	❌ [24]
detect-static-dependencies	Detect statics inside lambda expressions and LINQ queries	4.0/5 → 4.0/5	✅ detect-static-dependencies; tools: skill, glob	✅ 0.08	❌
generate-testability-wrappers	Generate TimeProvider adoption for DateTime.UtcNow	4.0/5 → 5.0/5 🟢	✅ generate-testability-wrappers; tools: skill, report_intent / ✅ generate-testability-wrappers; tools: skill, report_intent, glob, view, edit, bash	✅ 0.10	❌ [25]
generate-testability-wrappers	Generate custom Environment wrapper	3.0/5 → 5.0/5 🟢	✅ generate-testability-wrappers; tools: skill	✅ 0.10	✅
generate-testability-wrappers	Recommend System.IO.Abstractions for file system calls	4.0/5 → 5.0/5 🟢	✅ generate-testability-wrappers; tools: skill / ✅ generate-testability-wrappers; tools: report_intent, skill	✅ 0.10	✅
generate-testability-wrappers	Decline wrapper generation for already-abstracted code	2.0/5 → 5.0/5 🟢	✅ generate-testability-wrappers; tools: skill / ℹ️ not activated (expected)	✅ 0.10	✅
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED	✅ 0.14	❌ [26]
run-tests	Run tests with trx reporting on MTP project (SDK 9)	2.0/5 → 5.0/5 🟢	✅ run-tests; tools: report_intent, skill, view	✅ 0.14	✅
run-tests	Run tests with blame-hang on MTP project (SDK 10)	2.0/5 → 4.0/5 🟢	⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Run tests on a specific TFM with TRX in a multi-TFM MTP project (SDK 9)	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.14	❌ [27]
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: skill	✅ 0.14	❌ [28]
run-tests	Filter NUnit tests by class name on VSTest	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view	✅ 0.14	✅
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Filter xUnit v3 tests by class pattern and trait using query filter language	1.0/5 → 1.0/5 ⏰	✅ run-tests; tools: report_intent, skill, view, bash / ⚠️ NOT ACTIVATED	✅ 0.14	❌ [29]
run-tests	Filter TUnit tests by class using treenode-filter	1.0/5 → 4.0/5 🟢	⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Combine multiple filter criteria on VSTest MSTest	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	MTP project on SDK 9 must use -- separator for args	1.0/5 → 3.0/5 🟢	⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	MTP project on SDK 10 passes args directly	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Detect test platform from Directory.Build.props	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	✅ 0.14	✅
dotnet-test-frameworks	Cross-framework assertion equivalence mapping	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.12	❌ [30]
dotnet-test-frameworks	Identify TUnit framework and its unique attributes	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.12	❌ [31]
dotnet-test-frameworks	Replace try-catch with framework-native exception assertions	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.12	❌ [32]
dotnet-test-frameworks	Skip annotations across all four frameworks	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.12	❌ [33]
dotnet-test-frameworks	Convert NUnit lifecycle methods to xUnit equivalents	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.12	❌ [34]
dotnet-test-frameworks	Identify integration tests by markers and code patterns	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.12	❌ [35]
dotnet-test-frameworks	Convert cross-framework assertions to TUnit syntax	2.0/5 → 1.0/5 🔴	ℹ️ not activated (expected)	✅ 0.12	❌ [36]
dotnet-test-frameworks	Diagnose silently-passing TUnit test with missing await	4.0/5 → 4.0/5	ℹ️ not activated (expected)	✅ 0.12	❌ [37]
dotnet-test-frameworks	Refactor TUnit try/catch to native exception assertion	2.0/5 → 2.0/5	ℹ️ not activated (expected)	✅ 0.12	❌ [38]
dotnet-test-frameworks	TUnit lifecycle hooks at test, class, assembly, and session scope	4.0/5 → 4.0/5	ℹ️ not activated (expected)	✅ 0.12	❌ [39]
dotnet-test-frameworks	TUnit skip mechanisms — attribute, assembly-wide, and dynamic	3.0/5 → 3.0/5	ℹ️ not activated (expected)	✅ 0.12	✅
grade-tests	Grade a curated list of MSTest tests with mixed quality	4.0/5 → 5.0/5 🟢	✅ grade-tests; tools: skill, glob	🟡 0.21	✅
grade-tests	Grade pytest test methods using the same rubric	5.0/5 → 5.0/5	✅ grade-tests; tools: skill / ✅ grade-tests; tools: skill, bash, glob	🟡 0.21	✅
grade-tests	Decline to grade the entire workspace when no test list is provided	1.0/5 → 1.0/5	✅ grade-tests; tools: skill, glob / ✅ grade-tests; tools: skill	🟡 0.21	❌ [40]
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	3.0/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, create	✅ 0.10	✅
coverage-analysis	Run coverage from scratch without existing data	4.0/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, glob, read_bash, stop_bash, create / ✅ coverage-analysis; tools: skill, glob, create	✅ 0.10	✅
coverage-analysis	Coverage plateau diagnosis	4.0/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, bash, create	✅ 0.10	✅
code-testing-agent	Generate tests for ContosoUniversity ASP.NET Core MVC app	3.0/5 → 3.0/5	✅ code-testing-agent; tools: skill	🟡 0.21	❌ [41]
code-testing-agent	Generate pytest tests for the Flask tasks API (Python polyglot)	5.0/5 → 4.0/5 🔴	✅ code-testing-agent; tools: skill	🟡 0.21	❌
code-testing-agent	Generate Vitest tests for the shopping-cart library (TypeScript polyglot)	4.0/5 → 5.0/5 🟢	✅ code-testing-agent; tools: skill	🟡 0.21	✅
code-testing-agent	Does not revert a gutted-looking workspace (workspace integrity)	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	🟡 0.21	❌ [42]
optimizing-ef-core-queries	Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete	5.0/5 → 4.0/5 🔴	✅ optimizing-ef-core-queries; tools: report_intent, skill	🟡 0.27	❌

[1] (Plugin) Quality unchanged but weighted score is -2.9% due to: tokens (12709 → 17382), time (6.3s → 8.9s)
[2] (Plugin) Quality unchanged but weighted score is -6.8% due to: quality, tokens (161501 → 213972), tool calls (11 → 15)
[3] (Plugin) Quality unchanged but weighted score is -2.0% due to: tokens (12816 → 17376)
[4] (Plugin) Quality unchanged but weighted score is -21.0% due to: judgment, quality, tokens (13788 → 17868)
[5] (Isolated) Quality unchanged but weighted score is -0.2% due to: efficiency metrics
[6] (Plugin) Quality unchanged but weighted score is -5.5% due to: tokens (139842 → 425155), time (60.2s → 111.5s), tool calls (21 → 30)
[7] (Isolated) Quality unchanged but weighted score is -3.1% due to: tokens (114966 → 204093), time (55.3s → 76.0s), tool calls (19 → 23)
[8] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (26571 → 85903), tool calls (2 → 5), time (20.8s → 41.9s)
[9] (Plugin) Quality unchanged but weighted score is -1.0% due to: tokens (40609 → 85681), time (21.6s → 37.4s), tool calls (4 → 5)
[10] (Isolated) Quality unchanged but weighted score is -9.5% due to: tokens (40900 → 100100), tool calls (4 → 8), time (32.8s → 59.6s)
[11] (Plugin) Quality unchanged but weighted score is -1.8% due to: tokens (28144 → 37272)
[12] (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (41240 → 87606), time (23.9s → 49.9s), tool calls (5 → 7)
[13] (Plugin) Quality dropped but weighted score is +7.9% due to: completion (✗ → ✓), tokens (233474 → 110596), tool calls (21 → 7), time (102.2s → 76.8s)
[14] (Isolated) Quality unchanged but weighted score is -20.3% due to: judgment, tokens (53521 → 90318), quality, time (20.9s → 32.4s)
[15] (Isolated) Quality unchanged but weighted score is -19.0% due to: judgment, quality, tokens (68476 → 90944), time (26.3s → 33.8s), tool calls (8 → 10)
[16] (Plugin) Quality unchanged but weighted score is -8.5% due to: tokens (27791 → 65695), quality, tool calls (3 → 4)
[17] (Plugin) Quality unchanged but weighted score is -11.5% due to: tokens (40267 → 113966), quality, tool calls (3 → 7), time (28.5s → 43.4s)
[18] (Isolated) Quality improved but weighted score is -13.5% due to: judgment, tokens (40323 → 71088), tool calls (3 → 4)
[19] (Isolated) Quality unchanged but weighted score is -18.0% due to: completion (✓ → ✗), tokens (26398 → 51388), tool calls (2 → 3), time (14.9s → 21.0s)
[20] (Isolated) Quality improved but weighted score is -2.3% due to: tokens (42081 → 73087), tool calls (4 → 5)
[21] (Plugin) Quality unchanged but weighted score is -9.2% due to: tokens (39968 → 99609), tool calls (4 → 10), time (21.4s → 36.1s)
[22] (Isolated) Quality unchanged but weighted score is -6.7% due to: tokens (38629 → 74487), time (16.7s → 26.6s), tool calls (4 → 5)
[23] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (55800 → 100952), time (46.8s → 58.7s)
[24] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (52645 → 91698), time (24.9s → 33.2s)
[25] (Isolated) Quality improved but weighted score is -9.2% due to: tokens (12939 → 44171), tool calls (0 → 2), time (10.8s → 18.3s)
[26] (Plugin) Quality unchanged but weighted score is -1.7% due to: tokens (25319 → 34450)
[27] (Plugin) Quality unchanged but weighted score is -2.4% due to: tokens (25473 → 34665), time (10.1s → 12.5s)
[28] (Plugin) Quality unchanged but weighted score is -7.9% due to: tokens (25442 → 62594), tool calls (2 → 4)
[29] (Plugin) Quality unchanged but weighted score is -1.2% due to: tokens (12693 → 17210)
[30] (Plugin) Quality unchanged but weighted score is -2.3% due to: tokens (12836 → 17444)
[31] (Plugin) Quality unchanged but weighted score is -1.5% due to: tokens (13069 → 17617)
[32] (Plugin) Quality unchanged but weighted score is -2.2% due to: tokens (13107 → 17696)
[33] (Plugin) Quality unchanged but weighted score is -1.1% due to: tokens (12697 → 17169)
[34] (Isolated) Quality unchanged but weighted score is -0.2% due to: efficiency metrics
[35] (Plugin) Quality unchanged but weighted score is -1.5% due to: tokens (13334 → 17918)
[36] (Plugin) Quality unchanged but weighted score is -9.5% due to: tokens (12747 → 34598), tool calls (0 → 1), time (5.8s → 10.3s)
[37] (Isolated) Quality unchanged but weighted score is -0.1% due to: efficiency metrics
[38] (Isolated) Quality unchanged but weighted score is -0.6% due to: time (6.4s → 7.8s)
[39] (Plugin) Quality unchanged but weighted score is -2.6% due to: tokens (13154 → 17964), time (12.4s → 16.2s)
[40] (Isolated) Quality unchanged but weighted score is -2.6% due to: time (35.4s → 57.9s), tokens (82593 → 103252)
[41] (Isolated) Quality unchanged but weighted score is -15.3% due to: judgment, quality, tokens (663213 → 805576)
[42] (Isolated) Quality unchanged but weighted score is -15.7% due to: judgment, quality

⏰ timeout — run(s) hit the (120s, 180s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 803 in dotnet/skills, download eval artifacts with gh run download 28088953671 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/4765b90605280fea89478f211b55e516d1412265/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

Copilot

Copilot's findings

Files reviewed: 6/6 changed files
Comments generated: 2

…ptions under 15K rendered menu budget Merge the polyglot find-untested-sources-polyglot skill into find-untested-sources as a single model-invocable skill documenting both engines (Roslyn for C#/.NET, tree-sitter for polyglot), restoring discoverability that was lost when both were hidden via disable-model-invocation to fit the budget. Trim keyword-stuffed descriptions on coverage-analysis, test-anti-patterns, test-tagging, assertion-quality, grade-tests, and migrate-static-to-wrapper. Rendered skill-menu size for dotnet-test drops to 14,722 chars (278 under the 15,000 cap enforced by skill-validator on PR dotnet#803). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…x stale cap comment Address review feedback: - Plumb DisableModelInvocation through SkillInfo (populated once in SkillDiscovery.DiscoverSkillAt) instead of re-deserializing each skill's YAML frontmatter in CheckCommand. Removes the per-skill double parse. - Correct the SkillProfiler comment that still called the constant an 'aggregate description size cap'; it enforces a rendered skill-menu budget. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-25T12:07:28Z

Pushed two commits (72e07c31f, 9ecec6aa4):

1. dotnet-test slimmed under the cap — resolves the ⚠️ Sequencing blocker. dotnet-test now renders at 14,722 chars (278 under the 15,000 budget). Done by:

Consolidating find-untested-sources + find-untested-sources-polyglot into a single model-invocable find-untested-sources skill documenting both engines (Roslyn for C#/.NET, tree-sitter for polyglot). This restores discoverability for "which files have no tests?" — both skills were previously hidden via disable-model-invocation purely to fit the budget.
Trimming keyword-stuffed descriptions on coverage-analysis, test-anti-patterns, test-tagging, assertion-quality, grade-tests, and migrate-static-to-wrapper (all routing keywords preserved).

2. Addressed the two open review comments — plumbed DisableModelInvocation through SkillInfo (no more per-skill double YAML parse) and corrected the stale aggregate description size cap comment in SkillProfiler.

skill-validator builds clean (0 warnings), all 590 tests pass, and check --plugin dotnet-test now exits 0.

Copilot

Copilot's findings

Files reviewed: 15/16 changed files
Comments generated: 0 new

Evangelink · 2026-06-25T12:15:08Z

/evaluate

github-actions · 2026-06-25T12:23:47Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
migrate-static-to-wrapper	Migrate DateTime.UtcNow to TimeProvider in a service class	4.7/5 → 5.0/5 🟢	✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.08	❌ [1]
migrate-static-to-wrapper	Migrate only in scoped files, leaving others untouched	5.0/5 → 5.0/5	✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.08	❌ [2]
migrate-static-to-wrapper	Update test doubles when migrating a service to TimeProvider	5.0/5 → 5.0/5	✅ migrate-static-to-wrapper; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.08	❌ [3]
migrate-static-to-wrapper	Handle a genuinely static class that cannot take constructor injection	2.7/5 → 2.0/5 🔴	⚠️ NOT ACTIVATED	✅ 0.08	❌ [4]
migrate-static-to-wrapper	Decline migration when wrapper does not exist yet	4.7/5 → 4.7/5	✅ migrate-static-to-wrapper; tools: skill	✅ 0.08	❌ [5]
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	2.3/5 → 4.7/5 🟢	✅ coverage-analysis; tools: skill, view, create, read_bash, stop_bash, bash / ✅ coverage-analysis; tools: skill, view, create, bash	✅ 0.08	✅
coverage-analysis	Run coverage from scratch without existing data	4.0/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, create / ✅ coverage-analysis; tools: skill, glob, create	✅ 0.08	✅
coverage-analysis	Coverage plateau diagnosis	3.0/5 → 4.7/5 🟢	✅ coverage-analysis; tools: skill, create, view, read_bash, stop_bash / ✅ coverage-analysis; tools: skill, create, view	✅ 0.08	✅
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill	✅ 0.20	❌ [6]
test-anti-patterns	Detect flakiness indicators and test coupling	5.0/5 → 4.7/5 🔴	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.20	❌ [7]
test-anti-patterns	Detect duplicated tests and magic values	4.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill	✅ 0.20	❌ [8]
test-anti-patterns	Recognize well-written tests without inventing false positives	4.0/5 → 4.3/5 🟢	✅ test-anti-patterns; tools: skill	✅ 0.20	❌ [9]
test-anti-patterns	Detect coverage-touching pattern across a service facade	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill	✅ 0.20	❌ [10]
test-anti-patterns	Detect self-referential assertions in round-trip and identity tests	5.0/5 → 4.7/5 🔴	✅ test-anti-patterns; tools: skill	✅ 0.20	❌
test-anti-patterns	Polyglot: detect anti-patterns in a Python pytest suite	4.3/5 → 5.0/5 🟢	⚠️ NOT ACTIVATED	✅ 0.20	✅ [11]
test-tagging	Tag an untagged MSTest test suite	4.0/5 → 4.7/5 🟢	✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob	🟡 0.24	✅
test-tagging	Tag an untagged xUnit test suite	3.0/5 → 4.7/5 🟢	✅ test-tagging; tools: skill	🟡 0.24	✅
test-tagging	Tag an untagged NUnit test suite	4.0/5 → 4.7/5 🟢	✅ test-tagging; tools: skill / ✅ test-tagging; tools: skill, glob	🟡 0.24	✅
test-tagging	Audit test distribution without modifying files	5.0/5 → 3.0/5 🔴	✅ test-tagging; tools: skill	🟡 0.24	❌
test-tagging	Decline request to write new tests	4.0/5 → 4.0/5	ℹ️ not activated (expected)	🟡 0.24	❌ [12]
test-tagging	Tag a partially-tagged MSTest suite without duplicating existing traits	4.0/5 → 4.0/5	✅ test-tagging; tools: skill	🟡 0.24	✅ [13]
test-tagging	Accurately classify NUnit tests with misleading method names	3.7/5 → 4.7/5 🟢	✅ test-tagging; tools: skill, glob / ⚠️ NOT ACTIVATED	🟡 0.24	✅
test-tagging	Tag MSTest tests and verify the project still builds	5.0/5 → 4.0/5 🔴	✅ test-tagging; tools: skill	🟡 0.24	❌ [14]
test-tagging	Report traits for a Go test suite without modifying source	3.3/5 → 4.3/5 🟢	✅ test-tagging; tools: skill	🟡 0.24	✅ [15]
assertion-quality	Identify low assertion diversity in equality-dominated test suite	3.0/5 → 5.0/5 🟢	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill	🟡 0.26	✅
assertion-quality	Flag assertion-free tests and trivial-only assertions	4.3/5 → 4.0/5 🔴	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill	🟡 0.26	❌ [16]
assertion-quality	Recognize well-diversified assertion usage	4.0/5 → 4.3/5 🟢	✅ assertion-quality; tools: skill, glob	🟡 0.26	✅
assertion-quality	Identify self-referential assertions in identity and round-trip tests	4.7/5 → 4.7/5	✅ assertion-quality; tools: skill, glob	🟡 0.26	❌ [17]
assertion-quality	Decline request to write new tests from scratch	4.0/5 → 4.0/5	ℹ️ not activated (expected)	🟡 0.26	❌ [18]
assertion-quality	Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite	5.0/5 → 5.0/5	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; tools: skill, bash, glob	🟡 0.26	❌ [19]
grade-tests	Grade a curated list of MSTest tests with mixed quality	3.3/5 → 4.3/5 🟢	✅ grade-tests; tools: skill, glob, bash / ✅ grade-tests; tools: skill, bash, glob	🟡 0.25	✅ [20]
grade-tests	Grade pytest test methods using the same rubric	5.0/5 → 4.0/5 🔴	✅ grade-tests; tools: skill, bash, glob / ✅ grade-tests; tools: skill, glob, bash	🟡 0.25	❌ [21]
grade-tests	Decline to grade the entire workspace when no test list is provided	1.0/5 → 3.7/5 🟢	✅ grade-tests; tools: skill, glob / ✅ grade-tests; test-analysis-extensions; tools: skill	🟡 0.25	❌ [22]
grade-tests	Grade Go table-driven tests without misreading the loop as branching	4.0/5 → 4.3/5 🟢	✅ grade-tests; tools: skill, glob / ✅ grade-tests; tools: skill, glob, bash	🟡 0.25	❌ [23]
grade-tests	Grade tests when the production code under test is unavailable	4.0/5 → 4.0/5	✅ grade-tests; tools: skill, bash, glob	🟡 0.25	❌ [24]

[1] (Isolated) Quality improved but weighted score is -16.9% due to: judgment, tokens (63523 → 90348), quality, time (41.9s → 50.7s)
[2] ⚠️ High run-to-run variance (CV=67%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -3.4% due to: tokens (68462 → 95982), time (42.5s → 55.6s), tool calls (8 → 10)
[3] ⚠️ High run-to-run variance (CV=56%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=91%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=419%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=74%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=118%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.9% due to: tokens (40534 → 55003)
[8] ⚠️ High run-to-run variance (CV=638%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -12.8% due to: judgment, tokens (40346 → 64825), tool calls (3 → 4)
[9] ⚠️ High run-to-run variance (CV=75%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -4.5% due to: tokens (26404 → 63362), time (22.4s → 35.2s), tool calls (2 → 3)
[10] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (28438 → 66697), quality, tool calls (3 → 4), time (35.1s → 44.6s)
[11] ⚠️ High run-to-run variance (CV=67%) — consider re-running with --runs 5
[12] (Plugin) Quality unchanged but weighted score is -7.6% due to: tokens (40876 → 166232), tool calls (3 → 10), time (31.7s → 84.3s)
[13] ⚠️ High run-to-run variance (CV=346%) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=157%) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=463%) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=102%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -8.2% due to: tokens (26539 → 61526), time (21.8s → 39.0s), tool calls (2 → 3)
[17] ⚠️ High run-to-run variance (CV=246%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.3% due to: tokens (28090 → 93589), tool calls (2 → 4), time (39.3s → 53.6s)
[18] ⚠️ High run-to-run variance (CV=163%) — consider re-running with --runs 5
[19] (Plugin) Quality unchanged but weighted score is -7.7% due to: tokens (27080 → 88360), tool calls (3 → 6), time (18.9s → 44.0s)
[20] ⚠️ High run-to-run variance (CV=55%) — consider re-running with --runs 5
[21] (Plugin) Quality unchanged but weighted score is -7.4% due to: tokens (26374 → 141873), tool calls (2 → 7), time (17.3s → 54.4s)
[22] ⚠️ High run-to-run variance (CV=316%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -8.0% due to: tokens (52652 → 116941), time (30.1s → 53.0s), tool calls (7 → 10)
[23] ⚠️ High run-to-run variance (CV=474%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -11.0% due to: judgment, tokens (26851 → 64514), quality, tool calls (3 → 6), time (22.9s → 60.0s)
[24] ⚠️ High run-to-run variance (CV=55%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -24.4% due to: judgment, tokens (26572 → 71217), quality, tool calls (2 → 5), time (21.4s → 40.0s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 803 in dotnet/skills, download eval artifacts with gh run download 28169350954 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/9ecec6aa4e5c184d6805b23559bc7ccb9705b51f/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

Copilot AI review requested due to automatic review settings June 22, 2026 15:37

Evangelink requested review from JanKrivanek and ViktorHofer as code owners June 22, 2026 15:37

Copilot started reviewing on behalf of Evangelink June 22, 2026 15:37 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread eng/skill-validator/src/Check/CheckCommand.cs Outdated

Comment thread eng/skill-validator/src/Check/SkillProfiler.cs Outdated

github-actions Bot added the waiting-on-author PR state label label Jun 22, 2026

github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation and removed waiting-on-author PR state label labels Jun 22, 2026

github-actions Bot added a commit that referenced this pull request Jun 22, 2026

Update PR token usage data (PR #803)

4a6cb0f

AbhitejJohn approved these changes Jun 22, 2026

View reviewed changes

github-actions Bot added ready-to-merge PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 22, 2026

Merge branch 'main' into restore-15k-skill-budget

bbc43d3

Copilot AI review requested due to automatic review settings June 23, 2026 06:59

Copilot started reviewing on behalf of Evangelink June 23, 2026 07:00 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread eng/skill-validator/src/Check/CheckCommand.cs Outdated

github-actions Bot added waiting-on-author PR state label and removed ready-to-merge PR state label labels Jun 23, 2026

Evangelink enabled auto-merge (squash) June 23, 2026 10:11

YuliiaKovalova approved these changes Jun 23, 2026

View reviewed changes

github-actions Bot added a commit that referenced this pull request Jun 23, 2026

Update PR token usage data (PR #803)

54197bc

Copilot started reviewing on behalf of Evangelink June 24, 2026 06:51 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread eng/skill-validator/src/Check/SkillProfiler.cs Outdated

Comment thread eng/skill-validator/tests/Check/CheckCommandTests.cs

github-actions Bot added waiting-on-author PR state label and removed waiting-on-review PR state label labels Jun 24, 2026

YuliiaKovalova approved these changes Jun 24, 2026

View reviewed changes

github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation and removed waiting-on-author PR state label labels Jun 24, 2026

github-actions Bot added a commit that referenced this pull request Jun 24, 2026

Update PR token usage data (PR #803)

36dd45e

github-actions Bot added ready-to-merge PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 24, 2026

Merge branch 'main' into restore-15k-skill-budget

ac01b19

Copilot AI review requested due to automatic review settings June 25, 2026 10:12

Copilot started reviewing on behalf of Evangelink June 25, 2026 10:13 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread eng/skill-validator/src/Check/SkillProfiler.cs Outdated

Comment thread eng/skill-validator/src/Check/CheckCommand.cs Outdated

github-actions Bot added waiting-on-author PR state label and removed ready-to-merge PR state label labels Jun 25, 2026

Evangelink and others added 3 commits June 25, 2026 14:01

Merge branch 'main' into restore-15k-skill-budget

862b859

Copilot AI review requested due to automatic review settings June 25, 2026 12:06

Copilot started reviewing on behalf of Evangelink June 25, 2026 12:07 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

github-actions Bot added a commit that referenced this pull request Jun 25, 2026

Update PR token usage data (PR #803)

f90be5c

YuliiaKovalova approved these changes Jun 25, 2026

View reviewed changes

Evangelink merged commit 592a530 into dotnet:main Jun 25, 2026
41 checks passed

Uh oh!

Conversation

Evangelink commented Jun 22, 2026

Why

What

⚠️ Sequencing

Verification

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

AbhitejJohn commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Skill Validation Results

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Evangelink commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026

Skill Validation Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Evangelink commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Evangelink commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Skill Validation Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants