Skip to content

chore: skill eval fix batch 1#4470

Closed
miyoungc wants to merge 1 commit into
mainfrom
skills-eval-val-1
Closed

chore: skill eval fix batch 1#4470
miyoungc wants to merge 1 commit into
mainfrom
skills-eval-val-1

Conversation

@miyoungc

@miyoungc miyoungc commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Related Issue

Changes

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Your Name your-email@example.com

Summary by CodeRabbit

  • Tests
    • Enhanced evaluation expectations for agent skills and inference configuration modules with detailed behavior validation criteria
    • Strengthened testing framework to ensure consistent agent performance across multiple operational scenarios

Review Change Stack

@miyoungc

Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28a10a04-44b0-4f23-b82c-ea499c55d699

📥 Commits

Reviewing files that changed from the base of the PR and between 442f64b and 07601e9.

📒 Files selected for processing (4)
  • .agents/skills/nemoclaw-user-agent-skills/evals/evals.json
  • .agents/skills/nemoclaw-user-configure-inference/evals/evals.json
  • skills/nemoclaw-user-agent-skills/evals/evals.json
  • skills/nemoclaw-user-configure-inference/evals/evals.json

📝 Walkthrough

Walkthrough

Four evaluation JSON files are updated to add expected_behavior metadata arrays specifying required skill usage, reference documentation, and task answering criteria. Agent skills and inference configuration evals are strengthened across both .agents/skills/ and skills/ directory structures.

Changes

Evaluation Specification Updates

Layer / File(s) Summary
Agent Skills Evaluation Specifications
.agents/skills/nemoclaw-user-agent-skills/evals/evals.json, skills/nemoclaw-user-agent-skills/evals/evals.json
Three agent skill test cases (docs-resources-agent-skills-001, -002, -003) are enhanced with expected_behavior arrays requiring use of the nemoclaw-user-agent-skills skill, reference to references/agent-skills.md, and direct task answering. Ground truth strings are also refined to clarify delegation guidance, scoping rationale, and trust/control framing.
Configure Inference Evaluation Specifications
.agents/skills/nemoclaw-user-configure-inference/evals/evals.json, skills/nemoclaw-user-configure-inference/evals/evals.json
Multiple inference configuration test cases (covering options selection, provider credentials, local routing, provider switching, sub-agent setup, and tool-calling reliability) are updated with expected_behavior arrays requiring use of the nemoclaw-user-configure-inference skill, reference to scenario-specific markdown under references/, and direct task answering. Existing question and ground_truth fields remain unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4463: Directly conflicts with this PR—removes expected_behavior field from configure-inference eval schema while this PR adds it.

Suggested reviewers

  • jyaunches

Poem

🐰 JSON evals bloom with clarity,
Expected behaviors crystalline and bright,
References guide the way with clarity,
Agent skills and inference aligned just right,
A structured path to truth takes flight! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'chore: skill eval fix batch 1' is vague and generic, using non-descriptive terms that don't convey what skill evaluations are being fixed or the specific nature of the improvements. Consider a more descriptive title that specifies which skills are being updated and what aspect of evaluations is being improved, e.g., 'chore: add expected_behavior to agent-skills and configure-inference eval cases'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch skills-eval-val-1

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • None. No NemoClaw E2E is recommended. The changes are limited to skill evaluation JSON metadata used to assess expected skill/reference usage; they do not affect runtime/user flows or security-sensitive paths. Prefer the existing NVSkills/static evaluation validation path rather than sandbox E2E.

Optional E2E

  • None.

New E2E recommendations

  • None.

@github-actions

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. No scenario workflow, scenario metadata, scenario runtime, or validation-suite files changed.

Optional scenario E2E

  • None.

Relevant changed files

  • None.

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

@github-actions

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 0 worth checking, 0 nice ideas
Top item: No blocking code-review findings

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@miyoungc miyoungc marked this pull request as draft May 28, 2026 23:15
@copy-pr-bot

copy-pr-bot Bot commented May 28, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wscurran wscurran added fix bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality area: skills Skills, agent behaviors, prompts, or skill packaging and removed fix labels May 29, 2026
@miyoungc miyoungc closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: skills Skills, agent behaviors, prompts, or skill packaging bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants