Skip to content

Refine dynamic mode skill eval#6389

Open
rostan-t wants to merge 2 commits into
mainfrom
rtabet/ndd-skill-eval
Open

Refine dynamic mode skill eval#6389
rostan-t wants to merge 2 commits into
mainfrom
rtabet/ndd-skill-eval

Conversation

@rostan-t

@rostan-t rostan-t commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Category:

Other (e.g. Documentation, Tests, Configuration)

Description:

The current prompts in the evaluation for the dynamic mode skill are sometimes not specific enough, leading (but not limited) to the following issues:

  • The agent can write outputs to a file while the evaluator expects it directly in its output
  • The agent fails to consume input files when the absolute path is not provided
  • The agent can fail to account for variable sizes in the input data
  • The evaluator can be too restrictive in what it accepts, leading to rejection of acceptable output

Additional information:

Affected modules and functionalities:

Dynamic mode skill

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR tightens the evaluation prompts and assertions for the dali-dynamic-mode skill to address several observed agent failure modes. It also promotes the new run's benchmark results into both BENCHMARK.md and skill-card.md, and adds a scripts/requirements.txt for the eval harness.

  • Prompt changes (evals 1–6): Guards added against writing output to a file, absolute input path now supplied for the file-conversion task, and "variable sizes" note added to the object-detection task so agents remember to use .torch(pad=True).
  • Assertion changes: correct-slice-usage assertion relaxed (removed the superfluous "(not batch.tensors)" exclusion), correct-sample-inspection relaxed to focus on the batch.tensors[i] access pattern rather than requiring a specific .cpu() call chain, and correct-evalmode-syntax updated to accept both sync_cpu and sync_full.
  • SKILL.md guidance: New bullet added for .torch(pad=True) with variable-shape batches; code example in the Execution Model section updated from sync_full to sync_cpu; troubleshooting note updated to mention both modes.

Confidence Score: 5/5

All changes are prompt/assertion text and documentation updates; no runtime code paths are touched.

The diff is confined to eval prompts, assertion strings, skill documentation, and benchmark reporting. Every change is a deliberate tightening or relaxation of an evaluator rule, each justified in the PR description. The new requirements.txt omits version pins but that affects only eval reproducibility, not correctness of the skill itself.

scripts/requirements.txt — missing version pins may cause silent dependency drift in the eval harness.

Important Files Changed

Filename Overview
skills/dali-dynamic-mode/evals/evals.json Prompt and assertion improvements across all six evals: adds absolute path for file input, suppresses file-save behaviour, relaxes over-strict assertions, adds correct-import check to evals 2–4.
skills/dali-dynamic-mode/SKILL.md Adds .torch(pad=True) guidance for variable-shape batches; updates code example and troubleshooting note to reference sync_cpu alongside sync_full.
skills/dali-dynamic-mode/scripts/requirements.txt New file: pins runtime deps to nvidia-dali-cuda130 and torch without version constraints, which may affect reproducibility.
skills/dali-dynamic-mode/BENCHMARK.md Updated with full Tier 3 agent evaluation results (claude-code and codex) dated 2026-06-08; replaces all "not available" placeholders with actual metrics.
skills/dali-dynamic-mode/skill-card.md Adds evaluation agent list, task count, underlying signal descriptions, and results table; adds references to SKILL.md and BENCHMARK.md; minor wording and license-format updates.
skills/dali-dynamic-mode/skill.oms.sig Signature refreshed to cover the newly added scripts/requirements.txt and all updated file digests.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Eval harness reads evals.json] --> B{Has input files?}
    B -- Yes --> C[Stage files to /workspace/input/]
    B -- No --> D[Run agent with prompt only]
    C --> E[Run agent with absolute path in prompt]
    D --> F{Agent output}
    E --> F
    F --> G[Assert: correct-import]
    F --> H[Assert: correct API patterns]
    F --> I[Assert: .torch / pad=True for variable shapes]
    F --> J[Assert: no pipeline-mode constructs]
    G & H & I & J --> K{All assertions pass?}
    K -- Yes --> L[PASS]
    K -- No --> M[FAIL]
Loading

Fix All in Claude Code

Reviews (3): Last reviewed commit: "Attach NVSkills validation signatures" | Re-trigger Greptile

Comment thread skills/dali-dynamic-mode/SKILL.md Outdated
Comment thread skills/dali-dynamic-mode/evals/evals.json
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
@rostan-t rostan-t force-pushed the rtabet/ndd-skill-eval branch from ac41a26 to 5107f33 Compare June 8, 2026 15:40
@JanuszL

JanuszL commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

/nvskills-ci

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants