Refine dynamic mode skill eval by rostan-t · Pull Request #6389 · NVIDIA/DALI

rostan-t · 2026-06-08T15:31:13Z

Category:

Other (e.g. Documentation, Tests, Configuration)

Description:

The current prompts in the evaluation for the dynamic mode skill are sometimes not specific enough, leading (but not limited) to the following issues:

The agent can write outputs to a file while the evaluator expects it directly in its output
The agent fails to consume input files when the absolute path is not provided
The agent can fail to account for variable sizes in the input data
The evaluator can be too restrictive in what it accepts, leading to rejection of acceptable output

Additional information:

Affected modules and functionalities:

Dynamic mode skill

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

greptile-apps · 2026-06-08T15:34:44Z

Greptile Summary

This PR tightens the evaluation prompts and assertions for the dali-dynamic-mode skill to address several observed agent failure modes. It also promotes the new run's benchmark results into both BENCHMARK.md and skill-card.md, and adds a scripts/requirements.txt for the eval harness.

Prompt changes (evals 1–6): Guards added against writing output to a file, absolute input path now supplied for the file-conversion task, and "variable sizes" note added to the object-detection task so agents remember to use .torch(pad=True).
Assertion changes: correct-slice-usage assertion relaxed (removed the superfluous "(not batch.tensors)" exclusion), correct-sample-inspection relaxed to focus on the batch.tensors[i] access pattern rather than requiring a specific .cpu() call chain, and correct-evalmode-syntax updated to accept both sync_cpu and sync_full.
SKILL.md guidance: New bullet added for .torch(pad=True) with variable-shape batches; code example in the Execution Model section updated from sync_full to sync_cpu; troubleshooting note updated to mention both modes.

Confidence Score: 5/5

All changes are prompt/assertion text and documentation updates; no runtime code paths are touched.

The diff is confined to eval prompts, assertion strings, skill documentation, and benchmark reporting. Every change is a deliberate tightening or relaxation of an evaluator rule, each justified in the PR description. The new requirements.txt omits version pins but that affects only eval reproducibility, not correctness of the skill itself.

scripts/requirements.txt — missing version pins may cause silent dependency drift in the eval harness.

Important Files Changed

Filename	Overview
skills/dali-dynamic-mode/evals/evals.json	Prompt and assertion improvements across all six evals: adds absolute path for file input, suppresses file-save behaviour, relaxes over-strict assertions, adds correct-import check to evals 2–4.
skills/dali-dynamic-mode/SKILL.md	Adds .torch(pad=True) guidance for variable-shape batches; updates code example and troubleshooting note to reference sync_cpu alongside sync_full.
skills/dali-dynamic-mode/scripts/requirements.txt	New file: pins runtime deps to nvidia-dali-cuda130 and torch without version constraints, which may affect reproducibility.
skills/dali-dynamic-mode/BENCHMARK.md	Updated with full Tier 3 agent evaluation results (claude-code and codex) dated 2026-06-08; replaces all "not available" placeholders with actual metrics.
skills/dali-dynamic-mode/skill-card.md	Adds evaluation agent list, task count, underlying signal descriptions, and results table; adds references to SKILL.md and BENCHMARK.md; minor wording and license-format updates.
skills/dali-dynamic-mode/skill.oms.sig	Signature refreshed to cover the newly added scripts/requirements.txt and all updated file digests.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Eval harness reads evals.json] --> B{Has input files?}
    B -- Yes --> C[Stage files to /workspace/input/]
    B -- No --> D[Run agent with prompt only]
    C --> E[Run agent with absolute path in prompt]
    D --> F{Agent output}
    E --> F
    F --> G[Assert: correct-import]
    F --> H[Assert: correct API patterns]
    F --> I[Assert: .torch / pad=True for variable shapes]
    F --> J[Assert: no pipeline-mode constructs]
    G & H & I & J --> K{All assertions pass?}
    K -- Yes --> L[PASS]
    K -- No --> M[FAIL]

_{Reviews (3): Last reviewed commit: "Attach NVSkills validation signatures" | Re-trigger Greptile}

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

JanuszL · 2026-06-08T15:42:08Z

/nvskills-ci

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

rostan-t added the Dynamic Mode label Jun 8, 2026

greptile-apps Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread skills/dali-dynamic-mode/SKILL.md Outdated

Comment thread skills/dali-dynamic-mode/evals/evals.json

Refine dynamic mode skill eval

5107f33

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

rostan-t force-pushed the rtabet/ndd-skill-eval branch from ac41a26 to 5107f33 Compare June 8, 2026 15:40

Attach NVSkills validation signatures

bc6682a

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

dali-automaton assigned banasraf and mzient Jun 9, 2026

mzient approved these changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine dynamic mode skill eval#6389

Refine dynamic mode skill eval#6389
rostan-t wants to merge 2 commits into
mainfrom
rtabet/ndd-skill-eval

rostan-t commented Jun 8, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 8, 2026 •

edited

Loading

Greptile Summary

Flowchart

Uh oh!

Uh oh!

Uh oh!

JanuszL commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rostan-t commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

greptile-apps Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

JanuszL commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rostan-t commented Jun 8, 2026 •

edited

Loading

greptile-apps Bot commented Jun 8, 2026 •

edited

Loading