NVIDIA · rostan-t · Jun 8, 2026 · Jun 8, 2026
diff --git a/skills/dali-dynamic-mode/BENCHMARK.md b/skills/dali-dynamic-mode/BENCHMARK.md
@@ -7,14 +7,18 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s
 ## Evaluation Summary
 
 - Skill: `dali-dynamic-mode`
-- Evaluation date: 2026-05-28
+- Evaluation date: 2026-06-08
 - NVSkills-Eval profile: `external`
+- Environment: `astra-sandbox`
+- Dataset: 24 evaluation tasks
+- Attempts per task: 2
+- Pass threshold: 50%
 - Overall verdict: PASS
-- Tier 3 live agent evaluation: not available in this report
 
 ## Agents Used
 
-- Tier 3 agent details were not available in this report.
+- `claude-code`
+- `codex`
 
 ## Metrics Used
 
@@ -28,26 +32,40 @@ Reported benchmark dimensions:
 
 Underlying evaluation signals used in this run:
 
-- No Tier 3 evaluation signal details were available in this report.
+- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
+- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
+- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
+- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
+- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
+- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
+- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.
 
 ## Test Tasks
 
-Tier 3 evaluation task details were not available in this report.
+The benchmark included 24 recorded Tier 3 trials, but the source evaluation dataset was not available in this report payload.
 
 ## Results
 
-Tier 3 dimension rollup was not available in this report.
+| Dimension | Num | `claude-code` | `codex` |
+|---|---:|---:|---:|
+| Security | 8 | 100% (+0%) | 100% (+0%) |
+| Correctness | 8 | 98% (+61%) | 86% (+31%) |
+| Discoverability | 8 | 97% (+84%) | 81% (+47%) |
+| Effectiveness | 8 | 77% (+45%) | 66% (+29%) |
+| Efficiency | 8 | 88% (+59%) | 76% (+41%) |
+
+Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.
 
 ## Tier 1: Static Validation Summary
 
 Tier 1 validation passed. NVSkills-Eval ran 9 checks and found 0 total findings.
 
 Notable observations:
 
-- SECURITY: No security vulnerabilities detected (secrets, API keys, credentials)
+- SECURITY: no findings reported.
 - SCHEMA: Found skill manifest: SKILL.md
 - VERSION: No semantic version label present; resource will use commit-hash history (opting back out of an existing label is allowed)
-- PII: Scanning 1 files for PII
+- PII: Scanning 2 files for PII
 - LICENSE: no findings reported.
 
 ## Tier 2: Deduplication Summary

diff --git a/skills/dali-dynamic-mode/SKILL.md b/skills/dali-dynamic-mode/SKILL.md
@@ -29,6 +29,7 @@ Guide AI agents in writing, reviewing, and migrating code that uses DALI's imper
 - Treat readers as stateful: create them once, reuse them across epochs, and pass `batch_size` to `next_epoch(...)`.
 - Pass explicit `batch_size` to random ops; there is no pipeline-level batch size to inherit.
 - Use dynamic-mode API conventions: `device="gpu"` instead of pipeline-mode `"mixed"`, `Batch.tensors[...]` for sample selection, and `Batch.slice[...]` for per-sample slicing.
+- Use `.torch()` to convert a tensor or batch to a PyTorch tensor. Use `pad=True` for batches with variable shapes.
 
 ## Prerequisites
 
@@ -149,7 +150,7 @@ Default mode is `eager` -- async execution in a background thread, returns immed
 For debugging, switch to synchronous mode so errors surface at the exact call site rather than later in the async queue:
 
 ```python
-with ndd.EvalMode.sync_full:
+with ndd.EvalMode.sync_cpu:
     images = ndd.decoders.image(jpegs, device="gpu")
     images = ndd.resize(images, size=[224, 224])
     # Any error surfaces here, at the exact op that failed
@@ -288,5 +289,5 @@ Dynamic mode is more flexible than pipeline mode, but can have slightly worse pe
 
 ## Troubleshooting
 
-- If errors surface later than the failing call, rerun the block under `with ndd.EvalMode.sync_full:`.
+- If errors surface later than the failing call, rerun the block under `EvalMode.sync_cpu` or `EvalMode.sync_full`.
 - If a reader behaves unexpectedly across epochs, check that it is created once and each `next_epoch()` iterator is fully consumed.
diff --git a/skills/dali-dynamic-mode/evals/evals.json b/skills/dali-dynamic-mode/evals/evals.json
@@ -3,7 +3,7 @@
   "evals": [
     {
       "id": 1,
-      "prompt": "Write a complete Python script that uses DALI dynamic mode to load and preprocess images for training an image classification model with PyTorch. The images are JPEGs on disk, and I need GPU-accelerated decode, resize to 224x224, and ImageNet normalization. The script should include the training loop.",
+      "prompt": "Write a Python script that uses DALI dynamic mode to load and preprocess images for training an image classification model with PyTorch. The images are JPEGs on disk, and I need GPU-accelerated decode, resize to 224x224, and ImageNet normalization. Show the full script, do not save it to a file.",
       "expected_output": "Complete pipeline using ndd.readers.File, ndd.decoders.image(device='gpu'), ndd.resize, ndd.crop_mirror_normalize, .torch() handoff",
       "files": [],
       "assertions": [
@@ -24,18 +24,20 @@
       "expected_output": "Uses batch.slice[0] and batch.slice[1] for samplewise slicing",
       "files": [],
       "assertions": [
-        {"name": "correct-slice-usage", "text": "Uses batch.slice[0] and batch.slice[1] (not batch.tensors)"},
+        {"name": "correct-import", "text": "Uses import nvidia.dali.experimental.dynamic as ndd"},
+        {"name": "correct-slice-usage", "text": "Uses batch.slice[0] and batch.slice[1]"},
         {"name": "no-getitem", "text": "Does not use batch[0] or batch[:, 0] (Batch has no __getitem__)"},
         {"name": "correct-slice-semantics", "text": "Correctly explains that .slice indexes within each sample, not across samples"},
         {"name": "batch-size-to-random", "text": "Passes batch_size to ndd.random.uniform()"}
       ]
     },
     {
       "id": 3,
-      "prompt": "Convert the following pipeline-mode DALI code to dynamic mode. Write the complete converted script.",
+      "prompt": "Convert the file /workspace/input/pipeline_to_convert.py to dynamic mode. Include the complete converted script in your response.",
       "expected_output": "Correct conversion with all pipeline-mode patterns replaced",
       "files": ["evals/files/pipeline_to_convert.py"],
       "assertions": [
+        {"name": "correct-import", "text": "Uses import nvidia.dali.experimental.dynamic as ndd"},
         {"name": "device-gpu-not-mixed", "text": "device='mixed' converted to device='gpu'"},
         {"name": "reader-pascalcase", "text": "fn.readers.file converted to ndd.readers.File (PascalCase)"},
         {"name": "no-pipeline-mode", "text": "No pipeline-mode constructs (no @pipeline_def, pipe.build(), pipe.run()) and operators called directly on ndd (e.g. ndd.rotate, not fn.rotate or ndd.fn.rotate)"},
@@ -48,20 +50,21 @@
     {
       "id": 4,
       "prompt": "My data loading code built with DALI's dynamic (imperative) API produces wrong results intermittently — images sometimes appear corrupted. The code decodes JPEG images on the GPU, resizes them, and normalizes them. How do I debug this? Write a debugging guide with code examples.",
-      "expected_output": "Recommends EvalMode.sync_full or sync_cpu for debugging, explains async execution model, code examples use correct dynamic mode patterns",
+      "expected_output": "Recommends EvalMode.sync_full or sync_cpu for debugging (not necessarily both), explains async execution model, code examples use correct dynamic mode patterns",
       "files": [],
       "assertions": [
+        {"name": "correct-import", "text": "Uses import nvidia.dali.experimental.dynamic as ndd"},
         {"name": "recommends-sync-mode", "text": "Recommends EvalMode.sync_full or EvalMode.sync_cpu for debugging"},
         {"name": "no-scatter-evaluate", "text": "Does not recommend adding .evaluate() after every operation as the primary debugging approach"},
-        {"name": "correct-evalmode-syntax", "text": "Uses correct context manager syntax: with ndd.EvalMode.sync_full: (not ndd.eval_mode(...) or other invented API)"},
-        {"name": "correct-sample-inspection", "text": "When inspecting intermediate values, uses batch.tensors[i].cpu() or np.asarray(batch.tensors[i].cpu()) — not batch[i] or batch.as_cpu().as_array()"},
+        {"name": "correct-evalmode-syntax", "text": "Uses correct context manager syntax: `with ndd.EvalMode.sync_cpu:` or `with ndd.EvalMode.sync_full:` (not ndd.eval_mode(...) or other invented API)"},
+        {"name": "correct-sample-inspection", "text": "When inspecting intermediate values, uses batch.tensors[i], not batch[i] or batch.as_cpu().as_array()"},
         {"name": "code-examples-no-pipeline-mode", "text": "All code examples in the guide use dynamic mode patterns (ndd.decoders.image, ndd.resize, etc.) — no fn.* or ndd.fn.* operators in any code snippet"},
         {"name": "code-examples-device-gpu", "text": "All code examples use device='gpu' for decode, NOT device='mixed'"}
       ]
     },
     {
       "id": 5,
-      "prompt": "I need to train a speech classification model on WAV files using PyTorch. Write a complete Python script that uses DALI dynamic mode for the data loading and audio feature extraction (mel spectrograms). My audio clips have different durations.",
+      "prompt": "I need to train a speech classification model on WAV files using PyTorch. Show me a complete Python script that uses DALI dynamic mode for the data loading and audio feature extraction (mel spectrograms). My audio clips have different durations.",
       "expected_output": "Uses ndd.readers, ndd.decoders.audio(), spectral ops, handles variable-length via .torch(pad=True)",
       "files": [],
       "assertions": [
@@ -75,7 +78,7 @@
     },
     {
       "id": 6,
-      "prompt": "Write a complete Python script for an object detection training pipeline using DALI dynamic mode and PyTorch. It should read COCO-format images and annotations, apply random horizontal flip as augmentation (both images and their bounding boxes), resize, normalize, and feed to a model.",
+      "prompt": "Write a complete Python script for an object detection training pipeline using DALI dynamic mode and PyTorch. It should read COCO-format images and annotations, apply random horizontal flip as augmentation (both images and their bounding boxes), resize, normalize, and feed to a model. Images are of variable sizes.",
       "expected_output": "DALI reader with bbox support, coordinated augmentation via ndd.random, correct dynamic mode patterns",
       "files": [],
       "assertions": [

diff --git a/skills/dali-dynamic-mode/scripts/requirements.txt b/skills/dali-dynamic-mode/scripts/requirements.txt
@@ -0,0 +1,2 @@
+nvidia-dali-cuda130
+torch
diff --git a/skills/dali-dynamic-mode/skill-card.md b/skills/dali-dynamic-mode/skill-card.md
@@ -7,9 +7,9 @@ This skill is ready for commercial/non-commercial use. <br>
 NVIDIA <br>
 
 ### License/Terms of Use: <br>
-Apache-2.0 <br>
+Apache 2.0 <br>
 ## Use Case: <br>
-Developers and engineers writing, reviewing, or migrating code that uses NVIDIA DALI's imperative dynamic-mode API for GPU-accelerated data loading and preprocessing. <br>
+Developers and engineers writing, reviewing, or migrating data-loading code that uses NVIDIA DALI's imperative dynamic-mode API for GPU-accelerated data processing in deep learning workflows. <br>
 
 ### Deployment Geography for Use: <br>
 Global <br>
@@ -19,6 +19,8 @@ Risk: Review before execution as proposals could introduce incorrect or misleadi
 Mitigation: Review and scan skill before deployment. <br>
 
 ## Reference(s): <br>
+- [SKILL.md](SKILL.md) <br>
+- [BENCHMARK.md](BENCHMARK.md) <br>
 
 
 ## Skill Output: <br>
@@ -27,6 +29,15 @@ Mitigation: Review and scan skill before deployment. <br>
 **Output Parameters:** [1D] <br>
 **Other Properties Related to Output:** [None] <br>
 
+## Evaluation Agents Used: <br>
+- `claude-code` <br>
+- `codex` <br>
+
+
+
+## Evaluation Tasks: <br>
+Evaluated against 24 tasks with 2 attempts per task; pass threshold 50%. <br>
+
 ## Evaluation Metrics Used: <br>
 Reported benchmark dimensions: <br>
 - Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. <br>
@@ -35,10 +46,28 @@ Reported benchmark dimensions: <br>
 - Effectiveness: Checks whether the agent performs measurably better with the skill than without it. <br>
 - Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work. <br>
 
+Underlying evaluation signals used in this run: <br>
+- `security`: Checks for unsafe operations, secret leakage, and unauthorized access. <br>
+- `skill_execution`: Verifies that the agent loaded the expected skill and workflow. <br>
+- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage. <br>
+- `accuracy`: Grades final-answer correctness against the reference answer. <br>
+- `goal_accuracy`: Checks whether the overall user task completed successfully. <br>
+- `behavior_check`: Verifies expected behavior steps, including safety expectations. <br>
+- `token_efficiency`: Compares token usage with and without the skill. <br>
+
+
 
+## Evaluation Results: <br>
+| Dimension | Num | `claude-code` | `codex` |
+|---|---:|---:|---:|
+| Security | 8 | 100% (+0%) | 100% (+0%) |
+| Correctness | 8 | 98% (+61%) | 86% (+31%) |
+| Discoverability | 8 | 97% (+84%) | 81% (+47%) |
+| Effectiveness | 8 | 77% (+45%) | 66% (+29%) |
+| Efficiency | 8 | 88% (+59%) | 76% (+41%) |
 
 ## Skill Version(s): <br>
-4d4cfdd1 (source: git SHA, committed 2026-05-28) <br>
+v2.2.0-dev-88-g5107f33d (source: git tag) <br>
 
 ## Ethical Considerations: <br>
 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>