Skip to content

Add dep check for ptq and runtime check for evaluation/deployment#1240

Open
kaix-nv wants to merge 2 commits intomainfrom
kaix/nemo_skills
Open

Add dep check for ptq and runtime check for evaluation/deployment#1240
kaix-nv wants to merge 2 commits intomainfrom
kaix/nemo_skills

Conversation

@kaix-nv
Copy link
Copy Markdown
Contributor

@kaix-nv kaix-nv commented Apr 12, 2026

What does this PR do?

Type of change: ?

PTQ: model-specific dependency support

  • Add EXTRA_PIP_DEPS support to the launcher's ptq.sh so models requiring extra pip packages (e.g., mamba-ssm for hybrid Mamba architectures like Nemotron) can install them automatically before running PTQ. Also updates the PTQ skill with a new Step 2.5 for detecting model-specific dependencies.

Container registry auth checks

  • Add new section 6 covering auth detection for enroot/pyxis, Docker, and Singularity/Apptainer. Includes credential locations, how to add them, and common failure modes.
  • Add Step 7.5 with NEL default image table, DockerHub-first strategy with NGC fallback, and build-config CLI note.
  • Add auth check before remote SLURM deployment.

Usage

Set EXTRA_PIP_DEPS in the launcher YAML's environment section:

task_0:
  script: common/hf/ptq.sh
  args:
    - --repo nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --local-dir /hf-local/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --
    - --quant nvfp4
    - --tasks quant
  environment:
    - EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d"

Testing

Tested end-to-end: NVFP4 quantization of NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 on a B200 cluster via the launcher. Job succeeded: mamba-ssm installed automatically, calibration completed (512 samples, 84s), checkpoint exported (18 GB, 2 safetensor shards).

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added container registry authentication verification workflow for SLURM deployments, including credential checks and remediation guidance
    • New dependency-checking step for models requiring custom imports
    • Updated deployment workflow to require credential validation before SLURM job submission
  • New Features

    • Added support for specifying extra pip dependencies during model processing via environment variable

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 12, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

📝 Walkthrough

Walkthrough

The changes introduce container registry authentication verification workflows across SLURM deployment processes and add model-specific dependency detection for PTQ operations. Documentation updates provide step-by-step credential validation instructions before job submission. A shell script enhancement enables optional extra pip package installation via environment variables.

Changes

Cohort / File(s) Summary
Container Registry Authentication
.claude/skills/common/slurm-setup.md, .claude/skills/deployment/SKILL.md, .claude/skills/evaluation/SKILL.md
Added new documentation section (Section 6) for pre-SLURM container registry authentication verification, including runtime detection, credential checking, and failure remediation. Deployment and evaluation guides updated with gating steps to verify credentials before job submission and specialized SLURM authentication checks.
Model Dependency Management
.claude/skills/ptq/SKILL.md, tools/launcher/common/hf/ptq.sh
Added Step 2.5 workflow in PTQ guide for detecting model-specific dependencies via trust_remote_code inspection and mapping imports to required packages. Enhanced PTQ shell script with conditional installation of extra packages controlled by EXTRA_PIP_DEPS environment variable.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as Deployment CLI
    participant SLURM
    participant Node as Compute Node
    participant Registry as Container Registry

    User->>CLI: Run deployment/evaluation
    CLI->>CLI: Check credentials on local system
    alt Credentials Missing
        CLI->>User: Provide remediation commands
        User->>CLI: Run credential setup commands
    end
    CLI->>SLURM: Submit job with verified credentials
    SLURM->>Node: Schedule and route to compute
    Node->>Registry: Pull container image
    Registry-->>Node: Image retrieved
    Node->>Node: Launch container & run workload
Loading
sequenceDiagram
    participant User
    participant PTQGuide as PTQ Workflow
    participant Model as Model Config
    participant Deps as Dependency Check
    participant Launcher as Launcher Script
    participant Env as Execution Environment

    User->>PTQGuide: Start PTQ process
    PTQGuide->>Model: Check config.json for trust_remote_code
    alt trust_remote_code Enabled
        PTQGuide->>Deps: Detect custom imports in modeling_*.py
        Deps->>Deps: Map imports to pip packages
        Deps->>User: Identify missing dependencies
        User->>Launcher: Set EXTRA_PIP_DEPS environment variable
    end
    Launcher->>Env: Check if EXTRA_PIP_DEPS is set
    alt EXTRA_PIP_DEPS Non-empty
        Env->>Env: Unset PIP_CONSTRAINT
        Env->>Env: pip install $EXTRA_PIP_DEPS
    end
    Launcher->>Launcher: Proceed with model download and PTQ execution
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main changes: adding dependency checking for PTQ and adding runtime/authentication checks for evaluation and deployment workflows.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed PR contains only documentation and shell script changes with no Python code modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kaix/nemo_skills

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 12, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1240/

Built to branch gh-pages at 2026-04-13 21:19 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.91%. Comparing base (0b42c14) to head (b1ac809).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1240   +/-   ##
=======================================
  Coverage   76.91%   76.91%           
=======================================
  Files         350      350           
  Lines       40481    40481           
=======================================
  Hits        31137    31137           
  Misses       9344     9344           
Flag Coverage Δ
unit 55.53% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kaix-nv kaix-nv requested review from Edwardf0t1 and mxinO April 13, 2026 02:57
kaix-nv added 2 commits April 13, 2026 14:15
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv changed the title Add dependencies check for ptq Add dependencies check for ptq; Add runtime check for evaluation/deployment Apr 13, 2026
@kaix-nv kaix-nv changed the title Add dependencies check for ptq; Add runtime check for evaluation/deployment Add dep check for ptq and runtime check for evaluation/deployment Apr 13, 2026
@kaix-nv kaix-nv marked this pull request as ready for review April 14, 2026 03:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/common/slurm-setup.md:
- Around line 235-256: The snippets that directly cat credential files (the
commands referencing ~/.config/enroot/.credentials, ~/.docker/config.json, and
~/.singularity/docker-config.json or ~/.apptainer/docker-config.json) must not
be left as-is; change them to commands that only enumerate registry keys or
"machine <registry>" entries and redact any secrets instead of printing full
files. Locate the three occurrences (the enroot credentials check, the Docker
config json check, and the Singularity/Apptainer check) and replace the raw-dump
approach with parsing that extracts and prints only registry hostnames/machine
names (e.g., use a JSON parser like jq or a simple grep/awk pipeline) and ensure
token/auth fields are omitted or redacted in the output. Ensure the help text
describes that the check shows only registry keys/machine entries, not secrets.
- Around line 287-305: Replace the CLI password flags that expose secrets in the
commands "docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>" and
"singularity remote login --username '$oauthtoken' --password <ngc_api_key>
docker://nvcr.io" with stdin-based or interactive login flows: for Docker use
the --password-stdin flow (pipe the API key into docker login) and for
Singularity/Apptainer omit the --password flag to trigger an interactive prompt
or use the tool's supported stdin/token mechanism or environment variable helper
so secrets are not passed as command-line arguments.

In @.claude/skills/evaluation/SKILL.md:
- Line 31: The checklist was updated to add "Step 7.5: Check container registry
auth (SLURM only)" but the subsequent "Now, copy this checklist" block in
SKILL.md still skips 7.5; open .claude/skills/evaluation/SKILL.md, find the
"Now, copy this checklist" section and add the corresponding checkbox line for
Step 7.5 (matching the exact wording "Step 7.5: Check container registry auth
(SLURM only)") so the copyable checklist and any progress-tracking list include
that step and numbering remains consistent with the original checklist.
- Around line 202-214: The decision flow contradicts the pre-submit auth gate:
update the flow in the "Decision flow" section so authentication is verified and
handled before any job submission rather than submitting and reacting to a 401;
specifically, replace the current steps with: check DockerHub credentials first
(per the pre-submit auth gate), attempt to add DockerHub creds if missing, if
creds cannot be provided then set deployment.image to the NGC image
(nvcr.io/nvidia/vllm:<YY.MM>-py3) and only then submit the job, and keep the "Do
not retry more than once" rule; ensure references to "Default behavior",
"pre-submit auth gate", "deployment.image", and the NGC image string are updated
accordingly.
- Around line 199-200: Replace the unsafe raw credential dump command ssh <host>
"cat ~/.config/enroot/.credentials 2>/dev/null" with a non-leaking check that
only verifies presence of registry/machine entries or keys rather than printing
values; update the SKILL.md example so it inspects the file structure (e.g.,
look for "machine" entries or specific registry keys) or parses JSON/YAML to
report existence/status and masks or omits secret values, ensuring no full
credentials are echoed or logged.
- Line 80: The file has conflicting guidance: one sentence ("DON'T ALLOW FOR ANY
OTHER OPTIONS") hard-restricts accepted categories while another warns that CLI
options "may differ from this list" and says "Use the CLI's current options, not
this list"; reconcile by removing or softening the hard restriction and making
the doc authoritative only as an example, e.g., replace "DON'T ALLOW FOR ANY
OTHER OPTIONS" with a consistent instruction that callers should validate
against the CLI output (nel skills build-config --help) and accept CLI-provided
categories rather than rejecting options not listed in the doc; update the text
around the example categories and any validation guidance to explicitly prefer
the CLI's current options and to allow other valid CLI values.

In `@tools/launcher/common/hf/ptq.sh`:
- Around line 29-33: The pip install invocation uses unquoted $EXTRA_PIP_DEPS
which risks word-splitting/globbing; split the value into a shell array and
install using the array expansion to preserve each dependency token. For
example, parse EXTRA_PIP_DEPS into an array (e.g., read -r -a extras <<<
"$EXTRA_PIP_DEPS" or IFS=' ' read -r -a extras <<< "$EXTRA_PIP_DEPS") and
replace pip install $EXTRA_PIP_DEPS with pip install "${extras[@]}", keeping the
existing unset PIP_CONSTRAINT and condition around EXTRA_PIP_DEPS.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5232f7df-8043-4cd3-85fd-ec40f3a1bfcc

📥 Commits

Reviewing files that changed from the base of the PR and between 14b78ae and b1ac809.

📒 Files selected for processing (5)
  • .claude/skills/common/slurm-setup.md
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/ptq/SKILL.md
  • tools/launcher/common/hf/ptq.sh

Comment on lines +235 to +256
```bash
cat ~/.config/enroot/.credentials 2>/dev/null
```

Look for `machine <registry>` lines:
- NGC → `machine nvcr.io`
- DockerHub → `machine auth.docker.io`
- GHCR → `machine ghcr.io`

#### Docker

```bash
cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; print(json.dumps(json.load(sys.stdin).get('auths',{}), indent=2))"
```

Look for registry keys (`https://index.docker.io/v1/`, `nvcr.io`, `ghcr.io`).

#### Singularity/Apptainer

```bash
cat ~/.singularity/docker-config.json 2>/dev/null || cat ~/.apptainer/docker-config.json 2>/dev/null
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid commands that print raw credential files.

Lines 236, 247, and 255 currently instruct dumping auth files directly, which can leak tokens into terminal logs/history. Prefer listing registry keys/machine entries only (redacted view).

Suggested safer checks
- cat ~/.config/enroot/.credentials 2>/dev/null
+ grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null

- cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; print(json.dumps(json.load(sys.stdin).get('auths',{}), indent=2))"
+ cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; cfg=json.load(sys.stdin); print('\n'.join(cfg.get('auths', {}).keys()))"

- cat ~/.singularity/docker-config.json 2>/dev/null || cat ~/.apptainer/docker-config.json 2>/dev/null
+ (cat ~/.singularity/docker-config.json 2>/dev/null || cat ~/.apptainer/docker-config.json 2>/dev/null) | \
+ python3 -c "import json,sys; cfg=json.load(sys.stdin); print('\n'.join(cfg.get('auths', {}).keys()))"

As per coding guidelines "Never hardcode secrets, credentials, tokens, passwords, or API keys in source code. Use environment variables or configuration files listed in .gitignore to store sensitive information."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/common/slurm-setup.md around lines 235 - 256, The snippets
that directly cat credential files (the commands referencing
~/.config/enroot/.credentials, ~/.docker/config.json, and
~/.singularity/docker-config.json or ~/.apptainer/docker-config.json) must not
be left as-is; change them to commands that only enumerate registry keys or
"machine <registry>" entries and redact any secrets instead of printing full
files. Locate the three occurrences (the enroot credentials check, the Docker
config json check, and the Singularity/Apptainer check) and replace the raw-dump
approach with parsing that extracts and prints only registry hostnames/machine
names (e.g., use a JSON parser like jq or a simple grep/awk pipeline) and ensure
token/auth fields are omitted or redacted in the output. Ensure the help text
describes that the check shows only registry keys/machine entries, not secrets.

Comment on lines +287 to +305
**Docker:**

```bash
# DockerHub
docker login

# NGC
docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>
```

**Singularity/Apptainer:**

```bash
# DockerHub
singularity remote login --username <user> docker://docker.io

# NGC
singularity remote login --username '$oauthtoken' --password <ngc_api_key> docker://nvcr.io
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and examine its content around the specified lines
if [ -f ".claude/skills/common/slurm-setup.md" ]; then
  echo "File found. Reading content around lines 287-305:"
  sed -n '280,310p' ".claude/skills/common/slurm-setup.md" | cat -n
else
  echo "File not found at .claude/skills/common/slurm-setup.md"
  # Try to locate it
  find . -name "slurm-setup.md" 2>/dev/null
fi

Repository: NVIDIA/Model-Optimizer

Length of output: 981


🏁 Script executed:

# Check if there's any context or warnings around these code blocks
sed -n '270,310p' ".claude/skills/common/slurm-setup.md" | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 1294


🌐 Web query:

Docker login --password-stdin syntax and security best practices

💡 Result:

Syntax: docker login [SERVER] --username USER --password-stdin Examples: - echo "password" | docker login --username foo --password-stdin myregistry.com - cat password.txt | docker login --username foo --password-stdin - printf "%s" "$PASSWORD" | docker login --username foo --password-stdin (avoids trailing newline) Security Best Practices: - NEVER use --password flag; it exposes password in process list, shell history, logs. Docker warns: "Using --password via the CLI is insecure. Use --password-stdin." - Use --password-stdin for non-interactive login to avoid history/logs exposure. - Prefer credential helpers (credsStore or credHelpers in ~/.docker/config.json) for secure storage in OS keychain (osxkeychain, secretservice, wincred, pass). More secure than base64 in config.json. - Use Personal Access Tokens (PATs), not passwords. - In CI/CD: Use masked secrets/env vars with printf/echo piped to --password-stdin. - Rotate credentials regularly. - For specific registries (e.g., AWS ECR): Use dedicated helpers like ecr-login.

Citations:


🌐 Web query:

Singularity/Apptainer remote login password handling best practices

💡 Result:

Best practices for handling passwords with Singularity/Apptainer remote login focus on minimizing exposure of credentials, as they are stored unencrypted in ~/.singularity/remotes.yaml or ~/.apptainer/remote.yaml. Key recommendations from official documentation: 1. Use token-based authentication instead of passwords where possible. Generate API tokens or access tokens from the remote service (e.g., Sylabs Cloud, OCI registries like Docker Hub) and use them as the password field. Tokens can be revoked easily if compromised. Examples: - For SylabsCloud: Generate token at https://cloud.sylabs.io/auth/tokens and paste during login. - For OCI registries: Use personal access tokens as password/token, with required username (sometimes dummy like '00000000-0000-0000-0000-000000000000' for Azure ACR). 2. Avoid entering passwords directly on command line with --password, as they appear in process lists (e.g., ps output). Instead: - Use interactive prompt: apptainer remote login - Or --password-stdin to pipe password: echo "token" | apptainer remote login --password-stdin 3. Use --tokenfile to read from a secure file: apptainer remote login --tokenfile /path/to/token.txt 4. Be aware credentials are stored unencrypted in remote.yaml in your home directory. Protect access to your home directory (permissions 700 or similar). Logout with 'remote logout' when no longer needed. Do not commit remote.yaml to version control. 5. For automation/CI (non-interactive): - Use environment variables if supported (older Singularity had SINGULARITY_DOCKER_USERNAME/PASSWORD). - Prefer short-lived tokens. - Avoid persistent storage; login per session if possible. 6. Use --insecure only if necessary (HTTP instead of HTTPS), but avoid it. 7. For OCI registries specifically: Consult registry docs for CLI login requirements. Singularity/Apptainer reads ~/.docker/config.json if present (but not external helpers). Examples: apptainer remote login SylabsCloud # Interactive token prompt apptainer remote login --username foo --password-stdin docker://docker.io # Pipe token INFO: Token stored in /home/user/.apptainer/remote.yaml Official docs explicitly warn: "the login command will store the supplied credentials or tokens unencrypted in your home directory."

Citations:


Update documentation to avoid passing secrets via CLI arguments.

Lines 294 and 304 show credentials passed as CLI arguments, which exposes them in process lists and shell history. Use stdin-based or interactive login flows instead:

Suggested safer patterns
 # NGC
-docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>
+echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
 # NGC
-singularity remote login --username '$oauthtoken' --password <ngc_api_key> docker://nvcr.io
+echo "$NGC_API_KEY" | singularity remote login --username '$oauthtoken' --password-stdin docker://nvcr.io

Alternatively, use interactive prompts (omit --password* flags to trigger interactive login).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Docker:**
```bash
# DockerHub
docker login
# NGC
docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>
```
**Singularity/Apptainer:**
```bash
# DockerHub
singularity remote login --username <user> docker://docker.io
# NGC
singularity remote login --username '$oauthtoken' --password <ngc_api_key> docker://nvcr.io
```
**Docker:**
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/common/slurm-setup.md around lines 287 - 305, Replace the CLI
password flags that expose secrets in the commands "docker login nvcr.io -u
'$oauthtoken' -p <ngc_api_key>" and "singularity remote login --username
'$oauthtoken' --password <ngc_api_key> docker://nvcr.io" with stdin-based or
interactive login flows: for Docker use the --password-stdin flow (pipe the API
key into docker login) and for Singularity/Apptainer omit the --password flag to
trigger an interactive prompt or use the tool's supported stdin/token mechanism
or environment variable helper so secrets are not passed as command-line
arguments.

- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth (SLURM only)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Step list drift: final checklist is missing Step 7.5.

Line 31 adds Step 7.5, but the “Now, copy this checklist” block still skips it, so progress tracking is inconsistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md at line 31, The checklist was updated to
add "Step 7.5: Check container registry auth (SLURM only)" but the subsequent
"Now, copy this checklist" block in SKILL.md still skips 7.5; open
.claude/skills/evaluation/SKILL.md, find the "Now, copy this checklist" section
and add the corresponding checkbox line for Step 7.5 (matching the exact wording
"Step 7.5: Check container registry auth (SLURM only)") so the copyable
checklist and any progress-tracking list include that step and numbering remains
consistent with the original checklist.

DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.

> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead.
> **Note:** These categories come from NEL's `build-config` CLI. **Always run `nel skills build-config --help` first** to get the current options — they may differ from this list (e.g., `chat_reasoning` instead of separate `chat`/`reasoning`, `general_knowledge` instead of `standard`). Use the CLI's current options, not this list, when they conflict.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n .claude/skills/evaluation/SKILL.md | sed -n '75,85p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1163


Resolve conflicting instructions on valid configuration options.

Line 78 hard-restricts choices to only those listed above ("DON'T ALLOW FOR ANY OTHER OPTIONS"), while Line 80 explicitly states that options "may differ from this list" and instructs users to "Use the CLI's current options, not this list, when they conflict." These contradictory directives can cause users to reject valid options provided by the actual nel skills build-config --help output.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md at line 80, The file has conflicting
guidance: one sentence ("DON'T ALLOW FOR ANY OTHER OPTIONS") hard-restricts
accepted categories while another warns that CLI options "may differ from this
list" and says "Use the CLI's current options, not this list"; reconcile by
removing or softening the hard restriction and making the doc authoritative only
as an example, e.g., replace "DON'T ALLOW FOR ANY OTHER OPTIONS" with a
consistent instruction that callers should validate against the CLI output (nel
skills build-config --help) and accept CLI-provided categories rather than
rejecting options not listed in the doc; update the text around the example
categories and any validation guidance to explicitly prefer the CLI's current
options and to allow other valid CLI values.

Comment on lines +199 to +200
ssh <host> "cat ~/.config/enroot/.credentials 2>/dev/null"
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not instruct raw credential-file output over SSH.

Line 199 prints full enroot credentials, which can leak secrets into logs/transcripts. Check only machine entries or registry keys instead.

Suggested safer command
- ssh <host> "cat ~/.config/enroot/.credentials 2>/dev/null"
+ ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"

As per coding guidelines "Never hardcode secrets, credentials, tokens, passwords, or API keys in source code. Use environment variables or configuration files listed in .gitignore to store sensitive information."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 199 - 200, Replace the
unsafe raw credential dump command ssh <host> "cat ~/.config/enroot/.credentials
2>/dev/null" with a non-leaking check that only verifies presence of
registry/machine entries or keys rather than printing values; update the
SKILL.md example so it inspects the file structure (e.g., look for "machine"
entries or specific registry keys) or parses JSON/YAML to report
existence/status and masks or omits secret values, ensuring no full credentials
are echoed or logged.

Comment on lines +202 to +214
**Default behavior: use DockerHub image first.** If the job fails with a `401` or image pull error, fall back to the NGC alternative by adding `deployment.image` to the config:

```yaml
deployment:
image: nvcr.io/nvidia/vllm:<YY.MM>-py3 # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag
```

**Decision flow:**
1. Submit with the default DockerHub image (`vllm/vllm-openai:latest`)
2. If the job fails with image pull auth error (401) → check credentials
3. If DockerHub credentials can be added → add them and resubmit
4. If not → override `deployment.image` to the NGC vLLM image and resubmit
5. **Do not retry more than once** without fixing the auth issue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Decision flow contradicts the pre-submit auth gate.

Line 196 says verify auth before submission, but Line 210 starts by submitting first and reacting to failure. This reintroduces the queued-failure path the gate is meant to prevent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 202 - 214, The decision flow
contradicts the pre-submit auth gate: update the flow in the "Decision flow"
section so authentication is verified and handled before any job submission
rather than submitting and reacting to a 401; specifically, replace the current
steps with: check DockerHub credentials first (per the pre-submit auth gate),
attempt to add DockerHub creds if missing, if creds cannot be provided then set
deployment.image to the NGC image (nvcr.io/nvidia/vllm:<YY.MM>-py3) and only
then submit the job, and keep the "Do not retry more than once" rule; ensure
references to "Default behavior", "pre-submit auth gate", "deployment.image",
and the NGC image string are updated accordingly.

Comment on lines +29 to +33
if [ -n "$EXTRA_PIP_DEPS" ]; then
echo "Installing extra dependencies: $EXTRA_PIP_DEPS"
unset PIP_CONSTRAINT
pip install $EXTRA_PIP_DEPS
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Parse EXTRA_PIP_DEPS into an array before installing.

On Line 32, pip install $EXTRA_PIP_DEPS is vulnerable to unintended word splitting/globbing. This can install the wrong packages when dependency specs include special characters.

Suggested fix
 if [ -n "$EXTRA_PIP_DEPS" ]; then
     echo "Installing extra dependencies: $EXTRA_PIP_DEPS"
     unset PIP_CONSTRAINT
-    pip install $EXTRA_PIP_DEPS
+    read -r -a extra_pip_deps <<< "$EXTRA_PIP_DEPS"
+    pip install "${extra_pip_deps[@]}"
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ -n "$EXTRA_PIP_DEPS" ]; then
echo "Installing extra dependencies: $EXTRA_PIP_DEPS"
unset PIP_CONSTRAINT
pip install $EXTRA_PIP_DEPS
fi
if [ -n "$EXTRA_PIP_DEPS" ]; then
echo "Installing extra dependencies: $EXTRA_PIP_DEPS"
unset PIP_CONSTRAINT
read -r -a extra_pip_deps <<< "$EXTRA_PIP_DEPS"
pip install "${extra_pip_deps[@]}"
fi
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 32-32: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/hf/ptq.sh` around lines 29 - 33, The pip install
invocation uses unquoted $EXTRA_PIP_DEPS which risks word-splitting/globbing;
split the value into a shell array and install using the array expansion to
preserve each dependency token. For example, parse EXTRA_PIP_DEPS into an array
(e.g., read -r -a extras <<< "$EXTRA_PIP_DEPS" or IFS=' ' read -r -a extras <<<
"$EXTRA_PIP_DEPS") and replace pip install $EXTRA_PIP_DEPS with pip install
"${extras[@]}", keeping the existing unset PIP_CONSTRAINT and condition around
EXTRA_PIP_DEPS.


---

## 6. Container Registry Authentication
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies for local container setup, as I suggested in @Edwardf0t1 's PR, maybe we can put this in evn-setup.md for setting HF_TOKEN, docker login token, and ngc token.

| Runtime | Typical clusters | SLURM integration |
| --- | --- | --- |
| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
| **Singularity/Apptainer** | HPC / academic clusters | `singularity exec` inside job script |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About singularity and apptainer, seems we don't have clusters to test. Our overall workflow didn't consider this, we added only template for pyxis and docker, so maybe we can wait for community contribution about this, wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants