Add dep check for ptq and runtime check for evaluation/deployment by kaix-nv · Pull Request #1240 · NVIDIA/Model-Optimizer

kaix-nv · 2026-04-12T23:05:21Z

What does this PR do?

Type of change: ?

PTQ: model-specific dependency support

Add EXTRA_PIP_DEPS support to the launcher's ptq.sh so models requiring extra pip packages (e.g., mamba-ssm for hybrid Mamba architectures like Nemotron) can install them automatically before running PTQ. Also updates the PTQ skill with a new Step 2.5 for detecting model-specific dependencies.

Container registry auth checks

Add new section 6 covering auth detection for enroot/pyxis, Docker, and Singularity/Apptainer. Includes credential locations, how to add them, and common failure modes.
Add Step 7.5 with NEL default image table, DockerHub-first strategy with NGC fallback, and build-config CLI note.
Add auth check before remote SLURM deployment.

Usage

Set EXTRA_PIP_DEPS in the launcher YAML's environment section:

task_0:
  script: common/hf/ptq.sh
  args:
    - --repo nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --local-dir /hf-local/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --
    - --quant nvfp4
    - --tasks quant
  environment:
    - EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d"

Testing

Tested end-to-end: NVFP4 quantization of NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 on a B200 cluster via the launcher. Job succeeded: mamba-ssm installed automatically, calibration completed (512 samples, 84s), checkpoint exported (18 GB, 2 safetensor shards).

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

Documentation
- Added container registry authentication verification workflow for SLURM deployments, including credential checks and remediation guidance
- New dependency-checking step for models requiring custom imports
- Updated deployment workflow to require credential validation before SLURM job submission
New Features
- Added support for specifying extra pip dependencies during model processing via environment variable

copy-pr-bot · 2026-04-12T23:05:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-12T23:05:28Z

📝 Walkthrough

Walkthrough

The changes introduce container registry authentication verification workflows across SLURM deployment processes and add model-specific dependency detection for PTQ operations. Documentation updates provide step-by-step credential validation instructions before job submission. A shell script enhancement enables optional extra pip package installation via environment variables.

Changes

Cohort / File(s)	Summary
Container Registry Authentication `.claude/skills/common/slurm-setup.md`, `.claude/skills/deployment/SKILL.md`, `.claude/skills/evaluation/SKILL.md`	Added new documentation section (Section 6) for pre-SLURM container registry authentication verification, including runtime detection, credential checking, and failure remediation. Deployment and evaluation guides updated with gating steps to verify credentials before job submission and specialized SLURM authentication checks.
Model Dependency Management `.claude/skills/ptq/SKILL.md`, `tools/launcher/common/hf/ptq.sh`	Added Step 2.5 workflow in PTQ guide for detecting model-specific dependencies via `trust_remote_code` inspection and mapping imports to required packages. Enhanced PTQ shell script with conditional installation of extra packages controlled by `EXTRA_PIP_DEPS` environment variable.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as Deployment CLI
    participant SLURM
    participant Node as Compute Node
    participant Registry as Container Registry

    User->>CLI: Run deployment/evaluation
    CLI->>CLI: Check credentials on local system
    alt Credentials Missing
        CLI->>User: Provide remediation commands
        User->>CLI: Run credential setup commands
    end
    CLI->>SLURM: Submit job with verified credentials
    SLURM->>Node: Schedule and route to compute
    Node->>Registry: Pull container image
    Registry-->>Node: Image retrieved
    Node->>Node: Launch container & run workload

sequenceDiagram
    participant User
    participant PTQGuide as PTQ Workflow
    participant Model as Model Config
    participant Deps as Dependency Check
    participant Launcher as Launcher Script
    participant Env as Execution Environment

    User->>PTQGuide: Start PTQ process
    PTQGuide->>Model: Check config.json for trust_remote_code
    alt trust_remote_code Enabled
        PTQGuide->>Deps: Detect custom imports in modeling_*.py
        Deps->>Deps: Map imports to pip packages
        Deps->>User: Identify missing dependencies
        User->>Launcher: Set EXTRA_PIP_DEPS environment variable
    end
    Launcher->>Env: Check if EXTRA_PIP_DEPS is set
    alt EXTRA_PIP_DEPS Non-empty
        Env->>Env: Unset PIP_CONSTRAINT
        Env->>Env: pip install $EXTRA_PIP_DEPS
    end
    Launcher->>Launcher: Proceed with model download and PTQ execution

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the two main changes: adding dependency checking for PTQ and adding runtime/authentication checks for evaluation and deployment workflows.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	PR contains only documentation and shell script changes with no Python code modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kaix/nemo_skills

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-12T23:10:08Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1240/
Built to branch `gh-pages` at 2026-04-13 21:19 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-04-12T23:17:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.91%. Comparing base (0b42c14) to head (b1ac809).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1240   +/-   ##
=======================================
  Coverage   76.91%   76.91%           
=======================================
  Files         350      350           
  Lines       40481    40481           
=======================================
  Hits        31137    31137           
  Misses       9344     9344

Flag	Coverage Δ
unit	`55.53% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kai Xu <kaix@nvidia.com>

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/common/slurm-setup.md:
- Around line 235-256: The snippets that directly cat credential files (the
commands referencing ~/.config/enroot/.credentials, ~/.docker/config.json, and
~/.singularity/docker-config.json or ~/.apptainer/docker-config.json) must not
be left as-is; change them to commands that only enumerate registry keys or
"machine <registry>" entries and redact any secrets instead of printing full
files. Locate the three occurrences (the enroot credentials check, the Docker
config json check, and the Singularity/Apptainer check) and replace the raw-dump
approach with parsing that extracts and prints only registry hostnames/machine
names (e.g., use a JSON parser like jq or a simple grep/awk pipeline) and ensure
token/auth fields are omitted or redacted in the output. Ensure the help text
describes that the check shows only registry keys/machine entries, not secrets.
- Around line 287-305: Replace the CLI password flags that expose secrets in the
commands "docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>" and
"singularity remote login --username '$oauthtoken' --password <ngc_api_key>
docker://nvcr.io" with stdin-based or interactive login flows: for Docker use
the --password-stdin flow (pipe the API key into docker login) and for
Singularity/Apptainer omit the --password flag to trigger an interactive prompt
or use the tool's supported stdin/token mechanism or environment variable helper
so secrets are not passed as command-line arguments.

In @.claude/skills/evaluation/SKILL.md:
- Line 31: The checklist was updated to add "Step 7.5: Check container registry
auth (SLURM only)" but the subsequent "Now, copy this checklist" block in
SKILL.md still skips 7.5; open .claude/skills/evaluation/SKILL.md, find the
"Now, copy this checklist" section and add the corresponding checkbox line for
Step 7.5 (matching the exact wording "Step 7.5: Check container registry auth
(SLURM only)") so the copyable checklist and any progress-tracking list include
that step and numbering remains consistent with the original checklist.
- Around line 202-214: The decision flow contradicts the pre-submit auth gate:
update the flow in the "Decision flow" section so authentication is verified and
handled before any job submission rather than submitting and reacting to a 401;
specifically, replace the current steps with: check DockerHub credentials first
(per the pre-submit auth gate), attempt to add DockerHub creds if missing, if
creds cannot be provided then set deployment.image to the NGC image
(nvcr.io/nvidia/vllm:<YY.MM>-py3) and only then submit the job, and keep the "Do
not retry more than once" rule; ensure references to "Default behavior",
"pre-submit auth gate", "deployment.image", and the NGC image string are updated
accordingly.
- Around line 199-200: Replace the unsafe raw credential dump command ssh <host>
"cat ~/.config/enroot/.credentials 2>/dev/null" with a non-leaking check that
only verifies presence of registry/machine entries or keys rather than printing
values; update the SKILL.md example so it inspects the file structure (e.g.,
look for "machine" entries or specific registry keys) or parses JSON/YAML to
report existence/status and masks or omits secret values, ensuring no full
credentials are echoed or logged.
- Line 80: The file has conflicting guidance: one sentence ("DON'T ALLOW FOR ANY
OTHER OPTIONS") hard-restricts accepted categories while another warns that CLI
options "may differ from this list" and says "Use the CLI's current options, not
this list"; reconcile by removing or softening the hard restriction and making
the doc authoritative only as an example, e.g., replace "DON'T ALLOW FOR ANY
OTHER OPTIONS" with a consistent instruction that callers should validate
against the CLI output (nel skills build-config --help) and accept CLI-provided
categories rather than rejecting options not listed in the doc; update the text
around the example categories and any validation guidance to explicitly prefer
the CLI's current options and to allow other valid CLI values.

In `@tools/launcher/common/hf/ptq.sh`:
- Around line 29-33: The pip install invocation uses unquoted $EXTRA_PIP_DEPS
which risks word-splitting/globbing; split the value into a shell array and
install using the array expansion to preserve each dependency token. For
example, parse EXTRA_PIP_DEPS into an array (e.g., read -r -a extras <<<
"$EXTRA_PIP_DEPS" or IFS=' ' read -r -a extras <<< "$EXTRA_PIP_DEPS") and
replace pip install $EXTRA_PIP_DEPS with pip install "${extras[@]}", keeping the
existing unset PIP_CONSTRAINT and condition around EXTRA_PIP_DEPS.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5232f7df-8043-4cd3-85fd-ec40f3a1bfcc

📥 Commits

Reviewing files that changed from the base of the PR and between 14b78ae and b1ac809.

📒 Files selected for processing (5)

.claude/skills/common/slurm-setup.md
.claude/skills/deployment/SKILL.md
.claude/skills/evaluation/SKILL.md
.claude/skills/ptq/SKILL.md
tools/launcher/common/hf/ptq.sh

coderabbitai · 2026-04-14T03:58:14Z

.claude/skills/common/slurm-setup.md

+```bash
+cat ~/.config/enroot/.credentials 2>/dev/null
+```
+
+Look for `machine <registry>` lines:
+- NGC → `machine nvcr.io`
+- DockerHub → `machine auth.docker.io`
+- GHCR → `machine ghcr.io`
+
+#### Docker
+
+```bash
+cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; print(json.dumps(json.load(sys.stdin).get('auths',{}), indent=2))"
+```
+
+Look for registry keys (`https://index.docker.io/v1/`, `nvcr.io`, `ghcr.io`).
+
+#### Singularity/Apptainer
+
+```bash
+cat ~/.singularity/docker-config.json 2>/dev/null || cat ~/.apptainer/docker-config.json 2>/dev/null
+```


⚠️ Potential issue | 🟠 Major

Avoid commands that print raw credential files.

Lines 236, 247, and 255 currently instruct dumping auth files directly, which can leak tokens into terminal logs/history. Prefer listing registry keys/machine entries only (redacted view).

Suggested safer checks

- cat ~/.config/enroot/.credentials 2>/dev/null + grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null - cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; print(json.dumps(json.load(sys.stdin).get('auths',{}), indent=2))" + cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; cfg=json.load(sys.stdin); print('\n'.join(cfg.get('auths', {}).keys()))" - cat ~/.singularity/docker-config.json 2>/dev/null || cat ~/.apptainer/docker-config.json 2>/dev/null + (cat ~/.singularity/docker-config.json 2>/dev/null || cat ~/.apptainer/docker-config.json 2>/dev/null) | \ + python3 -c "import json,sys; cfg=json.load(sys.stdin); print('\n'.join(cfg.get('auths', {}).keys()))"

As per coding guidelines "Never hardcode secrets, credentials, tokens, passwords, or API keys in source code. Use environment variables or configuration files listed in .gitignore to store sensitive information."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/common/slurm-setup.md around lines 235 - 256, The snippets that directly cat credential files (the commands referencing ~/.config/enroot/.credentials, ~/.docker/config.json, and ~/.singularity/docker-config.json or ~/.apptainer/docker-config.json) must not be left as-is; change them to commands that only enumerate registry keys or "machine <registry>" entries and redact any secrets instead of printing full files. Locate the three occurrences (the enroot credentials check, the Docker config json check, and the Singularity/Apptainer check) and replace the raw-dump approach with parsing that extracts and prints only registry hostnames/machine names (e.g., use a JSON parser like jq or a simple grep/awk pipeline) and ensure token/auth fields are omitted or redacted in the output. Ensure the help text describes that the check shows only registry keys/machine entries, not secrets.

coderabbitai · 2026-04-14T03:58:14Z

.claude/skills/common/slurm-setup.md

+**Docker:**
+
+```bash
+# DockerHub
+docker login
+
+# NGC
+docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>
+```
+
+**Singularity/Apptainer:**
+
+```bash
+# DockerHub
+singularity remote login --username <user> docker://docker.io
+
+# NGC
+singularity remote login --username '$oauthtoken' --password <ngc_api_key> docker://nvcr.io
+```


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and examine its content around the specified lines if [ -f ".claude/skills/common/slurm-setup.md" ]; then echo "File found. Reading content around lines 287-305:" sed -n '280,310p' ".claude/skills/common/slurm-setup.md" | cat -n else echo "File not found at .claude/skills/common/slurm-setup.md" # Try to locate it find . -name "slurm-setup.md" 2>/dev/null fi

Repository: NVIDIA/Model-Optimizer

Length of output: 981

🏁 Script executed:

# Check if there's any context or warnings around these code blocks sed -n '270,310p' ".claude/skills/common/slurm-setup.md" | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 1294

🌐 Web query:

Docker login --password-stdin syntax and security best practices

💡 Result:

Syntax: docker login [SERVER] --username USER --password-stdin Examples: - echo "password" | docker login --username foo --password-stdin myregistry.com - cat password.txt | docker login --username foo --password-stdin - printf "%s" "$PASSWORD" | docker login --username foo --password-stdin (avoids trailing newline) Security Best Practices: - NEVER use --password flag; it exposes password in process list, shell history, logs. Docker warns: "Using --password via the CLI is insecure. Use --password-stdin." - Use --password-stdin for non-interactive login to avoid history/logs exposure. - Prefer credential helpers (credsStore or credHelpers in ~/.docker/config.json) for secure storage in OS keychain (osxkeychain, secretservice, wincred, pass). More secure than base64 in config.json. - Use Personal Access Tokens (PATs), not passwords. - In CI/CD: Use masked secrets/env vars with printf/echo piped to --password-stdin. - Rotate credentials regularly. - For specific registries (e.g., AWS ECR): Use dedicated helpers like ecr-login.

Citations:

1: https://docs.docker.com/reference/cli/docker/login/

2: https://docs.docker.com/reference/cli/docker/login

3: https://stackoverflow.com/questions/51489359/docker-using-password-via-the-cli-is-insecure-use-password-stdin

4: https://www.codegenes.net/blog/docker-using-password-via-the-cli-is-insecure-use-password-stdin/

5: https://github.com/docker/docker-credential-helpers/

6: https://docs.docker.com/go/credential-store/

🌐 Web query:

Singularity/Apptainer remote login password handling best practices

💡 Result:

Best practices for handling passwords with Singularity/Apptainer remote login focus on minimizing exposure of credentials, as they are stored unencrypted in ~/.singularity/remotes.yaml or ~/.apptainer/remote.yaml. Key recommendations from official documentation: 1. Use token-based authentication instead of passwords where possible. Generate API tokens or access tokens from the remote service (e.g., Sylabs Cloud, OCI registries like Docker Hub) and use them as the password field. Tokens can be revoked easily if compromised. Examples: - For SylabsCloud: Generate token at https://cloud.sylabs.io/auth/tokens and paste during login. - For OCI registries: Use personal access tokens as password/token, with required username (sometimes dummy like '00000000-0000-0000-0000-000000000000' for Azure ACR). 2. Avoid entering passwords directly on command line with --password, as they appear in process lists (e.g., ps output). Instead: - Use interactive prompt: apptainer remote login - Or --password-stdin to pipe password: echo "token" | apptainer remote login --password-stdin 3. Use --tokenfile to read from a secure file: apptainer remote login --tokenfile /path/to/token.txt 4. Be aware credentials are stored unencrypted in remote.yaml in your home directory. Protect access to your home directory (permissions 700 or similar). Logout with 'remote logout' when no longer needed. Do not commit remote.yaml to version control. 5. For automation/CI (non-interactive): - Use environment variables if supported (older Singularity had SINGULARITY_DOCKER_USERNAME/PASSWORD). - Prefer short-lived tokens. - Avoid persistent storage; login per session if possible. 6. Use --insecure only if necessary (HTTP instead of HTTPS), but avoid it. 7. For OCI registries specifically: Consult registry docs for CLI login requirements. Singularity/Apptainer reads ~/.docker/config.json if present (but not external helpers). Examples: apptainer remote login SylabsCloud # Interactive token prompt apptainer remote login --username foo --password-stdin docker://docker.io # Pipe token INFO: Token stored in /home/user/.apptainer/remote.yaml Official docs explicitly warn: "the login command will store the supplied credentials or tokens unencrypted in your home directory."

Citations:

1: https://docs.sylabs.io/guides/3.7/user-guide/cli/singularity_remote_login.html

2: https://docs.sylabs.io/guides/3.11/user-guide/cli/singularity_remote_login.html

3: https://apptainer.org/docs/user/main/cli/apptainer_remote_login.html

4: https://apptainer.org/docs/user/main/endpoint.html

5: https://apptainer.org/user-docs/3.8/endpoint.html

Update documentation to avoid passing secrets via CLI arguments.

Lines 294 and 304 show credentials passed as CLI arguments, which exposes them in process lists and shell history. Use stdin-based or interactive login flows instead:

Suggested safer patterns

# NGC -docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key> +echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin

# NGC -singularity remote login --username '$oauthtoken' --password <ngc_api_key> docker://nvcr.io +echo "$NGC_API_KEY" | singularity remote login --username '$oauthtoken' --password-stdin docker://nvcr.io

Alternatively, use interactive prompts (omit --password* flags to trigger interactive login).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Docker:**

```bash

# DockerHub

docker login

# NGC

docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>

```

**Singularity/Apptainer:**

```bash

# DockerHub

singularity remote login --username <user> docker://docker.io

# NGC

singularity remote login --username '$oauthtoken' --password <ngc_api_key> docker://nvcr.io

```

**Docker:**

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/common/slurm-setup.md around lines 287 - 305, Replace the CLI password flags that expose secrets in the commands "docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>" and "singularity remote login --username '$oauthtoken' --password <ngc_api_key> docker://nvcr.io" with stdin-based or interactive login flows: for Docker use the --password-stdin flow (pipe the API key into docker login) and for Singularity/Apptainer omit the --password flag to trigger an interactive prompt or use the tool's supported stdin/token mechanism or environment variable helper so secrets are not passed as command-line arguments.

coderabbitai · 2026-04-14T03:58:14Z

.claude/skills/evaluation/SKILL.md

 - [ ] Step 5: Confirm tasks (iterative)
 - [ ] Step 6: Advanced - Multi-node (Data Parallel)
 - [ ] Step 7: Advanced - Interceptors
+- [ ] Step 7.5: Check container registry auth (SLURM only)


⚠️ Potential issue | 🟡 Minor

Step list drift: final checklist is missing Step 7.5.

Line 31 adds Step 7.5, but the “Now, copy this checklist” block still skips it, so progress tracking is inconsistent.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/SKILL.md at line 31, The checklist was updated to add "Step 7.5: Check container registry auth (SLURM only)" but the subsequent "Now, copy this checklist" block in SKILL.md still skips 7.5; open .claude/skills/evaluation/SKILL.md, find the "Now, copy this checklist" section and add the corresponding checkbox line for Step 7.5 (matching the exact wording "Step 7.5: Check container registry auth (SLURM only)") so the copyable checklist and any progress-tracking list include that step and numbering remains consistent with the original checklist.

coderabbitai · 2026-04-14T03:58:14Z

.claude/skills/evaluation/SKILL.md

 DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.

-> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead.
+> **Note:** These categories come from NEL's `build-config` CLI. **Always run `nel skills build-config --help` first** to get the current options — they may differ from this list (e.g., `chat_reasoning` instead of separate `chat`/`reasoning`, `general_knowledge` instead of `standard`). Use the CLI's current options, not this list, when they conflict.


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n .claude/skills/evaluation/SKILL.md | sed -n '75,85p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1163

Resolve conflicting instructions on valid configuration options.

Line 78 hard-restricts choices to only those listed above ("DON'T ALLOW FOR ANY OTHER OPTIONS"), while Line 80 explicitly states that options "may differ from this list" and instructs users to "Use the CLI's current options, not this list, when they conflict." These contradictory directives can cause users to reject valid options provided by the actual nel skills build-config --help output.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/SKILL.md at line 80, The file has conflicting guidance: one sentence ("DON'T ALLOW FOR ANY OTHER OPTIONS") hard-restricts accepted categories while another warns that CLI options "may differ from this list" and says "Use the CLI's current options, not this list"; reconcile by removing or softening the hard restriction and making the doc authoritative only as an example, e.g., replace "DON'T ALLOW FOR ANY OTHER OPTIONS" with a consistent instruction that callers should validate against the CLI output (nel skills build-config --help) and accept CLI-provided categories rather than rejecting options not listed in the doc; update the text around the example categories and any validation guidance to explicitly prefer the CLI's current options and to allow other valid CLI values.

coderabbitai · 2026-04-14T03:58:14Z

.claude/skills/evaluation/SKILL.md

+ssh <host> "cat ~/.config/enroot/.credentials 2>/dev/null"
+```


⚠️ Potential issue | 🟠 Major

Do not instruct raw credential-file output over SSH.

Line 199 prints full enroot credentials, which can leak secrets into logs/transcripts. Check only machine entries or registry keys instead.

Suggested safer command

- ssh <host> "cat ~/.config/enroot/.credentials 2>/dev/null" + ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"

As per coding guidelines "Never hardcode secrets, credentials, tokens, passwords, or API keys in source code. Use environment variables or configuration files listed in .gitignore to store sensitive information."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/SKILL.md around lines 199 - 200, Replace the unsafe raw credential dump command ssh <host> "cat ~/.config/enroot/.credentials 2>/dev/null" with a non-leaking check that only verifies presence of registry/machine entries or keys rather than printing values; update the SKILL.md example so it inspects the file structure (e.g., look for "machine" entries or specific registry keys) or parses JSON/YAML to report existence/status and masks or omits secret values, ensuring no full credentials are echoed or logged.

coderabbitai · 2026-04-14T03:58:15Z

.claude/skills/evaluation/SKILL.md

+**Default behavior: use DockerHub image first.** If the job fails with a `401` or image pull error, fall back to the NGC alternative by adding `deployment.image` to the config:
+
+```yaml
+deployment:
+  image: nvcr.io/nvidia/vllm:<YY.MM>-py3  # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag
+```
+
+**Decision flow:**
+1. Submit with the default DockerHub image (`vllm/vllm-openai:latest`)
+2. If the job fails with image pull auth error (401) → check credentials
+3. If DockerHub credentials can be added → add them and resubmit
+4. If not → override `deployment.image` to the NGC vLLM image and resubmit
+5. **Do not retry more than once** without fixing the auth issue


⚠️ Potential issue | 🟠 Major

Decision flow contradicts the pre-submit auth gate.

Line 196 says verify auth before submission, but Line 210 starts by submitting first and reacting to failure. This reintroduces the queued-failure path the gate is meant to prevent.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/SKILL.md around lines 202 - 214, The decision flow contradicts the pre-submit auth gate: update the flow in the "Decision flow" section so authentication is verified and handled before any job submission rather than submitting and reacting to a 401; specifically, replace the current steps with: check DockerHub credentials first (per the pre-submit auth gate), attempt to add DockerHub creds if missing, if creds cannot be provided then set deployment.image to the NGC image (nvcr.io/nvidia/vllm:<YY.MM>-py3) and only then submit the job, and keep the "Do not retry more than once" rule; ensure references to "Default behavior", "pre-submit auth gate", "deployment.image", and the NGC image string are updated accordingly.

coderabbitai · 2026-04-14T03:58:15Z

tools/launcher/common/hf/ptq.sh

+if [ -n "$EXTRA_PIP_DEPS" ]; then
+    echo "Installing extra dependencies: $EXTRA_PIP_DEPS"
+    unset PIP_CONSTRAINT
+    pip install $EXTRA_PIP_DEPS
+fi


⚠️ Potential issue | 🟠 Major

Parse EXTRA_PIP_DEPS into an array before installing.

On Line 32, pip install $EXTRA_PIP_DEPS is vulnerable to unintended word splitting/globbing. This can install the wrong packages when dependency specs include special characters.

Suggested fix

if [ -n "$EXTRA_PIP_DEPS" ]; then echo "Installing extra dependencies: $EXTRA_PIP_DEPS" unset PIP_CONSTRAINT - pip install $EXTRA_PIP_DEPS + read -r -a extra_pip_deps <<< "$EXTRA_PIP_DEPS" + pip install "${extra_pip_deps[@]}" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [ -n "$EXTRA_PIP_DEPS" ]; then

echo "Installing extra dependencies: $EXTRA_PIP_DEPS"

unset PIP_CONSTRAINT

pip install $EXTRA_PIP_DEPS

fi

if [ -n "$EXTRA_PIP_DEPS" ]; then

echo "Installing extra dependencies: $EXTRA_PIP_DEPS"

unset PIP_CONSTRAINT

read -r -a extra_pip_deps <<< "$EXTRA_PIP_DEPS"

pip install "${extra_pip_deps[@]}"

fi

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 32-32: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/launcher/common/hf/ptq.sh` around lines 29 - 33, The pip install invocation uses unquoted $EXTRA_PIP_DEPS which risks word-splitting/globbing; split the value into a shell array and install using the array expansion to preserve each dependency token. For example, parse EXTRA_PIP_DEPS into an array (e.g., read -r -a extras <<< "$EXTRA_PIP_DEPS" or IFS=' ' read -r -a extras <<< "$EXTRA_PIP_DEPS") and replace pip install $EXTRA_PIP_DEPS with pip install "${extras[@]}", keeping the existing unset PIP_CONSTRAINT and condition around EXTRA_PIP_DEPS.

mxinO · 2026-04-14T08:00:39Z

.claude/skills/common/slurm-setup.md

+
+---
+
+## 6. Container Registry Authentication


This also applies for local container setup, as I suggested in @Edwardf0t1 's PR, maybe we can put this in evn-setup.md for setting HF_TOKEN, docker login token, and ngc token.

mxinO · 2026-04-14T08:04:49Z

.claude/skills/common/slurm-setup.md

+| Runtime | Typical clusters | SLURM integration |
+| --- | --- | --- |
+| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
+| **Singularity/Apptainer** | HPC / academic clusters | `singularity exec` inside job script |


About singularity and apptainer, seems we don't have clusters to test. Our overall workflow didn't consider this, we added only template for pyxis and docker, so maybe we can wait for community contribution about this, wdyt?

kaix-nv requested review from Edwardf0t1 and mxinO April 13, 2026 02:57

kaix-nv added 2 commits April 13, 2026 14:15

Add dependencies check for ptq

bee41d5

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add remote authentication check

b1ac809

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/nemo_skills branch from dcc8c9a to b1ac809 Compare April 13, 2026 21:16

kaix-nv changed the title ~~Add dependencies check for ptq~~ Add dependencies check for ptq; Add runtime check for evaluation/deployment Apr 13, 2026

kaix-nv changed the title ~~Add dependencies check for ptq; Add runtime check for evaluation/deployment~~ Add dep check for ptq and runtime check for evaluation/deployment Apr 13, 2026

kaix-nv marked this pull request as ready for review April 14, 2026 03:49

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

mxinO reviewed Apr 14, 2026

View reviewed changes

		ssh <host> "cat ~/.config/enroot/.credentials 2>/dev/null"
		```

Conversation

kaix-nv commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Apr 12, 2026

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-13 21:19 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

mxinO Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

mxinO Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaix-nv commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-13 21:19 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Apr 12, 2026 •

edited

Loading