Skip to content

Commit b1be817

Browse files
committed
Add NEL CI learnings: wrapper script pattern, cross-cluster copy, Hydra escaping
- Add wrapper script pattern for complex deployment commands that break Hydra's override parser (put serve.sh in checkpoint dir, reference as bash /checkpoint/serve.sh) - Add NEL_CONFIG_BASE64 + Python trigger pattern for custom configs - Add cross-cluster checkpoint copy via tar pipe - Add Hydra LexerNoViableAltException and Bad Request to common issues Learned from triggering full AA evaluation (MMLU-PRO, GPQA Diamond, LiveCodeBench, SCICODE, AIME 2025, Terminal-Bench Hard) for Devstral-Small-2-24B NVFP4 on oci-hsg via NEL CI. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent 1b94fc9 commit b1be817

1 file changed

Lines changed: 84 additions & 2 deletions

File tree

.claude/skills/evaluation/references/nel-ci-guide.md

Lines changed: 84 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,13 +60,26 @@ rsync -av /path/to/local/checkpoint \
6060
<cluster-login>:/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/
6161
```
6262

63+
**Cross-cluster copy** (e.g., dlcluster → oci-hsg): If the two clusters can't SSH to each other directly, pipe through your workstation without staging to disk:
64+
65+
```bash
66+
ssh user@source-cluster "tar czf - -C /path/to/checkpoint ." | \
67+
ssh user@target-cluster "tar xzf - -C /lustre/.../checkpoints/model-name"
68+
```
69+
70+
After copying, set permissions for svc-jet: `chmod -R 777 /lustre/.../checkpoints/model-name`
71+
6372
For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes.
6473

6574
---
6675

6776
## 4. NEL CI Trigger Pattern
6877

69-
For JET clusters, trigger evaluations via the GitLab API. Use `NEL_DEPLOYMENT_COMMAND` (not `NEL_OTHER_OVERRIDES` with `deployment.extra_args`) because `NEL_OTHER_OVERRIDES` splits values on spaces, breaking multi-flag commands.
78+
For JET clusters, trigger evaluations via the GitLab API.
79+
80+
### Simple deployment (standard models)
81+
82+
For models that work with stock vLLM/SGLang, use `NEL_DEPLOYMENT_COMMAND` directly:
7083

7184
```bash
7285
export GITLAB_TOKEN=<your_gitlab_token>
@@ -82,7 +95,6 @@ curl -k --request POST \
8295
{"key": "NEL_CLUSTER", "value": "oci-hsg"},
8396
{"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"},
8497
{"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"},
85-
{"key": "NEL_TASKS", "value": "simple_evals.gpqa_diamond_aa_v3"},
8698
{"key": "NEL_DEPLOYMENT_COMMAND", "value": "vllm serve /checkpoint --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --quantization modelopt_fp4 --trust-remote-code --served-model-name my-model"},
8799
{"key": "NEL_OTHER_OVERRIDES", "value": "deployment.tensor_parallel_size=4 execution.walltime=04:00:00"},
88100
{"key": "NEL_HF_HOME", "value": "/lustre/.../cache/huggingface"},
@@ -93,6 +105,74 @@ curl -k --request POST \
93105
"https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"
94106
```
95107

108+
### Complex deployment (unsupported models needing runtime patches)
109+
110+
If the model needs runtime patches (e.g., transformers upgrade, framework source fixes), **do NOT put multi-step commands in `NEL_DEPLOYMENT_COMMAND`** — Hydra's override parser will break on nested quotes, `&&`, `$()`, etc.
111+
112+
Instead, use the **wrapper script pattern**: place a `serve.sh` in the checkpoint directory on the cluster, then point `NEL_DEPLOYMENT_COMMAND` to it.
113+
114+
**Step 1** — Write wrapper script to the checkpoint directory on the cluster:
115+
116+
```bash
117+
ssh <cluster-login> 'cat > /lustre/.../checkpoint/serve.sh << '"'"'EOF'"'"'
118+
#!/bin/bash
119+
set -e
120+
pip install "transformers>=5.0.0.dev0" "huggingface_hub>=0.32.0" --pre -q
121+
# Patch vLLM for ministral3 support (example)
122+
MISTRAL3_PY=$(find /usr/local/lib -path "*/vllm/model_executor/models/mistral3.py" 2>/dev/null | head -1)
123+
sed -i "s/old_pattern/new_pattern/" "$MISTRAL3_PY"
124+
exec vllm serve /checkpoint --host 0.0.0.0 --port 8000 \
125+
--tensor-parallel-size 4 --quantization modelopt_fp4 \
126+
--trust-remote-code --served-model-name my-model --gpu-memory-utilization 0.9
127+
EOF
128+
chmod 777 /lustre/.../checkpoint/serve.sh'
129+
```
130+
131+
**Step 2** — Set `NEL_DEPLOYMENT_COMMAND` to the wrapper:
132+
133+
```
134+
{"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}
135+
```
136+
137+
This works because the checkpoint is mounted at `/checkpoint` inside the container. The script is Hydra-safe (no special characters in the override value).
138+
139+
### Custom configs with `NEL_CONFIG_BASE64`
140+
141+
When using a custom config (not from the repo), use `NEL_CONFIG_BASE64` instead of `NEL_CONFIG_PATH`. This requires setting `NEL_UNTRUSTED_EVAL=true`:
142+
143+
```python
144+
import json, base64, subprocess, os
145+
146+
with open("my_config.yaml") as f:
147+
config_b64 = base64.b64encode(f.read().encode()).decode()
148+
149+
payload = {
150+
"ref": "main",
151+
"variables": [
152+
{"key": "NEL_CONFIG_BASE64", "value": config_b64},
153+
{"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"},
154+
{"key": "NEL_CLUSTER", "value": "oci-hsg"},
155+
{"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"},
156+
{"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"},
157+
{"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"},
158+
{"key": "NEL_UNTRUSTED_EVAL", "value": "true"},
159+
# ... other variables
160+
]
161+
}
162+
163+
# Use Python to construct JSON (avoids shell escaping issues with curl)
164+
token = os.environ["GITLAB_TOKEN"]
165+
subprocess.run(
166+
["curl", "-k", "--request", "POST",
167+
"--header", f"PRIVATE-TOKEN: {token}",
168+
"--header", "Content-Type: application/json",
169+
"--data", json.dumps(payload),
170+
"https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"],
171+
)
172+
```
173+
174+
> **Tip**: Use Python (not bash) to construct the JSON payload for `curl`. Shell escaping of base64 strings and nested quotes is error-prone.
175+
96176
---
97177

98178
## 5. Environment Variables
@@ -173,6 +253,8 @@ evaluation:
173253
| `NEL_OTHER_OVERRIDES` splits `extra_args` | Space-separated parsing breaks multi-flag values | Use `NEL_DEPLOYMENT_COMMAND` instead |
174254
| Checkpoint not found in container | Path not on cluster compute-node filesystem | Copy checkpoint to `/lustre/` (or cluster-accessible path) first |
175255
| `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config |
256+
| `LexerNoViableAltException` in Hydra | `NEL_DEPLOYMENT_COMMAND` contains quotes, `&&`, `$()` | Use wrapper script pattern (section 4): put script in checkpoint dir, set command to `bash /checkpoint/serve.sh` |
257+
| `Bad Request` from GitLab API trigger | Shell escaping mangled the JSON payload | Use Python to construct JSON (section 4) instead of bash heredocs/string interpolation |
176258

177259
---
178260

0 commit comments

Comments
 (0)