You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add NEL CI learnings: wrapper script pattern, cross-cluster copy, Hydra escaping
- Add wrapper script pattern for complex deployment commands that break
Hydra's override parser (put serve.sh in checkpoint dir, reference as
bash /checkpoint/serve.sh)
- Add NEL_CONFIG_BASE64 + Python trigger pattern for custom configs
- Add cross-cluster checkpoint copy via tar pipe
- Add Hydra LexerNoViableAltException and Bad Request to common issues
Learned from triggering full AA evaluation (MMLU-PRO, GPQA Diamond,
LiveCodeBench, SCICODE, AIME 2025, Terminal-Bench Hard) for
Devstral-Small-2-24B NVFP4 on oci-hsg via NEL CI.
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
**Cross-cluster copy** (e.g., dlcluster → oci-hsg): If the two clusters can't SSH to each other directly, pipe through your workstation without staging to disk:
After copying, set permissions for svc-jet: `chmod -R 777 /lustre/.../checkpoints/model-name`
71
+
63
72
For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes.
64
73
65
74
---
66
75
67
76
## 4. NEL CI Trigger Pattern
68
77
69
-
For JET clusters, trigger evaluations via the GitLab API. Use `NEL_DEPLOYMENT_COMMAND` (not `NEL_OTHER_OVERRIDES` with `deployment.extra_args`) because `NEL_OTHER_OVERRIDES` splits values on spaces, breaking multi-flag commands.
78
+
For JET clusters, trigger evaluations via the GitLab API.
79
+
80
+
### Simple deployment (standard models)
81
+
82
+
For models that work with stock vLLM/SGLang, use `NEL_DEPLOYMENT_COMMAND` directly:
If the model needs runtime patches (e.g., transformers upgrade, framework source fixes), **do NOT put multi-step commands in `NEL_DEPLOYMENT_COMMAND`** — Hydra's override parser will break on nested quotes, `&&`, `$()`, etc.
111
+
112
+
Instead, use the **wrapper script pattern**: place a `serve.sh` in the checkpoint directory on the cluster, then point `NEL_DEPLOYMENT_COMMAND` to it.
113
+
114
+
**Step 1** — Write wrapper script to the checkpoint directory on the cluster:
This works because the checkpoint is mounted at `/checkpoint` inside the container. The script is Hydra-safe (no special characters in the override value).
138
+
139
+
### Custom configs with `NEL_CONFIG_BASE64`
140
+
141
+
When using a custom config (not from the repo), use `NEL_CONFIG_BASE64` instead of `NEL_CONFIG_PATH`. This requires setting `NEL_UNTRUSTED_EVAL=true`:
| Checkpoint not found in container | Path not on cluster compute-node filesystem | Copy checkpoint to `/lustre/` (or cluster-accessible path) first |
175
255
| `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config |
256
+
| `LexerNoViableAltException` in Hydra | `NEL_DEPLOYMENT_COMMAND` contains quotes, `&&`, `$()` | Use wrapper script pattern (section 4): put script in checkpoint dir, set command to `bash /checkpoint/serve.sh` |
257
+
| `Bad Request` from GitLab API trigger | Shell escaping mangled the JSON payload | Use Python to construct JSON (section 4) instead of bash heredocs/string interpolation |
0 commit comments