feat(models): vllm k8s support by benmccown · Pull Request #305 · NVIDIA-NeMo/nemo-platform

benmccown · 2026-06-12T17:19:59Z

No description provided.

Signed-off-by: Ben McCown <bmccown@nvidia.com>

Fixes found validating the k8s vLLM path on a real cluster: - model-source and health-path are stored as annotations, not labels (their values contain '/', invalid for k8s label values). - puller Job runs the image entrypoint via container args, not command (command would override the entrypoint and exec the first token). - puller HF_ENDPOINT resolves the cluster-routable Files URL (service discovery / base_url), bypassing the in-process local-service routing that returns localhost (unreachable from the puller pod). - release the RWO weights volume before serving: delete the completed puller Job at P3 so its pod releases the volume attachment, avoiding a Multi-Attach error when the server pod schedules on another node. Only on the success path; a failed Job is kept for status/log reporting. - status reads the Deployment first (source of truth once created); a missing Job with the PVC still present resumes P3 instead of reporting LOST (prevents a re-pull loop via drift recovery). - pod securityContext: do not force runAsUser on the server pod (images like vLLM lack a passwd entry for uid 1000 -> getpwuid crash); the puller keeps fsGroup so it can write the freshly-provisioned PVC. - orphan listing tolerates NIM CRD 403 quietly on the vLLM-only path. Signed-off-by: Ben McCown <bmccown@nvidia.com>

Teardown deletes every resource a deployment could own, by name, across both the operator path (NIMService/NIMCache CRs) and the directly-emitted vLLM path (Deployment/Service/Job/PVC). delete_model_deployment has only workspace/name (no engine/config -- it is also driven by orphan reconciliation), so attempting all resource types by name is the correct, idempotent, self-healing teardown. Previously a single delete failure (notably a 403 deleting a NIMService when the ServiceAccount lacked that RBAC on the vLLM-only path) raised out of the routine and aborted the whole teardown, leaving the deployment stuck in DELETING and re-erroring every reconcile. Now each delete is independent and 404-tolerant ('already gone' is success); non-404 failures are logged concisely (no stack trace) and aggregated, so one resource's failure never blocks the others, and the result is ERROR (visible/stuck) rather than a false DELETED that would orphan cluster resources. The platform Helm chart's models ServiceAccount must grant get/list/ watch/delete on apps.nvidia.com nimservices/nimcaches (in addition to the core/apps/batch resources) so engine-agnostic teardown does not 403. Signed-off-by: Ben McCown <bmccown@nvidia.com>

The vLLM puller + server pods previously inherited the NIM-oriented default securityContext (or none). Running the server as an arbitrary uid 1000 made torch/inductor crash at startup (getpass.getuser -> pwd.getpwuid: 'uid not found: 1000') because the vllm/vllm-openai image has no /etc/passwd entry for 1000, and running unset defaulted to root. Inspecting the image, its provisioned non-root user is 'vllm' (uid 2000, gid 0) with a real passwd entry. Add dedicated default_vllm_user_id (2000) / default_vllm_group_id (0) config and apply them to both the puller and the server: the server runs as a non-root user that resolves cleanly, and the puller writes the weights under the same uid/gid so the server can read them. Operators can override via config; we do not hardcode root. Signed-off-by: Ben McCown <bmccown@nvidia.com>

… raw-object migration When NIM is migrated off k8s-nim-operator onto the shared raw-object compilers (vllm_k8s_compiler), the securityContext uid/gid params are engine-specific and must not be shared: vLLM uses 2000/0 (its image's 'vllm' user, which has an /etc/passwd entry), while NIM images expect the operator's historical 1000/2000. Reusing default_vllm_user_id for NIM (or hardcoding either in the compiler) would reintroduce the getpwuid crash or break NIM. Add a FUTURE note in the compiler module docstring and cross-references on the NIM uid/gid config fields and the vLLM puller call site so a future implementer (possibly not us) does not hit this. Signed-off-by: Ben McCown <bmccown@nvidia.com>

github-actions · 2026-06-12T17:30:37Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	19033/25227	75.4%	61.3%
Integration Tests	11028/23999	46.0%	20.4%

benmccown added 5 commits June 11, 2026 10:43

feat(models): vllm k8s support

56ecd84

Signed-off-by: Ben McCown <bmccown@nvidia.com>

benmccown self-assigned this Jun 12, 2026

github-actions Bot added the feat label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(models): vllm k8s support#305

feat(models): vllm k8s support#305
benmccown wants to merge 5 commits into
mainfrom
vllm-k8s-support/bmccown

benmccown commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benmccown commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant