Skip to content

feat(models): vllm k8s support#305

Draft
benmccown wants to merge 5 commits into
mainfrom
vllm-k8s-support/bmccown
Draft

feat(models): vllm k8s support#305
benmccown wants to merge 5 commits into
mainfrom
vllm-k8s-support/bmccown

Conversation

@benmccown

Copy link
Copy Markdown
Contributor

No description provided.

Signed-off-by: Ben McCown <bmccown@nvidia.com>
Fixes found validating the k8s vLLM path on a real cluster:
- model-source and health-path are stored as annotations, not labels
  (their values contain '/', invalid for k8s label values).
- puller Job runs the image entrypoint via container args, not command
  (command would override the entrypoint and exec the first token).
- puller HF_ENDPOINT resolves the cluster-routable Files URL (service
  discovery / base_url), bypassing the in-process local-service routing
  that returns localhost (unreachable from the puller pod).
- release the RWO weights volume before serving: delete the completed
  puller Job at P3 so its pod releases the volume attachment, avoiding a
  Multi-Attach error when the server pod schedules on another node. Only
  on the success path; a failed Job is kept for status/log reporting.
- status reads the Deployment first (source of truth once created); a
  missing Job with the PVC still present resumes P3 instead of reporting
  LOST (prevents a re-pull loop via drift recovery).
- pod securityContext: do not force runAsUser on the server pod (images
  like vLLM lack a passwd entry for uid 1000 -> getpwuid crash); the
  puller keeps fsGroup so it can write the freshly-provisioned PVC.
- orphan listing tolerates NIM CRD 403 quietly on the vLLM-only path.

Signed-off-by: Ben McCown <bmccown@nvidia.com>
Teardown deletes every resource a deployment could own, by name, across
both the operator path (NIMService/NIMCache CRs) and the directly-emitted
vLLM path (Deployment/Service/Job/PVC). delete_model_deployment has only
workspace/name (no engine/config -- it is also driven by orphan
reconciliation), so attempting all resource types by name is the correct,
idempotent, self-healing teardown.

Previously a single delete failure (notably a 403 deleting a NIMService
when the ServiceAccount lacked that RBAC on the vLLM-only path) raised
out of the routine and aborted the whole teardown, leaving the deployment
stuck in DELETING and re-erroring every reconcile.

Now each delete is independent and 404-tolerant ('already gone' is
success); non-404 failures are logged concisely (no stack trace) and
aggregated, so one resource's failure never blocks the others, and the
result is ERROR (visible/stuck) rather than a false DELETED that would
orphan cluster resources.

The platform Helm chart's models ServiceAccount must grant get/list/
watch/delete on apps.nvidia.com nimservices/nimcaches (in addition to the
core/apps/batch resources) so engine-agnostic teardown does not 403.

Signed-off-by: Ben McCown <bmccown@nvidia.com>
The vLLM puller + server pods previously inherited the NIM-oriented default
securityContext (or none). Running the server as an arbitrary uid 1000 made
torch/inductor crash at startup (getpass.getuser -> pwd.getpwuid: 'uid not
found: 1000') because the vllm/vllm-openai image has no /etc/passwd entry for
1000, and running unset defaulted to root.

Inspecting the image, its provisioned non-root user is 'vllm' (uid 2000, gid 0)
with a real passwd entry. Add dedicated default_vllm_user_id (2000) /
default_vllm_group_id (0) config and apply them to both the puller and the
server: the server runs as a non-root user that resolves cleanly, and the
puller writes the weights under the same uid/gid so the server can read them.
Operators can override via config; we do not hardcode root.

Signed-off-by: Ben McCown <bmccown@nvidia.com>
… raw-object migration

When NIM is migrated off k8s-nim-operator onto the shared raw-object
compilers (vllm_k8s_compiler), the securityContext uid/gid params are
engine-specific and must not be shared: vLLM uses 2000/0 (its image's
'vllm' user, which has an /etc/passwd entry), while NIM images expect the
operator's historical 1000/2000. Reusing default_vllm_user_id for NIM (or
hardcoding either in the compiler) would reintroduce the getpwuid crash or
break NIM. Add a FUTURE note in the compiler module docstring and
cross-references on the NIM uid/gid config fields and the vLLM puller call
site so a future implementer (possibly not us) does not hit this.

Signed-off-by: Ben McCown <bmccown@nvidia.com>
@benmccown benmccown self-assigned this Jun 12, 2026
@github-actions github-actions Bot added the feat label Jun 12, 2026
@github-actions

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 19033/25227 75.4% 61.3%
Integration Tests 11028/23999 46.0% 20.4%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant