fix(examples): vllm example uses python3 + moves port into yaml

V2arK · V2arK · commit 82decf1dca57 · 2026-05-21T17:51:48.000-04:00
Found during dev B200 verification (Qwen2.5-0.5B variant of this shape):

- vllm/vllm-openai image has python3 only, no `python` symlink — the
  K8s container fails to start with `exec: "python": executable file
  not found in $PATH` because the platform's `command` field overrides
  the image's ENTRYPOINT entirely.
- The platform helm path passes numeric arg tokens to the Rollout spec
  as integers, and the K8s API server rejects them (`args[3] ... must
  be of type string: "integer"`). Moving `port: 8000` into the YAML
  config keeps every CLI token a non-numeric string while still letting
  vLLM pick up the port — and matches the "all config in one file"
  intent of --config.

Verified end-to-end on dev (cluster c-01-c-11-centml-org, hw x1-large-b200):
created deployment via SDK → rollout HEALTHY at t+76s → POST
/v1/chat/completions returned HTTP 200 with a real Qwen completion.

Signed-off-by: Honglin Cao &lt;hocao@nvidia.com&gt;
diff --git a/examples/sdk/create_inference_vllm.py b/examples/sdk/create_inference_vllm.py
@@ -27,7 +27,7 @@ def main():
             healthcheck="/health",
             concurrency=10,
             env_vars={"HF_TOKEN": "<your-hf-token>"},
-            command="python -m vllm.entrypoints.openai.api_server --port 8000 --config /etc/vllm/vllm_config.yaml",
+            command="python3 -m vllm.entrypoints.openai.api_server --config /etc/vllm/vllm_config.yaml",
             config_file=load_config_file_mount(path="./vllm_config.yaml", mount_path="/etc/vllm"),
         )
         response = cclient.create_inference(request)
diff --git a/examples/sdk/vllm_config.yaml b/examples/sdk/vllm_config.yaml
@@ -1,3 +1,4 @@
+port: 8000
 model: meta-llama/Llama-3.1-8B-Instruct
 tokenizer: meta-llama/Llama-3.1-8B-Instruct
 runner: generate

Original file line number	Diff line number	Diff line change
`@@ -27,7 +27,7 @@ def main():`
`27`	`27`	`healthcheck="/health",`
`28`	`28`	`concurrency=10,`
`29`	`29`	`env_vars={"HF_TOKEN": "<your-hf-token>"},`
`30`		`- command="python -m vllm.entrypoints.openai.api_server --port 8000 --config /etc/vllm/vllm_config.yaml",`
	`30`	`+ command="python3 -m vllm.entrypoints.openai.api_server --config /etc/vllm/vllm_config.yaml",`
`31`	`31`	`config_file=load_config_file_mount(path="./vllm_config.yaml", mount_path="/etc/vllm"),`
`32`	`32`	`)`
`33`	`33`	`response = cclient.create_inference(request)`
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,4 @@`
	`1`	`+port: 8000`
`1`	`2`	`model: meta-llama/Llama-3.1-8B-Instruct`
`2`	`3`	`tokenizer: meta-llama/Llama-3.1-8B-Instruct`
`3`	`4`	`runner: generate`