diff --git a/inference/a4x/disaggregated-serving/dynamo/README.md b/inference/a4x/disaggregated-serving/dynamo/README.md new file mode 100644 index 00000000..c53cdba0 --- /dev/null +++ b/inference/a4x/disaggregated-serving/dynamo/README.md @@ -0,0 +1,340 @@ +# Disaggregated Multi-Node Inference with NVIDIA Dynamo on A4X GKE + +This document outlines the steps to deploy and serve Large Language Models (LLMs) using [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo) disaggregated inference platform on [A4X GKE Node pools](https://cloud.google.com/kubernetes-engine). + +Dynamo provides a disaggregated architecture that separates prefill and decode operations for optimized inference performance, supporting both single-node (4 GPUs) and multi-node NVL72 (72 GPUs) configurations. Dynamo also supports various inference framework backends like [vLLM](https://docs.nvidia.com/dynamo/latest/components/backends/vllm/README.html) and [SGLang](https://docs.nvidia.com/dynamo/latest/components/backends/sglang/README.html). In this recipe, we will focus on serving using the SGLang backend. + + +## Table of Contents + +* [1. Test Environment](#test-environment) +* [2. Environment Setup (One-Time)](#environment-setup) + * [2.1. Clone the Repository](#clone-repo) + * [2.2. Configure Environment Variables](#configure-vars) + * [2.3. Connect to your GKE Cluster](#connect-cluster) + * [2.4. Create Secrets](#create-secrets) + * [2.5. Install Dynamo Platform](#install-platform) + * [2.6. Setup GCS Bucket for GKE ](#setup-gcsfuse) + * [2.7. Build Dynamo Image ](#build-dyanmo-image) +* [3. Deploy with SGLang Backend](#deploy-sglang) + * [3.1. SGLang Deployment without DeepEP(8 GPUs)](#sglang-wo-deepep) + * [3.2. SGLang Deployment with DeepEP(72 GPUs)](#sglang-deepep) +* [4. Inference Request](#inference-request) +* [5. Monitoring and Troubleshooting](#monitoring) +* [6. Cleanup](#cleanup) + + +## 1. Test Environment + +[Back to Top](#table-of-contents) + +This recipe has been tested with the following configuration: + +* **GKE Cluster**: + * GPU node pools with [a4x-highgpu-4g](https://docs.cloud.google.com/compute/docs/gpus#gb200-gpus) machines: + * For multi-node deployment: 4 machines with 4 GPUs each (16 GPUs total) + * [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled + * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled + +> [!IMPORTANT] +> To prepare the required environment, see the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a4x.md). + + +## 2. Environment Setup (One-Time) + +[Back to Top](#table-of-contents) + + +### 2.1. Clone the Repository + +```bash +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=$(pwd) +export RECIPE_ROOT=$REPO_ROOT/inference/a4x/disaggregated-serving/dynamo +``` + + +### 2.2. Configure Environment Variables + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export NAMESPACE=dynamo-cloud +export NGC_API_KEY= +export HF_TOKEN= +export RELEASE_VERSION=0.7.0 +export GCS_BUCKET= + +# Set the project for gcloud commands +gcloud config set project $PROJECT_ID +``` + +Replace the following values: + +| Variable | Description | Example | +| -------- | ----------- | ------- | +| `PROJECT_ID` | Your Google Cloud Project ID | `gcp-project-12345` | +| `CLUSTER_REGION` | The GCP region where your GKE cluster is located | `us-central1` | +| `CLUSTER_NAME` | The name of your GKE cluster | `a4x-cluster` | +| `NGC_API_KEY` | Your NVIDIA NGC API key (get from [NGC](https://ngc.nvidia.com)) | `nvapi-xxx...` | +| `HF_TOKEN` | Your Hugging Face access token | `hf_xxx...` | +| `GCS_BUCKET` | Your GCS bucket name | `gs://xxx` | + + +### 2.3. Connect to your GKE Cluster + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + + +### 2.4. Create Secrets + +Create the namespace: +```bash +kubectl create namespace ${NAMESPACE} +kubectl config set-context --current --namespace=$NAMESPACE +``` + +Create the Docker registry secret for NVIDIA Container Registry: +```bash +kubectl create secret docker-registry nvcr-secret \ + --namespace=${NAMESPACE} \ + --docker-server=nvcr.io \ + --docker-username='$oauthtoken' \ + --docker-password=${NGC_API_KEY} +``` + +Create the secret for the Hugging Face token: +```bash +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} \ + -n ${NAMESPACE} +``` + + +### 2.5. Install Dynamo Platform (One-Time Setup) + +Add the NVIDIA Helm repository: +```bash +helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ + --username='$oauthtoken' --password=${NGC_API_KEY} +helm repo update +``` + +Fetch the Dynamo Helm charts: +```bash +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz +``` + +Install the Dynamo CRDs: +```bash +helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz \ + --namespace default \ + --wait \ + --atomic +``` + +Install the Dynamo Platform with Grove & Kai scheduler enabled: +```bash +helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \ + --namespace ${NAMESPACE} --set grove.enabled=true --set kai-scheduler.enabled=true +``` + +Verify the installation: +```bash +kubectl get pods -n ${NAMESPACE} +``` + +Wait until all pods show a `Running` status before proceeding. + + +### 2.6. Setup GCS Bucket for GKE (One-Time Setup) + +It is recommended to utilize [gcsfuse](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-setup) to facilitate model access and mitigate [huggingface rate limiting](https://huggingface.co/docs/hub/en/rate-limits#hub-rate-limits) issues. + +Find the service account (usually annotated to default): +```bash +kubectl get serviceaccounts ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.iam\.gke\.io/gcp-service-account}{"\n"}{end}' +``` + +Config the service account email: +```bash +export SERVICE_ACCOUNT_EMAIL=$(kubectl get serviceaccount/default -n ${NAMESPACE} -o jsonpath='{.metadata.annotations.iam\.gke\.io/gcp-service-account}') +``` + +Authorize the service account: +```bash +gcloud iam service-accounts add-iam-policy-binding ${SERVICE_ACCOUNT_EMAIL} \ + --role roles/iam.workloadIdentityUser \ + --member "serviceAccount:project_id.svc.id.goog[${NAMESPACE}/default]" +``` + +Grant read access to the bucket: +```bash +gcloud storage buckets add-iam-policy-binding ${GCS_BUCKET} \ + --member "serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \ + --role "roles/storage.objectViewer" +``` + +Downloading model files into the gcs bucket and set your gcs bucket name in values.yaml file. + + +### 2.7. Build Dynamo Image + +Follow the [Dynamo container guide](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md) to build the image, then push it to your artifact registry. + +Build the image like this: +```bash +docker build -f container/Dockerfile.sglang . -t dynamo-wideep --no-cache --build-arg DYNAMO_VERSION=0.7.0 --platform linux/arm64 +``` + +Config the docker image: +```bash +export ARTIFACT_REGISTRY= +``` + + +## 3. Deploy with SGLang Backend + +[Back to Top](#table-of-contents) + +Deploy Dynamo with SGLang backend for high-performance inference. + + +### 3.1. SGLang Deployment without DeepEP (8 GPUs) + +Two nodes deployment uses 8 GPUs across 2 A4X machines, targeting low latency. + +#### DeepSeekR1 671B Model + +Deploy DeepSeekR1-671B across 2 nodes for testing and validation. Note the use of `--set-file prefill_serving_config` and `--set-file decode_serving_config` pointing to the correct model config file. + +```bash +cd $RECIPE_ROOT +helm install -f values_wo_deepep.yaml \ +--set workload.image=${ARTIFACT_REGISTRY} \ +--set volumes.gcsfuse.bucketName=${GCS_BUCKET} \ +--set-file prefill_serving_config=$REPO_ROOT/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-1p1d-prefill.yaml \ +--set-file decode_serving_config=$REPO_ROOT/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-1p1d-decode.yaml \ +$USER-dynamo-a4x-1p1d \ +$REPO_ROOT/src/helm-charts/a4x/inference-templates/dynamo-deployment +``` + + +### 3.2. SGLang Deployment with DeepEP (72 GPUs) + +Multi-node deployment uses 72 GPUs across 18 A4X machines, providing increased capacity for larger models or higher throughput. + +#### DeepSeekR1 671B Model + +Deploy DeepSeekR1-671B across 18 nodes for production workloads. Note the use of `--set-file prefill_serving_config` and `--set-file decode_serving_config` pointing to the correct model config file for a multi node deployment scenario: + +```bash +cd $RECIPE_ROOT +helm install -f values_deepep.yaml \ +--set workload.image=${ARTIFACT_REGISTRY} \ +--set volumes.gcsfuse.bucketName=${GCS_BUCKET} \ +--set-file prefill_serving_config=$REPO_ROOT/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-10p8d-prefill.yaml \ +--set-file decode_serving_config=$REPO_ROOT/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-10p8d-decode.yaml \ +$USER-dynamo-a4x-multi-node \ +$REPO_ROOT/src/helm-charts/a4x/inference-templates/dynamo-deployment +``` + + +## 4. Inference Request +[Back to Top](#table-of-contents) + +Check if the pods are in `Running` status before sending inference requests. + +```bash +kubectl get pods -n ${NAMESPACE} +``` + +We can then deploy the benchmark client and send benchark request. +Deploy the benchmark client like this: +```bash +kubectl apply -f bench_client.yaml -n ${NAMESPACE} +``` + +And send the request like this: + +```bash +kubectl exec -it bench-client -- bash -c "cd /workspace/dynamo/examples/backends/sglang/slurm_jobs/scripts/vllm && python3 -u benchmark_serving.py --host $USER-dynamo-a4x-1p1d-frontend --port 8000 --model deepseek-ai/DeepSeek-R1 --tokenizer deepseek-ai/DeepSeek-R1 --backend 'dynamo' --endpoint /v1/completions --disable-tqdm --dataset-name random --num-prompts 2560 --random-input-len 8192 --random-output-len 1024 --random-range-ratio 0.8 --ignore-eos --request-rate inf --percentile-metrics ttft,tpot,itl,e2el --max-concurrency 512" +``` + +Or we can send a benchmark request to a frontend pod like this: + +```bash +kubectl exec -n ${NAMESPACE} $USER-dynamo-multi-node-serving-frontend -- python3 -u -m sglang.bench_serving --backend sglang-oai-chat --base-url http://localhost:8000 --model "deepseek-ai/DeepSeek-R1" --tokenizer /data/model/deepseek-ai/DeepSeek-R1 --dataset-name random --num-prompts 10240 --random-input-len 8192 --random-range-ratio 0.8 --random-output-len 1024 --max-concurrency 2048 +``` + + +## 5. Monitoring and Troubleshooting + +[Back to Top](#table-of-contents) + +View logs for different components (replace with your deployment name): + +You can find the exact pod name by: +```bash +kubectl get pods -n ${NAMESPACE} +``` + +Frontend logs: +```bash +kubectl logs -f deployment/$USER-dynamo-multi-node-serving-frontend -n ${NAMESPACE} +``` + +Decode worker logs: +```bash +kubectl logs -f deployment/$USER-dynamo-multi-node-serving-decode-worker -n ${NAMESPACE} +``` + +Prefill worker logs: +```bash +kubectl logs -f deployment/$USER-dynamo-multi-node-serving-prefill-worker -n ${NAMESPACE} +``` + +Common issues: + +* **Pods stuck in Pending**: Check if nodes have sufficient resources (especially for multi-node deployments) +* **Model download slow**: Large models like DeepSeekR1 671B can take 30 minutes to download +* **Multi-node issues**: Verify network connectivity between nodes and proper subnet configuration +* **Deepep timeout issue**: Recompile DeepEP to patch NUM_CPU_TIMEOUT_SECS and NUM_TIMEOUT_CYCLES in csrc/kernels/configs.cuh during the image build. + + +## 6. Cleanup + +[Back to Top](#table-of-contents) + +List deployed releases: +```bash +helm list -n ${NAMESPACE} --filter $USER-dynamo- +``` + +Uninstall specific deployments: +```bash +helm uninstall $USER-dynamo-multi-node-serving -n ${NAMESPACE} +``` + +Uninstall Dynamo platform (if no longer needed): +```bash +helm uninstall dynamo-platform -n ${NAMESPACE} +helm uninstall dynamo-crds -n default +``` + +Delete namespace and secrets: +```bash +kubectl delete namespace ${NAMESPACE} +``` + +Clean up downloaded charts: +```bash +rm -f dynamo-crds-${RELEASE_VERSION}.tgz +rm -f dynamo-platform-${RELEASE_VERSION}.tgz +``` + diff --git a/inference/a4x/disaggregated-serving/dynamo/bench_client.yaml b/inference/a4x/disaggregated-serving/dynamo/bench_client.yaml new file mode 100644 index 00000000..16a96971 --- /dev/null +++ b/inference/a4x/disaggregated-serving/dynamo/bench_client.yaml @@ -0,0 +1,47 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Pod +metadata: + name: bench-client + labels: + app: bench-client +spec: + restartPolicy: Never + containers: + - name: benchmark + image: python:3.10 + workingDir: /workspace + command: ["/bin/bash", "-c"] + # This script runs ONCE when the pod starts to set everything up. + # Then it sleeps forever so the pod stays open for you. + args: + - | + echo "--- STARTING SETUP ---" + + # 1. Install Git + apt-get update && apt-get install -y git + + # 2. Install Python Dependencies + pip install -q transformers aiohttp numpy requests tqdm pandas datasets Pillow + + # 3. Clone the Repo (Specific Branch) + echo "Cloning repo..." + git clone --single-branch --branch ishan/sa-1.1-sgl-dsr1-fp8 https://github.com/ai-dynamo/dynamo.git /workspace/dynamo + + echo "--- SETUP COMPLETE. POD IS READY. ---" + + # 4. Keep the pod alive indefinitely + sleep infinity diff --git a/inference/a4x/disaggregated-serving/dynamo/values_deepep.yaml b/inference/a4x/disaggregated-serving/dynamo/values_deepep.yaml new file mode 100644 index 00000000..a68f203a --- /dev/null +++ b/inference/a4x/disaggregated-serving/dynamo/values_deepep.yaml @@ -0,0 +1,196 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +dynamo: + namespace: dynamo-cloud + releaseVersion: "0.7.0" + deploymentName: dynamo-disagg10p8d + computeDomain: + name: a4x-domain + numNodes: 18 + resourceClaimTemplateName: a4x-channel + serviceAccountName: + frontend: + image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.7.0 + replicas: 9 + livenessProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 150 + failureThreshold: 100 + readinessProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 300 + failureThreshold: 100 + decodeWorker: + nodeCount: 8 + replicas: 1 + envs: + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-token-secret + key: HF_TOKEN + - name: HF_HUB_ENABLE_HF_TRANSFER + value: "1" + - name: LD_LIBRARY_PATH + value: "/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:/opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu:/opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu/plugins:/usr/local/nvidia/lib64" + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: TP_SOCKET_IFNAME + value: eth0 + - name: DYN_SKIP_SGLANG_LOG_FORMATTING + value: "1" + - name: SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK + value: "256" + - name: MC_TE_METRIC + value: "true" + - name: SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE + value: "100000" + - name: SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT + value: "100000" + - name: SGLANG_DISAGGREGATION_WAITING_TIMEOUT + value: "100000" + - name: SGLANG_DECODE_BOOTSTRAP_TIMEOUT + value: "1000" + - name: SGLANG_HACK_SEQ_BOOTSTRAP_ROOM + value: "1" + - name: SGLANG_MOONCAKE_CUSTOM_MEM_POOL + value: "True" + - name: MC_FORCE_MNNVL + value: "1" + - name: NCCL_MNNVL_ENABLE + value: "1" + - name: NCCL_CUMEM_ENABLE + value: "1" + - name: SGLANG_USE_MESSAGE_QUEUE_BROADCASTER + value: "0" + - name: SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK + value: "1" + - name: PYTHONUNBUFFERED + value: "1" + livenessProbe: + initialDelaySeconds: 600 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 60 + readinessProbe: + initialDelaySeconds: 60 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 60 + startupProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 1800 + prefillWorker: + nodeCount: 2 + replicas: 5 + envs: + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-token-secret + key: HF_TOKEN + - name: HF_HUB_ENABLE_HF_TRANSFER + value: "1" + - name: LD_LIBRARY_PATH + value: "/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:/opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu:/opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu/plugins:/usr/local/nvidia/lib64" + - name: UCX_TLS + value: "^tcp" + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: TP_SOCKET_IFNAME + value: eth0 + - name: DYN_SKIP_SGLANG_LOG_FORMATTING + value: "1" + - name: MC_TE_METRIC + value: "true" + - name: SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE + value: "100000" + - name: SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT + value: "100000" + - name: SGLANG_DISAGGREGATION_WAITING_TIMEOUT + value: "100000" + - name: SGLANG_MOONCAKE_CUSTOM_MEM_POOL + value: "True" + - name: MC_FORCE_MNNVL + value: "1" + - name: NCCL_MNNVL_ENABLE + value: "1" + - name: NCCL_CUMEM_ENABLE + value: "1" + - name: SGLANG_USE_MESSAGE_QUEUE_BROADCASTER + value: "0" + - name: SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK + value: "1" + - name: PYTHONUNBUFFERED + value: "1" + livenessProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 60 + readinessProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 60 + startupProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 1800 + + +secrets: + ngc: + secretName: nvcr-secret + huggingface: + secretName: hf-token-secret + secretData: + token: "hf_api_token" + +volumes: + useGcs: true + gcsfuse: + bucketName: + ssdMountPath: "/ssd" + gcsMounts: + mountPath: "/data/model" + +service: + type: ClusterIP + ports: + frontend: 8000 + worker: 9090 + +workload: + model: deepseek-ai/DeepSeek-R1 + image: + framework: sglang + configFile: serving-args.yaml + configPath: /workload/configs + +network: + subnetworks: [] + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.0.7 + ncclSettings: + - name: NCCL_DEBUG + value: "VERSION" + +quantizations: + - "fp8" diff --git a/inference/a4x/disaggregated-serving/dynamo/values_wo_deepep.yaml b/inference/a4x/disaggregated-serving/dynamo/values_wo_deepep.yaml new file mode 100644 index 00000000..5308d69f --- /dev/null +++ b/inference/a4x/disaggregated-serving/dynamo/values_wo_deepep.yaml @@ -0,0 +1,202 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +dynamo: + namespace: dynamo-cloud + releaseVersion: "0.7.0" + deploymentName: dynamo-disagg1p1d + computeDomain: + name: a4x-domain + numNodes: 2 + resourceClaimTemplateName: a4x-channel + serviceAccountName: + frontend: + image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.7.0 + replicas: 9 + livenessProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 150 + failureThreshold: 100 + readinessProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 300 + failureThreshold: 100 + decodeWorker: + nodeCount: 1 + replicas: 1 + envs: + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-token-secret + key: HF_TOKEN + - name: HF_HUB_ENABLE_HF_TRANSFER + value: "1" + - name: LD_LIBRARY_PATH + value: "/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:/opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu:/opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu/plugins:/usr/local/nvidia/lib64" + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: TP_SOCKET_IFNAME + value: eth0 + - name: PYTHONUNBUFFERED + value: "1" + - name: DYN_SKIP_SGLANG_LOG_FORMATTING + value: "1" + - name: SGLANG_ENABLE_JIT_DEEPGEMM + value: "false" + - name: SGLANG_ENABLE_FLASHINFER_GEMM + value: "1" + - name: SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE + value: "100000" + - name: SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT + value: "100000" + - name: SGLANG_DISAGGREGATION_WAITING_TIMEOUT + value: "100000" + - name: SGLANG_DECODE_BOOTSTRAP_TIMEOUT + value: "1000" + - name: SGLANG_HACK_SEQ_BOOTSTRAP_ROOM + value: "1" + - name: SGLANG_MOONCAKE_CUSTOM_MEM_POOL + value: "True" + - name: SGLANG_USE_MESSAGE_QUEUE_BROADCASTER + value: "0" + - name: SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK + value: "1" + - name: MC_TE_METRIC + value: "true" + - name: MC_FORCE_MNNVL + value: "1" + - name: NCCL_MNNVL_ENABLE + value: "1" + - name: NCCL_CUMEM_ENABLE + value: "1" + livenessProbe: + initialDelaySeconds: 600 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 60 + readinessProbe: + initialDelaySeconds: 60 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 60 + startupProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 1800 + prefillWorker: + nodeCount: 1 + replicas: 1 + envs: + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-token-secret + key: HF_TOKEN + - name: HF_HUB_ENABLE_HF_TRANSFER + value: "1" + - name: LD_LIBRARY_PATH + value: "/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:/opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu:/opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu/plugins:/usr/local/nvidia/lib64" + - name: UCX_TLS + value: "^tcp" + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: TP_SOCKET_IFNAME + value: eth0 + - name: PYTHONUNBUFFERED + value: "1" + - name: DYN_SKIP_SGLANG_LOG_FORMATTING + value: "1" + - name: SGLANG_ENABLE_JIT_DEEPGEMM + value: "false" + - name: SGLANG_ENABLE_FLASHINFER_GEMM + value: "1" + - name: SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE + value: "100000" + - name: SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT + value: "100000" + - name: SGLANG_DISAGGREGATION_WAITING_TIMEOUT + value: "100000" + - name: SGLANG_MOONCAKE_CUSTOM_MEM_POOL + value: "True" + - name: SGLANG_USE_MESSAGE_QUEUE_BROADCASTER + value: "0" + - name: SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK + value: "1" + - name: MC_TE_METRIC + value: "true" + - name: MC_FORCE_MNNVL + value: "1" + - name: NCCL_MNNVL_ENABLE + value: "1" + - name: NCCL_CUMEM_ENABLE + value: "1" + livenessProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 60 + readinessProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 60 + startupProbe: + initialDelaySeconds: 3000 + periodSeconds: 60 + timeoutSeconds: 30 + failureThreshold: 1800 + + +secrets: + ngc: + secretName: nvcr-secret + huggingface: + secretName: hf-token-secret + secretData: + token: "hf_api_token" + +volumes: + useGcs: true + gcsfuse: + bucketName: + ssdMountPath: "/ssd" + gcsMounts: + mountPath: "/data/model" + +service: + type: ClusterIP + ports: + frontend: 8000 + worker: 9090 + +workload: + model: deepseek-ai/DeepSeek-R1 + image: + framework: sglang + configFile: serving-args.yaml + configPath: /workload/configs + +network: + subnetworks: [] + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.0.7 + ncclSettings: + - name: NCCL_DEBUG + value: "VERSION" + +quantizations: + - "fp8" diff --git a/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-10p8d-decode.yaml b/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-10p8d-decode.yaml new file mode 100644 index 00000000..4369e1ce --- /dev/null +++ b/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-10p8d-decode.yaml @@ -0,0 +1,47 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +served-model-name: deepseek-ai/DeepSeek-R1 +disaggregation-mode: decode +disaggregation-bootstrap-port: "30001" +host: "0.0.0.0" +port: "9090" +trust-remote-code: true +skip-tokenizer-init: true +tp-size: "32" +dp-size: "32" +ep-size: "32" +quantization: "fp8" +enable-dp-attention: true +attention-backend: "trtllm_mla" +kv-cache-dtype: "fp8_e4m3" +disable-radix-cache: true +stream-interval: "50" +decode-log-interval: "1000" +max-running-requests: "8192" +context-length: "9300" +watchdog-timeout: "1000000" +disable-shared-experts-fusion: true +eplb-algorithm: deepseek +mem-fraction-static: "0.82" +chunked-prefill-size: "36864" +moe-a2a-backend: "deepep" +deepep-mode: "low_latency" +ep-dispatch-algorithm: static +moe-dense-tp-size: "1" +enable-dp-lm-head: true +prefill-round-robin-balance: true +ep-num-redundant-experts: "32" +cuda-graph-max-bs: "256" +deepep-config: '{"normal_dispatch": {"num_sms": 128,"num_max_nvl_chunked_send_tokens": 28,"num_max_nvl_chunked_recv_tokens": 256,"num_max_rdma_chunked_send_tokens": 6,"num_max_rdma_chunked_recv_tokens": 256}, "normal_combine": {"num_sms": 128,"num_max_nvl_chunked_send_tokens": 15,"num_max_nvl_chunked_recv_tokens": 256,"num_max_rdma_chunked_send_tokens": 6,"num_max_rdma_chunked_recv_tokens": 128}}' \ No newline at end of file diff --git a/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-10p8d-prefill.yaml b/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-10p8d-prefill.yaml new file mode 100644 index 00000000..9c86f420 --- /dev/null +++ b/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-10p8d-prefill.yaml @@ -0,0 +1,46 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +served-model-name: deepseek-ai/DeepSeek-R1 +disaggregation-mode: prefill +disaggregation-bootstrap-port: "30001" +host: "0.0.0.0" +port: "9090" +trust-remote-code: true +tp-size: "8" +dp-size: "8" +ep-size: "8" +quantization: "fp8" +enable-dp-attention: true +attention-backend: "trtllm_mla" +kv-cache-dtype: "fp8_e4m3" +disable-radix-cache: true +stream-interval: "50" +max-running-requests: "30000" +context-length: "9300" +watchdog-timeout: "1000000" +disable-shared-experts-fusion: true +eplb-algorithm: deepseek +mem-fraction-static: "0.8" +max-total-tokens: "524288" +chunked-prefill-size: "131072" +load-balance-method: round_robin +disable-cuda-graph: true +moe-a2a-backend: deepep +deepep-mode: normal +ep-dispatch-algorithm: "dynamic" +moe-dense-tp-size: "1" +enable-dp-lm-head: true +ep-num-redundant-experts: "32" +deepep-config: '{"normal_dispatch": {"num_sms": 128,"num_max_nvl_chunked_send_tokens": 28,"num_max_nvl_chunked_recv_tokens": 256,"num_max_rdma_chunked_send_tokens": 6,"num_max_rdma_chunked_recv_tokens": 256}, "normal_combine": {"num_sms": 128,"num_max_nvl_chunked_send_tokens": 15,"num_max_nvl_chunked_recv_tokens": 256,"num_max_rdma_chunked_send_tokens": 6,"num_max_rdma_chunked_recv_tokens": 128}}' \ No newline at end of file diff --git a/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-1p1d-decode.yaml b/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-1p1d-decode.yaml new file mode 100644 index 00000000..ff0f3c47 --- /dev/null +++ b/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-1p1d-decode.yaml @@ -0,0 +1,40 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +enable-metrics: true +served-model-name: deepseek-ai/DeepSeek-R1 +disaggregation-bootstrap-port: "30001" +disaggregation-mode: decode +host: "0.0.0.0" +port: "9090" +disable-radix-cache: true +tensor-parallel-size: 4 +data-parallel-size: 1 +expert-parallel-size: 1 +trust-remote-code: true +kv-cache-dtype: "fp8_e4m3" +attention-backend: "trtllm_mla" +quantization: "fp8" +moe-runner-backend: "flashinfer_trtllm" +watchdog-timeout: "1000000" +context-length: "9600" +mem-fraction-static: "0.95" +chunked-prefill-size: "8192" +cuda-graph-max-bs: "512" +max-running-requests: "512" +scheduler-recv-interval: "10" +enable-flashinfer-allreduce-fusion: true +enable-symm-mem: true +moe-dense-tp-size: "1" +prefill-round-robin-balance: true \ No newline at end of file diff --git a/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-1p1d-prefill.yaml b/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-1p1d-prefill.yaml new file mode 100644 index 00000000..e42cb117 --- /dev/null +++ b/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-1p1d-prefill.yaml @@ -0,0 +1,40 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +enable-metrics: true +served-model-name: deepseek-ai/DeepSeek-R1 +disaggregation-bootstrap-port: "30001" +disaggregation-mode: prefill +host: "0.0.0.0" +port: "9090" +disable-radix-cache: true +tensor-parallel-size: 4 +data-parallel-size: 1 +expert-parallel-size: 1 +trust-remote-code: true +kv-cache-dtype: "fp8_e4m3" +attention-backend: "trtllm_mla" +quantization: "fp8" +moe-runner-backend: "flashinfer_trtllm" +watchdog-timeout: "1000000" +context-length: "9600" +mem-fraction-static: "0.95" +max-total-tokens: "32768" +chunked-prefill-size: "24576" +cuda-graph-max-bs: "512" +max-running-requests: "512" +load-balance-method: round_robin +scheduler-recv-interval: "10" +enable-flashinfer-allreduce-fusion: true +moe-dense-tp-size: "1" \ No newline at end of file diff --git a/src/helm-charts/a4x/inference-templates/dynamo-deployment/Chart.yaml b/src/helm-charts/a4x/inference-templates/dynamo-deployment/Chart.yaml new file mode 100644 index 00000000..25a2209e --- /dev/null +++ b/src/helm-charts/a4x/inference-templates/dynamo-deployment/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4x-dynamo-deployment +description: a4x-dynamo-deployment +type: application +version: 0.1.0 +appVersion: "0.4.0" \ No newline at end of file diff --git a/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-compute-domain.yaml b/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-compute-domain.yaml new file mode 100644 index 00000000..dc2ab53a --- /dev/null +++ b/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-compute-domain.yaml @@ -0,0 +1,24 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: resource.nvidia.com/v1beta1 +kind: ComputeDomain +metadata: + name: {{ .Values.dynamo.computeDomain.name }} + namespace: {{ .Values.dynamo.namespace }} +spec: + numNodes: {{ .Values.dynamo.computeDomain.numNodes }} + channel: + resourceClaimTemplate: + name: {{ .Values.dynamo.computeDomain.resourceClaimTemplateName }} diff --git a/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-graph-deployment.yaml b/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-graph-deployment.yaml new file mode 100644 index 00000000..8002e43a --- /dev/null +++ b/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-graph-deployment.yaml @@ -0,0 +1,416 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: {{ .Values.dynamo.deploymentName }} + namespace: {{ .Values.dynamo.namespace }} +spec: + {{- if .Values.workload.framework }} + backendFramework: {{ .Values.workload.framework }} + {{- end }} + services: + Frontend: + dynamoNamespace: {{ .Values.dynamo.namespace }} + componentType: frontend + replicas: {{ .Values.dynamo.frontend.replicas }} + resources: + requests: + cpu: "5" + memory: "50Gi" + limits: + cpu: "5" + memory: "50Gi" + extraPodMetadata: + annotations: + {{- if eq .Values.volumes.useGcs true }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "0" + gke-gcsfuse/memory-limit: "0" + gke-gcsfuse/ephemeral-storage-limit: "0" + gke-gcsfuse/file-cache-capacity: "500Gi" + gke-gcsfuse/cache-path: "/gcs-cache" + {{- end }} + extraPodSpec: + tolerations: + - key: "kubernetes.io/arch" + operator: "Equal" + value: "arm64" + effect: "NoSchedule" + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" + volumes: + - name: local-ssd + emptyDir: {} + {{- if eq .Values.volumes.useGcs true }} + - name: gcs-model-volume + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: {{ .Values.volumes.gcsfuse.bucketName }} + mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:50,file-cache:max-size-mb:-1" + {{- end }} + mainContainer: + image: {{ .Values.dynamo.frontend.image }} + {{- if eq .Values.volumes.useGcs true }} + volumeMounts: + - name: local-ssd + mountPath: /gcs-cache + - name: gcs-model-volume + mountPath: /data/model + readOnly: true + {{- end }} + resources: + requests: + ephemeral-storage: "30Gi" + limits: + ephemeral-storage: "30Gi" + + Decode: + {{- if gt (int .Values.dynamo.decodeWorker.nodeCount) 1 }} + multinode: + nodeCount: {{ .Values.dynamo.decodeWorker.nodeCount }} + {{- end }} + dynamoNamespace: {{ .Values.dynamo.namespace }} + envFromSecret: {{ .Values.secrets.huggingface.secretName }} + componentType: worker + subComponentType: decode + replicas: {{ .Values.dynamo.decodeWorker.replicas }} + livenessProbe: + httpGet: + path: /live + port: system + initialDelaySeconds: {{ .Values.dynamo.decodeWorker.livenessProbe.initialDelaySeconds }} + periodSeconds: {{ .Values.dynamo.decodeWorker.livenessProbe.periodSeconds }} + timeoutSeconds: {{ .Values.dynamo.decodeWorker.livenessProbe.timeoutSeconds }} + failureThreshold: {{ .Values.dynamo.decodeWorker.livenessProbe.failureThreshold }} + readinessProbe: + httpGet: + path: /health + port: system + initialDelaySeconds: {{ .Values.dynamo.decodeWorker.readinessProbe.initialDelaySeconds }} + timeoutSeconds: {{ .Values.dynamo.decodeWorker.readinessProbe.timeoutSeconds }} + periodSeconds: {{ .Values.dynamo.decodeWorker.readinessProbe.periodSeconds }} + failureThreshold: {{ .Values.dynamo.decodeWorker.readinessProbe.failureThreshold }} + sharedMemory: + size: 80Gi + resources: + limits: + gpu: "4" + claims: + - name: compute-domain-channel + envs: + - name: SERVER_ARGS_FILE + value: {{ .Values.workload.configPath }}/{{ .Values.workload.configFile }} + {{- if eq .Values.volumes.useGcs true }} + - name: MODEL_PATH + value: {{ .Values.volumes.gcsMounts.mountPath }}/{{ .Values.workload.model }} + {{- end }} + {{- if .Values.dynamo.decodeWorker.envs }} + {{- toYaml .Values.dynamo.decodeWorker.envs | nindent 8 }} + {{- end }} + extraPodMetadata: + annotations: + {{- if eq .Values.volumes.useGcs true }} + gke-gcsfuse/cpu-limit: "0" + gke-gcsfuse/ephemeral-storage-limit: "0" + gke-gcsfuse/memory-limit: "0" + gke-gcsfuse/volumes: "true" + {{- end }} + networking.gke.io/default-interface: 'eth0' + networking.gke.io/interfaces: | + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth2","network":"rdma-0"}, + {"interfaceName":"eth3","network":"rdma-1"}, + {"interfaceName":"eth4","network":"rdma-2"}, + {"interfaceName":"eth5","network":"rdma-3"} + ] + extraPodSpec: + {{- if .Values.dynamo.serviceAccountName }} + serviceAccountName: {{ .Values.dynamo.serviceAccountName }} + {{- end }} + resourceClaims: + - name: compute-domain-channel + resourceClaimTemplateName: {{ .Values.dynamo.computeDomain.resourceClaimTemplateName }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/arch + operator: In + values: + - arm64 + mainContainer: + securityContext: + privileged: true + image: {{ .Values.workload.image }} + workingDir: /sgl-workspace/dynamo/components/backends/sglang + startupProbe: + failureThreshold: {{ .Values.dynamo.decodeWorker.startupProbe.failureThreshold }} + httpGet: + path: /live + port: system + periodSeconds: {{ .Values.dynamo.decodeWorker.startupProbe.periodSeconds }} + timeoutSeconds: {{ .Values.dynamo.decodeWorker.startupProbe.timeoutSeconds }} + initialDelaySeconds: {{ .Values.dynamo.decodeWorker.startupProbe.initialDelaySeconds }} + command: ["/bin/bash", "-c"] + stdin: true + tty: true + args: + - | + set -e + nvidia-smi + . /usr/local/gib/scripts/set_nccl_env.sh + + echo "--- VERIFYING NCCL ENV VARS IN SHELL ---" + env | grep NCCL_ + echo "--- END VERIFICATION ---" + pip install hf_transfer + + ARGS=() + if [ -n "$MODEL_PATH" ]; then + echo "Adding model path from env var: $MODEL_PATH" + ARGS+=("--model-path" "$MODEL_PATH") + else + echo "No MODEL_PATH env var set from gcsfuse, relying on config file for model" + ARGS+=("--model" "{{ .Values.workload.model }}") + fi + if [ -f "$SERVER_ARGS_FILE" ]; then + echo "Loading server arguments from ConfigMap" + while IFS=': ' read -r key value || [ -n "$key" ]; do + [[ -z "$key" || "$key" == \#* ]] && continue + key=$(echo "$key" | xargs) + value=$(echo "$value" | xargs) + + if [ -n "$key" ]; then + if [[ "$value" == "true" ]]; then + ARGS+=("--$key") + elif [[ "$value" == "false" ]]; then + ARGS+=("--$key" "false") + elif [ -n "$value" ]; then + ARGS+=("--$key" "$value") + else + ARGS+=("--$key") + fi + fi + done < "$SERVER_ARGS_FILE" + fi + echo "Running: python3 -m dynamo.sglang ${ARGS[@]}" + exec python3 -m dynamo.sglang "${ARGS[@]}" + + volumeMounts: + {{- if eq .Values.volumes.useGcs true }} + - mountPath: /data/model + name: gcs-model-volume + {{- end }} + - name: library-dir-host + mountPath: /usr/local/nvidia + - name: gib + mountPath: /usr/local/gib + - name: serving-configuration + mountPath: {{ .Values.workload.configPath | default "/workload/configs" }} + volumes: + {{- if eq .Values.volumes.useGcs true }} + - name: gcs-model-volume + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: {{ .Values.volumes.gcsfuse.bucketName }} + mountOptions: implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1 + {{- end }} + - name: library-dir-host + hostPath: + path: /home/kubernetes/bin/nvidia + - name: gib + hostPath: + path: /home/kubernetes/bin/gib + - name: serving-configuration + configMap: + name: "{{ .Release.Name }}-decode-config" + items: + - key: serving-configuration + path: {{ .Values.workload.configFile | default "serving-args.yaml" }} + + Prefill: + {{- if gt (int .Values.dynamo.prefillWorker.nodeCount) 1 }} + multinode: + nodeCount: {{ .Values.dynamo.prefillWorker.nodeCount }} + {{- end }} + dynamoNamespace: {{ .Values.dynamo.namespace }} + envFromSecret: {{ .Values.secrets.huggingface.secretName }} + componentType: worker + subComponentType: prefill + replicas: {{ .Values.dynamo.prefillWorker.replicas }} + livenessProbe: + exec: + command: + - /bin/sh + - -c + - "exit 0" + initialDelaySeconds: {{ .Values.dynamo.prefillWorker.livenessProbe.initialDelaySeconds }} + periodSeconds: {{ .Values.dynamo.prefillWorker.livenessProbe.periodSeconds }} + timeoutSeconds: {{ .Values.dynamo.prefillWorker.livenessProbe.timeoutSeconds }} + failureThreshold: {{ .Values.dynamo.prefillWorker.livenessProbe.failureThreshold }} + readinessProbe: + httpGet: + path: /health + port: system + initialDelaySeconds: {{ .Values.dynamo.prefillWorker.readinessProbe.initialDelaySeconds }} + timeoutSeconds: {{ .Values.dynamo.prefillWorker.readinessProbe.timeoutSeconds }} + periodSeconds: {{ .Values.dynamo.prefillWorker.readinessProbe.periodSeconds }} + failureThreshold: {{ .Values.dynamo.prefillWorker.readinessProbe.failureThreshold }} + sharedMemory: + size: 80Gi + resources: + limits: + gpu: "4" + claims: + - name: compute-domain-channel + envs: + - name: SERVER_ARGS_FILE + value: {{ .Values.workload.configPath }}/{{ .Values.workload.configFile }} + {{- if eq .Values.volumes.useGcs true }} + - name: MODEL_PATH + value: {{ .Values.volumes.gcsMounts.mountPath }}/{{ .Values.workload.model }} + {{- end }} + {{- if .Values.dynamo.prefillWorker.envs }} + {{- toYaml .Values.dynamo.prefillWorker.envs | nindent 8 }} + {{- end }} + extraPodMetadata: + annotations: + {{- if eq .Values.volumes.useGcs true }} + gke-gcsfuse/cpu-limit: "0" + gke-gcsfuse/ephemeral-storage-limit: "0" + gke-gcsfuse/memory-limit: "0" + gke-gcsfuse/volumes: "true" + {{- end }} + networking.gke.io/default-interface: 'eth0' + networking.gke.io/interfaces: | + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth2","network":"rdma-0"}, + {"interfaceName":"eth3","network":"rdma-1"}, + {"interfaceName":"eth4","network":"rdma-2"}, + {"interfaceName":"eth5","network":"rdma-3"} + ] + extraPodSpec: + {{- if .Values.dynamo.serviceAccountName }} + serviceAccountName: {{ .Values.dynamo.serviceAccountName }} + {{- end }} + resourceClaims: + - name: compute-domain-channel + resourceClaimTemplateName: {{ .Values.dynamo.computeDomain.resourceClaimTemplateName }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/arch + operator: In + values: + - arm64 + mainContainer: + securityContext: + privileged: true + stdin: true + tty: true + image: {{ .Values.workload.image }} + workingDir: /sgl-workspace/dynamo/components/backends/sglang + startupProbe: + failureThreshold: {{ .Values.dynamo.prefillWorker.startupProbe.failureThreshold }} + httpGet: + path: /live + port: system + periodSeconds: {{ .Values.dynamo.prefillWorker.startupProbe.periodSeconds }} + timeoutSeconds: {{ .Values.dynamo.prefillWorker.startupProbe.timeoutSeconds }} + initialDelaySeconds: {{ .Values.dynamo.prefillWorker.startupProbe.initialDelaySeconds }} + command: ["/bin/bash", "-c"] + args: + - | + set -e + nvidia-smi + . /usr/local/gib/scripts/set_nccl_env.sh + + echo "--- VERIFYING NCCL ENV VARS IN SHELL ---" + env | grep NCCL_ + echo "--- END VERIFICATION ---" + pip install hf_transfer + + ARGS=() + if [ -n "$MODEL_PATH" ]; then + echo "Adding model path from env var: $MODEL_PATH" + ARGS+=("--model-path" "$MODEL_PATH") + else + echo "No MODEL_PATH env var set from gcsfuse, relying on config file for model" + ARGS+=("--model" "{{ .Values.workload.model }}") + fi + if [ -f "$SERVER_ARGS_FILE" ]; then + echo "Loading server arguments from ConfigMap" + while IFS=': ' read -r key value || [ -n "$key" ]; do + [[ -z "$key" || "$key" == \#* ]] && continue + key=$(echo "$key" | xargs) + value=$(echo "$value" | xargs) + + if [ -n "$key" ]; then + if [[ "$value" == "true" ]]; then + ARGS+=("--$key") + elif [[ "$value" == "false" ]]; then + ARGS+=("--$key" "false") + elif [ -n "$value" ]; then + ARGS+=("--$key" "$value") + else + ARGS+=("--$key") + fi + fi + done < "$SERVER_ARGS_FILE" + fi + echo "Running: python3 -m dynamo.sglang ${ARGS[@]}" + exec python3 -m dynamo.sglang "${ARGS[@]}" + + volumeMounts: + {{- if eq .Values.volumes.useGcs true }} + - mountPath: /data/model + name: gcs-model-volume + {{- end }} + - name: library-dir-host + mountPath: /usr/local/nvidia + - name: gib + mountPath: /usr/local/gib + - name: serving-configuration + mountPath: {{ .Values.workload.configPath | default "/workload/configs" }} + volumes: + {{- if eq .Values.volumes.useGcs true }} + - name: gcs-model-volume + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: {{ .Values.volumes.gcsfuse.bucketName }} + mountOptions: implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1 + {{- end }} + - name: library-dir-host + hostPath: + path: /home/kubernetes/bin/nvidia + - name: gib + hostPath: + path: /home/kubernetes/bin/gib + - name: serving-configuration + configMap: + name: "{{ .Release.Name }}-prefill-config" + items: + - key: serving-configuration + path: {{ .Values.workload.configFile | default "serving-args.yaml" }} \ No newline at end of file diff --git a/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-launcher-configmap.yaml b/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-launcher-configmap.yaml new file mode 100644 index 00000000..01e9b51f --- /dev/null +++ b/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" + namespace: {{ .Values.dynamo.namespace }} +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} \ No newline at end of file diff --git a/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-worker-configmap.yaml b/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-worker-configmap.yaml new file mode 100644 index 00000000..f82580ae --- /dev/null +++ b/src/helm-charts/a4x/inference-templates/dynamo-deployment/templates/dynamo-worker-configmap.yaml @@ -0,0 +1,35 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.prefill_serving_config }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-prefill-config" + namespace: {{ .Values.dynamo.namespace }} +data: + serving-configuration: |- +{{ .Values.prefill_serving_config | nindent 4 }} +{{- end }} +--- +{{- if .Values.decode_serving_config }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-decode-config" + namespace: {{ .Values.dynamo.namespace }} +data: + serving-configuration: |- +{{ .Values.decode_serving_config | nindent 4 }} +{{- end }} \ No newline at end of file