Deploy a scalable DGX cluster on-prem or in the cloud
- Overview
- Prerequisites
- Installation Steps
- Cluster Usage
- Troubleshooting
- Open Source Software
- Copyright and License
- Issues and Contributing
The DeepOps project aims to facilitate deployment of multi-node GPU clusters for Deep Learning and HPC environments, in an on-prem, optionally air-gapped datacenter or in the cloud.
This document is written as a step-by-step guide which should allow for a person with minimal Linux system administration experience to install and configure an entire cluster from scratch. More experienced administrators should be able to pick and choose items that may be useful, it is not required to follow all steps in the guide if existing software or infrastructure is to be used.
Installation involves first bootstraping management server(s) with a Kubernetes installation and persistent volume storage using Ceph. Cluster services for provisioning operating systems, monitoring, and mirroring container and package repos are then deployed on Kubernetes. From there, DGX servers are booted and installed with the DGX base OS, and Kubernetes is extended across the entire cluster to facilitate job management. An optional login server can be used to allow users a place to interact with data locally and launch jobs. The Slurm job scheduler can also be installed in parallel with Kubernetes to facilitate easier large-scale training jobs or more traditional HPC workloads.
For more information on deploying DGX in the datacenter, consult the DGX Data Center Reference Design Whitepaper
- 1 or more CPU-only servers for management
- 3 or more servers can be used for high-availability
- Minimum: 4 CPU cores, 16GB RAM, 100GB hard disk
- More storage required if storing containers in registry, etc.
- More RAM required if running more services on kubernetes or using one/few servers
- Ubuntu 16.04 LTS installed
- 1 or more DGX compute nodes
- Laptop or workstation for provisioning/deployment
- (optional) 1 CPU-only server for user job launch, data management, etc.
The administrator's provisioning system should have the following installed:
- Ansible 2.5 or later
- git
- docker (to build containers)
- ipmitool
- python-netaddr (for kubespray)
The management server(s) should be pre-installed with Ubuntu 16.04 LTS before starting the installation steps. If you already have a bare-metal provisioning system, it can be used to install Ubuntu on the management server(s). Integrating the DGX Base OS with other bare-metal provisioning systems is outside the scope of this project.
The DeepOps service container "DGXie" provides DHCP, DNS, and PXE services to the cluster, and will allow you to automatically install the official DGX base OS on DGX servers. If you elect to use this management service, you will need to have a dedicated network segment and subnet which can be controlled by the DHCP server.
- Download and configure DeepOps repo
- Deploy management server(s)
- Bootstrap
- Deploy Kubernetes
- Deploy Ceph persistent storage on management nodes
- Deploy cluster service containers on Kubernetes
- DHCP/DNS/PXE, container registry, Apt repo, monitoring, alerting
- Deploy DGX-1 compute nodes
- Install DGX OS (via PXE), bootstrap (via Ansible)
- Update firmware (via Ansible, if required)
- Join DGX-1 compute nodes to Kubernetes cluster and deploy GPU device plugin
- Deploy login node
- Install OS (via PXE), bootstrap (via Ansible)
- Install/build HPC software and modules
- Deploy cluster SW layers
- Install Slurm HPC scheduler on login and compute nodes
- Configure Kubernetes Oauth integration for user access
Download the DeepOps repo onto the provisioning system and copy the example configuration files so that you can make local changes:
git clone --recursive https://github.com/NVIDIA/deepops.git
cp -r config.example/ config/
ansible-galaxy install -r requirements.ymlNote: In Git 2.16.2 or later, use
--recurse-submodulesinstead of--recursive. If you did a non-recursive clone, you can later rungit submodule update --init --recursiveto pull down submodules
The config/ directory is ignored by git, so a new git repository can be created in this
directory to track local changes:
cd config/
git init .
git add .
git commit -am 'initial commit'Use the config/inventory file to set the cluster server hostnames, and optional
per-host info like IP addresses and network interfaces. The cluster should
ideally use DNS, but you can also explicitly set server IP addresses in the
inventory file.
Optional inventory settings:
- Use the
ansible_hostvariable to set alternate IP addresses for servers or for servers which do not have resolvable hostnames - Use the
ib_bond_addrvariable to configure the infiniband network adapters with IPoIB in a single bonded interface
Configure cluster parameters by modifying the various yaml files in the config/group_vars
directory. The cluster-wide global config resides in the all.yml file, while
group-specific options reside in the other files. File names correspond to groups
in the inventory file, i.e. [dgxservers] in the inventory file corresponds with
config/group_vars/dgxservers.yml.
The configuration assumes a single cpu-only management server, but multiple management servers can be used for high-availability.
Install the latest version of Ubuntu Server 16.04 LTS on each management server. Be sure to enable SSH and record the user and password used during install.
Bootstrap:
The password and SSH keys added to the ubuntu user in the config/group_vars/all.yml
file will be configured on the management node. You should add an SSH key to the configuration
file, or you will have to append the -k flag and type the password for the ubuntu
user for all Ansible commands following the bootstrap.
Deploy management node(s):
Type the password for the user you configured during management server OS installation when prompted to allow for the use of
sudowhen configuring the management servers. If the management servers were installed with the use of SSH keys and sudo does not require a password, you may omit the-kand-Kflags
ansible-playbook -l mgmt -k -K ansible/playbooks/bootstrap.ymlWhere mgmt is the group of servers in your config/inventory file which will become
management servers for the cluster.
To run arbitrary commands in parallel across nodes in the cluster, you can use ansible and the groups or hosts defined in the inventory file, for example:
ansible mgmt -a hostnameFor more info, see: https://docs.ansible.com/ansible/latest/user_guide/intro_adhoc.html
Apply additional changes to management servers to disable swap (required for Kubernetes):
ansible mgmt -b -a "swapoff -a"If you need to configure a secondary network interface for the private DGX network,
modify /etc/network/interfaces. For example:
auto ens192
iface ens192 inet static
address 192.168.1.1/24
dns-nameservers 8.8.8.8 8.8.4.4
gateway 192.168.1.1
mtu 1500Kubernetes:
Deploy Kubernetes on management servers:
Modify the file config/kube.yml if needed and deploy Kubernetes:
ansible-playbook -l mgmt -v -b --flush-cache --extra-vars "@config/kube.yml" kubespray/cluster.ymlSet up Kubernetes for remote administration:
ansible mgmt -b -m fetch -a "src=/etc/kubernetes/admin.conf flat=yes dest=./"
curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
chmod +x ./kubectlTo make administration easier, you may want to copy the kubectl binary to someplace in your $PATH
and copy the admin.conf configuration file to ~/.kube/config so that it is used by default.
Otherwise, you may use the kubectl flag --kubeconfig=./admin.conf instead of copying the configuration file.
If you have an existing Kubernetes configuration file, you can merge the two with:
mv ~/.kube/config{,.bak} && KUBECONFIG=./admin.conf:~/.kube/config.bak kubectl config view --flatten | tee ~/.kube/configTest you can access the kubernetes cluster:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
mgmt01 Ready master,node 7m v1.11.0Helm:
Some services are installed using Helm, a package manager for Kubernetes.
Install the Helm client by following the instructions for the OS on your provisioning system: https://docs.helm.sh/using_helm/#installing-helm
If you're using Linux, the script scripts/helm_install_linux.sh will set up Helm for the current user
Be sure to install a version of Helm matching the version in config/kube.yml
(Optional) If helm_enabled is true in config/kube.yml,
the Helm server will already be deployed in Kubernetes.
If it needs to be installed manually for some reason, run:
kubectl create sa tiller --namespace kube-system
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --node-selectors node-role.kubernetes.io/master=trueCeph:
Persistent storage for Kubernetes on the management nodes is supplied by Ceph. Ceph is provisioned using Rook to simplify deployment:
helm repo add rook-master https://charts.rook.io/master
helm install --namespace rook-ceph-system --name rook-ceph rook-master/rook-ceph --version v0.7.0-284.g863c10f --set agent.flexVolumeDirPath=/var/lib/kubelet/volume-plugins/
kubectl create -f services/rook-cluster.ymlNote: It will take a few minutes for containers to be pulled and started. Wait for Rook to be fully installed before proceeding
You can check Ceph status with:
kubectl -n rook-ceph exec -ti rook-ceph-tools ceph statusAn ingress controller routes external traffic to services.
Modify config/ingress.yml if needed and install the ingress controller:
helm install --values config/ingress.yml stable/nginx-ingressYou can check the ingress controller logs with:
kubectl logs -l app=nginx-ingressDGXie is an all-in-one container for DHCP, DNS, and PXE, specifically tailored to the DGX Base OS. If you already have DHCP, DNS, or PXE servers you can skip this step.
Setup
You will need to download the official DGX Base OS ISO image to your provisioning machine. The latest DGX Base OS is available via the NVIDIA Entperprise Support Portal (ESP).
Copy the DGX Base OS ISO to shared storage via a container running in Kubernetes,
substituting the path to the DGX ISO you downloaded (be sure to wait for the iso-loader POD
to be in the Running state before attempting to copy the ISO):
kubectl apply -f services/iso-loader.yml
kubectl cp /path/to/DGXServer-3.1.2.170902_f8777e.iso $(kubectl get pod -l app=iso-loader -o custom-columns=:metadata.name --no-headers):/data/iso/Note: If the
iso-loaderPOD fails to mount the CephFS volume, you may need to restart the kubelet service on the master node(s):ansible mgmt -b -a "systemctl restart kubelet"
Configure
Modify the DGXie configuration in config/dgxie.yml to set values for the DHCP server
and DGX install process
Modify config/dhcpd.hosts.conf to add a static IP lease for each login node and DGX
server in the cluster if required. IP addresses should match those used in the config/inventory file.
You may also add other valid configuration options for dnsmasq to this file.
You can get the MAC address of DGX system interfaces via the BMC, for example:
# interface 1
ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> raw 0x30 0x19 0x00 0x02 | tail -c 18 | tr ' ' ':'
# interface 2
ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> raw 0x30 0x19 0x00 0x12 | tail -c 18 | tr ' ' ':'Modify config/machines.json to add a PXE entry for each DGX. Copy the dgx-example section and modify
the MAC address for each DGX you would like to boot. You can modify boot parameters or install
alternate operating systems if required.
Store the config files as config-maps in Kubernetes, even if you have not made any changes (the DGXie container will try to mount these config maps):
kubectl create configmap dhcpd --from-file=config/dhcpd.hosts.conf
kubectl create configmap pxe-machines --from-file=config/machines.jsonDeploy DGXie service
Launch the DGXie service:
helm install --values config/dgxie.yml services/dgxieCheck the DGXie logs to make sure the services were started without errors:
kubectl logs -l app=dgxieConfigure the management server(s) to use DGXie for cluster-wide DNS:
ansible-playbook -l mgmt ansible/playbooks/resolv.ymlIf you later make changes to config/dhcpd.hosts.conf, you can update the file in Kubernetes
and restart the service with:
kubectl create configmap dhcpd --from-file=config/dhcpd.hosts.conf -o yaml --dry-run | kubectl replace -f -
kubectl delete pod -l app=dgxieIf you make changes to machines.json, you can update the file without having to restart the DGXie POD:
kubectl create configmap pxe-machines --from-file=config/machines.json -o yaml --dry-run | kubectl replace -f -Launch service. Runs on port 30000: http://mgmt:30000
kubectl apply -f services/apt.ymlModify config/registry.yml if needed and launch the container registry:
helm repo add stable https://kubernetes-charts.storage.googleapis.com
helm install --values config/registry.yml stable/docker-registry --version 1.4.3Once you have provisioned DGX servers, configure them to allow access to the local (insecure) container registry:
ansible-playbook -k ansible/playbooks/docker.ymlYou can check the container registry logs with:
kubectl logs -l app=docker-registryThe container registry will be available to nodes in the cluster at registry.local, for example:
# pull container image from docker hub
docker pull busybox:latest
# tag image for local container registry
# (you can also get the image ID manually with: docker images)
docker tag $(docker images -f reference=busybox --format "{{.ID}}") registry.local/busybox
# push image to local container registry
docker push registry.local/busyboxCluster monitoring is provided by Prometheus and Grafana
Service addresses:
- Grafana: http://mgmt:30200
- Prometheus: http://mgmt:30500
- Alertmanager: http://mgmt:30400
Where mgmt represents a DNS name or IP address of one of the management hosts in the kubernetes cluster.
The default login for Grafana is admin for the username and password.
Modify config/prometheus-operator.yml and config/kube-prometheus.yml if desired and deploy the monitoring
and alerting stack:
helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring --values config/prometheus-operator.yml
kubectl create configmap kube-prometheus-grafana-gpu --from-file=config/gpu-dashboard.json -n monitoring
helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring --values config/kube-prometheus.ymlTo collect GPU metrics, label each GPU node and deploy the DCGM Prometheus exporter:
kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
kubectl create -f services/dcgm-exporter.ymlCentralized logging is provided by Filebeat, Elasticsearch and Kibana
Note: The ELK Helm chart is current out of date and does not provide support for setting the Kibana NodePort
todo:
- filebeat syslog module needs to be in UTC somehow, syslog in UTC?
- fix kibana nodeport issue
Make sure all systems are set to the same timezone:
ansible all -k -b -a 'timedatectl status'To update, use: `ansible -k -b -a 'timedatectl set-timezone '
Install Osquery:
ansible-playbook -k ansible/playbooks/osquery.ymlDeploy Elasticsearch and Kibana:
helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
helm install --name elk --namespace logging --values config/elk.yml incubator/elastic-stackThe ELK stack will take several minutes to install, wait for elasticsearch to be ready in Kibana before proceeding.
Launch Filebeat, which will create an Elasticsearch index automatically:
helm install --name log --namespace logging --values config/filebeat.yml stable/filebeatThe logging stack can be deleted with:
helm del --purge log
helm del --purge elk
kubectl delete statefulset/elk-elasticsearch-data
kubectl delete pvc -l app=elasticsearch
# wait for all statefulsets to be removed before re-installing...Provisioning:
Provision DGX nodes with the official DGX ISO over PXE boot using DGXie.
Note: The
scripts/do_ipmi.shscript has these commands and can be looped over multiple hosts
Disable the DGX IPMI boot device selection 60s timeout, you only need to do this once for each DGX, but it is required:
ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> raw 0x00 0x08 0x03 0x08Note: The default IPMI username and password is
qct.admin
Set the DGX to boot from the first disk, using EFI, and to persist the setting:
ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> raw 0x00 0x08 0x05 0xe0 0x08 0x00 0x00 0x00Set the DGX to boot from the network in EFI mode, for the next boot only. If you set the DGX to always boot from the network, they will get stuck in an install loop. The installer should set the system to boot to the first disk via EFI after the install is finished
ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> chassis bootdev pxe options=efibootNote: If you have manually modified the boot order in the DGX SBIOS, you may need to manually return it to boot from disk by default before running the IPMI commands above to alter the boot order
Power cycle/on the DGX to begin the install process
ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> power cycleThe DGX install process will take approximately 15 minutes. You can check the DGXie logs with:
kubectl logs -l app=dgxieIf your DGX are on an un-routable subnet, uncomment the ansible_ssh_common_args variable in the
config/group_vars/dgx-servers.yml file and modify the IP address to the IP address of the management server
with access to the private subnet, i.e.
ansible_ssh_common_args: '-o ProxyCommand="ssh -W %h:%p -q ubuntu@10.0.0.1"'Test the connection to the DGX servers via the bastion host (management server). Type the password
for dgxuser on the DGX when prompted. The default password for dgxuser is DgxUser123:
ansible dgx-servers -k -a 'hostname'Configuration:
Configuration of the DGX is accomplished via Ansible roles.
Various playbooks to install components are available in ansible/playbooks.
Modify the file ansible/site.yml to enable or disable various playbooks, or run playbooks
directly.
Type the default password for dgxuser on the DGX when prompted while running the bootstrap playbook.
The default password for dgxuser is DgxUser123:
ansible-playbook -k -K -l dgx-servers ansible/playbooks/bootstrap.ymlAfter running the first command, you may omit the -K flag on subsequent runs. The password
for the deepops user will also change to the one set in the groups_vars/all.yml file
(by default, this password is deepops). Run the site playbook to finish configuring the DGX:
ansible-playbook -k -l dgx-servers ansible/site.ymlUpdating Firmware:
Firmware on the DGX can be updated through the firmware update container(s) and Ansible.
- Download the firmware update container package from the NVIDIA Enterprise Support Portal.
Updates are published as announcements on the support portal (example: https://goo.gl/3zimCk).
Make sure you download the correct package depending on the GPU in the DGX-1:
- For V100 (Volta), download the '0102' package - for example: https://dgxdownloads.nvidia.com/custhelp/dgx1/NVIDIA_Containers/nvidia-dgx-fw-0102-20180424.tar.gz
- For P100 (Pascal), download the '0101' package - for example: https://dgxdownloads.nvidia.com/custhelp/dgx1/NVIDIA_Containers/nvidia-dgx-fw-0101-20180424.tar.gz
- Once you've download the
.tar.gzfile, copy or move it insidecontainers/dgx-firmware - Edit the value of
firmware_update_containerin the fileansible/roles/nvidia-dgx-firmware/vars/main.ymlto match the name of the downloaded firmware container. - Run the Ansible playbook to update DGX firmware:
ansible-playbook -k -l dgx-servers ansible/playbooks/firmware.ymlAdding DGX to Kubernetes:
Create the NVIDIA GPU k8s device plugin daemon set (just need to do this once):
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.ymlIf the DGX is a member of the Slurm cluster, be sure to drain node in Slurm so that it does not accept Slurm jobs. From the login node, run:
sudo scontrol update node=dgx01 state=drain reason=k8sModify the config/inventory file to add the DGX to the kube-node and k8s-gpu categories by uncommenting
the dgx-servers entry in these sections
Re-run Kubespray to install Kubernetes on the DGX:
ansible-playbook -l k8s-cluster -k -v -b --flush-cache --extra-vars "@config/kube.yml" kubespray/cluster.ymlNote: If the kubesray run fails for any reason, try running again
Check that the installation was successful:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
dgx01 Ready node 3m v1.11.0
mgmt01 Ready master,node 2d v1.11.0Place a hold on the docker-ce package so it doesn't get upgraded:
ansible dgx-servers -k -b -a "apt-mark hold docker-ce"Install the nvidia-container-runtime on the DGX:
ansible-playbook -l k8s-gpu -k -v -b --flush-cache --extra-vars "@config/kube.yml" playbooks/k8s-gpu.ymlTest that GPU support is working:
kubectl apply -f tests/gpu-test-job.yml
kubectl exec -ti gpu-pod -- nvidia-smi -L
kubectl delete pod gpu-podNote: If you do not require a login node, you may skip this section
Note: By default the login node(s) are not part of the kubernetes cluster. If you need to add login node(s) to the kubernetes cluster, add login servers to the kubernetes categories in the
config/inventoryfile and re-run the ansible playbooks as above for management and DGX servers.
Provisioning:
Modify config/dhcpd.hosts.conf to add a static IP lease for each login node
if required. IP addresses should match those used in the config/inventory file.
Update the dhcpd.hosts.conf config map if modified and restart the DGXie POD:
kubectl create configmap dhcpd --from-file=config/dhcpd.hosts.conf -o yaml --dry-run | kubectl replace -f -
kubectl delete pod -l app=dgxieModify config/machines.json to add a PXE entry for each login node.
Copy the 64-bit-ubuntu-example section and modify
the MAC address for each login node you would like to boot. You can modify boot parameters or install
alternate operating systems if required.
Update the PXE server config map:
kubectl create configmap pxe-machines --from-file=config/machines.json -o yaml --dry-run | kubectl replace -f -Set login nodes to boot from the network for the next boot only and power on the systems. The login nodes should receive a response from the DGXie service and begin the OS install process.
Note: Be sure to either monitor the PXE install or configure servers to boot from the network on the next boot only to avoid a re-install loop
If manually configuring the install, be sure the initial user matches the user in config/group_vars/login.yml.
Configuration:
Once OS installation is complete, bootstrap and configure the login node(s) via Ansible.
If your login nodes are on an un-routable subnet, uncomment the ansible_ssh_common_args variable in the
config/group_vars/login.yml file and modify the IP address to the IP address of the management server
with access to the private subnet, i.e.
ansible_ssh_common_args: '-o ProxyCommand="ssh -W %h:%p -q ubuntu@10.0.0.1"'Various playbooks to install components are available in ansible/playbooks.
Modify the file ansible/site.yml to enable or disable various playbooks, or run playbooks
directly:
ansible-playbook -k -K -l login ansible/playbooks/bootstrap.yml
ansible-playbook -k -l login ansible/site.ymlSlurm overview: https://slurm.schedmd.com/overview.html
"Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters."
Note: For more information on Slurm and GPUs, see: https://github.com/dholt/slurm-gpu
To install Slurm, configure nodes in config/inventory and run the Ansible playbook:
ansible-playbook -k -l slurm-cluster ansible/playbooks/slurm.ymlDGX nodes may appear 'down' in Slurm after install due to rebooting. Set nodes to idle if required:
sudo scontrol update node=dgx01 state=idleAdding Software:
To modify installed software on cluster nodes, edit the package list in ansible/roles/software/defaults/main.yml
and apply the changes:
ansible-playbook -k -l login ansible/playbooks/software.ymlThe playbooks/extra.yml file contains optional configuration (these will be moved at a later date):
ansible-playbook -k -l all playbooks/extra.ymlBuilding software:
HPC clusters generally utilize a system of versioned software modules instead of installing software via the OS package manager. These software builds can be made easier with the EasyBuild tool. The software build environment should be set up on the login node in a shared directory accessible by all cluster nodes.
Assuming you created or used an existing NFS share during cluster bootstrap, create a directory
to hold software builds and create a direnv file to facilitate easier EasyBuild builds:
EasyBuild environment file:
$ cat /shared/.envrc
export EASYBUILD_PREFIX=/shared/sw
export EASYBUILD_MODULES_TOOL=Lmod
export EASYBUILD_JOB_BACKEND=GC3Pie
export EASYBUILD_JOB_BACKEND_CONFIG=/shared/.gc3pie.cfg
module use /shared/sw/modules/all
module load EasyBuildWhere the shared NFS directory is /shared, and initial software/modules built with EasyBuild are
installed in /shared/sw.
The direnv package should have been installed by default during cluster node configuration.
For more information on direnv, see: https://direnv.net/.
Use direnv to automatically set your EasyBuild environment; first
add an appropriate command to your shell login scripts:
type direnv >/dev/null 2>&1 && eval "$(direnv hook bash)"Then cd /shared and run direnv allow. The .envrc file should set up the environment to use EasyBuild
Install EasyBuild using the shared directory as the install path:
# pick an installation prefix to install EasyBuild to (change this to your liking)
EASYBUILD_PREFIX=/shared/sw
# download script
curl -O https://raw.githubusercontent.com/easybuilders/easybuild-framework/develop/easybuild/scripts/bootstrap_eb.py
# bootstrap EasyBuild
python bootstrap_eb.py $EASYBUILD_PREFIX
# update $MODULEPATH, and load the EasyBuild module
module use $EASYBUILD_PREFIX/modules/all
module load EasyBuildExample usage for building software:
# search
eb -S gcc-6
# build
eb GCC-6.4.0-2.28.eb -rExample usage for using software:
# prepend environment module path
export MODULEPATH=$EASYBUILD_PREFIX/modules/all:$MODULEPATH
# load environment module
module load HPLSlurm updates:
# whole shebang:
ansible-playbook -k -l slurm-cluster ansible/playbooks/slurm.yml
# just prolog and/or epilog:
ansible-playbook -k -l compute-nodes --tags prolog,epilog -e 'gather_facts=no' ansible/playbooks/slurm.ymlModify GPU drivers:
ansible-playbook -k -l <dgx-hostname> playbooks/gpu-driver.ymlExtra:
Set up /raid RAID-0 array cache (can also add rebuild-raid to PXE boot cmdline when installing):
ansible dgx-servers -k -b -a "/usr/bin/configure_raid_array.py -i"Un-freeze NVLINK counters (may want to use 0brw for just read/write):
ansible dgx-servers -k -b -a "nvidia-smi nvlink -sc 0bz"Managing DGX scheduler allocation:
Once the DGX compute nodes have been added to Kubernetes and Slurm, you can use the scripts/doctl.sh
script to manage which scheduler each DGX is allowed to run jobs from.
NVIDIA GPU Cloud Container Registry (NGC):
Create secret for registry login:
kubectl create secret docker-registry ngc --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=<api-key> --docker-email='foo@example.com'Add to Kubernetes pod spec:
imagePullSecrets:
- name: ngcUpgrading Helm Charts:
If you make changes to configuration or want to update Helm charts, you can use the helm upgrade
command to apply changes
Show currently installed releases:
helm listTo upgrade the ingress controller with new values from config/ingress.yml for example, you would run:
helm upgrade --values config/ingress.yml <release_name> stable/nginx-ingressWhere <release_name> is the name of the deployed ingress controller chart obtained from
helm list.
TODO:
- (done) restrict namespace to nodes with specific labels, i.e.
scheduler=k8s - wait for k8s fix to daemonset and PodNodeSelector issues
Using OAuth2
References: https://medium.com/@jessgreb01/kubernetes-authn-authz-with-google-oidc-and-rbac-74509ca8267e
Copy admin.conf and ca.pem from a kube master (i.e. mgmt01) to /root/.kube on the login
node (i.e. login01).
Generate an OAUTH2 client JSON config file and copy the user script to the login node:
sudo mkdir -p /shared/{bin,etc}
sudo cp scripts/k8s_user.sh /shared/bin/
sudo chmod +x /shared/bin/k8s_user.sh
sudo cp config/google_oauth2_client.json /shared/etc/Download kubectl and ks (ksonnet) and put in /shared/bin
Users can run the script to log in to Google Auth, generate tokens and create a kube config:
sudo /shared/bin/k8s_user.sh
Restrict Namespaces:
todo: a daemonset will still continuously try and fail to schedule pods on all nodes
User namespaces need to be restricted to nodes which are in k8s scheduling mode. Otherwise users can run pods on management nodes and nodes which are being managed by Slurm (via a DaemonSet for example).
Update the Kubespray config in config/kube.yml to tell the Kube API server to use the PodNodeSelector
admission controller (this should already be the default):
kube_apiserver_admission_control:
...
- PodNodeSelectorPatch namespaces to apply a specific node selector to every pod:
kubectl patch namespace <username> -p '{"metadata":{"annotations":{"scheduler.alpha.kubernetes.io/node-selector":"scheduler=k8s"}}}'
kubectl get ns <username> -o yamlWhere <username> is the name of the namespace, typically the same as the username
Using certs
Source: https://docs.bitnami.com/kubernetes/how-to/configure-rbac-in-your-kubernetes-cluster/
Copy the script to one of the management nodes and run to create a user:
scp scripts/add_user.sh mgmt-01:/tmp
ssh mgmt-01 /tmp/add_user.sh <username>
scp mgmt-01:~/<username>.kubeconfig ~/.kube/configWhere <username> is the name of the new user account being created
Service Mesh:
This may be needed for L7 load-balancing for GRPC services
kubectl apply -f services/ambassador-service.yml
kubectl apply -f services/ambassador-rbac.ymlIf Ansible complains that a variable is undefined, you can check node values with something like:
ansible all -m debug -a "var=ansible_default_ipv4"Where ansible_default_ipv4 is the variable in question
Rook:
If you need to remove Rook for any reason, here are the steps:
kubectl delete -f services/rook-cluster.yml
helm del --purge rook-ceph
ansible mgmt -b -m file -a "path=/var/lib/rook state=absent"Software used in this project:
- Ansible roles:
- Kubespray: https://github.com/kubernetes-incubator/kubespray
- Ceph: https://github.com/ceph/ceph-ansible
- Pixiecore: https://github.com/google/netboot/tree/master/pixiecore
This project is released under the BSD 3-clause license.
A signed copy of the Contributor License Agreement needs to be provided to deepops@nvidia.com before any change can be accepted.
- Please let us know by filing a new issue
- You can contribute by opening a pull request