What happened:
I have a Kubernetes cluster deployed on AKS. I am running network-operator. After an hour of the deployment the multus plugin fails to assign IP to the pods. The event I get is like this:
0s Warning FailedCreatePodSandBox pod/ibtest-57599dd56f-45ksq
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox
"26ed6556afa0ebd0407fa6db67b308e56524733d73304a2611eacdc1408a27aa": plugin type="multus" name="multus-cni-network"
failed (add): Multus: [default/ibtest-57599dd56f-45ksq/811508e0-b5d0-4126-9b00-16def7acfd6f]: error waiting for pod: Unauthorized
Here is full list of events:
$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
6s Normal Scheduled pod/ibtest-57599dd56f-45ksq Successfully assigned default/ibtest-57599dd56f-45ksq to aks-gpunodes-17372139-vmss000000
6s Warning FailedCreatePodSandBox pod/ibtest-57599dd56f-45ksq Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2e21cf024a4aab49665572445258e58c8ba1f53ed6e55dbae18816c339872b1d": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-45ksq/811508e0-b5d0-4126-9b00-16def7acfd6f]: error waiting for pod: Unauthorized
6s Normal Scheduled pod/ibtest-57599dd56f-k2xmt Successfully assigned default/ibtest-57599dd56f-k2xmt to aks-gpunodes-17372139-vmss000001
6s Warning FailedCreatePodSandBox pod/ibtest-57599dd56f-k2xmt Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "449727243cb4f36d9ab9e55a5cd7df86675b1346facc3b1e48ce3cf50f2ed7c5": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-k2xmt/43887458-6c67-4217-8e47-e08b09b114ce]: error waiting for pod: Unauthorized
6s Normal SuccessfulCreate replicaset/ibtest-57599dd56f Created pod: ibtest-57599dd56f-k2xmt
6s Normal SuccessfulCreate replicaset/ibtest-57599dd56f Created pod: ibtest-57599dd56f-45ksq
6s Normal ScalingReplicaSet deployment/ibtest Scaled up replica set ibtest-57599dd56f to 2
0s Warning FailedCreatePodSandBox pod/ibtest-57599dd56f-k2xmt Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1d4aec3c086b534f216d6ca1e2d0c84d16ccfdd7283fadb50c96f3cc544014ae": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-k2xmt/43887458-6c67-4217-8e47-e08b09b114ce]: error waiting for pod: Unauthorized
0s Warning FailedCreatePodSandBox pod/ibtest-57599dd56f-45ksq Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "26ed6556afa0ebd0407fa6db67b308e56524733d73304a2611eacdc1408a27aa": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-45ksq/811508e0-b5d0-4126-9b00-16def7acfd6f]: error waiting for pod: Unauthorized
What you expected to happen:
I expected the pod to be started with the secondary IP assigned by the multus plugin.
How to reproduce it (as minimally and precisely as possible):
Steps here: https://gist.github.com/surajssd/a0596ca7785228f025be5c3ac177219f
But here are the relevant steps:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm upgrade -i \
--wait \
--create-namespace \
-n network-operator \
network-operator \
nvidia/network-operator \
--set nfd.deployNodeFeatureRules=false
Anything else we need to know?:
I am using the machine type: Standard_HB120-16rs_v3 on Azure.
Logs:
- NicClusterPolicy CR spec and state:
---
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: nfd-network-rule
spec:
rules:
- name: "nfd-network-rule"
labels:
"feature.node.kubernetes.io/pci-15b3.present": "true"
matchFeatures:
- feature: pci.device
matchExpressions:
device: {op: In, value: ["101c", "101e"]}
---
# Try to match the versions from: https://github.com/Mellanox/network-operator/blob/master/example/crs/mellanox.com_v1alpha1_nicclusterpolicy_cr-full.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
nvIpam:
enableWebhook: false
repository: ghcr.io/mellanox
image: nvidia-k8s-ipam
# Latest tag: https://github.com/mellanox/nvidia-k8s-ipam/pkgs/container/nvidia-k8s-ipam
version: v0.2.0
ofedDriver:
forcePrecompiled: false
repository: nvcr.io/nvidia/mellanox
image: doca-driver
# Latest tag: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/mellanox/containers/doca-driver/tags
# When this is deployed a suffix is added in the format "-<os><os version>-<cpu arch>" for e.g.: "-ubuntu22.04-amd64".
version: 24.10-0.7.0.0-0
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: true
enable: true
force: true
timeoutSeconds: 300
maxParallelUpgrades: 1
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
rdmaSharedDevicePlugin:
repository: ghcr.io/mellanox
image: k8s-rdma-shared-dev-plugin
# Latest tag: https://github.com/mellanox/k8s-rdma-shared-dev-plugin/pkgs/container/k8s-rdma-shared-dev-plugin
version: v1.5.2
useCdi: true
# The config below directly propagates to k8s-rdma-shared-device-plugin configuration.
# Replace 'devices' with your (RDMA capable) netdevice name.
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"]
}
}
]
}
secondaryNetwork:
cniPlugins:
repository: ghcr.io/k8snetworkplumbingwg
image: plugins
# Latest tag: https://github.com/k8snetworkplumbingwg/plugins/pkgs/container/plugins
version: v1.5.0
multus:
repository: ghcr.io/k8snetworkplumbingwg
image: multus-cni
# Latest tag: https://github.com/k8snetworkplumbingwg/plugins/pkgs/container/multus-cni
version: v4.1.0
ipamPlugin:
repository: ghcr.io/k8snetworkplumbingwg
image: whereabouts
# Latest tag: https://github.com/k8snetworkplumbingwg/whereabouts/pkgs/container/whereabouts
version: v0.7.0
ipoib:
repository: ghcr.io/mellanox
image: ipoib-cni
# Latest tag: https://github.com/mellanox/ipoib-cni/pkgs/container/ipoib-cni
version: v1.2.1
# This is needed so that you don't schedule pods on the non-infiniband machines.
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
# This node label is added by NFD.
- key: feature.node.kubernetes.io/pci-15b3.present
operator: In
values:
- "true"
---
apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
name: aks-infiniband
spec:
networkNamespace: "default"
# This is an alt interface name for the IPoIB interface. As per this blog it
# seems to be common across the same VM SKU:
# https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-tightly-coupled-hpcai-workloads-with-infiniband-using-nvidia-network-ope/4117209
# It has been consistent on the machine types: Standard_HB120rs_v3 and Standard_ND96asr_v4.
# TODO: Figure out how to get this generic name?
master: "ibP257p0s0"
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.0.0/16",
"exclude": [
"192.168.0.0/32",
"192.168.255.255/32"
],
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info",
"gateway": "192.168.0.1"
}
- Output of:
kubectl get -n network-operator all:
✗ kubectl get -n network-operator all
NAME READY STATUS RESTARTS AGE
pod/cni-plugins-ds-j5v8k 1/1 Running 0 101m
pod/cni-plugins-ds-x9h28 1/1 Running 0 101m
pod/kube-ipoib-cni-ds-p4lfq 1/1 Running 0 101m
pod/kube-ipoib-cni-ds-pwwmh 1/1 Running 0 101m
pod/kube-multus-ds-bt6h6 1/1 Running 0 101m
pod/kube-multus-ds-ss8dn 1/1 Running 0 101m
pod/mofed-ubuntu22.04-6fd94b4c6b-ds-8d2k6 1/1 Running 5 (94m ago) 101m
pod/mofed-ubuntu22.04-6fd94b4c6b-ds-s5sp9 1/1 Running 4 (95m ago) 101m
pod/network-operator-5ff6ff9559-qvqg8 1/1 Running 0 101m
pod/network-operator-node-feature-discovery-gc-6d48649f49-wqwmd 1/1 Running 0 101m
pod/network-operator-node-feature-discovery-master-57648d678f-cmw7v 1/1 Running 0 101m
pod/network-operator-node-feature-discovery-worker-9n887 1/1 Running 0 101m
pod/network-operator-node-feature-discovery-worker-qtlpm 1/1 Running 0 101m
pod/network-operator-node-feature-discovery-worker-rtpmp 1/1 Running 0 101m
pod/nv-ipam-controller-67556c846b-c5ljf 1/1 Running 0 101m
pod/nv-ipam-controller-67556c846b-zb2nw 1/1 Running 0 101m
pod/nv-ipam-node-4bqhh 1/1 Running 0 101m
pod/nv-ipam-node-t4l47 1/1 Running 0 101m
pod/rdma-shared-dp-ds-kvfhb 1/1 Running 0 92m
pod/rdma-shared-dp-ds-l442j 1/1 Running 0 93m
pod/whereabouts-x5nql 1/1 Running 0 101m
pod/whereabouts-xwqzt 1/1 Running 0 101m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/cni-plugins-ds 2 2 2 2 2 <none> 101m
daemonset.apps/kube-ipoib-cni-ds 2 2 2 2 2 <none> 101m
daemonset.apps/kube-multus-ds 2 2 2 2 2 <none> 101m
daemonset.apps/mofed-ubuntu22.04-6fd94b4c6b-ds 2 2 2 2 2 feature.node.kubernetes.io/kernel-version.full=5.15.0-1079-azure,feature.node.kubernetes.io/pci-15b3.present=true,feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04 101m
daemonset.apps/network-operator-node-feature-discovery-worker 3 3 3 3 3 <none> 101m
daemonset.apps/nv-ipam-node 2 2 2 2 2 <none> 101m
daemonset.apps/rdma-shared-dp-ds 2 2 2 2 2 feature.node.kubernetes.io/pci-15b3.present=true,network.nvidia.com/operator.mofed.wait=false 101m
daemonset.apps/whereabouts 2 2 2 2 2 <none> 101m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/network-operator 1/1 1 1 101m
deployment.apps/network-operator-node-feature-discovery-gc 1/1 1 1 101m
deployment.apps/network-operator-node-feature-discovery-master 1/1 1 1 101m
deployment.apps/nv-ipam-controller 2/2 2 2 101m
NAME DESIRED CURRENT READY AGE
replicaset.apps/network-operator-5ff6ff9559 1 1 1 101m
replicaset.apps/network-operator-node-feature-discovery-gc-6d48649f49 1 1 1 101m
replicaset.apps/network-operator-node-feature-discovery-master-57648d678f 1 1 1 101m
replicaset.apps/nv-ipam-controller-67556c846b 2 2 2 101m
- Network Operator version:
✗ helm -n network-operator ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
network-operator network-operator 1 2025-03-05 16:37:15.858515 -0800 PST deployed network-operator-25.1.0 v25.1.0
- Logs of Network Operator controller:
✗ kubectl logs -n network-operator network-operator-5ff6ff9559-qvqg8
full log here: https://gist.github.com/surajssd/3c89a7aed0a57ca40d7b371bf84940a9
Environment:
- Kubernetes version (use
kubectl version):
✗ kubectl version
Client Version: v1.32.2
Kustomize Version: v5.5.0
Server Version: v1.30.9
WARNING: version difference between client (1.32) and server (1.30) exceeds the supported minor version skew of +/-1
- Hardware configuration:
- Network adapter model and firmware version:
mlx5_core driver version: 24.10-0.7.0
Logs from the mofed pods: https://gist.github.com/surajssd/1521f546efe4612c2cd9b1280d1dc654
- OS (e.g:
cat /etc/os-release):
# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
# uname -a
Linux aks-gpunodes-17372139-vmss000000 5.15.0-1079-azure #88-Ubuntu SMP Thu Jan 16 19:18:54 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
What happened:
I have a Kubernetes cluster deployed on AKS. I am running network-operator. After an hour of the deployment the multus plugin fails to assign IP to the pods. The event I get is like this:
Here is full list of events:
What you expected to happen:
I expected the pod to be started with the secondary IP assigned by the multus plugin.
How to reproduce it (as minimally and precisely as possible):
Steps here: https://gist.github.com/surajssd/a0596ca7785228f025be5c3ac177219f
But here are the relevant steps:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm upgrade -i \ --wait \ --create-namespace \ -n network-operator \ network-operator \ nvidia/network-operator \ --set nfd.deployNodeFeatureRules=falseAnything else we need to know?:
I am using the machine type:
Standard_HB120-16rs_v3on Azure.Logs:
kubectl get -n network-operator all:full log here: https://gist.github.com/surajssd/3c89a7aed0a57ca40d7b371bf84940a9
Environment:
kubectl version):mlx5_core driver version: 24.10-0.7.0Logs from the mofed pods: https://gist.github.com/surajssd/1521f546efe4612c2cd9b1280d1dc654
cat /etc/os-release):uname -a):