Skip to content

multus-cni-network fails to assign IP: error waiting for pod: Unauthorized #1363

@surajssd

Description

@surajssd

What happened:

I have a Kubernetes cluster deployed on AKS. I am running network-operator. After an hour of the deployment the multus plugin fails to assign IP to the pods. The event I get is like this:

0s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-45ksq
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox 
"26ed6556afa0ebd0407fa6db67b308e56524733d73304a2611eacdc1408a27aa": plugin type="multus" name="multus-cni-network"
failed (add): Multus: [default/ibtest-57599dd56f-45ksq/811508e0-b5d0-4126-9b00-16def7acfd6f]: error waiting for pod: Unauthorized

Here is full list of events:

$ kubectl get events
LAST SEEN   TYPE      REASON                   OBJECT                                  MESSAGE
6s          Normal    Scheduled                pod/ibtest-57599dd56f-45ksq             Successfully assigned default/ibtest-57599dd56f-45ksq to aks-gpunodes-17372139-vmss000000
6s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-45ksq             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2e21cf024a4aab49665572445258e58c8ba1f53ed6e55dbae18816c339872b1d": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-45ksq/811508e0-b5d0-4126-9b00-16def7acfd6f]: error waiting for pod: Unauthorized
6s          Normal    Scheduled                pod/ibtest-57599dd56f-k2xmt             Successfully assigned default/ibtest-57599dd56f-k2xmt to aks-gpunodes-17372139-vmss000001
6s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-k2xmt             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "449727243cb4f36d9ab9e55a5cd7df86675b1346facc3b1e48ce3cf50f2ed7c5": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-k2xmt/43887458-6c67-4217-8e47-e08b09b114ce]: error waiting for pod: Unauthorized
6s          Normal    SuccessfulCreate         replicaset/ibtest-57599dd56f            Created pod: ibtest-57599dd56f-k2xmt
6s          Normal    SuccessfulCreate         replicaset/ibtest-57599dd56f            Created pod: ibtest-57599dd56f-45ksq
6s          Normal    ScalingReplicaSet        deployment/ibtest                       Scaled up replica set ibtest-57599dd56f to 2
0s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-k2xmt             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1d4aec3c086b534f216d6ca1e2d0c84d16ccfdd7283fadb50c96f3cc544014ae": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-k2xmt/43887458-6c67-4217-8e47-e08b09b114ce]: error waiting for pod: Unauthorized
0s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-45ksq             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "26ed6556afa0ebd0407fa6db67b308e56524733d73304a2611eacdc1408a27aa": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-45ksq/811508e0-b5d0-4126-9b00-16def7acfd6f]: error waiting for pod: Unauthorized

What you expected to happen:

I expected the pod to be started with the secondary IP assigned by the multus plugin.

How to reproduce it (as minimally and precisely as possible):

Steps here: https://gist.github.com/surajssd/a0596ca7785228f025be5c3ac177219f

But here are the relevant steps:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm upgrade -i \
    --wait \
    --create-namespace \
    -n network-operator \
    network-operator \
    nvidia/network-operator \
    --set nfd.deployNodeFeatureRules=false

Anything else we need to know?:

I am using the machine type: Standard_HB120-16rs_v3 on Azure.

Logs:

  • NicClusterPolicy CR spec and state:
---
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
  name: nfd-network-rule
spec:
   rules:
   - name: "nfd-network-rule"
     labels:
        "feature.node.kubernetes.io/pci-15b3.present": "true"
     matchFeatures:
        - feature: pci.device
          matchExpressions:
            device: {op: In, value: ["101c", "101e"]}
---
# Try to match the versions from: https://github.com/Mellanox/network-operator/blob/master/example/crs/mellanox.com_v1alpha1_nicclusterpolicy_cr-full.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  nvIpam:
    enableWebhook: false
    repository: ghcr.io/mellanox
    image: nvidia-k8s-ipam
    # Latest tag: https://github.com/mellanox/nvidia-k8s-ipam/pkgs/container/nvidia-k8s-ipam
    version: v0.2.0

  ofedDriver:
    forcePrecompiled: false
    repository: nvcr.io/nvidia/mellanox
    image: doca-driver
    # Latest tag: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/mellanox/containers/doca-driver/tags
    # When this is deployed a suffix is added in the format "-<os><os version>-<cpu arch>" for e.g.: "-ubuntu22.04-amd64".
    version: 24.10-0.7.0.0-0

    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
      maxParallelUpgrades: 1
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30

  rdmaSharedDevicePlugin:
    repository: ghcr.io/mellanox
    image: k8s-rdma-shared-dev-plugin
    # Latest tag: https://github.com/mellanox/k8s-rdma-shared-dev-plugin/pkgs/container/k8s-rdma-shared-dev-plugin
    version: v1.5.2
    useCdi: true

    # The config below directly propagates to k8s-rdma-shared-device-plugin configuration.
    # Replace 'devices' with your (RDMA capable) netdevice name.
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_a",
            "rdmaHcaMax": 63,
            "selectors": {
              "vendors": ["15b3"]
            }
          }
        ]
      }

  secondaryNetwork:
    cniPlugins:
      repository: ghcr.io/k8snetworkplumbingwg
      image: plugins
      # Latest tag: https://github.com/k8snetworkplumbingwg/plugins/pkgs/container/plugins
      version: v1.5.0

    multus:
      repository: ghcr.io/k8snetworkplumbingwg
      image: multus-cni
      # Latest tag: https://github.com/k8snetworkplumbingwg/plugins/pkgs/container/multus-cni
      version: v4.1.0

    ipamPlugin:
      repository: ghcr.io/k8snetworkplumbingwg
      image: whereabouts
      # Latest tag: https://github.com/k8snetworkplumbingwg/whereabouts/pkgs/container/whereabouts
      version: v0.7.0

    ipoib:
      repository: ghcr.io/mellanox
      image: ipoib-cni
      # Latest tag: https://github.com/mellanox/ipoib-cni/pkgs/container/ipoib-cni
      version: v1.2.1

  # This is needed so that you don't schedule pods on the non-infiniband machines.
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        # This node label is added by NFD.
        - key: feature.node.kubernetes.io/pci-15b3.present
          operator: In
          values:
          - "true"
---
apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
  name: aks-infiniband
spec:
  networkNamespace: "default"
  # This is an alt interface name for the IPoIB interface. As per this blog it
  # seems to be common across the same VM SKU:
  # https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-tightly-coupled-hpcai-workloads-with-infiniband-using-nvidia-network-ope/4117209
  # It has been consistent on the machine types: Standard_HB120rs_v3 and Standard_ND96asr_v4.
  # TODO: Figure out how to get this generic name?
  master: "ibP257p0s0"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.0.0/16",
      "exclude": [
       "192.168.0.0/32",
       "192.168.255.255/32"
      ],
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info",
      "gateway": "192.168.0.1"
    }
  • Output of: kubectl get -n network-operator all:
✗ kubectl get -n network-operator all
NAME                                                                  READY   STATUS    RESTARTS      AGE
pod/cni-plugins-ds-j5v8k                                              1/1     Running   0             101m
pod/cni-plugins-ds-x9h28                                              1/1     Running   0             101m
pod/kube-ipoib-cni-ds-p4lfq                                           1/1     Running   0             101m
pod/kube-ipoib-cni-ds-pwwmh                                           1/1     Running   0             101m
pod/kube-multus-ds-bt6h6                                              1/1     Running   0             101m
pod/kube-multus-ds-ss8dn                                              1/1     Running   0             101m
pod/mofed-ubuntu22.04-6fd94b4c6b-ds-8d2k6                             1/1     Running   5 (94m ago)   101m
pod/mofed-ubuntu22.04-6fd94b4c6b-ds-s5sp9                             1/1     Running   4 (95m ago)   101m
pod/network-operator-5ff6ff9559-qvqg8                                 1/1     Running   0             101m
pod/network-operator-node-feature-discovery-gc-6d48649f49-wqwmd       1/1     Running   0             101m
pod/network-operator-node-feature-discovery-master-57648d678f-cmw7v   1/1     Running   0             101m
pod/network-operator-node-feature-discovery-worker-9n887              1/1     Running   0             101m
pod/network-operator-node-feature-discovery-worker-qtlpm              1/1     Running   0             101m
pod/network-operator-node-feature-discovery-worker-rtpmp              1/1     Running   0             101m
pod/nv-ipam-controller-67556c846b-c5ljf                               1/1     Running   0             101m
pod/nv-ipam-controller-67556c846b-zb2nw                               1/1     Running   0             101m
pod/nv-ipam-node-4bqhh                                                1/1     Running   0             101m
pod/nv-ipam-node-t4l47                                                1/1     Running   0             101m
pod/rdma-shared-dp-ds-kvfhb                                           1/1     Running   0             92m
pod/rdma-shared-dp-ds-l442j                                           1/1     Running   0             93m
pod/whereabouts-x5nql                                                 1/1     Running   0             101m
pod/whereabouts-xwqzt                                                 1/1     Running   0             101m

NAME                                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                                                                                                                                            AGE
daemonset.apps/cni-plugins-ds                                   2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/kube-ipoib-cni-ds                                2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/kube-multus-ds                                   2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/mofed-ubuntu22.04-6fd94b4c6b-ds                  2         2         2       2            2           feature.node.kubernetes.io/kernel-version.full=5.15.0-1079-azure,feature.node.kubernetes.io/pci-15b3.present=true,feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04   101m
daemonset.apps/network-operator-node-feature-discovery-worker   3         3         3       3            3           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/nv-ipam-node                                     2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/rdma-shared-dp-ds                                2         2         2       2            2           feature.node.kubernetes.io/pci-15b3.present=true,network.nvidia.com/operator.mofed.wait=false                                                                                                                                            101m
daemonset.apps/whereabouts                                      2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m

NAME                                                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator                                 1/1     1            1           101m
deployment.apps/network-operator-node-feature-discovery-gc       1/1     1            1           101m
deployment.apps/network-operator-node-feature-discovery-master   1/1     1            1           101m
deployment.apps/nv-ipam-controller                               2/2     2            2           101m

NAME                                                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-5ff6ff9559                                 1         1         1       101m
replicaset.apps/network-operator-node-feature-discovery-gc-6d48649f49       1         1         1       101m
replicaset.apps/network-operator-node-feature-discovery-master-57648d678f   1         1         1       101m
replicaset.apps/nv-ipam-controller-67556c846b                               2         2         2       101m
  • Network Operator version:
✗ helm -n network-operator ls
NAME                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
network-operator        network-operator        1               2025-03-05 16:37:15.858515 -0800 PST    deployed        network-operator-25.1.0 v25.1.0
  • Logs of Network Operator controller:
✗ kubectl logs -n network-operator network-operator-5ff6ff9559-qvqg8

full log here: https://gist.github.com/surajssd/3c89a7aed0a57ca40d7b371bf84940a9

Environment:

  • Kubernetes version (use kubectl version):
✗ kubectl version
Client Version: v1.32.2
Kustomize Version: v5.5.0
Server Version: v1.30.9
WARNING: version difference between client (1.32) and server (1.30) exceeds the supported minor version skew of +/-1
  • Hardware configuration:
    • Network adapter model and firmware version: mlx5_core driver version: 24.10-0.7.0

Logs from the mofed pods: https://gist.github.com/surajssd/1521f546efe4612c2cd9b1280d1dc654

  • OS (e.g: cat /etc/os-release):
# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
  • Kernel (e.g. uname -a):
# uname -a
Linux aks-gpunodes-17372139-vmss000000 5.15.0-1079-azure #88-Ubuntu SMP Thu Jan 16 19:18:54 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions