multus-cni-network fails to assign IP: `error waiting for pod: Unauthorized`

**What happened**:

I have a Kubernetes cluster deployed on AKS. I am running network-operator. After an hour of the deployment the multus plugin fails to assign IP to the pods. The event I get is like this:

```bash
0s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-45ksq
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox 
"26ed6556afa0ebd0407fa6db67b308e56524733d73304a2611eacdc1408a27aa": plugin type="multus" name="multus-cni-network"
failed (add): Multus: [default/ibtest-57599dd56f-45ksq/811508e0-b5d0-4126-9b00-16def7acfd6f]: error waiting for pod: Unauthorized
```

Here is full list of events:

```bash
$ kubectl get events
LAST SEEN   TYPE      REASON                   OBJECT                                  MESSAGE
6s          Normal    Scheduled                pod/ibtest-57599dd56f-45ksq             Successfully assigned default/ibtest-57599dd56f-45ksq to aks-gpunodes-17372139-vmss000000
6s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-45ksq             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2e21cf024a4aab49665572445258e58c8ba1f53ed6e55dbae18816c339872b1d": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-45ksq/811508e0-b5d0-4126-9b00-16def7acfd6f]: error waiting for pod: Unauthorized
6s          Normal    Scheduled                pod/ibtest-57599dd56f-k2xmt             Successfully assigned default/ibtest-57599dd56f-k2xmt to aks-gpunodes-17372139-vmss000001
6s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-k2xmt             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "449727243cb4f36d9ab9e55a5cd7df86675b1346facc3b1e48ce3cf50f2ed7c5": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-k2xmt/43887458-6c67-4217-8e47-e08b09b114ce]: error waiting for pod: Unauthorized
6s          Normal    SuccessfulCreate         replicaset/ibtest-57599dd56f            Created pod: ibtest-57599dd56f-k2xmt
6s          Normal    SuccessfulCreate         replicaset/ibtest-57599dd56f            Created pod: ibtest-57599dd56f-45ksq
6s          Normal    ScalingReplicaSet        deployment/ibtest                       Scaled up replica set ibtest-57599dd56f to 2
0s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-k2xmt             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1d4aec3c086b534f216d6ca1e2d0c84d16ccfdd7283fadb50c96f3cc544014ae": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-k2xmt/43887458-6c67-4217-8e47-e08b09b114ce]: error waiting for pod: Unauthorized
0s          Warning   FailedCreatePodSandBox   pod/ibtest-57599dd56f-45ksq             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "26ed6556afa0ebd0407fa6db67b308e56524733d73304a2611eacdc1408a27aa": plugin type="multus" name="multus-cni-network" failed (add): Multus: [default/ibtest-57599dd56f-45ksq/811508e0-b5d0-4126-9b00-16def7acfd6f]: error waiting for pod: Unauthorized
```

**What you expected to happen**:

I expected the pod to be started with the secondary IP assigned by the multus plugin.

**How to reproduce it (as minimally and precisely as possible)**:


Steps here: https://gist.github.com/surajssd/a0596ca7785228f025be5c3ac177219f

But here are the relevant steps:

```bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm upgrade -i \
    --wait \
    --create-namespace \
    -n network-operator \
    network-operator \
    nvidia/network-operator \
    --set nfd.deployNodeFeatureRules=false
```

**Anything else we need to know?**:

I am using the machine type: [`Standard_HB120-16rs_v3`](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/high-performance-compute/hbv3-series?tabs=sizebasic) on Azure.

**Logs**:

- NicClusterPolicy CR spec and state:

```yaml
---
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
  name: nfd-network-rule
spec:
   rules:
   - name: "nfd-network-rule"
     labels:
        "feature.node.kubernetes.io/pci-15b3.present": "true"
     matchFeatures:
        - feature: pci.device
          matchExpressions:
            device: {op: In, value: ["101c", "101e"]}
---
# Try to match the versions from: https://github.com/Mellanox/network-operator/blob/master/example/crs/mellanox.com_v1alpha1_nicclusterpolicy_cr-full.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  nvIpam:
    enableWebhook: false
    repository: ghcr.io/mellanox
    image: nvidia-k8s-ipam
    # Latest tag: https://github.com/mellanox/nvidia-k8s-ipam/pkgs/container/nvidia-k8s-ipam
    version: v0.2.0

  ofedDriver:
    forcePrecompiled: false
    repository: nvcr.io/nvidia/mellanox
    image: doca-driver
    # Latest tag: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/mellanox/containers/doca-driver/tags
    # When this is deployed a suffix is added in the format "-<os><os version>-<cpu arch>" for e.g.: "-ubuntu22.04-amd64".
    version: 24.10-0.7.0.0-0

    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
      maxParallelUpgrades: 1
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30

  rdmaSharedDevicePlugin:
    repository: ghcr.io/mellanox
    image: k8s-rdma-shared-dev-plugin
    # Latest tag: https://github.com/mellanox/k8s-rdma-shared-dev-plugin/pkgs/container/k8s-rdma-shared-dev-plugin
    version: v1.5.2
    useCdi: true

    # The config below directly propagates to k8s-rdma-shared-device-plugin configuration.
    # Replace 'devices' with your (RDMA capable) netdevice name.
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_a",
            "rdmaHcaMax": 63,
            "selectors": {
              "vendors": ["15b3"]
            }
          }
        ]
      }

  secondaryNetwork:
    cniPlugins:
      repository: ghcr.io/k8snetworkplumbingwg
      image: plugins
      # Latest tag: https://github.com/k8snetworkplumbingwg/plugins/pkgs/container/plugins
      version: v1.5.0

    multus:
      repository: ghcr.io/k8snetworkplumbingwg
      image: multus-cni
      # Latest tag: https://github.com/k8snetworkplumbingwg/plugins/pkgs/container/multus-cni
      version: v4.1.0

    ipamPlugin:
      repository: ghcr.io/k8snetworkplumbingwg
      image: whereabouts
      # Latest tag: https://github.com/k8snetworkplumbingwg/whereabouts/pkgs/container/whereabouts
      version: v0.7.0

    ipoib:
      repository: ghcr.io/mellanox
      image: ipoib-cni
      # Latest tag: https://github.com/mellanox/ipoib-cni/pkgs/container/ipoib-cni
      version: v1.2.1

  # This is needed so that you don't schedule pods on the non-infiniband machines.
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        # This node label is added by NFD.
        - key: feature.node.kubernetes.io/pci-15b3.present
          operator: In
          values:
          - "true"
---
apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
  name: aks-infiniband
spec:
  networkNamespace: "default"
  # This is an alt interface name for the IPoIB interface. As per this blog it
  # seems to be common across the same VM SKU:
  # https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-tightly-coupled-hpcai-workloads-with-infiniband-using-nvidia-network-ope/4117209
  # It has been consistent on the machine types: Standard_HB120rs_v3 and Standard_ND96asr_v4.
  # TODO: Figure out how to get this generic name?
  master: "ibP257p0s0"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.0.0/16",
      "exclude": [
       "192.168.0.0/32",
       "192.168.255.255/32"
      ],
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info",
      "gateway": "192.168.0.1"
    }
```

- Output of: `kubectl get -n network-operator all`:

```bash
✗ kubectl get -n network-operator all
NAME                                                                  READY   STATUS    RESTARTS      AGE
pod/cni-plugins-ds-j5v8k                                              1/1     Running   0             101m
pod/cni-plugins-ds-x9h28                                              1/1     Running   0             101m
pod/kube-ipoib-cni-ds-p4lfq                                           1/1     Running   0             101m
pod/kube-ipoib-cni-ds-pwwmh                                           1/1     Running   0             101m
pod/kube-multus-ds-bt6h6                                              1/1     Running   0             101m
pod/kube-multus-ds-ss8dn                                              1/1     Running   0             101m
pod/mofed-ubuntu22.04-6fd94b4c6b-ds-8d2k6                             1/1     Running   5 (94m ago)   101m
pod/mofed-ubuntu22.04-6fd94b4c6b-ds-s5sp9                             1/1     Running   4 (95m ago)   101m
pod/network-operator-5ff6ff9559-qvqg8                                 1/1     Running   0             101m
pod/network-operator-node-feature-discovery-gc-6d48649f49-wqwmd       1/1     Running   0             101m
pod/network-operator-node-feature-discovery-master-57648d678f-cmw7v   1/1     Running   0             101m
pod/network-operator-node-feature-discovery-worker-9n887              1/1     Running   0             101m
pod/network-operator-node-feature-discovery-worker-qtlpm              1/1     Running   0             101m
pod/network-operator-node-feature-discovery-worker-rtpmp              1/1     Running   0             101m
pod/nv-ipam-controller-67556c846b-c5ljf                               1/1     Running   0             101m
pod/nv-ipam-controller-67556c846b-zb2nw                               1/1     Running   0             101m
pod/nv-ipam-node-4bqhh                                                1/1     Running   0             101m
pod/nv-ipam-node-t4l47                                                1/1     Running   0             101m
pod/rdma-shared-dp-ds-kvfhb                                           1/1     Running   0             92m
pod/rdma-shared-dp-ds-l442j                                           1/1     Running   0             93m
pod/whereabouts-x5nql                                                 1/1     Running   0             101m
pod/whereabouts-xwqzt                                                 1/1     Running   0             101m

NAME                                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                                                                                                                                            AGE
daemonset.apps/cni-plugins-ds                                   2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/kube-ipoib-cni-ds                                2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/kube-multus-ds                                   2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/mofed-ubuntu22.04-6fd94b4c6b-ds                  2         2         2       2            2           feature.node.kubernetes.io/kernel-version.full=5.15.0-1079-azure,feature.node.kubernetes.io/pci-15b3.present=true,feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04   101m
daemonset.apps/network-operator-node-feature-discovery-worker   3         3         3       3            3           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/nv-ipam-node                                     2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m
daemonset.apps/rdma-shared-dp-ds                                2         2         2       2            2           feature.node.kubernetes.io/pci-15b3.present=true,network.nvidia.com/operator.mofed.wait=false                                                                                                                                            101m
daemonset.apps/whereabouts                                      2         2         2       2            2           <none>                                                                                                                                                                                                                                   101m

NAME                                                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator                                 1/1     1            1           101m
deployment.apps/network-operator-node-feature-discovery-gc       1/1     1            1           101m
deployment.apps/network-operator-node-feature-discovery-master   1/1     1            1           101m
deployment.apps/nv-ipam-controller                               2/2     2            2           101m

NAME                                                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-5ff6ff9559                                 1         1         1       101m
replicaset.apps/network-operator-node-feature-discovery-gc-6d48649f49       1         1         1       101m
replicaset.apps/network-operator-node-feature-discovery-master-57648d678f   1         1         1       101m
replicaset.apps/nv-ipam-controller-67556c846b                               2         2         2       101m
```

- Network Operator version:

```bash
✗ helm -n network-operator ls
NAME                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
network-operator        network-operator        1               2025-03-05 16:37:15.858515 -0800 PST    deployed        network-operator-25.1.0 v25.1.0
```

- Logs of Network Operator controller:

```bash
✗ kubectl logs -n network-operator network-operator-5ff6ff9559-qvqg8
```

full log here: https://gist.github.com/surajssd/3c89a7aed0a57ca40d7b371bf84940a9

**Environment**:
- Kubernetes version (use `kubectl version`): 

```bash
✗ kubectl version
Client Version: v1.32.2
Kustomize Version: v5.5.0
Server Version: v1.30.9
WARNING: version difference between client (1.32) and server (1.30) exceeds the supported minor version skew of +/-1
```

- Hardware configuration:
  - Network adapter model and firmware version: `mlx5_core driver version: 24.10-0.7.0`

Logs from the mofed pods: https://gist.github.com/surajssd/1521f546efe4612c2cd9b1280d1dc654

- OS (e.g: `cat /etc/os-release`):

```bash
# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
```

- Kernel (e.g. `uname -a`):

```bash
# uname -a
Linux aks-gpunodes-17372139-vmss000000 5.15.0-1079-azure #88-Ubuntu SMP Thu Jan 16 19:18:54 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multus-cni-network fails to assign IP: `error waiting for pod: Unauthorized` #1363

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multus-cni-network fails to assign IP: error waiting for pod: Unauthorized #1363

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

multus-cni-network fails to assign IP: `error waiting for pod: Unauthorized` #1363