Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 45 additions & 38 deletions docs/en/hardware_accelerator/npu/npu_operator/installation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -43,31 +43,51 @@ weight: 20

## Procedure

### Downloading Cluster plugin
### Downloading Packages

:::info
You can download the app named `Alauda Build of NPU Operator` and `Alauda Build of Node Feature Discovery` from the Marketplace on the Customer Portal website.
:::
From the Marketplace on the __<Term name="company"/> Customer Portal__ website, download both:

Note: The Volcano cluster plugin can be left uninstalled for now.
- The __Alauda Build of NPU Operator__ operator package (delivered as an OLM OperatorBundle).
- The __Alauda Build of Node Feature Discovery__ cluster plugin package.
- (Optional) The __Volcano__ cluster plugin package — only needed if you plan to enable the **ClusterD** component during deployment.
:::

### Uploading the Cluster plugin
### Uploading Packages

The platform provides the **`violet`** command-line tool for uploading packages downloaded from the Custom Portal Marketplace.
The platform provides the **`violet`** command-line tool for uploading both operator packages and cluster plugin packages downloaded from the Customer Portal Marketplace.

For details, see [Upload Packages](/extend/upload_package).

### Installing the Node Feature Discovery Cluster Plugin

`Alauda Build of Node Feature Discovery` is a **cluster plugin**, not an operator. Install it first because the NPU Operator depends on its node labelling.

1. Navigate to __Administrator__ > __Marketplace__ > __Cluster Plugins__.
2. Switch to the target cluster.
3. Locate __Alauda Build of Node Feature Discovery__ and click __Install__.

:::tip
The __Volcano__ cluster plugin can be left uninstalled for now. Install it from the same __Cluster Plugins__ page only if you later enable the **ClusterD** component of the NPU Operator.
:::

### Installing Alauda Build of NPU Operator
### Installing the Alauda Build of NPU Operator

`Alauda Build of NPU Operator` is delivered as an **operator** (OLM bundle), so the install flow is the OperatorHub flow, not the Cluster Plugins flow.

1. Apply the label `masterselector=dls-master-node` to all master nodes and the label `workerselector=dls-worker-node` to all worker nodes.
```bash
kubectl label nodes {master-node-id} masterselector=dls-master-node
kubectl label nodes {worker-node-id} workerselector=dls-worker-node
```

2. Go to the `Administrator` -> `Marketplace` -> `Cluster Plugin` page, switch to the target cluster, and then deploy the `Alauda Build of NPU Operator` Cluster plugin.
2. Navigate to __Administrator__ > __Marketplace__ > __OperatorHub__, switch to the target cluster, and locate the __Alauda Build of NPU Operator__ entry.

3. If the status is _Absent_, confirm the operator package was uploaded with `violet` in the previous step.

4. Click the operator to open its details page, then click __Install__.

5. On the install page, choose the target namespace (the default is `npu-operator`; all pods managed by this operator will land here) and fill in the deployment form below. Click __Install__ to start; confirm in the dialog and wait for the subscription to reach _Succeeded_.

Deployment form parameter description:
:::warning
Expand All @@ -78,28 +98,35 @@ For details, see [Upload Packages](/extend/upload_package).
Ascend Operator, NodeD, ClusterD, Resilience Controller, MindIO TFT, and MindIO ACP are not deployed by default. Please deploy them only when there is a clear need for them.
:::

:::note
All pods created by the operator (driver, device plugin, docker runtime, NPU exporter, Ascend Operator, NodeD, ClusterD, Resilience Controller, MindIO TFT, MindIO ACP, NPU Feature Discovery, and the operator itself) are deployed into the same namespace as the operator pod (the default is `npu-operator`). Volcano-related components (vc-controller / vc-scheduler) are intentionally not exposed in this form — the platform's own Volcano cluster plugin should be installed separately when ClusterD is enabled.
:::

| Component | Default | Description |
|------------|---------|-------------|
| Driver | Enabled | Whether to install driver and firmware.|
| Version | 24.1.RC3 | Driver and firmware version. You must select the version number from the repository directory [npu-driver-installer](https://gitcode.com/openFuyao/npu-driver-installer/tree/1.1.1/downloader/NPU).|
| Driver | Enabled | Whether to install driver and firmware. |
| Driver Version | 25.5.0 | Driver and firmware version. You must select the version number from the repository directory [npu-driver-installer](https://gitcode.com/openFuyao/npu-driver-installer). Hidden when **Driver** is disabled. |
| Ascend Device Plugin | Enabled | Whether to install Ascend Device Plugin. |
| Ascend Docker Runtime | Enabled | Whether to install Ascend Docker Runtime. |
| NPU exporter | Enabled | Whether to install NPU exporter. |
| NPU Exporter | Enabled | Whether to install NPU Exporter. |
| Ascend Operator | Disabled | Whether to install Ascend Operator. |
| NodeD | Disabled | Whether to install NodeD. |
| ClusterD | Disabled | Whether to install ClusterD. Need to install the Volcano cluster plugin first. |
| ClusterD | Disabled | Whether to install ClusterD. Requires the Volcano cluster plugin to be installed first. |
| Resilience Controller | Disabled | Whether to install Resilience Controller. |
| MindIO TFT | Disabled | Whether to install MindIO TFT. |
| MindIO ACP | Disabled | Whether to install MindIO ACP. |


#### Verification

1. First, you can see the status of "Installed" in the `Alauda Build of NPU Operator` Cluster plugin page.
1. On the __Administrator__ > __Marketplace__ > __OperatorHub__ details page of __Alauda Build of NPU Operator__, the subscription status should be _Succeeded_. The corresponding CSV (`npu-operator.v<version>`) appears under __Installed Operators__ in the target namespace.
2. Wait for the npu-driver pod to become running. Offline installation takes about 10 minutes, while online installation is much faster.
```bash
kubectl -n kube-system get pod -w | grep npu-driver
kubectl -n npu-operator get pod -w | grep npu-driver
Comment thread
coderabbitai[bot] marked this conversation as resolved.
```
:::note
Replace `npu-operator` with the namespace you chose during installation if you installed the operator into a different namespace. All pods managed by the operator share this namespace.
:::
3. Reboot all the NPU nodes.

4. Run the following command on the npu node.
Expand Down Expand Up @@ -184,34 +211,14 @@ For details, see [Upload Packages](/extend/upload_package).

### Installing Monitor

If the NPU exporter component was deployed when installing the Alauda Build of NPU Operator, perform the following steps to create a monitoring panel.
If the NPU Exporter component was deployed when installing the Alauda Build of NPU Operator, perform the following steps to create a monitoring panel.

1. Execute commands on the **control node** of the cluster.
1. The operator automatically deploys a `ServiceMonitor` named `npu-exporter-servicemonitor` in the operator namespace, wired up to the `npu-exporter` Service. No manual `ServiceMonitor` creation is required. You can verify it with:

```bash
cat << EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
prometheus: kube-prometheus
serviceMonitorSelector: prometheus
name: npu-exporter-ai
namespace: monitoring
spec:
endpoints:
- interval: 10s
path: /metrics
port: http
targetPort: 8082
namespaceSelector:
matchNames:
- npu-exporter
selector:
matchLabels:
app: npu-exporter-svc
EOF
kubectl -n npu-operator get servicemonitor npu-exporter-servicemonitor
```

2. You can import a Grafana dashboard JSON file by following [Import Dashboard](../../../observability/monitor/functions/manage_dashboard.html#import-dashboard), which converts it into a monitoring dashboard for display.
The JSON file is available in [ascend-npu-dashboard](https://github.com/Cz200OK/ascend-npu-dashboard/blob/main/Dashboards/ascend-npu-exporter-Dashboard_20240301.json).

Expand Down
38 changes: 38 additions & 0 deletions docs/en/hardware_accelerator/npu/npu_operator/release_notes.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
weight: 30
---

# Release Notes

## v1.2.0

Based on the openFuyao [npu-operator 1.2.0](https://gitcode.com/openFuyao/npu-operator/tree/1.2.0) community release.

### Breaking changes

- **Delivery model changed from cluster plugin to operator.** Earlier versions of __Alauda Build of NPU Operator__ shipped as a cluster plugin, installed from __Marketplace__ > __Cluster Plugins__. Starting with v1.2.0 the project is packaged as an OLM operator bundle and installed from __Marketplace__ > __OperatorHub__.

:::warning
In-place upgrade from v1.1.3 (or any earlier cluster-plugin release) is **not** supported. To move to v1.2.0, uninstall the old cluster plugin first and then install the new operator from scratch:

1. Uninstall the existing __Alauda Build of NPU Operator__ cluster plugin from __Marketplace__ > __Cluster Plugins__.
2. If the driver is no longer needed on the host, run on each NPU node:
```bash
/usr/local/Ascend/driver/script/uninstall.sh
```
3. Install v1.2.0 from __Marketplace__ > __OperatorHub__ following the [Installation](./installation.mdx) page.
:::

### New features

- **MindCluster / Ascend component stack upgraded to v7.3.0.** Picks up the v7.3.0 train of `ascend-docker-runtime`, `ascend-operator`, `ascend-k8sdeviceplugin`, `noded`, `vc-controller-manager`, `vc-scheduler`, `clusterd`, and `npu-exporter`. The driver and firmware default version is bumped to `25.5.0`.

### Bug fixes

- Fixed `npu-exporter` `ServiceMonitor` not taking effect: it was created in the wrong namespace and missed Prometheus selector labels, so the platform's Prometheus never scraped the exporter. The operator now ships a `ServiceMonitor` in its own namespace with the correct `prometheus: kube-prometheus` label and `honorLabels: true`.
- (Community) Fixed installation / detection logic on nodes that do not have NPU cards.
- (Community) Fixed environment-variable indicators not being deduplicated.

## v1.1.3

Based on the openFuyao [npu-operator 1.1.1](https://gitcode.com/openFuyao/npu-operator/tree/1.1.1) community release. Backed by the MindCluster / Ascend v7.2.RC1 component stack and delivered as a cluster plugin.