From 80ead9b9989a6e157a727f46e7bbacf71be96057 Mon Sep 17 00:00:00 2001 From: luohua13 Date: Thu, 14 May 2026 03:56:34 +0000 Subject: [PATCH] docs(npu-operator): rewrite install page for OperatorHub flow + add release notes The 1.2.0 release of Alauda Build of NPU Operator changes the delivery model from cluster plugin (`Marketplace > Cluster Plugins`) to OLM operator (`Marketplace > OperatorHub`). Adapt the installation page and add a release notes page mirroring the per-product layout used by hami-docs. Installation page: - Rename Downloading/Uploading sections from "Cluster plugin" to "Packages" and enumerate both the operator package (npu-operator) and the cluster plugin packages (NFD required, Volcano optional). - Split installation into two subsections: NFD as a Cluster Plugin and NPU Operator via OperatorHub (Install dialog, namespace selector, deployment form). - Bump default Driver Version to 25.5.0 and rename the row to match the form label; add a note that all operator-managed pods land in the operator namespace and that Volcano components are absent. - Verification step 1 switches from the Cluster plugin page to the OperatorHub details page / Installed Operators view. - Verification step 2 watches the npu-driver pod in the operator namespace (was kube-system) with a note for non-default namespaces. - Installing Monitor: drop the obsolete manual ServiceMonitor snippet (which targeted the wrong namespaces); the operator now auto- installs npu-exporter-servicemonitor in its own namespace. Release notes: - v1.2.0 mapped to openFuyao npu-operator 1.2.0; headline is the cluster-plugin-to-operator delivery change (no in-place upgrade from v1.1.3) and the MindCluster/Ascend v7.3.0 stack bump. - Downstream bug fix highlighted is the npu-exporter ServiceMonitor not taking effect; plus the two community 1.1.1 -> 1.2.0 fixes. - v1.1.3 mapped to openFuyao npu-operator 1.1.1 (MindCluster v7.2.RC1, cluster-plugin delivery). --- .../npu/npu_operator/installation.mdx | 83 ++++++++++--------- .../npu/npu_operator/release_notes.mdx | 38 +++++++++ 2 files changed, 83 insertions(+), 38 deletions(-) create mode 100644 docs/en/hardware_accelerator/npu/npu_operator/release_notes.mdx diff --git a/docs/en/hardware_accelerator/npu/npu_operator/installation.mdx b/docs/en/hardware_accelerator/npu/npu_operator/installation.mdx index 621959806..edf9a55b5 100644 --- a/docs/en/hardware_accelerator/npu/npu_operator/installation.mdx +++ b/docs/en/hardware_accelerator/npu/npu_operator/installation.mdx @@ -43,23 +43,37 @@ weight: 20 ## Procedure -### Downloading Cluster plugin +### Downloading Packages :::info -You can download the app named `Alauda Build of NPU Operator` and `Alauda Build of Node Feature Discovery` from the Marketplace on the Customer Portal website. -::: +From the Marketplace on the __ Customer Portal__ website, download both: -Note: The Volcano cluster plugin can be left uninstalled for now. +- The __Alauda Build of NPU Operator__ operator package (delivered as an OLM OperatorBundle). +- The __Alauda Build of Node Feature Discovery__ cluster plugin package. +- (Optional) The __Volcano__ cluster plugin package — only needed if you plan to enable the **ClusterD** component during deployment. +::: -### Uploading the Cluster plugin +### Uploading Packages -The platform provides the **`violet`** command-line tool for uploading packages downloaded from the Custom Portal Marketplace. +The platform provides the **`violet`** command-line tool for uploading both operator packages and cluster plugin packages downloaded from the Customer Portal Marketplace. For details, see [Upload Packages](/extend/upload_package). +### Installing the Node Feature Discovery Cluster Plugin + +`Alauda Build of Node Feature Discovery` is a **cluster plugin**, not an operator. Install it first because the NPU Operator depends on its node labelling. + +1. Navigate to __Administrator__ > __Marketplace__ > __Cluster Plugins__. +2. Switch to the target cluster. +3. Locate __Alauda Build of Node Feature Discovery__ and click __Install__. + +:::tip +The __Volcano__ cluster plugin can be left uninstalled for now. Install it from the same __Cluster Plugins__ page only if you later enable the **ClusterD** component of the NPU Operator. +::: -### Installing Alauda Build of NPU Operator +### Installing the Alauda Build of NPU Operator +`Alauda Build of NPU Operator` is delivered as an **operator** (OLM bundle), so the install flow is the OperatorHub flow, not the Cluster Plugins flow. 1. Apply the label `masterselector=dls-master-node` to all master nodes and the label `workerselector=dls-worker-node` to all worker nodes. ```bash @@ -67,7 +81,13 @@ For details, see [Upload Packages](/extend/upload_package). kubectl label nodes {worker-node-id} workerselector=dls-worker-node ``` -2. Go to the `Administrator` -> `Marketplace` -> `Cluster Plugin` page, switch to the target cluster, and then deploy the `Alauda Build of NPU Operator` Cluster plugin. +2. Navigate to __Administrator__ > __Marketplace__ > __OperatorHub__, switch to the target cluster, and locate the __Alauda Build of NPU Operator__ entry. + +3. If the status is _Absent_, confirm the operator package was uploaded with `violet` in the previous step. + +4. Click the operator to open its details page, then click __Install__. + +5. On the install page, choose the target namespace (the default is `npu-operator`; all pods managed by this operator will land here) and fill in the deployment form below. Click __Install__ to start; confirm in the dialog and wait for the subscription to reach _Succeeded_. Deployment form parameter description: :::warning @@ -78,16 +98,20 @@ For details, see [Upload Packages](/extend/upload_package). Ascend Operator, NodeD, ClusterD, Resilience Controller, MindIO TFT, and MindIO ACP are not deployed by default. Please deploy them only when there is a clear need for them. ::: + :::note + All pods created by the operator (driver, device plugin, docker runtime, NPU exporter, Ascend Operator, NodeD, ClusterD, Resilience Controller, MindIO TFT, MindIO ACP, NPU Feature Discovery, and the operator itself) are deployed into the same namespace as the operator pod (the default is `npu-operator`). Volcano-related components (vc-controller / vc-scheduler) are intentionally not exposed in this form — the platform's own Volcano cluster plugin should be installed separately when ClusterD is enabled. + ::: + | Component | Default | Description | |------------|---------|-------------| - | Driver | Enabled | Whether to install driver and firmware.| - | Version | 24.1.RC3 | Driver and firmware version. You must select the version number from the repository directory [npu-driver-installer](https://gitcode.com/openFuyao/npu-driver-installer/tree/1.1.1/downloader/NPU).| + | Driver | Enabled | Whether to install driver and firmware. | + | Driver Version | 25.5.0 | Driver and firmware version. You must select the version number from the repository directory [npu-driver-installer](https://gitcode.com/openFuyao/npu-driver-installer). Hidden when **Driver** is disabled. | | Ascend Device Plugin | Enabled | Whether to install Ascend Device Plugin. | | Ascend Docker Runtime | Enabled | Whether to install Ascend Docker Runtime. | - | NPU exporter | Enabled | Whether to install NPU exporter. | + | NPU Exporter | Enabled | Whether to install NPU Exporter. | | Ascend Operator | Disabled | Whether to install Ascend Operator. | | NodeD | Disabled | Whether to install NodeD. | - | ClusterD | Disabled | Whether to install ClusterD. Need to install the Volcano cluster plugin first. | + | ClusterD | Disabled | Whether to install ClusterD. Requires the Volcano cluster plugin to be installed first. | | Resilience Controller | Disabled | Whether to install Resilience Controller. | | MindIO TFT | Disabled | Whether to install MindIO TFT. | | MindIO ACP | Disabled | Whether to install MindIO ACP. | @@ -95,11 +119,14 @@ For details, see [Upload Packages](/extend/upload_package). #### Verification -1. First, you can see the status of "Installed" in the `Alauda Build of NPU Operator` Cluster plugin page. +1. On the __Administrator__ > __Marketplace__ > __OperatorHub__ details page of __Alauda Build of NPU Operator__, the subscription status should be _Succeeded_. The corresponding CSV (`npu-operator.v`) appears under __Installed Operators__ in the target namespace. 2. Wait for the npu-driver pod to become running. Offline installation takes about 10 minutes, while online installation is much faster. ```bash - kubectl -n kube-system get pod -w | grep npu-driver + kubectl -n npu-operator get pod -w | grep npu-driver ``` + :::note + Replace `npu-operator` with the namespace you chose during installation if you installed the operator into a different namespace. All pods managed by the operator share this namespace. + ::: 3. Reboot all the NPU nodes. 4. Run the following command on the npu node. @@ -184,34 +211,14 @@ For details, see [Upload Packages](/extend/upload_package). ### Installing Monitor -If the NPU exporter component was deployed when installing the Alauda Build of NPU Operator, perform the following steps to create a monitoring panel. +If the NPU Exporter component was deployed when installing the Alauda Build of NPU Operator, perform the following steps to create a monitoring panel. -1. Execute commands on the **control node** of the cluster. +1. The operator automatically deploys a `ServiceMonitor` named `npu-exporter-servicemonitor` in the operator namespace, wired up to the `npu-exporter` Service. No manual `ServiceMonitor` creation is required. You can verify it with: ```bash - cat << EOF | kubectl apply -f - - apiVersion: monitoring.coreos.com/v1 - kind: ServiceMonitor - metadata: - labels: - prometheus: kube-prometheus - serviceMonitorSelector: prometheus - name: npu-exporter-ai - namespace: monitoring - spec: - endpoints: - - interval: 10s - path: /metrics - port: http - targetPort: 8082 - namespaceSelector: - matchNames: - - npu-exporter - selector: - matchLabels: - app: npu-exporter-svc - EOF + kubectl -n npu-operator get servicemonitor npu-exporter-servicemonitor ``` + 2. You can import a Grafana dashboard JSON file by following [Import Dashboard](../../../observability/monitor/functions/manage_dashboard.html#import-dashboard), which converts it into a monitoring dashboard for display. The JSON file is available in [ascend-npu-dashboard](https://github.com/Cz200OK/ascend-npu-dashboard/blob/main/Dashboards/ascend-npu-exporter-Dashboard_20240301.json). diff --git a/docs/en/hardware_accelerator/npu/npu_operator/release_notes.mdx b/docs/en/hardware_accelerator/npu/npu_operator/release_notes.mdx new file mode 100644 index 000000000..3533c6bca --- /dev/null +++ b/docs/en/hardware_accelerator/npu/npu_operator/release_notes.mdx @@ -0,0 +1,38 @@ +--- +weight: 30 +--- + +# Release Notes + +## v1.2.0 + +Based on the openFuyao [npu-operator 1.2.0](https://gitcode.com/openFuyao/npu-operator/tree/1.2.0) community release. + +### Breaking changes + +- **Delivery model changed from cluster plugin to operator.** Earlier versions of __Alauda Build of NPU Operator__ shipped as a cluster plugin, installed from __Marketplace__ > __Cluster Plugins__. Starting with v1.2.0 the project is packaged as an OLM operator bundle and installed from __Marketplace__ > __OperatorHub__. + + :::warning + In-place upgrade from v1.1.3 (or any earlier cluster-plugin release) is **not** supported. To move to v1.2.0, uninstall the old cluster plugin first and then install the new operator from scratch: + + 1. Uninstall the existing __Alauda Build of NPU Operator__ cluster plugin from __Marketplace__ > __Cluster Plugins__. + 2. If the driver is no longer needed on the host, run on each NPU node: + ```bash + /usr/local/Ascend/driver/script/uninstall.sh + ``` + 3. Install v1.2.0 from __Marketplace__ > __OperatorHub__ following the [Installation](./installation.mdx) page. + ::: + +### New features + +- **MindCluster / Ascend component stack upgraded to v7.3.0.** Picks up the v7.3.0 train of `ascend-docker-runtime`, `ascend-operator`, `ascend-k8sdeviceplugin`, `noded`, `vc-controller-manager`, `vc-scheduler`, `clusterd`, and `npu-exporter`. The driver and firmware default version is bumped to `25.5.0`. + +### Bug fixes + +- Fixed `npu-exporter` `ServiceMonitor` not taking effect: it was created in the wrong namespace and missed Prometheus selector labels, so the platform's Prometheus never scraped the exporter. The operator now ships a `ServiceMonitor` in its own namespace with the correct `prometheus: kube-prometheus` label and `honorLabels: true`. +- (Community) Fixed installation / detection logic on nodes that do not have NPU cards. +- (Community) Fixed environment-variable indicators not being deduplicated. + +## v1.1.3 + +Based on the openFuyao [npu-operator 1.1.1](https://gitcode.com/openFuyao/npu-operator/tree/1.1.1) community release. Backed by the MindCluster / Ascend v7.2.RC1 component stack and delivered as a cluster plugin.