Skip to content

RFE-8500: Display GPU metrics on the Node Details page#16456

Open
swshende-cmd wants to merge 3 commits into
openshift:mainfrom
swshende-cmd:gpu-metrics-node-details
Open

RFE-8500: Display GPU metrics on the Node Details page#16456
swshende-cmd wants to merge 3 commits into
openshift:mainfrom
swshende-cmd:gpu-metrics-node-details

Conversation

@swshende-cmd
Copy link
Copy Markdown

@swshende-cmd swshende-cmd commented May 15, 2026

Summary

  • Adds a new GPU Metrics section to the Node Details page (Compute > Nodes > <node> > Details) that displays real-time DCGM exporter metrics per GPU device: utilization (%), temperature (°C), power usage (W), and framebuffer memory (used/free).
  • Shows GPU summary information (count, model name, capacity, allocatable) from both Prometheus and the Kubernetes Node status.capacity/status.allocatable fields.
  • The section is conditionally rendered only for nodes that report GPU capacity (nvidia.com/gpu or amd.com/gpu) or have active DCGM metrics, ensuring no visual impact on non-GPU nodes.

Details

Problem

GPU metrics are currently only accessible through a separate GPU dashboard. Customers have requested that key GPU performance metrics be displayed directly on the Node Details page for improved visibility and faster access to critical information.

Ref: RFE-8500

Solution

New files:

  • nodeGpuMetricsQueries.ts — PromQL query builders targeting DCGM exporter metrics (DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE). Queries use both Hostname and node label selectors joined with or to support common DCGM labeling conventions.
  • NodeDetailsGpuMetrics.tsx — React component that polls Prometheus via usePrometheusPoll, displays a summary DescriptionList (GPU count, model, capacity, allocatable) and a per-GPU table with device labels (e.g., "GPU 0 — Tesla T4").

Modified files:

  • NodeDetails.tsx — Imports and conditionally renders <NodeDetailsGpuMetrics> after the overview section.
  • console-app.json — Adds 11 i18n keys for GPU metrics labels and messages.

Prerequisites

Requires the NVIDIA GPU Operator with DCGM exporter deployed on the cluster to expose GPU metrics to Prometheus. Nodes without GPU capacity or DCGM metrics will not show the section.

Test plan

  • Unit tests: 20 tests pass across nodeGpuMetricsQueries.spec.ts (14 tests) and NodeDetailsGpuMetrics.spec.tsx (6 tests)
  • Lint: ESLint passes via pre-commit hook
  • Manual testing: Verified on OCP cluster with Tesla T4 GPUs — GPU metrics (utilization 100%, temp 43°C, power 70W) displayed correctly under load using CUDA nbody benchmark
  • Verify section does not render on non-GPU nodes
  • Verify graceful fallback message when DCGM exporter is not installed but node has GPU capacity

Made with Cursor

Summary by CodeRabbit

New Features

  • Added GPU metrics section in Node Details displaying GPU count, model, temperature, power usage, and memory information
  • GPU metrics automatically display when available; shows informative message if metrics are unavailable
  • Supports multiple GPU resource types

Tests

  • Added comprehensive test coverage for GPU metrics functionality

Adds a new GPU metrics section to the Node Details page that surfaces
DCGM exporter metrics (utilization, temperature, power usage, framebuffer
memory) per GPU device, along with summary information (GPU count, model,
capacity, allocatable) from the Kubernetes Node resource.

The section is only rendered for nodes that report GPU capacity
(nvidia.com/gpu or amd.com/gpu) or have active DCGM metrics. PromQL
queries use both Hostname and node label selectors joined with `or` to
support common DCGM exporter labeling conventions.

Includes unit tests for query generation helpers and component rendering.

Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 15, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented May 15, 2026

@swshende-cmd: This pull request references RFE-8500 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature request to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds a new GPU Metrics section to the Node Details page (Compute > Nodes > <node> > Details) that displays real-time DCGM exporter metrics per GPU device: utilization (%), temperature (°C), power usage (W), and framebuffer memory (used/free).
  • Shows GPU summary information (count, model name, capacity, allocatable) from both Prometheus and the Kubernetes Node status.capacity/status.allocatable fields.
  • The section is conditionally rendered only for nodes that report GPU capacity (nvidia.com/gpu or amd.com/gpu) or have active DCGM metrics, ensuring no visual impact on non-GPU nodes.

Details

Problem

GPU metrics are currently only accessible through a separate GPU dashboard. Customers have requested that key GPU performance metrics be displayed directly on the Node Details page for improved visibility and faster access to critical information.

Ref: RFE-8500

Solution

New files:

  • nodeGpuMetricsQueries.ts — PromQL query builders targeting DCGM exporter metrics (DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE). Queries use both Hostname and node label selectors joined with or to support common DCGM labeling conventions.
  • NodeDetailsGpuMetrics.tsx — React component that polls Prometheus via usePrometheusPoll, displays a summary DescriptionList (GPU count, model, capacity, allocatable) and a per-GPU table with device labels (e.g., "GPU 0 — Tesla T4").

Modified files:

  • NodeDetails.tsx — Imports and conditionally renders <NodeDetailsGpuMetrics> after the overview section.
  • console-app.json — Adds 11 i18n keys for GPU metrics labels and messages.

Prerequisites

Requires the NVIDIA GPU Operator with DCGM exporter deployed on the cluster to expose GPU metrics to Prometheus. Nodes without GPU capacity or DCGM metrics will not show the section.

Test plan

  • Unit tests: 20 tests pass across nodeGpuMetricsQueries.spec.ts (14 tests) and NodeDetailsGpuMetrics.spec.tsx (6 tests)
  • Lint: ESLint passes via pre-commit hook
  • Manual testing: Verified on OCP cluster with Tesla T4 GPUs — GPU metrics (utilization 100%, temp 43°C, power 70W) displayed correctly under load using CUDA nbody benchmark
  • Verify section does not render on non-GPU nodes
  • Verify graceful fallback message when DCGM exporter is not installed but node has GPU capacity

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from TheRealJon and cajieh May 15, 2026 18:08
@openshift-ci openshift-ci Bot added the component/core Related to console core functionality label May 15, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 15, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: swshende-cmd
Once this PR has been reviewed and has the lgtm label, please assign spadgett for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added kind/i18n Indicates issue or PR relates to internationalization or has content that needs to be translated needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 15, 2026

Hi @swshende-cmd. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

📝 Walkthrough

Walkthrough

This PR extends the OpenShift Console's node details view with GPU metrics visualization powered by Prometheus DCGM exporter data. It introduces a GPU query module that builds PromQL expressions for count, utilization, temperature, power usage, and framebuffer memory metrics while handling PromQL label escaping and dual DCGM label conventions. A new NodeDetailsGpuMetrics component polls these metrics, transforms results into per-GPU rows, formats values with units, and renders conditional UI based on loading and data availability states. The component integrates into NodeDetails.tsx via conditional rendering when Prometheus is configured, and includes eleven localization keys for GPU-related labels and an availability message tied to DCGM exporter prerequisites.

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main change: adding GPU metrics display to the Node Details page, directly matching the changeset's core purpose.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering problem statement, solution details, test results, and outstanding verifications, matching most required template sections effectively.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR adds Jest tests, not Ginkgo tests. Custom check applies only to Ginkgo frameworks. Jest test names are stable and descriptive with no dynamic content.
Test Structure And Quality ✅ Passed Custom check requires Ginkgo test review. PR contains only Jest/React Testing Library tests (TypeScript). No Ginkgo tests present. Check not applicable to this frontend PR.
Microshift Test Compatibility ✅ Passed This PR adds no Ginkgo e2e tests. All test files are Jest unit tests (spec.ts/tsx). The check applies only to Ginkgo e2e tests, which are not present here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR contains only Jest/RTL frontend component unit tests, which are not subject to SNO compatibility checks.
Topology-Aware Scheduling Compatibility ✅ Passed Check not applicable. PR modifies only frontend React/TypeScript UI components and localization. No deployment manifests, operator code, controllers, or scheduling constraints present.
Ote Binary Stdout Contract ✅ Passed OTE contract does not apply. PR contains only frontend TypeScript/React/JSON with no Go test code or process-level setup that could violate OTE stdout contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The custom check targets Ginkgo e2e tests (Go), but this PR adds only Jest/React unit tests (TypeScript/JavaScript) for console frontend components. The check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts (1)

1-1: ⚡ Quick win

Consider using template literals instead of lodash templates.

Lodash is imported solely for _.template(), which adds bundle weight. Native template literals can achieve the same result with zero dependencies and better performance:

const buildQuery = (hn: string, nd: string, metric: string) =>
  `${metric}{${hn}} or ${metric}{${nd}}`;
♻️ Proposed refactor using template literals
-import * as _ from 'lodash';
-
 export enum GpuMetricQuery {
   GPU_COUNT = 'GPU_COUNT',
   GPU_UTILIZATION = 'GPU_UTILIZATION',
@@ -32,24 +30,24 @@
 };

 const gpuQueries = {
-  [GpuMetricQuery.GPU_COUNT]: _.template(
-    `count(DCGM_FI_DEV_GPU_UTIL{<%= hn %>} or DCGM_FI_DEV_GPU_UTIL{<%= nd %>})`,
-  ),
-  [GpuMetricQuery.GPU_UTILIZATION]: _.template(
-    `DCGM_FI_DEV_GPU_UTIL{<%= hn %>} or DCGM_FI_DEV_GPU_UTIL{<%= nd %>}`,
-  ),
-  [GpuMetricQuery.GPU_TEMPERATURE]: _.template(
-    `DCGM_FI_DEV_GPU_TEMP{<%= hn %>} or DCGM_FI_DEV_GPU_TEMP{<%= nd %>}`,
-  ),
-  [GpuMetricQuery.GPU_POWER_USAGE]: _.template(
-    `DCGM_FI_DEV_POWER_USAGE{<%= hn %>} or DCGM_FI_DEV_POWER_USAGE{<%= nd %>}`,
-  ),
-  [GpuMetricQuery.GPU_FB_USED]: _.template(
-    `DCGM_FI_DEV_FB_USED{<%= hn %>} or DCGM_FI_DEV_FB_USED{<%= nd %>}`,
-  ),
-  [GpuMetricQuery.GPU_FB_FREE]: _.template(
-    `DCGM_FI_DEV_FB_FREE{<%= hn %>} or DCGM_FI_DEV_FB_FREE{<%= nd %>}`,
-  ),
+  [GpuMetricQuery.GPU_COUNT]: (hn: string, nd: string) =>
+    `count(DCGM_FI_DEV_GPU_UTIL{${hn}} or DCGM_FI_DEV_GPU_UTIL{${nd}})`,
+  [GpuMetricQuery.GPU_UTILIZATION]: (hn: string, nd: string) =>
+    `DCGM_FI_DEV_GPU_UTIL{${hn}} or DCGM_FI_DEV_GPU_UTIL{${nd}}`,
+  [GpuMetricQuery.GPU_TEMPERATURE]: (hn: string, nd: string) =>
+    `DCGM_FI_DEV_GPU_TEMP{${hn}} or DCGM_FI_DEV_GPU_TEMP{${nd}}`,
+  [GpuMetricQuery.GPU_POWER_USAGE]: (hn: string, nd: string) =>
+    `DCGM_FI_DEV_POWER_USAGE{${hn}} or DCGM_FI_DEV_POWER_USAGE{${nd}}`,
+  [GpuMetricQuery.GPU_FB_USED]: (hn: string, nd: string) =>
+    `DCGM_FI_DEV_FB_USED{${hn}} or DCGM_FI_DEV_FB_USED{${nd}}`,
+  [GpuMetricQuery.GPU_FB_FREE]: (hn: string, nd: string) =>
+    `DCGM_FI_DEV_FB_FREE{${hn}} or DCGM_FI_DEV_FB_FREE{${nd}}`,
 };

 export const getGpuMetricQueries = (nodeName: string): Record<GpuMetricQuery, string> => {
   const selectors = buildNodeSelectors(nodeName);
   return {
-    [GpuMetricQuery.GPU_COUNT]: gpuQueries[GpuMetricQuery.GPU_COUNT](selectors),
-    [GpuMetricQuery.GPU_UTILIZATION]: gpuQueries[GpuMetricQuery.GPU_UTILIZATION](selectors),
-    [GpuMetricQuery.GPU_TEMPERATURE]: gpuQueries[GpuMetricQuery.GPU_TEMPERATURE](selectors),
-    [GpuMetricQuery.GPU_POWER_USAGE]: gpuQueries[GpuMetricQuery.GPU_POWER_USAGE](selectors),
-    [GpuMetricQuery.GPU_FB_USED]: gpuQueries[GpuMetricQuery.GPU_FB_USED](selectors),
-    [GpuMetricQuery.GPU_FB_FREE]: gpuQueries[GpuMetricQuery.GPU_FB_FREE](selectors),
+    [GpuMetricQuery.GPU_COUNT]: gpuQueries[GpuMetricQuery.GPU_COUNT](selectors.hn, selectors.nd),
+    [GpuMetricQuery.GPU_UTILIZATION]: gpuQueries[GpuMetricQuery.GPU_UTILIZATION](selectors.hn, selectors.nd),
+    [GpuMetricQuery.GPU_TEMPERATURE]: gpuQueries[GpuMetricQuery.GPU_TEMPERATURE](selectors.hn, selectors.nd),
+    [GpuMetricQuery.GPU_POWER_USAGE]: gpuQueries[GpuMetricQuery.GPU_POWER_USAGE](selectors.hn, selectors.nd),
+    [GpuMetricQuery.GPU_FB_USED]: gpuQueries[GpuMetricQuery.GPU_FB_USED](selectors.hn, selectors.nd),
+    [GpuMetricQuery.GPU_FB_FREE]: gpuQueries[GpuMetricQuery.GPU_FB_FREE](selectors.hn, selectors.nd),
   };
 };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts`
at line 1, The code imports lodash solely for _.template; remove the lodash
import and replace uses of _.template (e.g., the function/buildQuery that
constructs metric query strings) with native template literals — update any
occurrences that call _.template(...) and the returned function invocation to a
simple arrow function like buildQuery(hn, nd, metric) that returns
`${metric}{${hn}} or ${metric}{${nd}}`, ensuring all call sites use the new
buildQuery signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx`:
- Around line 41-59: The resultsByGpu function currently uses an empty string
when no GPU identifier is present, causing later unlabeled results to overwrite
earlier ones; update the reducer to skip any PrometheusResult where the computed
gpu identifier (the variable gpu derived from r.metric?.gpu ||
r.metric?.GPU_I_ID || r.metric?.UUID || r.metric?.device) is falsy, so only
entries with a valid identifier are added to the acc, and optionally emit a
console.warn or logger warning when a result is skipped; ensure you reference
resultsByGpu, the gpu variable, acc, and response.data.result when making the
change.
- Around line 164-168: The GPU count computation currently turns non-numeric
values into the string "NaN" because parseFloat(gpuCountValue) can produce NaN;
update the logic in NodeDetailsGpuMetrics where gpuCountValue and gpuCountStr
are computed to parse the value into a number (e.g., const parsed =
parseFloat(gpuCountValue)), then check Number.isFinite(parsed) (or
!Number.isNaN(parsed)) before calling Math.round and String; only set
gpuCountStr when the parsed value is a valid number, otherwise set it to
undefined or an empty display-safe value.

---

Nitpick comments:
In `@frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts`:
- Line 1: The code imports lodash solely for _.template; remove the lodash
import and replace uses of _.template (e.g., the function/buildQuery that
constructs metric query strings) with native template literals — update any
occurrences that call _.template(...) and the returned function invocation to a
simple arrow function like buildQuery(hn, nd, metric) that returns
`${metric}{${hn}} or ${metric}{${nd}}`, ensuring all call sites use the new
buildQuery signature.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3e407984-1bc8-4e92-b9ff-9044fcc1406f

📥 Commits

Reviewing files that changed from the base of the PR and between ee38b13 and 23e866c.

📒 Files selected for processing (6)
  • frontend/packages/console-app/locales/en/console-app.json
  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
📜 Review details
🧰 Additional context used
📓 Path-based instructions (12)
frontend/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

frontend/**/*.{ts,tsx,js,jsx}: Never import from package index files (e.g., @console/shared) in new code, as they can create circular dependencies and slow builds. Import from specific file paths instead.
Do not use backticks in t() calls for i18n strings, as the i18n parser cannot extract keys from template literals. Use single or double quotes instead.

Files:

  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Never import from deprecated packages or use code with the @deprecated TSdoc tag in new code.

**/*.{ts,tsx}: Use React functional components with hooks instead of class components
State Management: Use React hooks and Context API (migrating away from legacy Redux/Immutable.js)
Hooks: Use existing hooks from console-shared when possible (useK8sWatchResource, useUserSettings, etc.)
API calls: Use k8s resource hooks for data fetching, consoleFetchJSON for HTTP requests
Extensions: Use console extension points for plugin integration
Types: Check existing types in console-shared before creating new ones
Dynamic Plugins: Use console extension points for plugin integration
Styling: Use SCSS modules co-located with components, PatternFly design system components, avoid any SCSS/CSS if possible
Accessibility: Follow WCAG 2.1 AA standards, use semantic HTML, ARIA labels where needed, ensure keyboard navigation, test with screen readers
i18n: Use useTranslation('namespace') hook with key format for translation keys
Error Handling: Use ErrorBoundary components and graceful degradation patterns
Optimize re-renders: Use useCallback for memoized callbacks to avoid function recreation every render
Optimize re-renders: Use useMemo for expensive computations to avoid recalculating on every render
Lazy loading: Use React.lazy() to lazy load heavy components
TypeScript type safety: Avoid using any type; suggest proper type definitions and verify null/undefined are handled properly
Type component props properly: Reuse existing component prop types instead of duplicating type definitions
Use proper hooks: Use specialized hooks like usePluginInfo for plugin data instead of generic data fetching patterns
Avoid deprecated components: Check for JSDoc @deprecated tags, import paths containing /deprecated, and DEPRECATED_ file name prefix before using components
Importing from barrel files and circular dependencies: Import directly from specific files instead...

Files:

  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
frontend/**/*.{ts,tsx,js,jsx,json}

📄 CodeRabbit inference engine (AGENTS.md)

Never use absolute URLs or paths in the console code. The console runs behind a proxy under an arbitrary path.

Files:

  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
  • frontend/packages/console-app/locales/en/console-app.json
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
frontend/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

When writing code for static plugins, ensure that all $codeRef reference the corresponding extension type from the @console/dynamic-plugin-sdk package.

Files:

  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{tsx,ts}

📄 CodeRabbit inference engine (TESTING.md)

**/*.{tsx,ts}: Always use page.getByTestId('x') for Playwright selectors which queries [data-test="x"]. If a React element only has a legacy test attribute, add data-test to the element. Never remove legacy attributes
Prefer data-test attributes in Cypress selectors (e.g., cy.get('[data-test="create-deployment"]')) over brittle CSS/ARIA selectors

File Naming: PascalCase for components, kebab-case for utilities, *.spec.ts(x) for tests

Files:

  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{go,ts,tsx,js,jsx}

📄 CodeRabbit inference engine (STYLEGUIDE.md)

Use lowercase dash-separated names for all files to avoid git issues with case-insensitive file systems

Files:

  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (STYLEGUIDE.md)

**/*.{ts,tsx,js,jsx}: New code MUST be written in TypeScript, not JavaScript
Prefer functional programming patterns and immutable data structures
Run the linter and follow all rules defined in .eslintrc
Never use absolute paths in code - the app should be able to run behind a proxy under an arbitrary path

Files:

  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
frontend/**/*.{js,ts,tsx}

📄 CodeRabbit inference engine (README.md)

frontend/**/*.{js,ts,tsx}: Support only the latest versions of Edge, Chrome, Safari, and Firefox browsers; IE 11 and earlier are not supported
CSP violations should be automatically reported to telemetry by parsing dynamic plugin names from securitypolicyviolation events, with throttling to prevent duplicate reports within a day

Files:

  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (INTERNATIONALIZATION.md)

For dynamic translation keys that cannot be parsed by i18next-parser (t(key), t('key' + id), t(key${id})), specify possible static values in comments for the parser to extract

Files:

  • frontend/packages/console-app/src/components/nodes/NodeDetails.tsx
  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.spec.{ts,tsx}

📄 CodeRabbit inference engine (STYLEGUIDE.md)

Tests should follow a similar 'test tables' convention as used in Go where applicable

Files:

  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.ts

📄 CodeRabbit inference engine (STYLEGUIDE.md)

Plugin SDK Changes: Any updates to console-dynamic-plugin-sdk should aim to maintain backward compatibility as it's a public API - use the plugin-api-review skill to vet changes for public API impact and ensure proper documentation updates

Files:

  • frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts
  • frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
**/__tests__/**/*.spec.tsx

📄 CodeRabbit inference engine (TESTING.md)

**/__tests__/**/*.spec.tsx: Unit tests must use Jest framework with @testing-library/react and @testing-library/jest-dom libraries for testing React components, hooks, and utilities
Test what users see and interact with - DO NOT test internal component state, private methods, props passed to child components, CSS class names/styles, or component structure
Use accessibility-first queries that match how screen readers and users interact with the UI
Always prefer role-based queries (e.g., getByRole) over generic selectors for semantic testing
Use reusable helper functions such as renderWithProviders and renderHookWithProviders from frontend/packages/console-shared/src/test-utils. Extract repetitive setup not covered by these helpers into custom functions if needed
Handle asynchronous updates with findBy* and waitFor in tests
Use proper TypeScript types for props, state, and mock data in unit tests
Structure tests following the Arrange-Act-Assert (AAA) pattern: Arrange (render component with mocks), Act (perform user actions), Assert (verify expected state)
Test files must be located in __tests__/ directory within the component directory, use the same name as the implementation file, and use .spec.tsx extension
When mocking, ALWAYS use ESM import statements at the top of the file - NEVER use require('react') or React.createElement() in mocks
Prefer jest.mock() for module mocks and jest.fn() for component mocks instead of jest.spyOn()
Keep mocks simple - return null, strings, or children directly. Use jest.fn(() => null) for simple component mocks, jest.fn(() => 'ComponentName') for mocks that display text, and jest.fn((props) => props.children) for wrapper components
Mock custom hooks with jest.fn() returning mock data
Clean up mocks with afterEach(() => { jest.restoreAllMocks(); })

Files:

  • frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
🔇 Additional comments (14)
frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts (4)

26-32: LGTM!


55-65: LGTM!


70-74: LGTM!


17-18: Verify PromQL label escaping completeness.

The escaping logic handles backslashes and single quotes for PromQL single-quoted string literals. Confirm this covers all required escaping per the PromQL specification.

PromQL label matcher escaping rules for single-quoted strings
frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts (1)

1-87: LGTM!

frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx (6)

1-24: LGTM!


25-39: LGTM!


61-87: LGTM!


89-102: LGTM!


108-162: LGTM!


232-259: LGTM!

frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx (1)

1-157: LGTM!

frontend/packages/console-app/src/components/nodes/NodeDetails.tsx (1)

2-2: LGTM!

Also applies to: 5-5, 16-16

frontend/packages/console-app/locales/en/console-app.json (1)

440-450: LGTM!

Comment on lines +41 to +59
const resultsByGpu = (
response: PrometheusResponse | undefined,
): Record<string, GpuMetricResult> => {
if (!response?.data?.result?.length) {
return {};
}
return response.data.result.reduce<Record<string, GpuMetricResult>>(
(acc, r: PrometheusResult) => {
const gpu = r.metric?.gpu ?? r.metric?.GPU_I_ID ?? r.metric?.UUID ?? r.metric?.device ?? '';
acc[gpu] = {
value: r.value?.[1] ?? '',
modelName: r.metric?.modelName,
device: r.metric?.device,
};
return acc;
},
{},
);
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle missing GPU identifiers to prevent data loss.

When no GPU identifier label is found (line 49), the code defaults to an empty string ''. If multiple metric results lack identifiers, they will overwrite each other in the acc accumulator, silently losing data.

Consider either:

  1. Skipping results without valid GPU identifiers
  2. Generating a unique fallback ID (e.g., using array index)
  3. Logging a warning when identifiers are missing
🛡️ Proposed fix to skip metrics without GPU identifiers
 const resultsByGpu = (
   response: PrometheusResponse | undefined,
 ): Record<string, GpuMetricResult> => {
   if (!response?.data?.result?.length) {
     return {};
   }
   return response.data.result.reduce<Record<string, GpuMetricResult>>(
     (acc, r: PrometheusResult) => {
       const gpu = r.metric?.gpu ?? r.metric?.GPU_I_ID ?? r.metric?.UUID ?? r.metric?.device ?? '';
+      if (!gpu) {
+        // Skip metrics without GPU identifier to prevent data loss
+        return acc;
+      }
       acc[gpu] = {
         value: r.value?.[1] ?? '',
         modelName: r.metric?.modelName,
         device: r.metric?.device,
       };
       return acc;
     },
     {},
   );
 };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx`
around lines 41 - 59, The resultsByGpu function currently uses an empty string
when no GPU identifier is present, causing later unlabeled results to overwrite
earlier ones; update the reducer to skip any PrometheusResult where the computed
gpu identifier (the variable gpu derived from r.metric?.gpu ||
r.metric?.GPU_I_ID || r.metric?.UUID || r.metric?.device) is falsy, so only
entries with a valid identifier are added to the acc, and optionally emit a
console.warn or logger warning when a result is skipped; ensure you reference
resultsByGpu, the gpu variable, acc, and response.data.result when making the
change.

Comment on lines +164 to +168
const gpuCountValue = countResponse?.data?.result?.[0]?.value?.[1];
const gpuCountStr =
gpuCountValue !== undefined && gpuCountValue !== ''
? String(Math.round(parseFloat(gpuCountValue)))
: undefined;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle NaN in GPU count display.

If gpuCountValue is non-numeric, parseFloat returns NaN, and String(Math.round(NaN)) produces "NaN", which would be displayed in the UI.

🛡️ Proposed fix to handle NaN
   const gpuCountValue = countResponse?.data?.result?.[0]?.value?.[1];
-  const gpuCountStr =
-    gpuCountValue !== undefined && gpuCountValue !== ''
-      ? String(Math.round(parseFloat(gpuCountValue)))
-      : undefined;
+  const gpuCountStr = (() => {
+    if (gpuCountValue === undefined || gpuCountValue === '') return undefined;
+    const parsed = parseFloat(gpuCountValue);
+    return Number.isNaN(parsed) ? undefined : String(Math.round(parsed));
+  })();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const gpuCountValue = countResponse?.data?.result?.[0]?.value?.[1];
const gpuCountStr =
gpuCountValue !== undefined && gpuCountValue !== ''
? String(Math.round(parseFloat(gpuCountValue)))
: undefined;
const gpuCountValue = countResponse?.data?.result?.[0]?.value?.[1];
const gpuCountStr = (() => {
if (gpuCountValue === undefined || gpuCountValue === '') return undefined;
const parsed = parseFloat(gpuCountValue);
return Number.isNaN(parsed) ? undefined : String(Math.round(parsed));
})();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx`
around lines 164 - 168, The GPU count computation currently turns non-numeric
values into the string "NaN" because parseFloat(gpuCountValue) can produce NaN;
update the logic in NodeDetailsGpuMetrics where gpuCountValue and gpuCountStr
are computed to parse the value into a number (e.g., const parsed =
parseFloat(gpuCountValue)), then check Number.isFinite(parsed) (or
!Number.isNaN(parsed)) before calling Math.round and String; only set
gpuCountStr when the parsed value is a valid number, otherwise set it to
undefined or an empty display-safe value.

- Skip Prometheus results without a valid GPU identifier to prevent
  silent data loss when multiple results lack label keys.
- Guard GPU count display against NaN from non-numeric Prometheus values.

Co-authored-by: Cursor <cursoragent@cursor.com>
@swshende-cmd
Copy link
Copy Markdown
Author

/verified by @swshende-cmd

Below are the screen shots before the RFE implementation
Nodes_Details_Page_Before_Change

Below screenprint shows the GPU metric details on the Node details page
GPU_Nodes_Details_After_implementing_RFE

Below screenprint shows the realtime GPU metric details on the Node details page , while testing The GPU stress test was actively run. Here's what nvidia-smi showed.
GPU_Node_Details_After_Implementing_RFE_And_GPU_Metrics_ShownRealtime

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@swshende-cmd: Jira verification commands are restricted to collaborators for this repo.

Details

In response to this:

/verified by @swshende-cmd

Below are the screen shots before the RFE implementation
Nodes_Details_Page_Before_Change

Below screenprint shows the GPU metric details on the Node details page
GPU_Nodes_Details_After_implementing_RFE

Below screenprint shows the realtime GPU metric details on the Node details page , while testing The GPU stress test was actively run. Here's what nvidia-smi showed.
GPU_Node_Details_After_Implementing_RFE_And_GPU_Metrics_ShownRealtime

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Remove the lodash dependency from nodeGpuMetricsQueries.ts and use
native template literals for PromQL query construction, reducing
bundle weight with zero functional change.

Co-authored-by: Cursor <cursoragent@cursor.com>
@swshende-cmd
Copy link
Copy Markdown
Author

/verified by @swshende-cmd

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@swshende-cmd: Jira verification commands are restricted to collaborators for this repo.

Details

In response to this:

/verified by @swshende-cmd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/core Related to console core functionality jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. kind/i18n Indicates issue or PR relates to internationalization or has content that needs to be translated needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants