RFE-8500: Display GPU metrics on the Node Details page#16456
RFE-8500: Display GPU metrics on the Node Details page#16456swshende-cmd wants to merge 3 commits into
Conversation
Adds a new GPU metrics section to the Node Details page that surfaces DCGM exporter metrics (utilization, temperature, power usage, framebuffer memory) per GPU device, along with summary information (GPU count, model, capacity, allocatable) from the Kubernetes Node resource. The section is only rendered for nodes that report GPU capacity (nvidia.com/gpu or amd.com/gpu) or have active DCGM metrics. PromQL queries use both Hostname and node label selectors joined with `or` to support common DCGM exporter labeling conventions. Includes unit tests for query generation helpers and component rendering. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@swshende-cmd: This pull request references RFE-8500 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature request to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: swshende-cmd The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @swshende-cmd. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthroughThis PR extends the OpenShift Console's node details view with GPU metrics visualization powered by Prometheus DCGM exporter data. It introduces a GPU query module that builds PromQL expressions for count, utilization, temperature, power usage, and framebuffer memory metrics while handling PromQL label escaping and dual DCGM label conventions. A new 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts (1)
1-1: ⚡ Quick winConsider using template literals instead of lodash templates.
Lodash is imported solely for
_.template(), which adds bundle weight. Native template literals can achieve the same result with zero dependencies and better performance:const buildQuery = (hn: string, nd: string, metric: string) => `${metric}{${hn}} or ${metric}{${nd}}`;♻️ Proposed refactor using template literals
-import * as _ from 'lodash'; - export enum GpuMetricQuery { GPU_COUNT = 'GPU_COUNT', GPU_UTILIZATION = 'GPU_UTILIZATION', @@ -32,24 +30,24 @@ }; const gpuQueries = { - [GpuMetricQuery.GPU_COUNT]: _.template( - `count(DCGM_FI_DEV_GPU_UTIL{<%= hn %>} or DCGM_FI_DEV_GPU_UTIL{<%= nd %>})`, - ), - [GpuMetricQuery.GPU_UTILIZATION]: _.template( - `DCGM_FI_DEV_GPU_UTIL{<%= hn %>} or DCGM_FI_DEV_GPU_UTIL{<%= nd %>}`, - ), - [GpuMetricQuery.GPU_TEMPERATURE]: _.template( - `DCGM_FI_DEV_GPU_TEMP{<%= hn %>} or DCGM_FI_DEV_GPU_TEMP{<%= nd %>}`, - ), - [GpuMetricQuery.GPU_POWER_USAGE]: _.template( - `DCGM_FI_DEV_POWER_USAGE{<%= hn %>} or DCGM_FI_DEV_POWER_USAGE{<%= nd %>}`, - ), - [GpuMetricQuery.GPU_FB_USED]: _.template( - `DCGM_FI_DEV_FB_USED{<%= hn %>} or DCGM_FI_DEV_FB_USED{<%= nd %>}`, - ), - [GpuMetricQuery.GPU_FB_FREE]: _.template( - `DCGM_FI_DEV_FB_FREE{<%= hn %>} or DCGM_FI_DEV_FB_FREE{<%= nd %>}`, - ), + [GpuMetricQuery.GPU_COUNT]: (hn: string, nd: string) => + `count(DCGM_FI_DEV_GPU_UTIL{${hn}} or DCGM_FI_DEV_GPU_UTIL{${nd}})`, + [GpuMetricQuery.GPU_UTILIZATION]: (hn: string, nd: string) => + `DCGM_FI_DEV_GPU_UTIL{${hn}} or DCGM_FI_DEV_GPU_UTIL{${nd}}`, + [GpuMetricQuery.GPU_TEMPERATURE]: (hn: string, nd: string) => + `DCGM_FI_DEV_GPU_TEMP{${hn}} or DCGM_FI_DEV_GPU_TEMP{${nd}}`, + [GpuMetricQuery.GPU_POWER_USAGE]: (hn: string, nd: string) => + `DCGM_FI_DEV_POWER_USAGE{${hn}} or DCGM_FI_DEV_POWER_USAGE{${nd}}`, + [GpuMetricQuery.GPU_FB_USED]: (hn: string, nd: string) => + `DCGM_FI_DEV_FB_USED{${hn}} or DCGM_FI_DEV_FB_USED{${nd}}`, + [GpuMetricQuery.GPU_FB_FREE]: (hn: string, nd: string) => + `DCGM_FI_DEV_FB_FREE{${hn}} or DCGM_FI_DEV_FB_FREE{${nd}}`, }; export const getGpuMetricQueries = (nodeName: string): Record<GpuMetricQuery, string> => { const selectors = buildNodeSelectors(nodeName); return { - [GpuMetricQuery.GPU_COUNT]: gpuQueries[GpuMetricQuery.GPU_COUNT](selectors), - [GpuMetricQuery.GPU_UTILIZATION]: gpuQueries[GpuMetricQuery.GPU_UTILIZATION](selectors), - [GpuMetricQuery.GPU_TEMPERATURE]: gpuQueries[GpuMetricQuery.GPU_TEMPERATURE](selectors), - [GpuMetricQuery.GPU_POWER_USAGE]: gpuQueries[GpuMetricQuery.GPU_POWER_USAGE](selectors), - [GpuMetricQuery.GPU_FB_USED]: gpuQueries[GpuMetricQuery.GPU_FB_USED](selectors), - [GpuMetricQuery.GPU_FB_FREE]: gpuQueries[GpuMetricQuery.GPU_FB_FREE](selectors), + [GpuMetricQuery.GPU_COUNT]: gpuQueries[GpuMetricQuery.GPU_COUNT](selectors.hn, selectors.nd), + [GpuMetricQuery.GPU_UTILIZATION]: gpuQueries[GpuMetricQuery.GPU_UTILIZATION](selectors.hn, selectors.nd), + [GpuMetricQuery.GPU_TEMPERATURE]: gpuQueries[GpuMetricQuery.GPU_TEMPERATURE](selectors.hn, selectors.nd), + [GpuMetricQuery.GPU_POWER_USAGE]: gpuQueries[GpuMetricQuery.GPU_POWER_USAGE](selectors.hn, selectors.nd), + [GpuMetricQuery.GPU_FB_USED]: gpuQueries[GpuMetricQuery.GPU_FB_USED](selectors.hn, selectors.nd), + [GpuMetricQuery.GPU_FB_FREE]: gpuQueries[GpuMetricQuery.GPU_FB_FREE](selectors.hn, selectors.nd), }; };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts` at line 1, The code imports lodash solely for _.template; remove the lodash import and replace uses of _.template (e.g., the function/buildQuery that constructs metric query strings) with native template literals — update any occurrences that call _.template(...) and the returned function invocation to a simple arrow function like buildQuery(hn, nd, metric) that returns `${metric}{${hn}} or ${metric}{${nd}}`, ensuring all call sites use the new buildQuery signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx`:
- Around line 41-59: The resultsByGpu function currently uses an empty string
when no GPU identifier is present, causing later unlabeled results to overwrite
earlier ones; update the reducer to skip any PrometheusResult where the computed
gpu identifier (the variable gpu derived from r.metric?.gpu ||
r.metric?.GPU_I_ID || r.metric?.UUID || r.metric?.device) is falsy, so only
entries with a valid identifier are added to the acc, and optionally emit a
console.warn or logger warning when a result is skipped; ensure you reference
resultsByGpu, the gpu variable, acc, and response.data.result when making the
change.
- Around line 164-168: The GPU count computation currently turns non-numeric
values into the string "NaN" because parseFloat(gpuCountValue) can produce NaN;
update the logic in NodeDetailsGpuMetrics where gpuCountValue and gpuCountStr
are computed to parse the value into a number (e.g., const parsed =
parseFloat(gpuCountValue)), then check Number.isFinite(parsed) (or
!Number.isNaN(parsed)) before calling Math.round and String; only set
gpuCountStr when the parsed value is a valid number, otherwise set it to
undefined or an empty display-safe value.
---
Nitpick comments:
In `@frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts`:
- Line 1: The code imports lodash solely for _.template; remove the lodash
import and replace uses of _.template (e.g., the function/buildQuery that
constructs metric query strings) with native template literals — update any
occurrences that call _.template(...) and the returned function invocation to a
simple arrow function like buildQuery(hn, nd, metric) that returns
`${metric}{${hn}} or ${metric}{${nd}}`, ensuring all call sites use the new
buildQuery signature.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3e407984-1bc8-4e92-b9ff-9044fcc1406f
📒 Files selected for processing (6)
frontend/packages/console-app/locales/en/console-app.jsonfrontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
📜 Review details
🧰 Additional context used
📓 Path-based instructions (12)
frontend/**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
frontend/**/*.{ts,tsx,js,jsx}: Never import from package index files (e.g.,@console/shared) in new code, as they can create circular dependencies and slow builds. Import from specific file paths instead.
Do not use backticks int()calls for i18n strings, as the i18n parser cannot extract keys from template literals. Use single or double quotes instead.
Files:
frontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.tsfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Never import from deprecated packages or use code with the
@deprecatedTSdoc tag in new code.
**/*.{ts,tsx}: Use React functional components with hooks instead of class components
State Management: Use React hooks and Context API (migrating away from legacy Redux/Immutable.js)
Hooks: Use existing hooks fromconsole-sharedwhen possible (useK8sWatchResource,useUserSettings, etc.)
API calls: Use k8s resource hooks for data fetching,consoleFetchJSONfor HTTP requests
Extensions: Use console extension points for plugin integration
Types: Check existing types inconsole-sharedbefore creating new ones
Dynamic Plugins: Use console extension points for plugin integration
Styling: Use SCSS modules co-located with components, PatternFly design system components, avoid any SCSS/CSS if possible
Accessibility: Follow WCAG 2.1 AA standards, use semantic HTML, ARIA labels where needed, ensure keyboard navigation, test with screen readers
i18n: UseuseTranslation('namespace')hook withkeyformat for translation keys
Error Handling: Use ErrorBoundary components and graceful degradation patterns
Optimize re-renders: UseuseCallbackfor memoized callbacks to avoid function recreation every render
Optimize re-renders: UseuseMemofor expensive computations to avoid recalculating on every render
Lazy loading: UseReact.lazy()to lazy load heavy components
TypeScript type safety: Avoid usinganytype; suggest proper type definitions and verify null/undefined are handled properly
Type component props properly: Reuse existing component prop types instead of duplicating type definitions
Use proper hooks: Use specialized hooks likeusePluginInfofor plugin data instead of generic data fetching patterns
Avoid deprecated components: Check for JSDoc@deprecatedtags, import paths containing/deprecated, andDEPRECATED_file name prefix before using components
Importing from barrel files and circular dependencies: Import directly from specific files instead...
Files:
frontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.tsfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
frontend/**/*.{ts,tsx,js,jsx,json}
📄 CodeRabbit inference engine (AGENTS.md)
Never use absolute URLs or paths in the console code. The console runs behind a proxy under an arbitrary path.
Files:
frontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.tsfrontend/packages/console-app/locales/en/console-app.jsonfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
frontend/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
When writing code for static plugins, ensure that all
$codeRefreference the corresponding extension type from the@console/dynamic-plugin-sdkpackage.
Files:
frontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.tsfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{tsx,ts}
📄 CodeRabbit inference engine (TESTING.md)
**/*.{tsx,ts}: Always usepage.getByTestId('x')for Playwright selectors which queries[data-test="x"]. If a React element only has a legacy test attribute, adddata-testto the element. Never remove legacy attributes
Preferdata-testattributes in Cypress selectors (e.g.,cy.get('[data-test="create-deployment"]')) over brittle CSS/ARIA selectorsFile Naming: PascalCase for components, kebab-case for utilities,
*.spec.ts(x)for tests
Files:
frontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.tsfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{go,ts,tsx,js,jsx}
📄 CodeRabbit inference engine (STYLEGUIDE.md)
Use lowercase dash-separated names for all files to avoid git issues with case-insensitive file systems
Files:
frontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.tsfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (STYLEGUIDE.md)
**/*.{ts,tsx,js,jsx}: New code MUST be written in TypeScript, not JavaScript
Prefer functional programming patterns and immutable data structures
Run the linter and follow all rules defined in .eslintrc
Never use absolute paths in code - the app should be able to run behind a proxy under an arbitrary path
Files:
frontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.tsfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
frontend/**/*.{js,ts,tsx}
📄 CodeRabbit inference engine (README.md)
frontend/**/*.{js,ts,tsx}: Support only the latest versions of Edge, Chrome, Safari, and Firefox browsers; IE 11 and earlier are not supported
CSP violations should be automatically reported to telemetry by parsing dynamic plugin names from securitypolicyviolation events, with throttling to prevent duplicate reports within a day
Files:
frontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.tsfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.{ts,tsx,js}
📄 CodeRabbit inference engine (INTERNATIONALIZATION.md)
For dynamic translation keys that cannot be parsed by i18next-parser (t(key), t('key' + id), t(
key${id})), specify possible static values in comments for the parser to extract
Files:
frontend/packages/console-app/src/components/nodes/NodeDetails.tsxfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsxfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.tsfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.spec.{ts,tsx}
📄 CodeRabbit inference engine (STYLEGUIDE.md)
Tests should follow a similar 'test tables' convention as used in Go where applicable
Files:
frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
**/*.ts
📄 CodeRabbit inference engine (STYLEGUIDE.md)
Plugin SDK Changes: Any updates to
console-dynamic-plugin-sdkshould aim to maintain backward compatibility as it's a public API - use theplugin-api-reviewskill to vet changes for public API impact and ensure proper documentation updates
Files:
frontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.tsfrontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts
**/__tests__/**/*.spec.tsx
📄 CodeRabbit inference engine (TESTING.md)
**/__tests__/**/*.spec.tsx: Unit tests must use Jest framework with@testing-library/reactand@testing-library/jest-domlibraries for testing React components, hooks, and utilities
Test what users see and interact with - DO NOT test internal component state, private methods, props passed to child components, CSS class names/styles, or component structure
Use accessibility-first queries that match how screen readers and users interact with the UI
Always prefer role-based queries (e.g.,getByRole) over generic selectors for semantic testing
Use reusable helper functions such asrenderWithProvidersandrenderHookWithProvidersfromfrontend/packages/console-shared/src/test-utils. Extract repetitive setup not covered by these helpers into custom functions if needed
Handle asynchronous updates withfindBy*andwaitForin tests
Use proper TypeScript types for props, state, and mock data in unit tests
Structure tests following the Arrange-Act-Assert (AAA) pattern: Arrange (render component with mocks), Act (perform user actions), Assert (verify expected state)
Test files must be located in__tests__/directory within the component directory, use the same name as the implementation file, and use.spec.tsxextension
When mocking, ALWAYS use ESMimportstatements at the top of the file - NEVER userequire('react')orReact.createElement()in mocks
Preferjest.mock()for module mocks andjest.fn()for component mocks instead ofjest.spyOn()
Keep mocks simple - returnnull, strings, orchildrendirectly. Usejest.fn(() => null)for simple component mocks,jest.fn(() => 'ComponentName')for mocks that display text, andjest.fn((props) => props.children)for wrapper components
Mock custom hooks withjest.fn()returning mock data
Clean up mocks withafterEach(() => { jest.restoreAllMocks(); })
Files:
frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx
🔇 Additional comments (14)
frontend/packages/console-app/src/components/nodes/nodeGpuMetricsQueries.ts (4)
26-32: LGTM!
55-65: LGTM!
70-74: LGTM!
17-18: Verify PromQL label escaping completeness.The escaping logic handles backslashes and single quotes for PromQL single-quoted string literals. Confirm this covers all required escaping per the PromQL specification.
PromQL label matcher escaping rules for single-quoted stringsfrontend/packages/console-app/src/components/nodes/__tests__/nodeGpuMetricsQueries.spec.ts (1)
1-87: LGTM!frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx (6)
1-24: LGTM!
25-39: LGTM!
61-87: LGTM!
89-102: LGTM!
108-162: LGTM!
232-259: LGTM!frontend/packages/console-app/src/components/nodes/__tests__/NodeDetailsGpuMetrics.spec.tsx (1)
1-157: LGTM!frontend/packages/console-app/src/components/nodes/NodeDetails.tsx (1)
2-2: LGTM!Also applies to: 5-5, 16-16
frontend/packages/console-app/locales/en/console-app.json (1)
440-450: LGTM!
| const resultsByGpu = ( | ||
| response: PrometheusResponse | undefined, | ||
| ): Record<string, GpuMetricResult> => { | ||
| if (!response?.data?.result?.length) { | ||
| return {}; | ||
| } | ||
| return response.data.result.reduce<Record<string, GpuMetricResult>>( | ||
| (acc, r: PrometheusResult) => { | ||
| const gpu = r.metric?.gpu ?? r.metric?.GPU_I_ID ?? r.metric?.UUID ?? r.metric?.device ?? ''; | ||
| acc[gpu] = { | ||
| value: r.value?.[1] ?? '', | ||
| modelName: r.metric?.modelName, | ||
| device: r.metric?.device, | ||
| }; | ||
| return acc; | ||
| }, | ||
| {}, | ||
| ); | ||
| }; |
There was a problem hiding this comment.
Handle missing GPU identifiers to prevent data loss.
When no GPU identifier label is found (line 49), the code defaults to an empty string ''. If multiple metric results lack identifiers, they will overwrite each other in the acc accumulator, silently losing data.
Consider either:
- Skipping results without valid GPU identifiers
- Generating a unique fallback ID (e.g., using array index)
- Logging a warning when identifiers are missing
🛡️ Proposed fix to skip metrics without GPU identifiers
const resultsByGpu = (
response: PrometheusResponse | undefined,
): Record<string, GpuMetricResult> => {
if (!response?.data?.result?.length) {
return {};
}
return response.data.result.reduce<Record<string, GpuMetricResult>>(
(acc, r: PrometheusResult) => {
const gpu = r.metric?.gpu ?? r.metric?.GPU_I_ID ?? r.metric?.UUID ?? r.metric?.device ?? '';
+ if (!gpu) {
+ // Skip metrics without GPU identifier to prevent data loss
+ return acc;
+ }
acc[gpu] = {
value: r.value?.[1] ?? '',
modelName: r.metric?.modelName,
device: r.metric?.device,
};
return acc;
},
{},
);
};🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx`
around lines 41 - 59, The resultsByGpu function currently uses an empty string
when no GPU identifier is present, causing later unlabeled results to overwrite
earlier ones; update the reducer to skip any PrometheusResult where the computed
gpu identifier (the variable gpu derived from r.metric?.gpu ||
r.metric?.GPU_I_ID || r.metric?.UUID || r.metric?.device) is falsy, so only
entries with a valid identifier are added to the acc, and optionally emit a
console.warn or logger warning when a result is skipped; ensure you reference
resultsByGpu, the gpu variable, acc, and response.data.result when making the
change.
| const gpuCountValue = countResponse?.data?.result?.[0]?.value?.[1]; | ||
| const gpuCountStr = | ||
| gpuCountValue !== undefined && gpuCountValue !== '' | ||
| ? String(Math.round(parseFloat(gpuCountValue))) | ||
| : undefined; |
There was a problem hiding this comment.
Handle NaN in GPU count display.
If gpuCountValue is non-numeric, parseFloat returns NaN, and String(Math.round(NaN)) produces "NaN", which would be displayed in the UI.
🛡️ Proposed fix to handle NaN
const gpuCountValue = countResponse?.data?.result?.[0]?.value?.[1];
- const gpuCountStr =
- gpuCountValue !== undefined && gpuCountValue !== ''
- ? String(Math.round(parseFloat(gpuCountValue)))
- : undefined;
+ const gpuCountStr = (() => {
+ if (gpuCountValue === undefined || gpuCountValue === '') return undefined;
+ const parsed = parseFloat(gpuCountValue);
+ return Number.isNaN(parsed) ? undefined : String(Math.round(parsed));
+ })();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const gpuCountValue = countResponse?.data?.result?.[0]?.value?.[1]; | |
| const gpuCountStr = | |
| gpuCountValue !== undefined && gpuCountValue !== '' | |
| ? String(Math.round(parseFloat(gpuCountValue))) | |
| : undefined; | |
| const gpuCountValue = countResponse?.data?.result?.[0]?.value?.[1]; | |
| const gpuCountStr = (() => { | |
| if (gpuCountValue === undefined || gpuCountValue === '') return undefined; | |
| const parsed = parseFloat(gpuCountValue); | |
| return Number.isNaN(parsed) ? undefined : String(Math.round(parsed)); | |
| })(); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/packages/console-app/src/components/nodes/NodeDetailsGpuMetrics.tsx`
around lines 164 - 168, The GPU count computation currently turns non-numeric
values into the string "NaN" because parseFloat(gpuCountValue) can produce NaN;
update the logic in NodeDetailsGpuMetrics where gpuCountValue and gpuCountStr
are computed to parse the value into a number (e.g., const parsed =
parseFloat(gpuCountValue)), then check Number.isFinite(parsed) (or
!Number.isNaN(parsed)) before calling Math.round and String; only set
gpuCountStr when the parsed value is a valid number, otherwise set it to
undefined or an empty display-safe value.
- Skip Prometheus results without a valid GPU identifier to prevent silent data loss when multiple results lack label keys. - Guard GPU count display against NaN from non-numeric Prometheus values. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/verified by @swshende-cmd Below are the screen shots before the RFE implementation Below screenprint shows the GPU metric details on the Node details page Below screenprint shows the realtime GPU metric details on the Node details page , while testing The GPU stress test was actively run. Here's what nvidia-smi showed. |
|
@swshende-cmd: Jira verification commands are restricted to collaborators for this repo. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Remove the lodash dependency from nodeGpuMetricsQueries.ts and use native template literals for PromQL query construction, reducing bundle weight with zero functional change. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/verified by @swshende-cmd |
|
@swshende-cmd: Jira verification commands are restricted to collaborators for this repo. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |



Summary
Compute > Nodes > <node> > Details) that displays real-time DCGM exporter metrics per GPU device: utilization (%), temperature (°C), power usage (W), and framebuffer memory (used/free).status.capacity/status.allocatablefields.nvidia.com/gpuoramd.com/gpu) or have active DCGM metrics, ensuring no visual impact on non-GPU nodes.Details
Problem
GPU metrics are currently only accessible through a separate GPU dashboard. Customers have requested that key GPU performance metrics be displayed directly on the Node Details page for improved visibility and faster access to critical information.
Ref: RFE-8500
Solution
New files:
nodeGpuMetricsQueries.ts— PromQL query builders targeting DCGM exporter metrics (DCGM_FI_DEV_GPU_UTIL,DCGM_FI_DEV_GPU_TEMP,DCGM_FI_DEV_POWER_USAGE,DCGM_FI_DEV_FB_USED,DCGM_FI_DEV_FB_FREE). Queries use bothHostnameandnodelabel selectors joined withorto support common DCGM labeling conventions.NodeDetailsGpuMetrics.tsx— React component that polls Prometheus viausePrometheusPoll, displays a summaryDescriptionList(GPU count, model, capacity, allocatable) and a per-GPU table with device labels (e.g., "GPU 0 — Tesla T4").Modified files:
NodeDetails.tsx— Imports and conditionally renders<NodeDetailsGpuMetrics>after the overview section.console-app.json— Adds 11 i18n keys for GPU metrics labels and messages.Prerequisites
Requires the NVIDIA GPU Operator with DCGM exporter deployed on the cluster to expose GPU metrics to Prometheus. Nodes without GPU capacity or DCGM metrics will not show the section.
Test plan
nodeGpuMetricsQueries.spec.ts(14 tests) andNodeDetailsGpuMetrics.spec.tsx(6 tests)Made with Cursor
Summary by CodeRabbit
New Features
Tests