Skip to content

[dashboard] Show TPU stats on Cluster tab #63774

Merged
edoakes merged 9 commits into
ray-project:masterfrom
spencer-p:tpu-dashboard-util
Jun 8, 2026
Merged

[dashboard] Show TPU stats on Cluster tab #63774
edoakes merged 9 commits into
ray-project:masterfrom
spencer-p:tpu-dashboard-util

Conversation

@spencer-p

Copy link
Copy Markdown
Contributor

Description

This change shows TPU tensor core utilization and High Bandwidth Memory utilization in the ray cluster dashboard.

  • TPU worker rows show tensor core util and HBM usage.
  • If the cluster has TPUs and no GPUs, the column names change from "GPU" and "GRAM" to "TPU" and "HBM" respectively.
  • If the cluster is mixed, both titles are shown, like "GPU / TPU".

Example screenshot:
image

Related issues

#57829

@spencer-p spencer-p requested a review from a team as a code owner June 1, 2026 17:08

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces TPU monitoring support to the Ray dashboard, updating components like GPUColumn, GRAMColumn, and index.tsx to dynamically display TPU metrics alongside GPU metrics, and adding a new TPUColumn component. It also updates the reporter agent and models to handle TPU stats. However, several critical issues were identified in the review: a potential runtime crash in the reporter agent due to treating a Pydantic model as a dictionary, and multiple potential TypeErrors in the frontend code if GPU or TPU data is null rather than undefined.

Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
Comment thread python/ray/dashboard/client/src/pages/node/GPUColumn.tsx Outdated
Comment thread python/ray/dashboard/client/src/pages/node/GRAMColumn.tsx Outdated
Comment thread python/ray/dashboard/client/src/pages/node/index.tsx Outdated
Comment thread python/ray/dashboard/client/src/pages/node/index.tsx Outdated
@spencer-p

Copy link
Copy Markdown
Contributor Author

I don't like plumbing through alternate strings and nodes into the GPU column too much. It doesn't seem like it will scale well for other accelerators.

I think it might be preferable to have a new TPU column that is conditionally hidden, and hide the GPU columns iff there are TPUs and no GPUs. It would be ugly for a cluster with a mix of TPUs and GPUs, but I suspect that is rare.

@edoakes

edoakes commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

I don't like plumbing through alternate strings and nodes into the GPU column too much. It doesn't seem like it will scale well for other accelerators.

I think it might be preferable to have a new TPU column that is conditionally hidden, and hide the GPU columns iff there are TPUs and no GPUs. It would be ugly for a cluster with a mix of TPUs and GPUs, but I suspect that is rare.

Agreed that mixing accelerator types would be rare, but should still be supported. Could we abstract the column into some kind of base "accelerator" and then pass an enum that would key into the column names?

@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from 32d1a71 to a1b71bd Compare June 1, 2026 19:13
@ray-gardener ray-gardener Bot added dashboard Issues specific to the Ray Dashboard core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jun 1, 2026
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from a1b71bd to 76c7c7a Compare June 6, 2026 00:29
Comment thread python/ray/dashboard/client/src/pages/node/AcceleratorColumn.tsx
Comment thread python/ray/dashboard/client/src/components/ActorTable.tsx
@spencer-p

Copy link
Copy Markdown
Contributor Author

Could we abstract the column into some kind of base "accelerator"

Agreed. Just pushed all new changes in this direction.

Comment thread python/ray/dashboard/client/src/pages/node/index.tsx Outdated
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from c83b4dc to 5696790 Compare June 8, 2026 17:16
Comment thread python/ray/dashboard/modules/node/datacenter.py Outdated
Comment thread python/ray/dashboard/modules/node/datacenter.py Outdated
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from afab5a5 to 6f3a676 Compare June 8, 2026 19:03
@spencer-p

Copy link
Copy Markdown
Contributor Author

The Actors tab now has stubs for TPUs, I'd prefer to complete that in another PR so we can land incremental changes and not have too much scope creep for this one.

The Clusters tab change looks great. Here's what a 4x4 cluster looks like:

image

And here's a dashboard showing one tpu v5 chip and an nvidia L4 on the same page:

image

Note the unified column with dynamic title :)

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jun 8, 2026
@edoakes

edoakes commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Note the unified column with dynamic title :)

Very nice! Exactly what I was thinking :)

@edoakes edoakes enabled auto-merge (squash) June 8, 2026 20:09
@github-actions github-actions Bot disabled auto-merge June 8, 2026 20:09

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 658e440. Configure here.

Comment thread python/ray/dashboard/client/src/pages/node/index.tsx
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from 658e440 to dfa4c6b Compare June 8, 2026 21:00
spencer-p added 5 commits June 8, 2026 21:02
Signed-off-by: Spencer Peterson <spencerjp@google.com>
- memory usage is shown in GiB if >1024MiB
- omit placeholder chips

Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
spencer-p added 4 commits June 8, 2026 21:02
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from dfa4c6b to cbaebb8 Compare June 8, 2026 21:02
@edoakes edoakes enabled auto-merge (squash) June 8, 2026 21:06
@edoakes edoakes merged commit 29abd06 into ray-project:master Jun 8, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core dashboard Issues specific to the Ray Dashboard go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants