[dashboard] Show TPU stats on Cluster tab #63774
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces TPU monitoring support to the Ray dashboard, updating components like GPUColumn, GRAMColumn, and index.tsx to dynamically display TPU metrics alongside GPU metrics, and adding a new TPUColumn component. It also updates the reporter agent and models to handle TPU stats. However, several critical issues were identified in the review: a potential runtime crash in the reporter agent due to treating a Pydantic model as a dictionary, and multiple potential TypeErrors in the frontend code if GPU or TPU data is null rather than undefined.
|
I don't like plumbing through alternate strings and nodes into the GPU column too much. It doesn't seem like it will scale well for other accelerators. I think it might be preferable to have a new TPU column that is conditionally hidden, and hide the GPU columns iff there are TPUs and no GPUs. It would be ugly for a cluster with a mix of TPUs and GPUs, but I suspect that is rare. |
Agreed that mixing accelerator types would be rare, but should still be supported. Could we abstract the column into some kind of base "accelerator" and then pass an enum that would key into the column names? |
32d1a71 to
a1b71bd
Compare
a1b71bd to
76c7c7a
Compare
Agreed. Just pushed all new changes in this direction. |
c83b4dc to
5696790
Compare
afab5a5 to
6f3a676
Compare
Very nice! Exactly what I was thinking :) |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 658e440. Configure here.
658e440 to
dfa4c6b
Compare
Signed-off-by: Spencer Peterson <spencerjp@google.com>
- memory usage is shown in GiB if >1024MiB - omit placeholder chips Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
dfa4c6b to
cbaebb8
Compare



Description
This change shows TPU tensor core utilization and High Bandwidth Memory utilization in the ray cluster dashboard.
Example screenshot:

Related issues
#57829