Skip to content

GPU Monitoring

Niccanor Dhas edited this page Feb 22, 2026 · 1 revision

GPU Monitoring

tmam can collect real-time GPU metrics from NVIDIA and AMD GPUs and display them in the dashboard's GPU Analytics view.


Enabling GPU Monitoring

Pass collect_gpu_stats=True to init():

from tmam import init

init( url="http://localhost:5050/api/sdk", public_key="pk-tmam-xxxxxxxx", secrect_key="sk-tmam-xxxxxxxx", application_name="my-gpu-app", collect_gpu_stats=True, )

tmam will auto-detect whether an NVIDIA or AMD GPU is present.


GPU Vendor Requirements

NVIDIA GPUs

Install the pynvml library (NVIDIA Management Library Python bindings):

pip install pynvml

Requires NVIDIA drivers to be installed on the host. Works with any CUDA-capable GPU.

AMD GPUs

Install the amdsmi library:

pip install amdsmi

Requires AMD ROCm drivers to be installed on the host.


Metrics Collected

For each GPU in the system, tmam collects the following metrics, tagged with the GPU index, UUID, and name:

Metric Name OTel Name Description
Utilization gpu.utilization Core utilization %
Encoder Utilization gpu.enc.utilization Video encoder utilization %
Decoder Utilization gpu.dec.utilization Video decoder utilization %
Temperature gpu.temperature Temperature in °C
Fan Speed gpu.fan_speed Fan speed (NVIDIA only)
Memory Available gpu.memory.available Available VRAM in MB
Memory Total gpu.memory.total Total VRAM in MB
Memory Used gpu.memory.used Used VRAM in MB
Memory Free gpu.memory.free Free VRAM in MB
Power Draw gpu.power.draw Current power draw in W
Power Limit gpu.power.limit Power limit in W

All metrics are tagged with:

  • gpu.index — GPU index (0, 1, 2...)
  • gpu.uuid — GPU UUID
  • gpu.name — GPU model name
  • service.name — your application_name
  • deployment.environment — your environment

Viewing GPU Metrics

In the dashboard, navigate to Analytics → GPU to see:

  • GPU utilization over time
  • Memory usage (used vs. total)
  • Temperature and power draw
  • Per-GPU breakdowns for multi-GPU systems

No GPU Detected

If collect_gpu_stats=True but no supported GPU is found, tmam logs:

Tmam GPU Instrumentation Error: No supported GPUs found.
If this is a non-GPU host, set `collect_gpu_stats=False` to disable GPU stats.

This does not affect other tracing or metrics collection — it is non-fatal.


Example: LLM + GPU Monitoring

from tmam import init
from transformers import pipeline

init( url="http://localhost:5050/api/sdk", public_key="pk-tmam-xxxxxxxx", secrect_key="sk-tmam-xxxxxxxx", application_name="local-llm", environment="dev", collect_gpu_stats=True, # monitor GPU while running inference )

Transformers calls are auto-instrumented

generator = pipeline("text-generation", model="gpt2", device=0) output = generator("The future of AI is", max_new_tokens=50) print(output[0]["generated_text"])

While inference runs, tmam records both the LLM span (tokens, latency) and GPU metrics (VRAM usage, utilization) — correlatable by timestamp in the dashboard.

Clone this wiki locally