
NVIDIA DCGM
Collect GPU power, utilization, memory, temperature, and health metrics.
Overview
NVIDIA DCGM provides cluster-level GPU telemetry for production AI infrastructure. Matcha uses DCGM metrics to understand GPU power behavior, utilization patterns, memory pressure, temperature, and health signals across nodes. This becomes the telemetry layer for workload-level energy attribution.
Configuration Steps
Enable DCGM exporter on your GPU nodes.
Connect Matcha to your DCGM metrics endpoint.
Select the GPU metrics you want to ingest.
Map GPU IDs to nodes, pods, jobs, or workloads.
Verify streaming telemetry in Matcha.
