Monitor your NVIDIA Base Command Manager (BCM) cluster by enabling this ActiveGate extension
NVIDIA Base Command Manager (BCM) streamlines cluster provisioning, workload management, and infrastructure monitoring. It provides all the tools you need to deploy and manage an AI data center.
This extension provides real-time insights into your whole cluster—including nodes, disks, and GPUs—allowing you to correlate that data with the rest of your monitored environment and easily pinpoint issues and bottlenecks.
/root/.cm on the head node. Once located, you will need to copy them into the filesystem of the ActiveGate and make sure that the dtuserag(Linux) system user or Local Service (Windows) can access them.dtuserag system user (Linux) or Local Service (Windows).dtuserag system user (Linux) or Local Service (Windows).The metrics collected through this extension consume Dynatrace Davis Data Units (see DDUs for metrics).
A rough estimation of the amount of DDUs consumed by metric ingest can be obtained through the following formula:
( (4 * number of clusters)+ (10 * number of nodes (both head and worker nodes))+ (1 * number of disks)+ (2 * number of GPUs)) * 525.6 DDUs/year
If your license consists of Custom Metrics, each custom metric is equivalent to 525.6 DDUs/yr. For details, see Metric Cost Calculation.
When activating your extension using monitoring configuration, you can limit monitoring to one of the feature sets. To work properly the extension has to collect at least one metric after the activation.
In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.
All metrics that aren't categorized into any feature set are considered to be the default and are always reported.
A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.
| Metric name | Metric key | Description |
|---|---|---|
| GPU memory free | nvidia.bcm.gpu_mem_free | — |
| GPU memory utilization | nvidia.bcm.gpu_utilization | — |
| Metric name | Metric key | Description |
|---|---|---|
| Disk free space | nvidia.bcm.free_space | — |
| Metric name | Metric key | Description |
|---|---|---|
| CPU System | nvidia.bcm.cpu_system | — |
| CPU Usage | nvidia.bcm.cpu_usage | — |
| CPU User | nvidia.bcm.cpu_user | — |
| CPU Wait | nvidia.bcm.cpu_wait | — |
| Metric name | Metric key | Description |
|---|---|---|
| Hardware corrupted memory | nvidia.bcm.hardware_corrupted_memory | — |
| Memory free | nvidia.bcm.memory_free | — |
| Page swap in | nvidia.bcm.page_swap_in | — |
| Page swap out | nvidia.bcm.page_swap_out | — |
| Swap free | nvidia.bcm.swap_free | — |
| Out of memory killer | nvidia.bcm.oomkiller | — |
| Total free memory | nvidia.bcm.total_memory_free | — |
| Total free swap | nvidia.bcm.total_swap_free | — |