Monitor your NVIDIA Base Command Manager (BCM) cluster by enabling this ActiveGate extension.
NVIDIA Base Command Manager (BCM) streamlines cluster provisioning, workload management, and infrastructure monitoring. It provides all the tools you need to deploy and manage an AI data center.
This extension provides real-time insights into your whole cluster—including nodes, disks, and GPUs—allowing you to correlate that data with the rest of your monitored environment and easily pinpoint issues and bottlenecks.
/root/.cm on the head node. Once located, you will need to copy them into the filesystem of the ActiveGate and make sure that the dtuserag (Linux) system user or Local Service (Windows) can access them.dtuserag system user (Linux) or Local Service (Windows).dtuserag system user (Linux) or Local Service (Windows).There is no charge to use the extension. You are only charged for the data that the extension ingests.
The NVIDIA BCM extension ingests custom metrics, which consume Davis Data Units (DDUs) (Dynatrace classic license) or Metrics powered by Grail (DPS), according to your license model.
The approximate number of metric data points per minute is:
(4 * <number of clusters>)+ (10 * <number of nodes (both head and worker nodes)>)+ (1 * <number of disks>)+ (2 * <number of GPUs>)
In the Dynatrace Platform Subscription, metric ingestion consumes Metrics powered by Grail according to the number of ingested metric data points.
To calculate the approximate yearly consumption, apply the following calculation: <metric data points per minute> * 60 minutes * 24 hours * 365 days.
In the classic licensing model, metric ingestion consumes Davis Data Units (DDUs) at the rate of .001 DDUs per metric data point. Multiply the above formula for annual data points by .001 to estimate annual DDU usage.
The DDU cost above does not include any possible log events or custom events that are triggered by the extension. For more information, see DDU events.
When activating your extension using a monitoring configuration, you can limit monitoring to one of the feature sets. To work properly, the extension has to collect at least one metric after the activation.
In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.
All metrics that aren't categorized into any feature set are considered to be the default and are always reported.
A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.
| Metric name | Metric key | Description |
|---|---|---|
| Hardware corrupted memory | nvidia.bcm.hardware_corrupted_memory | — |
| Memory free | nvidia.bcm.memory_free | — |
| Page swap in | nvidia.bcm.page_swap_in | — |
| Page swap out | nvidia.bcm.page_swap_out | — |
| Swap free | nvidia.bcm.swap_free | — |
| Out of memory killer | nvidia.bcm.oomkiller | — |
| Total free memory | nvidia.bcm.total_memory_free | — |
| Total free swap | nvidia.bcm.total_swap_free | — |
| Metric name | Metric key | Description |
|---|---|---|
| GPU memory free | nvidia.bcm.gpu_mem_free | — |
| GPU memory utilization | nvidia.bcm.gpu_utilization | — |
| Metric name | Metric key | Description |
|---|---|---|
| Disk free space | nvidia.bcm.free_space | — |
| Metric name | Metric key | Description |
|---|---|---|
| CPU System | nvidia.bcm.cpu_system | — |
| CPU Usage | nvidia.bcm.cpu_usage | — |
| CPU User | nvidia.bcm.cpu_user | — |
| CPU Wait | nvidia.bcm.cpu_wait | — |
| Metric name | Metric key | Description |
|---|---|---|
| — | nvidia.bcm.connectivity | — |