Try it free

NVIDIA BCM extension

  • Latest Dynatrace
  • Extension

Monitor your NVIDIA Base Command Manager (BCM) cluster by enabling this ActiveGate extension.

Get started

Overview

NVIDIA Base Command Manager (BCM) streamlines cluster provisioning, workload management, and infrastructure monitoring. It provides all the tools you need to deploy and manage an AI data center.

This extension provides real-time insights into your whole cluster—including nodes, disks, and GPUs—allowing you to correlate that data with the rest of your monitored environment and easily pinpoint issues and bottlenecks.

Requirements

  • Dynatrace version 1.309+
  • ActiveGate version 1.309+
  • ActiveGate with the Extensions 2.0 module enabled.
  • A certificate and its key to access the NVIDIA BCM API, typically found under /root/.cm on the head node. Once located, you will need to copy them into the filesystem of the ActiveGate and make sure that the dtuserag (Linux) system user or Local Service (Windows) can access them.

Activation and setup

  1. Under Extensions in the left menu, select NVIDIA BCM.
  2. Select an ActiveGate group where the extension will run.
  3. Configure it as follows:
    • URL: Address to connect to the API of the head node, usually on port 8081.
    • Certificate path: Physical location where you added the certificate. It needs to be accessible by the dtuserag system user (Linux) or Local Service (Windows).
    • Key path: Physical location where you added the key to the above certificate. It needs to be accessible by the dtuserag system user (Linux) or Local Service (Windows).
    • HTTPS Proxy: Address for the proxy, if one is required.
    • Proxy username: Username to authenticate against the proxy.
    • Proxy password: Password for the above user.
    • Debug: Produces more verbose logs for troubleshooting.

Details

Licensing and costs

There is no charge to use the extension. You are only charged for the data that the extension ingests.

The NVIDIA BCM extension ingests custom metrics, which consume Davis Data Units (DDUs) (Dynatrace classic license) or Metrics powered by Grail (DPS), according to your license model.

The approximate number of metric data points per minute is:

(4 * <number of clusters>)
+ (10 * <number of nodes (both head and worker nodes)>)
+ (1 * <number of disks>)
+ (2 * <number of GPUs>)

Dynatrace Platform Subscription

In the Dynatrace Platform Subscription, metric ingestion consumes Metrics powered by Grail according to the number of ingested metric data points.

To calculate the approximate yearly consumption, apply the following calculation: <metric data points per minute> * 60 minutes * 24 hours * 365 days.

Dynatrace classic license

In the classic licensing model, metric ingestion consumes Davis Data Units (DDUs) at the rate of .001 DDUs per metric data point. Multiply the above formula for annual data points by .001 to estimate annual DDU usage.

The DDU cost above does not include any possible log events or custom events that are triggered by the extension. For more information, see DDU events.

Feature sets

When activating your extension using a monitoring configuration, you can limit monitoring to one of the feature sets. To work properly, the extension has to collect at least one metric after the activation.

In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.

All metrics that aren't categorized into any feature set are considered to be the default and are always reported.

A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.

Memory
Metric nameMetric keyDescription
Hardware corrupted memorynvidia.bcm.hardware_corrupted_memory—
Memory freenvidia.bcm.memory_free—
Page swap innvidia.bcm.page_swap_in—
Page swap outnvidia.bcm.page_swap_out—
Swap freenvidia.bcm.swap_free—
Out of memory killernvidia.bcm.oomkiller—
Total free memorynvidia.bcm.total_memory_free—
Total free swapnvidia.bcm.total_swap_free—
GPU
Metric nameMetric keyDescription
GPU memory freenvidia.bcm.gpu_mem_free—
GPU memory utilizationnvidia.bcm.gpu_utilization—
Disk
Metric nameMetric keyDescription
Disk free spacenvidia.bcm.free_space—
CPU
Metric nameMetric keyDescription
CPU Systemnvidia.bcm.cpu_system—
CPU Usagenvidia.bcm.cpu_usage—
CPU Usernvidia.bcm.cpu_user—
CPU Waitnvidia.bcm.cpu_wait—
default
Metric nameMetric keyDescription
—nvidia.bcm.connectivity—
Hub

Explore in Dynatrace Hub

Monitor your NVIDIA Base Command Manager (BCM) cluster by enabling this ActiveGate extension.

Related tags
ComputePythonGPUNVIDIAInfrastructure Observability