Databricks Workspace extension

Latest Dynatrace
Extension

Remotely monitor your Databricks workspaces.

Get started

Overview

With the Dynatrace Databricks Workspace extension, you can remotely monitor your Databricks workspaces.

This extension works in harmony with the OneAgent-based Databricks extension, but is also ideal for workspaces and clusters where the OneAgent cannot be installed, such as Databricks serverless compute.

Use cases

Gather Databricks Job Run metrics, including success rate and job duration.
For Databricks Jobs running on all-purpose and job compute clusters, understand the cost of these jobs (currently, Azure Databricks is supported).
Ingest job and task run information as traces for further analysis.
Gather health metrics and detailed usage information from your Databricks model serving endpoints.
Ingest billing data from Databricks to understand usage across workspaces, SKU & product category, jobs, and more.

Requirements

Databricks API version 2.2 is used for the APIs below:

List job runs(Scope: jobs)
Get a single job(Scope: jobs)

API version 2.1 is used for the following:

Get cluster info(Scope: clusters)

API version 2.0 is used for the following:

Get all serving endpoints(Scope: model-serving)
Get metrics of a serving endpoint(Scope: model-serving)

The following system tables are queried when ingesting model serving endpoint data:

And billing & cost data:

To query any of the above system table data, the workspace must also have:

Unity Catalog enabled.
A SQL warehouse available to execute system table queries against.

For the databricks.job.cost metric, currently only Azure Databricks workspaces are supported.

Activation and setup

Install Dynatrace Environment ActiveGate.
Ensure connectivity between this ActiveGate and your Databricks workspace URL.

Create a Databricks access token or a Service principal for your Databricks workspace with access to all of the API scopes and/or system tables listed under the Requirements. The following commands can be used to grant all of the required system table permissions to the service principal:

GRANT USE SCHEMA ON SCHEMA system.access TO `<service principal application id>`;
GRANT SELECT ON TABLE system.access.workspaces_latest TO `<service principal application id>`;
GRANT USE SCHEMA ON SCHEMA system.billing TO `<service principal application id>`;
GRANT SELECT ON TABLE system.billing.usage TO `<service principal application id>`;
GRANT SELECT ON TABLE system.billing.list_prices TO `<service principal application id>`;
GRANT USE SCHEMA ON SCHEMA system.serving TO `<service principal application id>`;
GRANT SELECT ON TABLE system.serving.endpoint_usage TO `<service principal application id>`;
GRANT SELECT ON TABLE system.serving.served_entities TO `<service principal application id>`;
GRANT USE SCHEMA ON SCHEMA system.lakeflow TO `<service principal application id>`;
GRANT SELECT ON TABLE system.lakeflow.jobs TO `<service principal application id>`;

Create a Dynatrace access token with the openTelemetryTrace.ingest scope.
If ingesting billing and/or model serving endpoint data, provide the ID of an existing SQL warehouse used to execute system table queries. Ensure that the warehouse is up and running before enabling the extension, as there may be some startup delay depending on the type of warehouse.
Create a new monitoring configuration in Dynatrace, using the URL, Dynatrace token, and either the Databricks access token or the client ID/secret associated with the service principal.

Details

This extension remotely queries the Databricks APIs and system tables using the provided Databricks URL and access token or OAuth client.

With that information, it calculates and reports the various metrics selected from the feature sets in the Dynatrace monitoring configuration.

If trace ingestion is configured, the extension transforms the data from the Databricks APIs into OpenTelemetry traces with the job as the parent span and the tasks in that job as child spans.

Only once a job is completed will its metrics and trace be ingested. This means that data about a job is not ingested while it is running.

When ingesting any system table data, there will be a billing cost associated, as it involves executing queries against a SQL warehouse. Depending on the warehouse type you are using (Serverless, Pro or Classic), as well as your specific billing model, the cost will vary.

Note that for the Databricks Cost Management dashboard, billing data is reported with a delay of up to 3 hours due to the system table refresh rate. If you don't see data in the dashboard as expected, expand the selected timeframe.

Licensing and cost

If all the feature sets are enabled, the number of metric datapoints is:

7 * # of Jobs

11 * # of model serving endpoints

If traces are configured to be ingested, the number of spans is:

# of Jobs * (1 + Tasks per Job)

Log ingestion will vary, with log lines reported per:

Job run
As well as each new table entry to:
- system.serving.endpoint_usage
- system.billing.usage

Feature sets

When activating your extension using monitoring configuration, you can limit monitoring to one of the feature sets. To work properly, the extension has to collect at least one metric after the activation.

In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.

All metrics that aren't categorized into any feature set are considered to be the default and are always reported.

A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.

Databricks Job Metrics (detailed)

Metric name	Metric key	Description
Job Setup Duration	databricks.job.duration.setup	—
Job Excecution Duration	databricks.job.duration.execution	—
Job Cleanup Duration	databricks.job.duration.cleanup	—
Job Queue Duration	databricks.job.duration.queue	—

Databricks Job Cost Metrics

Metric name	Metric key	Description
Job Cost (Approx)	databricks.job.cost	—

Databricks Job Metrics

Metric name	Metric key	Description
Job Run Duration	databricks.job.duration.run	—
Job Success Rate	databricks.job.success_rate	—

Explore in Dynatrace Hub

Remotely monitor your Databricks workspaces.