Databricks Workspace extension

  • Latest Dynatrace
  • Extension

Remotely monitor your Databricks workspaces.

Get started

Overview

With the Dynatrace Databricks Workspace extension, you can remotely monitor your Databricks workspaces.

This extension works in harmony with the OneAgent-based Databricks extension, but is also ideal for workspaces and clusters where the OneAgent cannot be installed, such as Databricks serverless compute.

Use cases

  • Gather Databricks Job Run metrics, including success rate and job duration.
  • For Databricks Jobs running on all-purpose and job compute clusters, understand the cost of these jobs (currently, Azure Databricks is supported).
  • Ingest job and task run information as traces for further analysis.
  • Gather health metrics and detailed usage information from your Databricks model serving endpoints.
  • Ingest billing data from Databricks to understand usage across workspaces, SKU & product category, jobs, and more.

Requirements

Databricks API version 2.2 is used for the APIs below:

API version 2.1 is used for the following:

API version 2.0 is used for the following:

The following system tables are queried when ingesting model serving endpoint data:

And billing & cost data:

To query any of the above system table data, the workspace must also have:

For the databricks.job.cost metric, currently only Azure Databricks workspaces are supported.

Activation and setup

  1. Install Dynatrace Environment ActiveGate.

  2. Ensure connectivity between this ActiveGate and your Databricks workspace URL.

  3. Create a Databricks access token or a Service principal for your Databricks workspace with access to all of the API scopes and/or system tables listed under the Requirements. The following commands can be used to grant all of the required system table permissions to the service principal:

    GRANT USE SCHEMA ON SCHEMA system.access TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.access.workspaces_latest TO `<service principal application id>`;
    GRANT USE SCHEMA ON SCHEMA system.billing TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.billing.usage TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.billing.list_prices TO `<service principal application id>`;
    GRANT USE SCHEMA ON SCHEMA system.serving TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.serving.endpoint_usage TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.serving.served_entities TO `<service principal application id>`;
    GRANT USE SCHEMA ON SCHEMA system.lakeflow TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.lakeflow.jobs TO `<service principal application id>`;
  4. Create a Dynatrace access token with the openTelemetryTrace.ingest scope.

  5. If ingesting billing and/or model serving endpoint data, provide the ID of an existing SQL warehouse used to execute system table queries. Ensure that the warehouse is up and running before enabling the extension, as there may be some startup delay depending on the type of warehouse.

  6. Create a new monitoring configuration in Dynatrace, using the URL, Dynatrace token, and either the Databricks access token or the client ID/secret associated with the service principal.

Details

This extension remotely queries the Databricks APIs and system tables using the provided Databricks URL and access token or OAuth client.

With that information, it calculates and reports the various metrics selected from the feature sets in the Dynatrace monitoring configuration.

If trace ingestion is configured, the extension transforms the data from the Databricks APIs into OpenTelemetry traces with the job as the parent span and the tasks in that job as child spans.

Only once a job is completed will its metrics and trace be ingested. This means that data about a job is not ingested while it is running.

When ingesting any system table data, there will be a billing cost associated, as it involves executing queries against a SQL warehouse. Depending on the warehouse type you are using (Serverless, Pro or Classic), as well as your specific billing model, the cost will vary.

Note that for the Databricks Cost Management dashboard, billing data is reported with a delay of up to 3 hours due to the system table refresh rate. If you don't see data in the dashboard as expected, expand the selected timeframe.

Licensing and cost

If all the feature sets are enabled, the number of metric datapoints is:

7 * # of Jobs

11 * # of model serving endpoints

If traces are configured to be ingested, the number of spans is:

# of Jobs * (1 + Tasks per Job)

Log ingestion will vary, with log lines reported per:

  • Job run
  • As well as each new table entry to:
    • system.serving.endpoint_usage
    • system.billing.usage

Feature sets

When activating your extension using monitoring configuration, you can limit monitoring to one of the feature sets. To work properly, the extension has to collect at least one metric after the activation.

In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.

All metrics that aren't categorized into any feature set are considered to be the default and are always reported.

A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.

Databricks Job Metrics (detailed)
Metric nameMetric keyDescription
Job Setup Durationdatabricks.job.duration.setup
Job Excecution Durationdatabricks.job.duration.execution
Job Cleanup Durationdatabricks.job.duration.cleanup
Job Queue Durationdatabricks.job.duration.queue
Databricks Job Cost Metrics
Metric nameMetric keyDescription
Job Cost (Approx)databricks.job.cost
Databricks Job Metrics
Metric nameMetric keyDescription
Job Run Durationdatabricks.job.duration.run
Job Success Ratedatabricks.job.success_rate
Related tags
AnalyticsPythonData Processing/AnalyticsDatabricksInfrastructure Observability