Remotely monitor your Databricks workspaces.
With the Dynatrace Databricks Workspace extension, you can remotely monitor your Databricks workspaces.
This extension works in harmony with the OneAgent-based Databricks extension, but is also ideal for workspaces and clusters where the OneAgent cannot be installed, such as Databricks serverless compute.
Databricks API version 2.2 is used for the APIs below:
API version 2.1 is used for the following:
API version 2.0 is used for the following:
The following system tables are queried when ingesting model serving endpoint data:
And billing & cost data:
To query any of the above system table data, the workspace must also have:
For the databricks.job.cost metric, currently only Azure Databricks workspaces are supported.
Install Dynatrace Environment ActiveGate.
Ensure connectivity between this ActiveGate and your Databricks workspace URL.
Create a Databricks access token or a Service principal for your Databricks workspace with access to all of the API scopes and/or system tables listed under the Requirements. The following commands can be used to grant all of the required system table permissions to the service principal:
GRANT USE SCHEMA ON SCHEMA system.access TO `<service principal application id>`;GRANT SELECT ON TABLE system.access.workspaces_latest TO `<service principal application id>`;GRANT USE SCHEMA ON SCHEMA system.billing TO `<service principal application id>`;GRANT SELECT ON TABLE system.billing.usage TO `<service principal application id>`;GRANT SELECT ON TABLE system.billing.list_prices TO `<service principal application id>`;GRANT USE SCHEMA ON SCHEMA system.serving TO `<service principal application id>`;GRANT SELECT ON TABLE system.serving.endpoint_usage TO `<service principal application id>`;GRANT SELECT ON TABLE system.serving.served_entities TO `<service principal application id>`;GRANT USE SCHEMA ON SCHEMA system.lakeflow TO `<service principal application id>`;GRANT SELECT ON TABLE system.lakeflow.jobs TO `<service principal application id>`;
Create a Dynatrace access token with the openTelemetryTrace.ingest scope.
If ingesting billing and/or model serving endpoint data, provide the ID of an existing SQL warehouse used to execute system table queries. Ensure that the warehouse is up and running before enabling the extension, as there may be some startup delay depending on the type of warehouse.
Create a new monitoring configuration in Dynatrace, using the URL, Dynatrace token, and either the Databricks access token or the client ID/secret associated with the service principal.
This extension remotely queries the Databricks APIs and system tables using the provided Databricks URL and access token or OAuth client.
With that information, it calculates and reports the various metrics selected from the feature sets in the Dynatrace monitoring configuration.
If trace ingestion is configured, the extension transforms the data from the Databricks APIs into OpenTelemetry traces with the job as the parent span and the tasks in that job as child spans.
Only once a job is completed will its metrics and trace be ingested. This means that data about a job is not ingested while it is running.
When ingesting any system table data, there will be a billing cost associated, as it involves executing queries against a SQL warehouse. Depending on the warehouse type you are using (Serverless, Pro or Classic), as well as your specific billing model, the cost will vary.
Note that for the Databricks Cost Management dashboard, billing data is reported with a delay of up to 3 hours due to the system table refresh rate. If you don't see data in the dashboard as expected, expand the selected timeframe.
If all the feature sets are enabled, the number of metric datapoints is:
7 * # of Jobs
11 * # of model serving endpoints
If traces are configured to be ingested, the number of spans is:
# of Jobs * (1 + Tasks per Job)
Log ingestion will vary, with log lines reported per:
When activating your extension using monitoring configuration, you can limit monitoring to one of the feature sets. To work properly, the extension has to collect at least one metric after the activation.
In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.
All metrics that aren't categorized into any feature set are considered to be the default and are always reported.
A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.
| Metric name | Metric key | Description |
|---|---|---|
| Job Setup Duration | databricks.job.duration.setup | — |
| Job Excecution Duration | databricks.job.duration.execution | — |
| Job Cleanup Duration | databricks.job.duration.cleanup | — |
| Job Queue Duration | databricks.job.duration.queue | — |
| Metric name | Metric key | Description |
|---|---|---|
| Job Cost (Approx) | databricks.job.cost | — |
| Metric name | Metric key | Description |
|---|---|---|
| Job Run Duration | databricks.job.duration.run | — |
| Job Success Rate | databricks.job.success_rate | — |