Remotely monitor your Databricks workspaces.
With the Dynatrace Databricks Workspace extension, you can remotely monitor your Databricks workspaces.
This extension works in harmony with the OneAgent-based Databricks extension, but is also ideal for workspaces and clusters where the OneAgent cannot be installed, such as Databricks serverless compute.
Databricks API version 2.2 is used for the APIs below:
API version 2.1 is used for the following:
API version 2.0 is used for the following:
The following system tables are queried when ingesting model serving endpoint data:
Billing & cost data:
Cluster resource utilization:
Audit logs:
To query any of the above system table data, the workspace must also have:
To capture metrics from the Spark API, the token or Service Principal must have Can attach to permissions on the clusters you want to monitor.
For the databricks.job.cost metric, currently only Azure Databricks workspaces are supported.
Install Dynatrace Environment ActiveGate.
Ensure connectivity between this ActiveGate and your Databricks workspace URL.
Create a Databricks access token or a Service principal for your Databricks workspace with access to all of the API scopes and/or system tables listed under the Requirements. The following commands can be used to grant all of the required system table permissions to the service principal:
GRANT USE SCHEMA ON SCHEMA system.access TO `<service principal application id>`;GRANT SELECT ON TABLE system.access.workspaces_latest TO `<service principal application id>`;GRANT SELECT ON TABLE system.access.audit TO `<service principal application id>`;GRANT USE SCHEMA ON SCHEMA system.billing TO `<service principal application id>`;GRANT SELECT ON TABLE system.billing.usage TO `<service principal application id>`;GRANT SELECT ON TABLE system.billing.list_prices TO `<service principal application id>`;GRANT USE SCHEMA ON SCHEMA system.serving TO `<service principal application id>`;GRANT SELECT ON TABLE system.serving.endpoint_usage TO `<service principal application id>`;GRANT SELECT ON TABLE system.serving.served_entities TO `<service principal application id>`;GRANT USE SCHEMA ON SCHEMA system.lakeflow TO `<service principal application id>`;GRANT SELECT ON TABLE system.lakeflow.jobs TO `<service principal application id>`;GRANT USE SCHEMA ON SCHEMA system.compute TO `<service principal application id>`;GRANT SELECT ON TABLE system.compute.clusters TO `<service principal application id>`;GRANT SELECT ON TABLE system.compute.node_timeline TO `<service principal application id>`;
If using a Service principal, note that you will also need to do one of the following:
Can view access to all model serving endpoints and jobs that you want to monitor. This can be done via the API endpoints in the links above, or manually in the UI.Create a Dynatrace access token with the openTelemetryTrace.ingest scope.
If ingesting billing and/or model serving endpoint data, provide the ID of an existing SQL warehouse used to execute system table queries. Ensure that the warehouse is up and running before enabling the extension, as there may be some startup delay depending on the type of warehouse.
Create a new monitoring configuration in Dynatrace, using the URL, Dynatrace token, and either the Databricks access token or the client ID/secret associated with the service principal.
This extension remotely queries the Databricks APIs and system tables using the provided Databricks URL and access token or OAuth client.
With that information, it calculates and reports the various metrics selected from the feature sets in the Dynatrace monitoring configuration.
If trace ingestion is configured, the extension transforms the data from the Databricks APIs into OpenTelemetry traces with the job as the parent span and the tasks in that job as child spans.
Only once a job is completed will its metrics and trace be ingested. This means that data about a job is not ingested while it is running.
When ingesting any system table data, there will be a billing cost associated, as it involves executing queries against a SQL warehouse. Depending on the warehouse type you are using (Serverless, Pro or Classic), as well as your specific billing model, the cost will vary.
Note that for the Databricks Cost Management dashboard, billing data is reported with a delay of up to 3 hours due to the system table refresh rate. If you don't see data in the dashboard as expected, expand the selected timeframe.
If all the feature sets are enabled, the number of metric datapoints is:
8 * # of jobs
11 * # of model serving endpoints
# of clusters * (16 + (27 * # Spark jobs) + (32 * # Spark apps))
If traces are configured to be ingested, the number of spans is:
# of jobs * (1 + tasks per job)
Log ingestion will vary, with log lines reported per:
When activating your extension using monitoring configuration, you can limit monitoring to one of the feature sets. To work properly, the extension has to collect at least one metric after the activation.
In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.
All metrics that aren't categorized into any feature set are considered to be the default and are always reported.
A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.
| Metric name | Metric key | Description |
|---|---|---|
| Streaming Batch Duration | databricks.cluster.spark.streaming.statistics.batch_duration | Time interval configured for each streaming batch |
| Streaming Receivers | databricks.cluster.spark.streaming.statistics.num_receivers | Total number of receivers configured for the streaming job |
| Streaming Active Receivers | databricks.cluster.spark.streaming.statistics.num_active_receivers | Number of receivers actively ingesting data |
| Streaming Inactive Receivers | databricks.cluster.spark.streaming.statistics.num_inactive_receivers | Number of receivers that are currently inactive |
| Streaming Completed Batches | databricks.cluster.spark.streaming.statistics.num_total_completed_batches.count | Total number of batches that have been fully processed |
| Streaming Retained Completed Batches | databricks.cluster.spark.streaming.statistics.num_retained_completed_batches.count | Number of completed batches retained in memory for monitoring or debugging |
| Streaming Active Batches | databricks.cluster.spark.streaming.statistics.num_active_batches | Number of streaming batches currently being processed |
| Streaming Processed Records | databricks.cluster.spark.streaming.statistics.num_processed_records.count | Total number of records processed across all batches |
| Streaming Received Records | databricks.cluster.spark.streaming.statistics.num_received_records.count | Total number of records received from all sources |
| Streaming Avg Input Rate | databricks.cluster.spark.streaming.statistics.avg_input_rate | Average number of records received per second across batches |
| Streaming Avg Scheduling Delay | databricks.cluster.spark.streaming.statistics.avg_scheduling_delay | Average delay between batch creation and start of processing |
| Streaming Avg Processing Time | databricks.cluster.spark.streaming.statistics.avg_processing_time | Average time taken to process each batch |
| Streaming Avg Total Delay | databricks.cluster.spark.streaming.statistics.avg_total_delay | Average total delay from data ingestion to processing completion |
| Metric name | Metric key | Description |
|---|---|---|
| RDD Count | databricks.cluster.spark.rdd_count.gauge | Total number of Resilient Distributed Datasets currently tracked by the Spark application |
| RDD Partitions | databricks.cluster.spark.rdd.num_partitions | Total number of partitions across all Resilient Distributed Datasets |
| RDD Cached Partitions | databricks.cluster.spark.rdd.num_cached_partitions | Number of Resilient Distributed Dataset partitions currently cached in memory or disk |
| RDD Memory Used | databricks.cluster.spark.rdd.memory_used | Amount of memory used to store Resilient Distributed Dataset data |
| RDD Disk Used | databricks.cluster.spark.rdd.disk_used | Amount of disk space used to store Resilient Distributed Dataset data |
| Metric name | Metric key | Description |
|---|---|---|
| Job Status | databricks.cluster.spark.job.status | Current status of the job (e.g., running, succeeded, failed) |
| Job Duration | databricks.cluster.spark.job.duration | Total time taken by the job from start to finish |
| Job Total Tasks | databricks.cluster.spark.job.total_tasks | Total number of tasks planned for the job |
| Job Active Tasks | databricks.cluster.spark.job.active_tasks | Number of tasks currently executing within the job |
| Job Skipped Tasks | databricks.cluster.spark.job.skipped_tasks | Number of tasks skipped due to earlier failures or optimizations |
| Job Failed Tasks | databricks.cluster.spark.job.failed_tasks | Number of tasks that failed during job execution |
| Job Completed Tasks | databricks.cluster.spark.job.completed_tasks | Total number of tasks that have successfully completed |
| Job Active Stages | databricks.cluster.spark.job.active_stages | Number of stages currently running in a Spark job |
| Job Completed Stages | databricks.cluster.spark.job.completed_stages | Total number of stages that have successfully completed |
| Job Skipped Stages | databricks.cluster.spark.job.skipped_stages | Number of stages skipped due to earlier failures or optimizations |
| Job Failed Stages | databricks.cluster.spark.job.failed_stages | Number of stages that failed during job execution |
| Job Count | databricks.cluster.spark.job_count.gauge | Total number of Spark jobs submitted |
| Metric name | Metric key | Description |
|---|---|---|
| Job Run Duration | databricks.job.duration.run | — |
| Job Success Rate | databricks.job.success_rate | — |
| Job Runs Count | databricks.job.runs | — |
| Metric name | Metric key | Description |
|---|---|---|
| Cluster CPU System Percentage | databricks.compute.cpu.system | Percentage of time the CPU spent in system mode. |
| Cluster CPU User Percentage | databricks.compute.cpu.user | Percentage of time the CPU spent in userland. |
| Cluster CPU Wait Percentage | databricks.compute.cpu.wait | Percentage of time the CPU spent waiting for I/O. |
| Cluster CPU Total Percentage | databricks.compute.cpu.total | Percentage of time the CPU spent in total (including system and user time). |
| Cluster Memory Usage Percentage | databricks.compute.memory.used | Percentage of the compute's memory that was used during the time period (including memory used by background processes running on the compute). |
| Cluster Memory Swap Percentage | databricks.compute.memory.swap | Percentage of memory usage attributed to memory swap. |
| Cluster Network Sent Bytes | databricks.compute.network.sent | The number of bytes sent out in network traffic. |
| Cluster Network Received Bytes | databricks.compute.network.received | The number of received bytes from network traffic. |
| Metric name | Metric key | Description |
|---|---|---|
| Model Serving Endpoint Memory Usage Percentage | databricks.model_endpoint.mem_usage_percentage | — |
| Model Serving Endpoint CPU Usage Percentage | databricks.model_endpoint.cpu_usage_percentage | — |
| Model Serving Endpoint Request Count Total | databricks.model_endpoint.request_count_total | — |
| Model Serving Endpoint Request 5xx Count Total | databricks.model_endpoint.request_5xx_count_total | — |
| Model Serving Endpoint Provisioned Concurrent Requests Total | databricks.model_endpoint.provisioned_concurrent_requests_total | — |
| Model Serving Endpoint Request 4xx Count Total | databricks.model_endpoint.request_4xx_count_total | — |
| Model Serving Endpoint GPU Usage Percentage | databricks.model_endpoint.gpu_usage_percentage | — |
| Model Serving Endpoint GPU Memory Usage Percentage | databricks.model_endpoint.gpu_memory_usage_percentage | — |
| Model Serving Endpoint Average Request Latency | databricks.model_endpoint.request_latency_ms_avg | — |
| Model Serving Endpoint P99 Request Latency | databricks.model_endpoint.request_latency_ms_p99 | — |
| Model Serving Endpoint P95 Request Latency | databricks.model_endpoint.request_latency_ms_p95 | — |
| Metric name | Metric key | Description |
|---|---|---|
| Job Setup Duration | databricks.job.duration.setup | — |
| Job Execution Duration | databricks.job.duration.execution | — |
| Job Cleanup Duration | databricks.job.duration.cleanup | — |
| Job Queue Duration | databricks.job.duration.queue | — |
| Metric name | Metric key | Description |
|---|---|---|
| Executor RDD Blocks | databricks.cluster.spark.executor.rdd_blocks | Number of Resilient Distributed Dataset blocks stored in memory or disk by the executor |
| Executor Memory Used | databricks.cluster.spark.executor.memory_used | The amount of memory currently used by the executor for execution and storage tasks |
| Executor Disk Used | databricks.cluster.spark.executor.disk_used | Disk used by the Spark executor |
| Executor Active Tasks | databricks.cluster.spark.executor.active_tasks | Total number of tasks that are currently executing on the specified executor within the Databricks Cluster |
| Executor Failed Tasks | databricks.cluster.spark.executor.failed_tasks | Number of failed tasks on the Spark executor |
| Executor Completed Tasks | databricks.cluster.spark.executor.completed_tasks | Number of completed tasks on the Spark Application |
| Executor Total Tasks | databricks.cluster.spark.executor.total_tasks | Total number of tasks executed by the executor |
| Executor Duration | databricks.cluster.spark.executor.total_duration.count | Time taken by Spark executor to complete a task |
| Executor Input Bytes | databricks.cluster.spark.executor.total_input_bytes.count | Total number of Bytes read by a Spark task from its input source |
| Executor Shuffle Read | databricks.cluster.spark.executor.total_shuffle_read.count | Total data read by the executor during shuffle operations (from other executors) |
| Executor Shuffle Write | databricks.cluster.spark.executor.total_shuffle_write.count | Total data written by the executor during shuffle operations (to other executors) |
| Executor Max Memory | databricks.cluster.spark.executor.max_memory | The maximum amount of memory allocated to the executor by Spark |
| Executor Alive Count | databricks.cluster.spark.executor.alive_count.gauge | Number of tasks that are currently running on the Databricks Cluster |
| Executor Dead Count | databricks.cluster.spark.executor.dead_count.gauge | Number of dead tasks on the Spark application |
| Metric name | Metric key | Description |
|---|---|---|
| Job Cost (Approx) | databricks.job.cost | — |
| Metric name | Metric key | Description |
|---|---|---|
| Stage Active Tasks | databricks.cluster.spark.job.stage.num_active_tasks | Number of tasks currently running in the stage |
| Stage Completed Tasks | databricks.cluster.spark.job.stage.num_complete_tasks | Number of tasks that have successfully completed in the stage |
| Stage Failed Tasks | databricks.cluster.spark.job.stage.num_failed_tasks | Number of tasks that failed during execution in the stage |
| Stage Killed Tasks | databricks.cluster.spark.job.stage.num_killed_tasks | Number of tasks that were killed (e.g., due to job cancellation or speculative execution) |
| Stage Executor Run Time | databricks.cluster.spark.job.stage.executor_run_time | Total time executors spent running tasks in the stage |
| Stage Input Bytes | databricks.cluster.spark.job.stage.input_bytes | Total number of bytes read from input sources in the stage |
| Stage Input Records | databricks.cluster.spark.job.stage.input_records | Total number of records read from input sources in the stage |
| Stage Output Bytes | databricks.cluster.spark.job.stage.output_bytes | Total number of bytes written to output destinations in the stage |
| Stage Output Records | databricks.cluster.spark.job.stage.output_records | Total number of records written to output destinations in the stage |
| Stage Shuffle Read Bytes | databricks.cluster.spark.job.stage.shuffle_read_bytes | Total bytes read from other executors during shuffle operations |
| Stage Shuffle Read Records | databricks.cluster.spark.job.stage.shuffle_read_records | Total records read from other executors during shuffle operations |
| Stage Shuffle Write Bytes | databricks.cluster.spark.job.stage.shuffle_write_bytes | Total bytes written to other executors during shuffle operations |
| Stage Shuffle Write Records | databricks.cluster.spark.job.stage.shuffle_write_records | Total records written to other executors during shuffle operations |
| Stage Memory Bytes Spilled | databricks.cluster.spark.job.stage.memory_bytes_spilled | Amount of data spilled to memory due to shuffle or aggregation operations |
| Stage Disk Bytes Spilled | databricks.cluster.spark.job.stage.disk_bytes_spilled | Amount of data spilled to disk due to insufficient memory during task execution |