Databricks extension

Latest Dynatrace
Extension
Published Oct 27, 2025

Monitor your Databricks clusters via multiple APIs.

Get started

Overview

Use this extension if you have Databricks clusters for which you would like to:

Monitor job statuses and other important job and cluster level metrics
Analyze uptime and autoscaling issues

Use the Databricks OneAgent extension to collect metrics from your embedded Ganglia instance, the Apache Spark APIs, and/or the Databricks API on your Databricks cluster.

Databricks Runtime v13+ no longer supports Ganglia. If this applies to your Databricks Runtime installation, use the Spark and Databricks API configuration option.

Use cases

Monitor job, cluster, and infrastructure metrics
Detect long upscaling times
Detect and filter driver and worker types

Activation and setup

To activate this extension, you need to

Define in the configuration which metrics you'd like to collect from your Databricks clusters.
Set up a global init script on your Databricks cluster to download the Dynatrace OneAgent.
Start or restart your Databricks cluster to enable the Dynatrace OneAgent and this extension.

See below for activation details.

Ensure the EEC is enabled on each host.

To do this globally
1. Go to Settings > Preferences > Extension Execution Controller.
2. Turn on the first two options.
From inside your Databricks cluster, create a Databricks API token:

User Settings > Create API Token
Copy your Databricks URL.
Copy the Linux OneAgent Installation wget command from

Deploy Dynatrace > Start Installation > Linux > Enter or Create Paas Token
Databricks clusters can go up and down quickly, causing multiple host entities within Dynatrace. Databricks reuses IP addresses, so if you'd like to have the same host entities for your clusters, you can add the following flag to the OneAgent installation command in your global init script:
--set-host-id-source="ip-addresses"
Example:
/bin/sh Dynatrace-OneAgent-Linux.sh --set-monitoring-mode=infra-only --set-app-log-content-access=true --set-host-id-source="ip-addresses" --set-host-group=""

Configuration for Apache Spark and Databricks API metrics (recommended)

Set up global init script on Databricks cluster.
- Change the Dynatrace tenant and API token values.
- Change the host group for the OneAgent installation.
If your Databricks cluster does not have network access to your Dynatrace Cluster or ActiveGate, the Dynatrace-OneAgent-Linux.sh file can be manually uploaded to you Databricks DBFS and the script below can be modified to use those locations instead of using the wget command.
```
#!/usr/bin/env bash

wget  -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"
/bin/sh Dynatrace-OneAgent-Linux.sh --set-monitoring-mode=infra-only --set-app-log-content-access=true --set-host-id-source="ip-addresses" --set-host-group=""
```
Configure OneAgent extension in Dynatrace cluster.
1. Go to Extensions > Databricks > Add Monitoring Configuration and select Databricks hosts.
2. Turn on Call Spark API.
3. Turn on Call Databricks API.
4. Enter your Databricks URL and Databricks Rest API Token.
Select which feature sets of metrics you'd like to capture
Start (or restart if you're using an existing All-purpose compute cluster) your Databricks clusters and ensure the OneAgent is connected.
Verify that metrics show up on the host screen of your Databricks cluster's driver node. All the metrics will be attached to that host entity.

Configuration for Ganglia (legacy)

Create Dynatrace API with ReadConfig permissions

Set up global init script on Databricks cluster.

Change the Dynatrace tenant and API token values
Change the Host Group for the OneAgent Installation
Change DB_WS_URL & DB_WS_TOKEN Values (from steps above)
NOTE: If your Databricks cluster does not have network access to your Dynatrace cluster or ActiveGate, the OneAgent.sh and extension ZIP file can be manually uploaded to you Databricks DBFS and the script below can be modified to use those locations instead of using the wget commands.

#!/usr/bin/env bash

wget  -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"
/bin/sh Dynatrace-OneAgent-Linux.sh --set-monitoring-mode=infra-only --set-app-log-content-access=true --set-host-id-source="ip-addresses" --set-host-group=""

# token with 'ReadConfig' permissions
wget -O custom_python_databricks_ganglia.zip "https://<TENANT>.live.dynatrace.com/api/config/v1/extensions/custom.python.databricks_ganglia/binary" --header="Authorization: Api-Token <ReadConfig API-TOKEN>"
unzip custom_python_databricks_ganglia.zip -d /opt/dynatrace/oneagent/plugin_deployment/

# Add Databricks Workspace URL Environment Variable
cat <<EOF | sudo tee /etc/databricks_env
DB_WS_URL=https://adb-XXXXXXXXX.XX.azuredatabricks.net
DB_WS_TOKEN=dapiXXXXXXXXXXXXXXXXXXXXXXXXXXX
EOF

Create a Dynatrace API token with entities.read and entities.write permissions.
Configure the Databricks OneAgent extension in the Dynatrace cluster.
1. Go to Extensions > Databricks > Add Monitoring Configuration.
2. Select Databricks hosts.
3. Turn on Call Ganglia API.
4. Enter a Dynatrace URL and Databricks REST API Token.
5. Select which metrics you'd like to capture from Ganglia.
Start (or restart, if you're using an existing all-purpose compute cluster) your Databricks clusters and ensure the OneAgent is connected.
Verify that metrics are showing up on the included dashboard.

Feature sets

When activating your extension using monitoring configuration, you can limit monitoring to one of the feature sets. To work properly, the extension has to collect at least one metric after the activation.

In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.

All metrics that aren't categorized into any feature set are considered to be the default and are always reported.

A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.

Spark Executor Metrics

Metric name	Metric key	Description
Executor RDD Blocks	databricks.spark.executor.rdd_blocks	Number of Resilient Distributed Dataset blocks stored in memory or disk by the executor
Executor Memory Used	databricks.spark.executor.memory_used	The amount of memory currently used by the executor for execution and storage tasks
Executor Disk Used	databricks.spark.executor.disk_used	Disk used by the Spark executor
Executor Active Tasks	databricks.spark.executor.active_tasks	Total number of tasks that are currently executing on the specified executor within the Databricks Cluster
Executor Failed Tasks	databricks.spark.executor.failed_tasks	Number of failed tasks on the Spark executor
Executor Completed Tasks	databricks.spark.executor.completed_tasks	Number of completed tasks on the Spark Application
Executor Total Tasks	databricks.spark.executor.total_tasks	Total number of tasks executed by the executor
Executor Duration	databricks.spark.executor.total_duration.count	Time taken by Spark executor to complete a task
Executor Input Bytes	databricks.spark.executor.total_input_bytes.count	Total number of Bytes read by a Spark task from its input source
Executor Shuffle Read	databricks.spark.executor.total_shuffle_read.count	Total data read by the executor during shuffle operations (from other executors)
Executor Shuffle Write	databricks.spark.executor.total_shuffle_write.count	Total data written by the executor during shuffle operations (to other executors)
Executor Max Memory	databricks.spark.executor.max_memory	The maximum amount of memory allocated to the executor by Spark
Executor Alive Count	databricks.spark.executor.alive_count.gauge	Number of tasks that are currently running on the Databricks Cluster
Executor Dead Count	databricks.spark.executor.dead_count.gauge	Number of dead tasks on the Spark application

Databricks Cluster Timing Metrics

Metric name	Metric key	Description
Databricks Cluster Upsizing Time	databricks.cluster.upsizing_time	Time spent upsizing cluster

Spark Streaming Metrics

Metric name	Metric key	Description
Streaming Batch Duration	databricks.spark.streaming.statistics.batch_duration	Time interval configured for each streaming batch
Streaming Receivers	databricks.spark.streaming.statistics.num_receivers	Total number of receivers configured for the streaming job
Streaming Active Receivers	databricks.spark.streaming.statistics.num_active_receivers	Number of receivers actively ingesting data
Streaming Inactive Receivers	databricks.spark.streaming.statistics.num_inactive_receivers	Number of receivers that are currently inactive
Streaming Completed Batches	databricks.spark.streaming.statistics.num_total_completed_batches.count	Total number of batches that have been fully processed
Streaming Retained Completed Batches	databricks.spark.streaming.statistics.num_retained_completed_batches.count	Number of completed batches retained in memory for monitoring or debugging
Streaming Active Batches	databricks.spark.streaming.statistics.num_active_batches	Number of streaming batches currently being processed
Streaming Processed Records	databricks.spark.streaming.statistics.num_processed_records.count	Total number of records processed across all batches
Streaming Received Records	databricks.spark.streaming.statistics.num_received_records.count	Total number of records received from all sources
Streaming Avg Input Rate	databricks.spark.streaming.statistics.avg_input_rate	Average number of records received per second across batches
Streaming Avg Scheduling Delay	databricks.spark.streaming.statistics.avg_scheduling_delay	Average delay between batch creation and start of processing
Streaming Avg Processing Time	databricks.spark.streaming.statistics.avg_processing_time	Average time taken to process each batch
Streaming Avg Total Delay	databricks.spark.streaming.statistics.avg_total_delay	Average total delay from data ingestion to processing completion

Spark RDD Metrics

Metric name	Metric key	Description
RDD Count	databricks.spark.rdd_count.gauge	Total number of Resilient Distributed Datasets currently tracked by the Spark application
RDD Partitions	databricks.spark.rdd.num_partitions	Total number of partitions across all Resilient Distributed Datasets
RDD Cached Partitions	databricks.spark.rdd.num_cached_partitions	Number of Resilient Distributed Dataset partitions currently cached in memory or disk
RDD Memory Used	databricks.spark.rdd.memory_used	Amount of memory used to store Resilient Distributed Dataset data
RDD Disk Used	databricks.spark.rdd.disk_used	Amount of disk space used to store Resilient Distributed Dataset data

default

Metric name	Metric key	Description
Application Count	databricks.spark.application_count.gauge	Number of apps running databricks

Spark Stage Metrics

Metric name	Metric key	Description
Stage Active Tasks	databricks.spark.job.stage.num_active_tasks	Number of tasks currently running in the stage
Stage Completed Tasks	databricks.spark.job.stage.num_complete_tasks	Number of tasks that have successfully completed in the stage
Stage Failed Tasks	databricks.spark.job.stage.num_failed_tasks	Number of tasks that failed during execution in the stage
Stage Killed Tasks	databricks.spark.job.stage.num_killed_tasks	Number of tasks that were killed (e.g., due to job cancellation or speculative execution)
Stage Executor Run Time	databricks.spark.job.stage.executor_run_time	Total time executors spent running tasks in the stage
Stage Input Bytes	databricks.spark.job.stage.input_bytes	Total number of bytes read from input sources in the stage
Stage Input Records	databricks.spark.job.stage.input_records	Total number of records read from input sources in the stage
Stage Output Bytes	databricks.spark.job.stage.output_bytes	Total number of bytes written to output destinations in the stage
Stage Output Records	databricks.spark.job.stage.output_records	Total number of records written to output destinations in the stage
Stage Shuffle Read Bytes	databricks.spark.job.stage.shuffle_read_bytes	Total bytes read from other executors during shuffle operations
Stage Shuffle Read Records	databricks.spark.job.stage.shuffle_read_records	Total records read from other executors during shuffle operations
Stage Shuffle Write Bytes	databricks.spark.job.stage.shuffle_write_bytes	Total bytes written to other executors during shuffle operations
Stage Shuffle Write Records	databricks.spark.job.stage.shuffle_write_records	Total records written to other executors during shuffle operations
Stage Memory Bytes Spilled	databricks.spark.job.stage.memory_bytes_spilled	Amount of data spilled to memory due to shuffle or aggregation operations
Stage Disk Bytes Spilled	databricks.spark.job.stage.disk_bytes_spilled	Amount of data spilled to disk due to insufficient memory during task execution

Spark Job Metrics

Metric name	Metric key	Description
Job Status	databricks.spark.job.status	Current status of the job (e.g., running, succeeded, failed)
Job Duration	databricks.spark.job.duration	Total time taken by the job from start to finish
Job Total Tasks	databricks.spark.job.total_tasks	Total number of tasks planned for the job
Job Active Tasks	databricks.spark.job.active_tasks	Number of tasks currently executing within the job
Job Skipped Tasks	databricks.spark.job.skipped_tasks	Number of tasks skipped due to earlier failures or optimizations
Job Failed Tasks	databricks.spark.job.failed_tasks	Number of tasks that failed during job execution
Job Completed Tasks	databricks.spark.job.completed_tasks	Total number of tasks that have successfully completed
Job Active Stages	databricks.spark.job.active_stages	Number of stages currently running in a Spark job
Job Completed Stages	databricks.spark.job.completed_stages	Total number of stages that have successfully completed
Job Skipped Stages	databricks.spark.job.skipped_stages	Number of stages skipped due to earlier failures or optimizations
Job Failed Stages	databricks.spark.job.failed_stages	Number of stages that failed during job execution
Job Count	databricks.spark.job_count.gauge	Total number of Spark jobs submitted

Hardware Metrics

Metric name	Metric key	Description
CPU User %	databricks.hardware.cpu.usr	Percentage of CPUs time spent on User processes
CPU Nice %	databricks.hardware.cpu.nice	Percentage of CPU time used by processes that have a positive niceness, meaning a lower priority than other tasks
CPU System %	databricks.hardware.cpu.sys	Percentage of CPUs time spent on System processes
CPU IOWait %	databricks.hardware.cpu.iowait	Percentage of time CPU spends idle while waiting for I/O operations to complete
CPU IRQ %	databricks.hardware.cpu.irq	Interrupt Request Percentage, Proportion of CPU time spent handling hardware interrupts requests
CPU Steal %	databricks.hardware.cpu.steal	Percentage of time a virtual CPU waits for physical CPU while hypervisor is servicing another virtual processor
CPU Idle %	databricks.hardware.cpu.idle	Percentage of CPU idling
Memory Used	databricks.hardware.mem.used	Total memory currently in use, including buffers and cache
Memory Total	databricks.hardware.mem.total	Total physical memory installed on the system
Memory Free	databricks.hardware.mem.free	Portion of memory that is completely unused and available
Memory Buff/Cache	databricks.hardware.mem.buff_cache	Memory used by the system for buffers and cache to improve performance
Memory Shared	databricks.hardware.mem.shared	Memory shared between processes
Memory Available	databricks.hardware.mem.available	Total amount of memory available for use by the system

Explore in Dynatrace Hub

Monitor your Databricks clusters via multiple APIs.