Databricks extension

  • Latest Dynatrace
  • Extension
  • Published Oct 27, 2025

Monitor your Databricks clusters via multiple APIs.

Get started

Overview

Use this extension if you have Databricks clusters for which you would like to:

  • Monitor job statuses and other important job and cluster level metrics
  • Analyze uptime and autoscaling issues

Use the Databricks OneAgent extension to collect metrics from your embedded Ganglia instance, the Apache Spark APIs, and/or the Databricks API on your Databricks cluster.

Databricks Runtime v13+ no longer supports Ganglia. If this applies to your Databricks Runtime installation, use the Spark and Databricks API configuration option.

Use cases

  • Monitor job, cluster, and infrastructure metrics
  • Detect long upscaling times
  • Detect and filter driver and worker types

Activation and setup

To activate this extension, you need to

  1. Define in the configuration which metrics you'd like to collect from your Databricks clusters.

  2. Set up a global init script on your Databricks cluster to download the Dynatrace OneAgent.

  3. Start or restart your Databricks cluster to enable the Dynatrace OneAgent and this extension.

See below for activation details.

  1. Ensure the EEC is enabled on each host.

    To do this globally

    1. Go to Settings > Preferences > Extension Execution Controller.
    2. Turn on the first two options.
  2. From inside your Databricks cluster, create a Databricks API token:

    User Settings > Create API Token

  3. Copy your Databricks URL.

  4. Copy the Linux OneAgent Installation wget command from

    Deploy Dynatrace > Start Installation > Linux > Enter or Create Paas Token

    Databricks clusters can go up and down quickly, causing multiple host entities within Dynatrace. Databricks reuses IP addresses, so if you'd like to have the same host entities for your clusters, you can add the following flag to the OneAgent installation command in your global init script:

    --set-host-id-source="ip-addresses"

    Example:

    /bin/sh Dynatrace-OneAgent-Linux.sh --set-monitoring-mode=infra-only --set-app-log-content-access=true --set-host-id-source="ip-addresses" --set-host-group=""
  1. Set up global init script on Databricks cluster.

    • Change the Dynatrace tenant and API token values.
    • Change the host group for the OneAgent installation.

    If your Databricks cluster does not have network access to your Dynatrace Cluster or ActiveGate, the Dynatrace-OneAgent-Linux.sh file can be manually uploaded to you Databricks DBFS and the script below can be modified to use those locations instead of using the wget command.

    #!/usr/bin/env bash
    wget -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"
    /bin/sh Dynatrace-OneAgent-Linux.sh --set-monitoring-mode=infra-only --set-app-log-content-access=true --set-host-id-source="ip-addresses" --set-host-group=""
  2. Configure OneAgent extension in Dynatrace cluster.

    1. Go to Extensions > Databricks > Add Monitoring Configuration and select Databricks hosts.
    2. Turn on Call Spark API.
    3. Turn on Call Databricks API.
    4. Enter your Databricks URL and Databricks Rest API Token.
  3. Select which feature sets of metrics you'd like to capture

  4. Start (or restart if you're using an existing All-purpose compute cluster) your Databricks clusters and ensure the OneAgent is connected.

  5. Verify that metrics show up on the host screen of your Databricks cluster's driver node. All the metrics will be attached to that host entity.

Configuration for Ganglia (legacy)

  1. Create Dynatrace API with ReadConfig permissions

  2. Set up global init script on Databricks cluster.

    • Change the Dynatrace tenant and API token values
    • Change the Host Group for the OneAgent Installation
    • Change DB_WS_URL & DB_WS_TOKEN Values (from steps above)
    • NOTE: If your Databricks cluster does not have network access to your Dynatrace cluster or ActiveGate, the OneAgent.sh and extension ZIP file can be manually uploaded to you Databricks DBFS and the script below can be modified to use those locations instead of using the wget commands.
    #!/usr/bin/env bash
    wget -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"
    /bin/sh Dynatrace-OneAgent-Linux.sh --set-monitoring-mode=infra-only --set-app-log-content-access=true --set-host-id-source="ip-addresses" --set-host-group=""
    # token with 'ReadConfig' permissions
    wget -O custom_python_databricks_ganglia.zip "https://<TENANT>.live.dynatrace.com/api/config/v1/extensions/custom.python.databricks_ganglia/binary" --header="Authorization: Api-Token <ReadConfig API-TOKEN>"
    unzip custom_python_databricks_ganglia.zip -d /opt/dynatrace/oneagent/plugin_deployment/
    # Add Databricks Workspace URL Environment Variable
    cat <<EOF | sudo tee /etc/databricks_env
    DB_WS_URL=https://adb-XXXXXXXXX.XX.azuredatabricks.net
    DB_WS_TOKEN=dapiXXXXXXXXXXXXXXXXXXXXXXXXXXX
    EOF
  3. Create a Dynatrace API token with entities.read and entities.write permissions.

  4. Configure the Databricks OneAgent extension in the Dynatrace cluster.

    1. Go to Extensions > Databricks > Add Monitoring Configuration.
    2. Select Databricks hosts.
    3. Turn on Call Ganglia API.
    4. Enter a Dynatrace URL and Databricks REST API Token.
    5. Select which metrics you'd like to capture from Ganglia.
  5. Start (or restart, if you're using an existing all-purpose compute cluster) your Databricks clusters and ensure the OneAgent is connected.

  6. Verify that metrics are showing up on the included dashboard.

Feature sets

When activating your extension using monitoring configuration, you can limit monitoring to one of the feature sets. To work properly the extension has to collect at least one metric after the activation.

In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.

All metrics that aren't categorized into any feature set are considered to be the default and are always reported.

A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.

Databricks Cluster Timing Metrics
Metric nameMetric keyDescription
Databricks Cluster Upsizing Timedatabricks.cluster.upsizing_timeTime spent upsizing cluster
Spark Executor Metrics
Metric nameMetric keyDescription
Executor RDD Blocksdatabricks.spark.executor.rdd_blocksNumber of Resilient Distributed Dataset blocks stored in memory or disk by the executor
Executor Memory Useddatabricks.spark.executor.memory_usedThe amount of memory currently used by the executor for execution and storage tasks
Executor Disk Useddatabricks.spark.executor.disk_usedDisk used by the Spark executor
Executor Active Tasksdatabricks.spark.executor.active_tasksTotal number of tasks that are currently executing on the specified executor within the Databricks Cluster
Executor Failed Tasksdatabricks.spark.executor.failed_tasksNumber of failed tasks on the Spark executor
Executor Completed Tasksdatabricks.spark.executor.completed_tasksNumber of completed tasks on the Spark Application
Executor Total Tasksdatabricks.spark.executor.total_tasksTotal number of tasks executed by the executor
Executor Durationdatabricks.spark.executor.total_duration.countTime taken by Spark executor to complete a task
Executor Input Bytesdatabricks.spark.executor.total_input_bytes.countTotal number of Bytes read by a Spark task from its input source
Executor Shuffle Readdatabricks.spark.executor.total_shuffle_read.countTotal data read by the executor during shuffle operations (from other executors)
Executor Shuffle Writedatabricks.spark.executor.total_shuffle_write.countTotal data written by the executor during shuffle operations (to other executors)
Executor Max Memorydatabricks.spark.executor.max_memoryThe maximum amount of memory allocated to the executor by Spark
Executor Alive Countdatabricks.spark.executor.alive_count.gaugeNumber of tasks that are currently running on the Databricks Cluster
Executor Dead Countdatabricks.spark.executor.dead_count.gaugeNumber of dead tasks on the Spark application
Hardware Metrics
Metric nameMetric keyDescription
CPU User %databricks.hardware.cpu.usrPercentage of CPUs time spent on User processes
CPU Nice %databricks.hardware.cpu.nicePercentage of CPU time used by processes that have a positive niceness, meaning a lower priority than other tasks
CPU System %databricks.hardware.cpu.sysPercentage of CPUs time spent on System processes
CPU IOWait %databricks.hardware.cpu.iowaitPercentage of time CPU spends idle while waiting for I/O operations to complete
CPU IRQ %databricks.hardware.cpu.irqInterrupt Request Percentage, Proportion of CPU time spent handling hardware interrupts requests
CPU Steal %databricks.hardware.cpu.stealPercentage of time a virtual CPU waits for physical CPU while hypervisor is servicing another virtual processor
CPU Idle %databricks.hardware.cpu.idlePercentage of CPU idling
Memory Useddatabricks.hardware.mem.usedTotal memory currently in use, including buffers and cache
Memory Totaldatabricks.hardware.mem.totalTotal physical memory installed on the system
Memory Freedatabricks.hardware.mem.freePortion of memory that is completely unused and available
Memory Buff/Cachedatabricks.hardware.mem.buff_cacheMemory used by the system for buffers and cache to improve performance
Memory Shareddatabricks.hardware.mem.sharedMemory shared between processes
Memory Availabledatabricks.hardware.mem.availableTotal amount of memory available for use by the system
Spark Job Metrics
Metric nameMetric keyDescription
Job Statusdatabricks.spark.job.statusCurrent status of the job (e.g., running, succeeded, failed)
Job Durationdatabricks.spark.job.durationTotal time taken by the job from start to finish
Job Total Tasksdatabricks.spark.job.total_tasksTotal number of tasks planned for the job
Job Active Tasksdatabricks.spark.job.active_tasksNumber of tasks currently executing within the job
Job Skipped Tasksdatabricks.spark.job.skipped_tasksNumber of tasks skipped due to earlier failures or optimizations
Job Failed Tasksdatabricks.spark.job.failed_tasksNumber of tasks that failed during job execution
Job Completed Tasksdatabricks.spark.job.completed_tasksTotal number of tasks that have successfully completed
Job Active Stagesdatabricks.spark.job.active_stagesNumber of stages currently running in a Spark job
Job Completed Stagesdatabricks.spark.job.completed_stagesTotal number of stages that have successfully completed
Job Skipped Stagesdatabricks.spark.job.skipped_stagesNumber of stages skipped due to earlier failures or optimizations
Job Failed Stagesdatabricks.spark.job.failed_stagesNumber of stages that failed during job execution
Job Countdatabricks.spark.job_count.gaugeTotal number of Spark jobs submitted
Spark Stage Metrics
Metric nameMetric keyDescription
Stage Active Tasksdatabricks.spark.job.stage.num_active_tasksNumber of tasks currently running in the stage
Stage Completed Tasksdatabricks.spark.job.stage.num_complete_tasksNumber of tasks that have successfully completed in the stage
Stage Failed Tasksdatabricks.spark.job.stage.num_failed_tasksNumber of tasks that failed during execution in the stage
Stage Killed Tasksdatabricks.spark.job.stage.num_killed_tasksNumber of tasks that were killed (e.g., due to job cancellation or speculative execution)
Stage Executor Run Timedatabricks.spark.job.stage.executor_run_timeTotal time executors spent running tasks in the stage
Stage Input Bytesdatabricks.spark.job.stage.input_bytesTotal number of bytes read from input sources in the stage
Stage Input Recordsdatabricks.spark.job.stage.input_recordsTotal number of records read from input sources in the stage
Stage Output Bytesdatabricks.spark.job.stage.output_bytesTotal number of bytes written to output destinations in the stage
Stage Output Recordsdatabricks.spark.job.stage.output_recordsTotal number of records written to output destinations in the stage
Stage Shuffle Read Bytesdatabricks.spark.job.stage.shuffle_read_bytesTotal bytes read from other executors during shuffle operations
Stage Shuffle Read Recordsdatabricks.spark.job.stage.shuffle_read_recordsTotal records read from other executors during shuffle operations
Stage Shuffle Write Bytesdatabricks.spark.job.stage.shuffle_write_bytesTotal bytes written to other executors during shuffle operations
Stage Shuffle Write Recordsdatabricks.spark.job.stage.shuffle_write_recordsTotal records written to other executors during shuffle operations
Stage Memory Bytes Spilleddatabricks.spark.job.stage.memory_bytes_spilledAmount of data spilled to memory due to shuffle or aggregation operations
Stage Disk Bytes Spilleddatabricks.spark.job.stage.disk_bytes_spilledAmount of data spilled to disk due to insufficient memory during task execution
default
Metric nameMetric keyDescription
Application Countdatabricks.spark.application_count.gaugeNumber of apps running databricks
Spark RDD Metrics
Metric nameMetric keyDescription
RDD Countdatabricks.spark.rdd_count.gaugeTotal number of Resilient Distributed Datasets currently tracked by the Spark application
RDD Partitionsdatabricks.spark.rdd.num_partitionsTotal number of partitions across all Resilient Distributed Datasets
RDD Cached Partitionsdatabricks.spark.rdd.num_cached_partitionsNumber of Resilient Distributed Dataset partitions currently cached in memory or disk
RDD Memory Useddatabricks.spark.rdd.memory_usedAmount of memory used to store Resilient Distributed Dataset data
RDD Disk Useddatabricks.spark.rdd.disk_usedAmount of disk space used to store Resilient Distributed Dataset data
Spark Streaming Metrics
Metric nameMetric keyDescription
Streaming Batch Durationdatabricks.spark.streaming.statistics.batch_durationTime interval configured for each streaming batch
Streaming Receiversdatabricks.spark.streaming.statistics.num_receiversTotal number of receivers configured for the streaming job
Streaming Active Receiversdatabricks.spark.streaming.statistics.num_active_receiversNumber of receivers actively ingesting data
Streaming Inactive Receiversdatabricks.spark.streaming.statistics.num_inactive_receiversNumber of receivers that are currently inactive
Streaming Completed Batchesdatabricks.spark.streaming.statistics.num_total_completed_batches.countTotal number of batches that have been fully processed
Streaming Retained Completed Batchesdatabricks.spark.streaming.statistics.num_retained_completed_batches.countNumber of completed batches retained in memory for monitoring or debugging
Streaming Active Batchesdatabricks.spark.streaming.statistics.num_active_batchesNumber of streaming batches currently being processed
Streaming Processed Recordsdatabricks.spark.streaming.statistics.num_processed_records.countTotal number of records processed across all batches
Streaming Received Recordsdatabricks.spark.streaming.statistics.num_received_records.countTotal number of records received from all sources
Streaming Avg Input Ratedatabricks.spark.streaming.statistics.avg_input_rateAverage number of records received per second across batches
Streaming Avg Scheduling Delaydatabricks.spark.streaming.statistics.avg_scheduling_delayAverage delay between batch creation and start of processing
Streaming Avg Processing Timedatabricks.spark.streaming.statistics.avg_processing_timeAverage time taken to process each batch
Streaming Avg Total Delaydatabricks.spark.streaming.statistics.avg_total_delayAverage total delay from data ingestion to processing completion
Related tags
AnalyticsPythonData Processing/AnalyticsDatabricksInfrastructure Observability