Monitor your Databricks clusters via multiple APIs.
Use this extension if you have Databricks clusters for which you would like to:
Use the Databricks OneAgent extension to collect metrics from your embedded Ganglia instance, the Apache Spark APIs, and/or the Databricks API on your Databricks cluster.
Databricks Runtime v13+ no longer supports Ganglia. If this applies to your Databricks Runtime installation, use the Spark and Databricks API configuration option.
To activate this extension, you need to
Define in the configuration which metrics you'd like to collect from your Databricks clusters.
Set up a global init script on your Databricks cluster to download the Dynatrace OneAgent.
Start or restart your Databricks cluster to enable the Dynatrace OneAgent and this extension.
See below for activation details.
Ensure the EEC is enabled on each host.
To do this globally
From inside your Databricks cluster, create a Databricks API token:
User Settings > Create API Token
Copy your Databricks URL.
Copy the Linux OneAgent Installation wget command from
Deploy Dynatrace > Start Installation > Linux > Enter or Create Paas Token
Databricks clusters can go up and down quickly, causing multiple host entities within Dynatrace. Databricks reuses IP addresses, so if you'd like to have the same host entities for your clusters, you can add the following flag to the OneAgent installation command in your global init script:
--set-host-id-source="ip-addresses"
Example:
/bin/sh Dynatrace-OneAgent-Linux.sh --set-monitoring-mode=infra-only --set-app-log-content-access=true --set-host-id-source="ip-addresses" --set-host-group=""
Set up global init script on Databricks cluster.
If your Databricks cluster does not have network access to your Dynatrace Cluster or ActiveGate, the Dynatrace-OneAgent-Linux.sh file can be manually uploaded to you Databricks DBFS and the script below can be modified to use those locations instead of using the wget command.
#!/usr/bin/env bashwget -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"/bin/sh Dynatrace-OneAgent-Linux.sh --set-monitoring-mode=infra-only --set-app-log-content-access=true --set-host-id-source="ip-addresses" --set-host-group=""
Configure OneAgent extension in Dynatrace cluster.
Select which feature sets of metrics you'd like to capture
Start (or restart if you're using an existing All-purpose compute cluster) your Databricks clusters and ensure the OneAgent is connected.
Verify that metrics show up on the host screen of your Databricks cluster's driver node. All the metrics will be attached to that host entity.
Create Dynatrace API with ReadConfig permissions
Set up global init script on Databricks cluster.
#!/usr/bin/env bashwget -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"/bin/sh Dynatrace-OneAgent-Linux.sh --set-monitoring-mode=infra-only --set-app-log-content-access=true --set-host-id-source="ip-addresses" --set-host-group=""# token with 'ReadConfig' permissionswget -O custom_python_databricks_ganglia.zip "https://<TENANT>.live.dynatrace.com/api/config/v1/extensions/custom.python.databricks_ganglia/binary" --header="Authorization: Api-Token <ReadConfig API-TOKEN>"unzip custom_python_databricks_ganglia.zip -d /opt/dynatrace/oneagent/plugin_deployment/# Add Databricks Workspace URL Environment Variablecat <<EOF | sudo tee /etc/databricks_envDB_WS_URL=https://adb-XXXXXXXXX.XX.azuredatabricks.netDB_WS_TOKEN=dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXEOF
Create a Dynatrace API token with entities.read and entities.write permissions.
Configure the Databricks OneAgent extension in the Dynatrace cluster.
Start (or restart, if you're using an existing all-purpose compute cluster) your Databricks clusters and ensure the OneAgent is connected.
Verify that metrics are showing up on the included dashboard.
When activating your extension using monitoring configuration, you can limit monitoring to one of the feature sets. To work properly the extension has to collect at least one metric after the activation.
In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.
All metrics that aren't categorized into any feature set are considered to be the default and are always reported.
A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.
| Metric name | Metric key | Description |
|---|---|---|
| Databricks Cluster Upsizing Time | databricks.cluster.upsizing_time | Time spent upsizing cluster |
| Metric name | Metric key | Description |
|---|---|---|
| Executor RDD Blocks | databricks.spark.executor.rdd_blocks | Number of Resilient Distributed Dataset blocks stored in memory or disk by the executor |
| Executor Memory Used | databricks.spark.executor.memory_used | The amount of memory currently used by the executor for execution and storage tasks |
| Executor Disk Used | databricks.spark.executor.disk_used | Disk used by the Spark executor |
| Executor Active Tasks | databricks.spark.executor.active_tasks | Total number of tasks that are currently executing on the specified executor within the Databricks Cluster |
| Executor Failed Tasks | databricks.spark.executor.failed_tasks | Number of failed tasks on the Spark executor |
| Executor Completed Tasks | databricks.spark.executor.completed_tasks | Number of completed tasks on the Spark Application |
| Executor Total Tasks | databricks.spark.executor.total_tasks | Total number of tasks executed by the executor |
| Executor Duration | databricks.spark.executor.total_duration.count | Time taken by Spark executor to complete a task |
| Executor Input Bytes | databricks.spark.executor.total_input_bytes.count | Total number of Bytes read by a Spark task from its input source |
| Executor Shuffle Read | databricks.spark.executor.total_shuffle_read.count | Total data read by the executor during shuffle operations (from other executors) |
| Executor Shuffle Write | databricks.spark.executor.total_shuffle_write.count | Total data written by the executor during shuffle operations (to other executors) |
| Executor Max Memory | databricks.spark.executor.max_memory | The maximum amount of memory allocated to the executor by Spark |
| Executor Alive Count | databricks.spark.executor.alive_count.gauge | Number of tasks that are currently running on the Databricks Cluster |
| Executor Dead Count | databricks.spark.executor.dead_count.gauge | Number of dead tasks on the Spark application |
| Metric name | Metric key | Description |
|---|---|---|
| CPU User % | databricks.hardware.cpu.usr | Percentage of CPUs time spent on User processes |
| CPU Nice % | databricks.hardware.cpu.nice | Percentage of CPU time used by processes that have a positive niceness, meaning a lower priority than other tasks |
| CPU System % | databricks.hardware.cpu.sys | Percentage of CPUs time spent on System processes |
| CPU IOWait % | databricks.hardware.cpu.iowait | Percentage of time CPU spends idle while waiting for I/O operations to complete |
| CPU IRQ % | databricks.hardware.cpu.irq | Interrupt Request Percentage, Proportion of CPU time spent handling hardware interrupts requests |
| CPU Steal % | databricks.hardware.cpu.steal | Percentage of time a virtual CPU waits for physical CPU while hypervisor is servicing another virtual processor |
| CPU Idle % | databricks.hardware.cpu.idle | Percentage of CPU idling |
| Memory Used | databricks.hardware.mem.used | Total memory currently in use, including buffers and cache |
| Memory Total | databricks.hardware.mem.total | Total physical memory installed on the system |
| Memory Free | databricks.hardware.mem.free | Portion of memory that is completely unused and available |
| Memory Buff/Cache | databricks.hardware.mem.buff_cache | Memory used by the system for buffers and cache to improve performance |
| Memory Shared | databricks.hardware.mem.shared | Memory shared between processes |
| Memory Available | databricks.hardware.mem.available | Total amount of memory available for use by the system |
| Metric name | Metric key | Description |
|---|---|---|
| Job Status | databricks.spark.job.status | Current status of the job (e.g., running, succeeded, failed) |
| Job Duration | databricks.spark.job.duration | Total time taken by the job from start to finish |
| Job Total Tasks | databricks.spark.job.total_tasks | Total number of tasks planned for the job |
| Job Active Tasks | databricks.spark.job.active_tasks | Number of tasks currently executing within the job |
| Job Skipped Tasks | databricks.spark.job.skipped_tasks | Number of tasks skipped due to earlier failures or optimizations |
| Job Failed Tasks | databricks.spark.job.failed_tasks | Number of tasks that failed during job execution |
| Job Completed Tasks | databricks.spark.job.completed_tasks | Total number of tasks that have successfully completed |
| Job Active Stages | databricks.spark.job.active_stages | Number of stages currently running in a Spark job |
| Job Completed Stages | databricks.spark.job.completed_stages | Total number of stages that have successfully completed |
| Job Skipped Stages | databricks.spark.job.skipped_stages | Number of stages skipped due to earlier failures or optimizations |
| Job Failed Stages | databricks.spark.job.failed_stages | Number of stages that failed during job execution |
| Job Count | databricks.spark.job_count.gauge | Total number of Spark jobs submitted |
| Metric name | Metric key | Description |
|---|---|---|
| Stage Active Tasks | databricks.spark.job.stage.num_active_tasks | Number of tasks currently running in the stage |
| Stage Completed Tasks | databricks.spark.job.stage.num_complete_tasks | Number of tasks that have successfully completed in the stage |
| Stage Failed Tasks | databricks.spark.job.stage.num_failed_tasks | Number of tasks that failed during execution in the stage |
| Stage Killed Tasks | databricks.spark.job.stage.num_killed_tasks | Number of tasks that were killed (e.g., due to job cancellation or speculative execution) |
| Stage Executor Run Time | databricks.spark.job.stage.executor_run_time | Total time executors spent running tasks in the stage |
| Stage Input Bytes | databricks.spark.job.stage.input_bytes | Total number of bytes read from input sources in the stage |
| Stage Input Records | databricks.spark.job.stage.input_records | Total number of records read from input sources in the stage |
| Stage Output Bytes | databricks.spark.job.stage.output_bytes | Total number of bytes written to output destinations in the stage |
| Stage Output Records | databricks.spark.job.stage.output_records | Total number of records written to output destinations in the stage |
| Stage Shuffle Read Bytes | databricks.spark.job.stage.shuffle_read_bytes | Total bytes read from other executors during shuffle operations |
| Stage Shuffle Read Records | databricks.spark.job.stage.shuffle_read_records | Total records read from other executors during shuffle operations |
| Stage Shuffle Write Bytes | databricks.spark.job.stage.shuffle_write_bytes | Total bytes written to other executors during shuffle operations |
| Stage Shuffle Write Records | databricks.spark.job.stage.shuffle_write_records | Total records written to other executors during shuffle operations |
| Stage Memory Bytes Spilled | databricks.spark.job.stage.memory_bytes_spilled | Amount of data spilled to memory due to shuffle or aggregation operations |
| Stage Disk Bytes Spilled | databricks.spark.job.stage.disk_bytes_spilled | Amount of data spilled to disk due to insufficient memory during task execution |
| Metric name | Metric key | Description |
|---|---|---|
| Application Count | databricks.spark.application_count.gauge | Number of apps running databricks |
| Metric name | Metric key | Description |
|---|---|---|
| RDD Count | databricks.spark.rdd_count.gauge | Total number of Resilient Distributed Datasets currently tracked by the Spark application |
| RDD Partitions | databricks.spark.rdd.num_partitions | Total number of partitions across all Resilient Distributed Datasets |
| RDD Cached Partitions | databricks.spark.rdd.num_cached_partitions | Number of Resilient Distributed Dataset partitions currently cached in memory or disk |
| RDD Memory Used | databricks.spark.rdd.memory_used | Amount of memory used to store Resilient Distributed Dataset data |
| RDD Disk Used | databricks.spark.rdd.disk_used | Amount of disk space used to store Resilient Distributed Dataset data |
| Metric name | Metric key | Description |
|---|---|---|
| Streaming Batch Duration | databricks.spark.streaming.statistics.batch_duration | Time interval configured for each streaming batch |
| Streaming Receivers | databricks.spark.streaming.statistics.num_receivers | Total number of receivers configured for the streaming job |
| Streaming Active Receivers | databricks.spark.streaming.statistics.num_active_receivers | Number of receivers actively ingesting data |
| Streaming Inactive Receivers | databricks.spark.streaming.statistics.num_inactive_receivers | Number of receivers that are currently inactive |
| Streaming Completed Batches | databricks.spark.streaming.statistics.num_total_completed_batches.count | Total number of batches that have been fully processed |
| Streaming Retained Completed Batches | databricks.spark.streaming.statistics.num_retained_completed_batches.count | Number of completed batches retained in memory for monitoring or debugging |
| Streaming Active Batches | databricks.spark.streaming.statistics.num_active_batches | Number of streaming batches currently being processed |
| Streaming Processed Records | databricks.spark.streaming.statistics.num_processed_records.count | Total number of records processed across all batches |
| Streaming Received Records | databricks.spark.streaming.statistics.num_received_records.count | Total number of records received from all sources |
| Streaming Avg Input Rate | databricks.spark.streaming.statistics.avg_input_rate | Average number of records received per second across batches |
| Streaming Avg Scheduling Delay | databricks.spark.streaming.statistics.avg_scheduling_delay | Average delay between batch creation and start of processing |
| Streaming Avg Processing Time | databricks.spark.streaming.statistics.avg_processing_time | Average time taken to process each batch |
| Streaming Avg Total Delay | databricks.spark.streaming.statistics.avg_total_delay | Average total delay from data ingestion to processing completion |