Databricks Workspace extension

  • Latest Dynatrace
  • Extension

Remotely monitor your Databricks workspaces.

Get started

Overview

With the Dynatrace Databricks Workspace extension, you can remotely monitor your Databricks workspaces.

This extension works in harmony with the OneAgent-based Databricks extension, but is also ideal for workspaces and clusters where the OneAgent cannot be installed, such as Databricks serverless compute.

Use cases

  • Gather Databricks Job Run metrics, including success rate and job duration.
  • For Databricks Jobs running on all-purpose and job compute clusters, understand the cost of these jobs (currently, Azure Databricks is supported).
  • Ingest job and task run information as traces for further analysis.
  • Gather health metrics and detailed usage information from your Databricks model serving endpoints.
  • Ingest billing data from Databricks to understand usage across workspaces, SKU & product category, jobs, and more.
  • Get rightsizing recommendations based on resource utilization metrics collected from your Databricks clusters.
  • Remotely capture Spark metrics from clusters to capture detailed information on jobs, tasks, stages, executors, and RDDs.
  • Ingest audit logs from your workspaces.

Requirements

Databricks API version 2.2 is used for the APIs below:

API version 2.1 is used for the following:

API version 2.0 is used for the following:

The following system tables are queried when ingesting model serving endpoint data:

Billing & cost data:

Cluster resource utilization:

Audit logs:

To query any of the above system table data, the workspace must also have:

  • Unity Catalog enabled.
  • A SQL warehouse available to execute system table queries against.
    • Note that running these queries can incur additional costs in Databricks. To minimize this, we recommend using an existing active SQL warehouse if you have one.

To capture metrics from the Spark API, the token or Service Principal must have Can attach to permissions on the clusters you want to monitor.

For the databricks.job.cost metric, currently only Azure Databricks workspaces are supported.

Activation and setup

  1. Install Dynatrace Environment ActiveGate.

  2. Ensure connectivity between this ActiveGate and your Databricks workspace URL.

  3. Create a Databricks access token or a Service principal for your Databricks workspace with access to all of the API scopes and/or system tables listed under the Requirements. The following commands can be used to grant all of the required system table permissions to the service principal:

    GRANT USE SCHEMA ON SCHEMA system.access TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.access.workspaces_latest TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.access.audit TO `<service principal application id>`;
    GRANT USE SCHEMA ON SCHEMA system.billing TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.billing.usage TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.billing.list_prices TO `<service principal application id>`;
    GRANT USE SCHEMA ON SCHEMA system.serving TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.serving.endpoint_usage TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.serving.served_entities TO `<service principal application id>`;
    GRANT USE SCHEMA ON SCHEMA system.lakeflow TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.lakeflow.jobs TO `<service principal application id>`;
    GRANT USE SCHEMA ON SCHEMA system.compute TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.compute.clusters TO `<service principal application id>`;
    GRANT SELECT ON TABLE system.compute.node_timeline TO `<service principal application id>`;

    If using a Service principal, note that you will also need to do one of the following:

  4. Create a Dynatrace access token with the openTelemetryTrace.ingest scope.

  5. If ingesting billing and/or model serving endpoint data, provide the ID of an existing SQL warehouse used to execute system table queries. Ensure that the warehouse is up and running before enabling the extension, as there may be some startup delay depending on the type of warehouse.

  6. Create a new monitoring configuration in Dynatrace, using the URL, Dynatrace token, and either the Databricks access token or the client ID/secret associated with the service principal.

Details

This extension remotely queries the Databricks APIs and system tables using the provided Databricks URL and access token or OAuth client.

With that information, it calculates and reports the various metrics selected from the feature sets in the Dynatrace monitoring configuration.

If trace ingestion is configured, the extension transforms the data from the Databricks APIs into OpenTelemetry traces with the job as the parent span and the tasks in that job as child spans.

Only once a job is completed will its metrics and trace be ingested. This means that data about a job is not ingested while it is running.

When ingesting any system table data, there will be a billing cost associated, as it involves executing queries against a SQL warehouse. Depending on the warehouse type you are using (Serverless, Pro or Classic), as well as your specific billing model, the cost will vary.

Note that for the Databricks Cost Management dashboard, billing data is reported with a delay of up to 3 hours due to the system table refresh rate. If you don't see data in the dashboard as expected, expand the selected timeframe.

Licensing and cost

If all the feature sets are enabled, the number of metric datapoints is:

8 * # of jobs

11 * # of model serving endpoints

# of clusters * (16 + (27 * # Spark jobs) + (32 * # Spark apps))

If traces are configured to be ingested, the number of spans is:

# of jobs * (1 + tasks per job)

Log ingestion will vary, with log lines reported per:

  • Job run
  • As well as each new table entry to:
    • system.serving.endpoint_usage
    • system.billing.usage
    • system.access.audit

Feature sets

When activating your extension using monitoring configuration, you can limit monitoring to one of the feature sets. To work properly, the extension has to collect at least one metric after the activation.

In highly segmented networks, feature sets can reflect the segments of your environment. Then, when you create a monitoring configuration, you can select a feature set and a corresponding ActiveGate group that can connect to this particular segment.

All metrics that aren't categorized into any feature set are considered to be the default and are always reported.

A metric inherits the feature set of a subgroup, which in turn inherits the feature set of a group. Also, the feature set defined on the metric level overrides the feature set defined on the subgroup level, which in turn overrides the feature set defined on the group level.

Spark Streaming Metrics
Metric nameMetric keyDescription
Streaming Batch Durationdatabricks.cluster.spark.streaming.statistics.batch_durationTime interval configured for each streaming batch
Streaming Receiversdatabricks.cluster.spark.streaming.statistics.num_receiversTotal number of receivers configured for the streaming job
Streaming Active Receiversdatabricks.cluster.spark.streaming.statistics.num_active_receiversNumber of receivers actively ingesting data
Streaming Inactive Receiversdatabricks.cluster.spark.streaming.statistics.num_inactive_receiversNumber of receivers that are currently inactive
Streaming Completed Batchesdatabricks.cluster.spark.streaming.statistics.num_total_completed_batches.countTotal number of batches that have been fully processed
Streaming Retained Completed Batchesdatabricks.cluster.spark.streaming.statistics.num_retained_completed_batches.countNumber of completed batches retained in memory for monitoring or debugging
Streaming Active Batchesdatabricks.cluster.spark.streaming.statistics.num_active_batchesNumber of streaming batches currently being processed
Streaming Processed Recordsdatabricks.cluster.spark.streaming.statistics.num_processed_records.countTotal number of records processed across all batches
Streaming Received Recordsdatabricks.cluster.spark.streaming.statistics.num_received_records.countTotal number of records received from all sources
Streaming Avg Input Ratedatabricks.cluster.spark.streaming.statistics.avg_input_rateAverage number of records received per second across batches
Streaming Avg Scheduling Delaydatabricks.cluster.spark.streaming.statistics.avg_scheduling_delayAverage delay between batch creation and start of processing
Streaming Avg Processing Timedatabricks.cluster.spark.streaming.statistics.avg_processing_timeAverage time taken to process each batch
Streaming Avg Total Delaydatabricks.cluster.spark.streaming.statistics.avg_total_delayAverage total delay from data ingestion to processing completion
Spark RDD Metrics
Metric nameMetric keyDescription
RDD Countdatabricks.cluster.spark.rdd_count.gaugeTotal number of Resilient Distributed Datasets currently tracked by the Spark application
RDD Partitionsdatabricks.cluster.spark.rdd.num_partitionsTotal number of partitions across all Resilient Distributed Datasets
RDD Cached Partitionsdatabricks.cluster.spark.rdd.num_cached_partitionsNumber of Resilient Distributed Dataset partitions currently cached in memory or disk
RDD Memory Useddatabricks.cluster.spark.rdd.memory_usedAmount of memory used to store Resilient Distributed Dataset data
RDD Disk Useddatabricks.cluster.spark.rdd.disk_usedAmount of disk space used to store Resilient Distributed Dataset data
Spark Job Metrics
Metric nameMetric keyDescription
Job Statusdatabricks.cluster.spark.job.statusCurrent status of the job (e.g., running, succeeded, failed)
Job Durationdatabricks.cluster.spark.job.durationTotal time taken by the job from start to finish
Job Total Tasksdatabricks.cluster.spark.job.total_tasksTotal number of tasks planned for the job
Job Active Tasksdatabricks.cluster.spark.job.active_tasksNumber of tasks currently executing within the job
Job Skipped Tasksdatabricks.cluster.spark.job.skipped_tasksNumber of tasks skipped due to earlier failures or optimizations
Job Failed Tasksdatabricks.cluster.spark.job.failed_tasksNumber of tasks that failed during job execution
Job Completed Tasksdatabricks.cluster.spark.job.completed_tasksTotal number of tasks that have successfully completed
Job Active Stagesdatabricks.cluster.spark.job.active_stagesNumber of stages currently running in a Spark job
Job Completed Stagesdatabricks.cluster.spark.job.completed_stagesTotal number of stages that have successfully completed
Job Skipped Stagesdatabricks.cluster.spark.job.skipped_stagesNumber of stages skipped due to earlier failures or optimizations
Job Failed Stagesdatabricks.cluster.spark.job.failed_stagesNumber of stages that failed during job execution
Job Countdatabricks.cluster.spark.job_count.gaugeTotal number of Spark jobs submitted
Databricks Job Metrics
Metric nameMetric keyDescription
Job Run Durationdatabricks.job.duration.run
Job Success Ratedatabricks.job.success_rate
Job Runs Countdatabricks.job.runs
Databricks Resource Utilization Metrics
Metric nameMetric keyDescription
Cluster CPU System Percentagedatabricks.compute.cpu.systemPercentage of time the CPU spent in system mode.
Cluster CPU User Percentagedatabricks.compute.cpu.userPercentage of time the CPU spent in userland.
Cluster CPU Wait Percentagedatabricks.compute.cpu.waitPercentage of time the CPU spent waiting for I/O.
Cluster CPU Total Percentagedatabricks.compute.cpu.totalPercentage of time the CPU spent in total (including system and user time).
Cluster Memory Usage Percentagedatabricks.compute.memory.usedPercentage of the compute's memory that was used during the time period (including memory used by background processes running on the compute).
Cluster Memory Swap Percentagedatabricks.compute.memory.swapPercentage of memory usage attributed to memory swap.
Cluster Network Sent Bytesdatabricks.compute.network.sentThe number of bytes sent out in network traffic.
Cluster Network Received Bytesdatabricks.compute.network.receivedThe number of received bytes from network traffic.
Databricks Model Serving Endpoint Metrics
Metric nameMetric keyDescription
Model Serving Endpoint Memory Usage Percentagedatabricks.model_endpoint.mem_usage_percentage
Model Serving Endpoint CPU Usage Percentagedatabricks.model_endpoint.cpu_usage_percentage
Model Serving Endpoint Request Count Totaldatabricks.model_endpoint.request_count_total
Model Serving Endpoint Request 5xx Count Totaldatabricks.model_endpoint.request_5xx_count_total
Model Serving Endpoint Provisioned Concurrent Requests Totaldatabricks.model_endpoint.provisioned_concurrent_requests_total
Model Serving Endpoint Request 4xx Count Totaldatabricks.model_endpoint.request_4xx_count_total
Model Serving Endpoint GPU Usage Percentagedatabricks.model_endpoint.gpu_usage_percentage
Model Serving Endpoint GPU Memory Usage Percentagedatabricks.model_endpoint.gpu_memory_usage_percentage
Model Serving Endpoint Average Request Latencydatabricks.model_endpoint.request_latency_ms_avg
Model Serving Endpoint P99 Request Latencydatabricks.model_endpoint.request_latency_ms_p99
Model Serving Endpoint P95 Request Latencydatabricks.model_endpoint.request_latency_ms_p95
Databricks Job Metrics (detailed)
Metric nameMetric keyDescription
Job Setup Durationdatabricks.job.duration.setup
Job Execution Durationdatabricks.job.duration.execution
Job Cleanup Durationdatabricks.job.duration.cleanup
Job Queue Durationdatabricks.job.duration.queue
Spark Executor Metrics
Metric nameMetric keyDescription
Executor RDD Blocksdatabricks.cluster.spark.executor.rdd_blocksNumber of Resilient Distributed Dataset blocks stored in memory or disk by the executor
Executor Memory Useddatabricks.cluster.spark.executor.memory_usedThe amount of memory currently used by the executor for execution and storage tasks
Executor Disk Useddatabricks.cluster.spark.executor.disk_usedDisk used by the Spark executor
Executor Active Tasksdatabricks.cluster.spark.executor.active_tasksTotal number of tasks that are currently executing on the specified executor within the Databricks Cluster
Executor Failed Tasksdatabricks.cluster.spark.executor.failed_tasksNumber of failed tasks on the Spark executor
Executor Completed Tasksdatabricks.cluster.spark.executor.completed_tasksNumber of completed tasks on the Spark Application
Executor Total Tasksdatabricks.cluster.spark.executor.total_tasksTotal number of tasks executed by the executor
Executor Durationdatabricks.cluster.spark.executor.total_duration.countTime taken by Spark executor to complete a task
Executor Input Bytesdatabricks.cluster.spark.executor.total_input_bytes.countTotal number of Bytes read by a Spark task from its input source
Executor Shuffle Readdatabricks.cluster.spark.executor.total_shuffle_read.countTotal data read by the executor during shuffle operations (from other executors)
Executor Shuffle Writedatabricks.cluster.spark.executor.total_shuffle_write.countTotal data written by the executor during shuffle operations (to other executors)
Executor Max Memorydatabricks.cluster.spark.executor.max_memoryThe maximum amount of memory allocated to the executor by Spark
Executor Alive Countdatabricks.cluster.spark.executor.alive_count.gaugeNumber of tasks that are currently running on the Databricks Cluster
Executor Dead Countdatabricks.cluster.spark.executor.dead_count.gaugeNumber of dead tasks on the Spark application
Databricks Job Cost Metrics
Metric nameMetric keyDescription
Job Cost (Approx)databricks.job.cost
Spark Stage Metrics
Metric nameMetric keyDescription
Stage Active Tasksdatabricks.cluster.spark.job.stage.num_active_tasksNumber of tasks currently running in the stage
Stage Completed Tasksdatabricks.cluster.spark.job.stage.num_complete_tasksNumber of tasks that have successfully completed in the stage
Stage Failed Tasksdatabricks.cluster.spark.job.stage.num_failed_tasksNumber of tasks that failed during execution in the stage
Stage Killed Tasksdatabricks.cluster.spark.job.stage.num_killed_tasksNumber of tasks that were killed (e.g., due to job cancellation or speculative execution)
Stage Executor Run Timedatabricks.cluster.spark.job.stage.executor_run_timeTotal time executors spent running tasks in the stage
Stage Input Bytesdatabricks.cluster.spark.job.stage.input_bytesTotal number of bytes read from input sources in the stage
Stage Input Recordsdatabricks.cluster.spark.job.stage.input_recordsTotal number of records read from input sources in the stage
Stage Output Bytesdatabricks.cluster.spark.job.stage.output_bytesTotal number of bytes written to output destinations in the stage
Stage Output Recordsdatabricks.cluster.spark.job.stage.output_recordsTotal number of records written to output destinations in the stage
Stage Shuffle Read Bytesdatabricks.cluster.spark.job.stage.shuffle_read_bytesTotal bytes read from other executors during shuffle operations
Stage Shuffle Read Recordsdatabricks.cluster.spark.job.stage.shuffle_read_recordsTotal records read from other executors during shuffle operations
Stage Shuffle Write Bytesdatabricks.cluster.spark.job.stage.shuffle_write_bytesTotal bytes written to other executors during shuffle operations
Stage Shuffle Write Recordsdatabricks.cluster.spark.job.stage.shuffle_write_recordsTotal records written to other executors during shuffle operations
Stage Memory Bytes Spilleddatabricks.cluster.spark.job.stage.memory_bytes_spilledAmount of data spilled to memory due to shuffle or aggregation operations
Stage Disk Bytes Spilleddatabricks.cluster.spark.job.stage.disk_bytes_spilledAmount of data spilled to disk due to insufficient memory during task execution
Related tags
AnalyticsPythonData Processing/AnalyticsDatabricksInfrastructure Observability