Azure Managed Instance for Apache Cassandra Monitoring
From both a data and infrastructure perspective, this Prometheus Extension 2.0 allows you to monitors and analyze the activity of your Apache Cassandra clusters. It visualize your cluster's health and shows metrics like CPU, connectivity, request latency, suspension, and garbage collection time. Additionally, with Davis, it automatically detects performance problems and provides precise root cause analysis.
Prerequisites
- Azure Managed Instance for Apache Cassandra created and running.
An Ubuntu virtual machine deployed inside the Azure Virtual Network where the managed instance is present.
- Prometheus server set up to scrape Cassandra nodes and with relabel config in place
- Environment ActiveGate version 1.231+ with access to the Prometheus server
Setup
Create an Ubuntu virtual machine in the same virtual network as your Azure Managed Instance for Apache Cassandra.
Ensure Docker is installed on your virtual machine.
- Create a file named
prometheus.yml
on your virtual machine with the contents below.
Add every Cassandra Node IP address and port9443
in thestatic_configs
section. The IP addresses can be gathered from the Data Center section of the Azure Portal for your Cassandra Cluster.1static_configs:2 - targets: ["<Node_IP_1>:9443", "<Node_IP_2>:9443", "<Node_IP_N>:9443"]
1global:2 scrape_interval: 15s3 scrape_timeout: 10s4 evaluation_interval: 15s56alerting:7 alertmanagers:8 - static_configs:9 - targets: []10 scheme: http11 timeout: 10s1213scrape_configs:14 - job_name: prometheus15 scrape_interval: 15s16 scrape_timeout: 15s17 metrics_path: /metrics18 scheme: http19 static_configs:20 - targets:21 - localhost:90902223 - job_name: "mcac"24 scrape_interval: 15s25 scrape_timeout: 15s26 static_configs:27 - targets: ["<Node_IP_1>:9443", "<Node_IP_2>:9443", "<Node_IP_N>:9443"]28 honor_labels: true29 honor_timestamps: false30 scheme: https31 tls_config:32 insecure_skip_verify: true33 metric_relabel_configs:34 #drop metrics we can calculate from prometheus directly35 - source_labels: [__name__]36 regex: .*rate_(mean|1m|5m|15m)37 action: drop38 #save the original name for all metrics39 - source_labels: [__name__]40 regex: (collectd_mcac_.+)41 target_label: prom_name42 replacement: ${1}43 - source_labels: ["prom_name"]44 regex: .+_bucket_(\d+)45 target_label: le46 replacement: ${1}47 - source_labels: ["prom_name"]48 regex: .+_bucket_inf49 target_label: le50 replacement: +Inf51 - source_labels: ["prom_name"]52 regex: .*_histogram_p(\d+)53 target_label: quantile54 replacement: .${1}55 - source_labels: ["prom_name"]56 regex: .*_histogram_min57 target_label: quantile58 replacement: "0"59 - source_labels: ["prom_name"]60 regex: .*_histogram_max61 target_label: quantile62 replacement: "1"63 #Table Metrics *ALL* we can drop64 - source_labels: ["mcac"]65 regex: org\.apache\.cassandra\.metrics\.table\.(\w+)66 action: drop67 #Table Metrics68 - source_labels: ["mcac"]69 regex: org\.apache\.cassandra\.metrics\.table\.(\w+)\.(\w+)\.(\w+)70 target_label: table71 replacement: ${3}72 - source_labels: ["mcac"]73 regex: org\.apache\.cassandra\.metrics\.table\.(\w+)\.(\w+)\.(\w+)74 target_label: keyspace75 replacement: ${2}76 - source_labels: ["mcac"]77 regex: org\.apache\.cassandra\.metrics\.table\.(\w+)\.(\w+)\.(\w+)78 target_label: __name__79 replacement: mcac_table_${1}80 #Keyspace Metrics81 - source_labels: ["mcac"]82 regex: org\.apache\.cassandra\.metrics\.keyspace\.(\w+)\.(\w+)83 target_label: keyspace84 replacement: ${2}85 - source_labels: ["mcac"]86 regex: org\.apache\.cassandra\.metrics\.keyspace\.(\w+)\.(\w+)87 target_label: __name__88 replacement: mcac_keyspace_${1}89 #ThreadPool Metrics (one type is repair.task so we just ignore the second part)90 - source_labels: ["mcac"]91 regex: org\.apache\.cassandra\.metrics\.thread_pools\.(\w+)\.(\w+)\.(\w+).*92 target_label: pool_type93 replacement: ${2}94 - source_labels: ["mcac"]95 regex: org\.apache\.cassandra\.metrics\.thread_pools\.(\w+)\.(\w+)\.(\w+).*96 target_label: pool_name97 replacement: ${3}98 - source_labels: ["mcac"]99 regex: org\.apache\.cassandra\.metrics\.thread_pools\.(\w+)\.(\w+)\.(\w+).*100 target_label: __name__101 replacement: mcac_thread_pools_${1}102 #ClientRequest Metrics103 - source_labels: ["mcac"]104 regex: org\.apache\.cassandra\.metrics\.client_request\.(\w+)\.(\w+)$105 target_label: request_type106 replacement: ${2}107 - source_labels: ["mcac"]108 regex: org\.apache\.cassandra\.metrics\.client_request\.(\w+)\.(\w+)$109 target_label: __name__110 replacement: mcac_client_request_${1}111 - source_labels: ["mcac"]112 regex: org\.apache\.cassandra\.metrics\.client_request\.(\w+)\.(\w+)\.(\w+)$113 target_label: cl114 replacement: ${3}115 - source_labels: ["mcac"]116 regex: org\.apache\.cassandra\.metrics\.client_request\.(\w+)\.(\w+)\.(\w+)$117 target_label: request_type118 replacement: ${2}119 - source_labels: ["mcac"]120 regex: org\.apache\.cassandra\.metrics\.client_request\.(\w+)\.(\w+)\.(\w+)$121 target_label: __name__122 replacement: mcac_client_request_${1}_cl123 #Cache Metrics124 - source_labels: ["mcac"]125 regex: org\.apache\.cassandra\.metrics\.cache\.(\w+)\.(\w+)126 target_label: cache_name127 replacement: ${2}128 - source_labels: ["mcac"]129 regex: org\.apache\.cassandra\.metrics\.cache\.(\w+)\.(\w+)130 target_label: __name__131 replacement: mcac_cache_${1}132 #CQL Metrics133 - source_labels: ["mcac"]134 regex: org\.apache\.cassandra\.metrics\.cql\.(\w+)135 target_label: __name__136 replacement: mcac_cql_${1}137 #Dropped Message Metrics138 - source_labels: ["mcac"]139 regex: org\.apache\.cassandra\.metrics\.dropped_message\.(\w+)\.(\w+)140 target_label: message_type141 replacement: ${2}142 - source_labels: ["mcac"]143 regex: org\.apache\.cassandra\.metrics\.dropped_message\.(\w+)\.(\w+)144 target_label: __name__145 replacement: mcac_dropped_message_${1}146 #Streaming Metrics147 - source_labels: ["mcac"]148 regex: org\.apache\.cassandra\.metrics\.streaming\.(\w+)\.(.+)$149 target_label: peer_ip150 replacement: ${2}151 - source_labels: ["mcac"]152 regex: org\.apache\.cassandra\.metrics\.streaming\.(\w+)\.(.+)$153 target_label: __name__154 replacement: mcac_streaming_${1}155 - source_labels: ["mcac"]156 regex: org\.apache\.cassandra\.metrics\.streaming\.(\w+)$157 target_label: __name__158 replacement: mcac_streaming_${1}159 #CommitLog Metrics160 - source_labels: ["mcac"]161 regex: org\.apache\.cassandra\.metrics\.commit_log\.(\w+)162 target_label: __name__163 replacement: mcac_commit_log_${1}164 #Compaction Metrics165 - source_labels: ["mcac"]166 regex: org\.apache\.cassandra\.metrics\.compaction\.(\w+)167 target_label: __name__168 replacement: mcac_compaction_${1}169 #Storage Metrics170 - source_labels: ["mcac"]171 regex: org\.apache\.cassandra\.metrics\.storage\.(\w+)172 target_label: __name__173 replacement: mcac_storage_${1}174 #Batch Metrics175 - source_labels: ["mcac"]176 regex: org\.apache\.cassandra\.metrics\.batch\.(\w+)177 target_label: __name__178 replacement: mcac_batch_${1}179 #Client Metrics180 - source_labels: ["mcac"]181 regex: org\.apache\.cassandra\.metrics\.client\.(\w+)182 target_label: __name__183 replacement: mcac_client_${1}184 #BufferPool Metrics185 - source_labels: ["mcac"]186 regex: org\.apache\.cassandra\.metrics\.buffer_pool\.(\w+)187 target_label: __name__188 replacement: mcac_buffer_pool_${1}189 #Index Metrics190 - source_labels: ["mcac"]191 regex: org\.apache\.cassandra\.metrics\.index\.(\w+)192 target_label: __name__193 replacement: mcac_sstable_index_${1}194 #HintService Metrics195 - source_labels: ["mcac"]196 regex: org\.apache\.cassandra\.metrics\.hinted_hand_off_manager\.([^\-]+)-(\w+)197 target_label: peer_ip198 replacement: ${2}199 - source_labels: ["mcac"]200 regex: org\.apache\.cassandra\.metrics\.hinted_hand_off_manager\.([^\-]+)-(\w+)201 target_label: __name__202 replacement: mcac_hints_${1}203 #HintService Metrics204 - source_labels: ["mcac"]205 regex: org\.apache\.cassandra\.metrics\.hints_service\.hints_delays\-(\w+)206 target_label: peer_ip207 replacement: ${1}208 - source_labels: ["mcac"]209 regex: org\.apache\.cassandra\.metrics\.hints_service\.hints_delays\-(\w+)210 target_label: __name__211 replacement: mcac_hints_hints_delays212 - source_labels: ["mcac"]213 regex: org\.apache\.cassandra\.metrics\.hints_service\.([^\-]+)214 target_label: __name__215 replacement: mcac_hints_${1}216 # Misc217 - source_labels: ["mcac"]218 regex: org\.apache\.cassandra\.metrics\.memtable_pool\.(\w+)219 target_label: __name__220 replacement: mcac_memtable_pool_${1}221 - source_labels: ["mcac"]222 regex: com\.datastax\.bdp\.type\.performance_objects\.name\.cql_slow_log\.metrics\.queries_latency223 target_label: __name__224 replacement: mcac_cql_slow_log_query_latency225 - source_labels: ["mcac"]226 regex: org\.apache\.cassandra\.metrics\.read_coordination\.(.*)227 target_label: read_type228 replacement: $1229 - source_labels: ["mcac"]230 regex: org\.apache\.cassandra\.metrics\.read_coordination\.(.*)231 target_label: __name__232 replacement: mcac_read_coordination_requests233 #GC Metrics234 - source_labels: ["mcac"]235 regex: jvm\.gc\.(\w+)\.(\w+)236 target_label: collector_type237 replacement: ${1}238 - source_labels: ["mcac"]239 regex: jvm\.gc\.(\w+)\.(\w+)240 target_label: __name__241 replacement: mcac_jvm_gc_${2}242 #JVM Metrics243 - source_labels: ["mcac"]244 regex: jvm\.memory\.(\w+)\.(\w+)245 target_label: memory_type246 replacement: ${1}247 - source_labels: ["mcac"]248 regex: jvm\.memory\.(\w+)\.(\w+)249 target_label: __name__250 replacement: mcac_jvm_memory_${2}251 - source_labels: ["mcac"]252 regex: jvm\.memory\.pools\.(\w+)\.(\w+)253 target_label: pool_name254 replacement: ${2}255 - source_labels: ["mcac"]256 regex: jvm\.memory\.pools\.(\w+)\.(\w+)257 target_label: __name__258 replacement: mcac_jvm_memory_pool_${2}259 - source_labels: ["mcac"]260 regex: jvm\.fd\.usage261 target_label: __name__262 replacement: mcac_jvm_fd_usage263 - source_labels: ["mcac"]264 regex: jvm\.buffers\.(\w+)\.(\w+)265 target_label: buffer_type266 replacement: ${1}267 - source_labels: ["mcac"]268 regex: jvm\.buffers\.(\w+)\.(\w+)269 target_label: __name__270 replacement: mcac_jvm_buffer_${2}271 #Append the prom types back to formatted names272 - source_labels: [__name__, "prom_name"]273 regex: (mcac_.*);.*(_micros_bucket|_bucket|_micros_count_total|_count_total|_total|_micros_sum|_sum|_stddev).*274 separator: ;275 target_label: __name__276 replacement: ${1}${2}277 - regex: prom_name278 action: labeldrop
-
Start your Prometheus server Docker container.
Be sure to change the path in the command below to point to the
prometheus.yml
file from above.1docker run \2 -d \3 -p 9090:9090 \4 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \5 prom/prometheus -
If your virtual machine is not available from the internet, install a Dynatrace Environment ActiveGate on your Ubuntu VM.
Recommended: Set thegroup
property on the installation.
Enable and configure extension
-
In the Dynatrace menu, go to Dynatrace Hub.
-
Search for Azure Managed Instance for Apache Cassandra and enable the extension.
-
Verify that the Prometheus endpoint publishes the Cassandra metrics. Use either of these queries:
{__name__=~"mcac.*"}
http://<Prometheus Server URL>:9090/api/v1/query?query=%7B__name__%3D%7E%22mcac.*%22%7D
-
Add the endpoint of your Prometheus server to the Extension Monitoring Configuration:
http://<Prometheus Server URL>:9090/api/v1
The
<Prometheus Server URL>
does not need to be public. If you install your ActiveGate on the same VM or same VNet as the Prometheus server,localhost
or a private IP can be used. -
Select the ActiveGate group on which to enable this extension.
-
Add a Monitoring Configuration description and select the Feature Sets of the metrics you'd like to collect.
-
A dashboard named Azure Managed Instance for Apache Cassandra Overview is provided with the extension.
Metrics
Available metrics are listed below.
- Metric metadata and dimensions are available using the Data explorer after the extension is enabled.
- See Apache Cassandra Monitoring Documentation for more information about collected metrics.
Cluster node metrics
Metric Name | Metric Key | Description |
---|---|---|
Storage Load | com.dynatrace.extension.prometheus.azure_cassandra_storage_load | Size, in bytes, of the on-disk data size this node manages. |
Storage Exceptions | com.dynatrace.extension.prometheus.azure_cassandra_storage_exceptions.count | Number of internal exceptions caught. In normal operation, this should be zero. |
Commit Log Pending Tasks | com.dynatrace.extension.prometheus.azure_cassandra_commit_log_pending_tasks | Number of commit log messages written but yet to be fsync'd. |
Commit Log Completed Tasks Total | com.dynatrace.extension.prometheus.azure_cassandra_commit_log_completed_tasks_total.count | Total number of commit log messages written since start/restart. |
Buffer Pool Size | com.dynatrace.extension.prometheus.azure_cassandra_buffer_pool_size | Size, in bytes, of the managed buffer pool. |
Buffer Pool Misses Total | com.dynatrace.extension.prometheus.azure_cassandra_buffer_pool_misses_total.count | The number of misses in the pool. The higher this is, the more allocations incurred. |
Client Connected Native Clients | com.dynatrace.extension.prometheus.azure_cassandra_client_connected_native_clients | Number of clients connected to this node's native protocol server. |
Client Auth Failure Total | com.dynatrace.extension.prometheus.azure_cassandra_client_auth_failure_total.count | Number of clients who experience authentication failures. |
Client Auth Success Total | com.dynatrace.extension.prometheus.azure_cassandra_client_auth_success_total.count | Number of clients who successfully authenticate. |
Storage Total Hints Total | com.dynatrace.extension.prometheus.azure_cassandra_storage_total_hints_total.count | Number of hint messages written to this node since start/restart. Includes one entry for each host to be hinted per hint. |
CQL Prepared Statements Executed Total | com.dynatrace.extension.prometheus.azure_cassandra_cql_prepared_statements_executed_total.count | Number of prepared statements executed. |
CQL Regular Statements Executed Total | com.dynatrace.extension.prometheus.azure_cassandra_cql_regular_statements_executed_total.count | Number of non-prepared statements executed. |
Dropped Messages Total | com.dynatrace.extension.prometheus.azure_cassandra_dropped_messages_total.count | Number of dropped messages. |
JVM GC Count | com.dynatrace.extension.prometheus.azure_cassandra_jvm_gc_count.count | Total number of collections that have occurred. |
JVM GC Time | com.dynatrace.extension.prometheus.azure_cassandra_jvm_gc_time.count | Approximate accumulated collection elapsed time in milliseconds. |
JVM Memory Used | com.dynatrace.extension.prometheus.azure_cassandra_jvm_memory_used | Amount of used memory in bytes. |
JVM Memory Usage | func:com.dynatrace.extension.prometheus.azure_cassandra_jvm_memory_usage | Ratio of used memory to maximum memory. |
Thread Pools Active Tasks | com.dynatrace.extension.prometheus.azure_cassandra_thread_pools_active_tasks | Number of tasks being actively worked on by this pool. |
Thread Pools Total Blocked Tasks Total | com.dynatrace.extension.prometheus.azure_cassandra_thread_pools_total_blocked_tasks_total.count | Number of tasks that were blocked due to queue saturation. |
Thread Pools Completed Tasks | com.dynatrace.extension.prometheus.azure_cassandra_thread_pools_completed_tasks | Number of tasks completed. |
Client Request Latency Total | com.dynatrace.extension.prometheus.azure_cassandra_client_request_latency_total.count | Latency of client requests. |
Client Request Failures Total | com.dynatrace.extension.prometheus.azure_cassandra_client_request_failures_total.count | Number of transaction failures encountered. |
Client Request Unavailables Total | com.dynatrace.extension.prometheus.azure_cassandra_client_request_unavailables_total.count | Number of unavailable exceptions encountered. |
Cache Hit Rate | func:com.dynatrace.extension.prometheus.azure_cassandra_cache_hit_rate | All-time cache hit rate. |
Cache Capacity | com.dynatrace.extension.prometheus.azure_cassandra_cache_capacity | Cache capacity in bytes. |
Cache Misses Total | com.dynatrace.extension.prometheus.azure_cassandra_cache_misses_total.count | Total number of cache misses. |
Cache Size | com.dynatrace.extension.prometheus.azure_cassandra_cache_size | Total size of occupied cache, in bytes. |
Keyspace metrics
Metric Name | Metric Key | Description |
---|---|---|
Keyspace All Memtables Live Data Size | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_all_memtables_live_data_size | Total amount of live data stored in the memtables (2i and pending flush memtables included) that resides off-heap, excluding any data structure overhead. |
Keyspace Bloom Filter Disk Space Used | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_bloom_filter_disk_space_used | Disk space used by bloom filter (in bytes). |
Keyspace Live Disk Space Used | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_live_disk_space_used | Disk space used by SSTables belonging to this table (in bytes). |
Keyspace Memtable Columns Count | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_memtable_columns_count.gauge | Total number of columns present in the memtable. |
Keyspace Memtable Live Data Size | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_memtable_live_data_size | Total amount of live data stored in the memtable, excluding any data structure overhead. |
Keyspace Memtable Switch Count | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_memtable_switch_count.gauge | Number of times that flush has resulted in the memtable being switched out. |
Keyspace Pending Compaction | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_pending_compaction | Estimated number of compactions remaining to perform. |
Keyspace Pending Flushes | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_pending_flushes | Estimated number of flush tasks pending for this table. |
Keyspace Read Total Latency Total | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_read_total_latency_total.count | Read latency. |
Keyspace Total Disk Space Used | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_total_disk_space_used | Total disk space used by SSTables belonging to this table, including obsolete ones waiting for GC. |
Keyspace Write Total Latency Total | com.dynatrace.extension.prometheus.azure_cassandra_keyspace_write_total_latency_total.count | Write Latency. |
Table metrics
Metric Name | Metric Key | Description |
---|---|---|
Table Bloom Filter Disk Space Used | com.dynatrace.extension.prometheus.azure_cassandra_table_bloom_filter_disk_space_used | Disk space used by bloom filter (in bytes). |
Table Bloom Filter False Positives | com.dynatrace.extension.prometheus.azure_cassandra_table_bloom_filter_false_positives | Number of false positives on table's bloom filter. |
Table Bloom Filter False Ratio | func:com.dynatrace.extension.prometheus.azure_cassandra_table_bloom_filter_false_ratio | False positive ratio of table's bloom filter. |
Table Bytes Flushed Total | com.dynatrace.extension.prometheus.azure_cassandra_table_bytes_flushed_total.count | Total number of bytes flushed since server start/restart. |
Table Compaction Bytes Written Total | com.dynatrace.extension.prometheus.azure_cassandra_table_compaction_bytes_written_total.count | Total number of bytes compacted since server start/restart. |
Table Compression Ratio | func:com.dynatrace.extension.prometheus.azure_cassandra_table_compression_ratio | Current compression ratio for all SSTables. |
Table Dropped Mutations Total | com.dynatrace.extension.prometheus.azure_cassandra_table_dropped_mutations_total.count | Number of dropped mutations on this table. |
Table Estimated Partition Count | com.dynatrace.extension.prometheus.azure_cassandra_table_estimated_partition_count.gauge | Approximate number of keys in table. |
Table Key Cache Hit Rate | func:com.dynatrace.extension.prometheus.azure_cassandra_table_key_cache_hit_rate | Key cache hit rate for this table. |
Table Live Disk Space Used Total | com.dynatrace.extension.prometheus.azure_cassandra_table_live_disk_space_used_total | Disk space used by SSTables belonging to this table (in bytes). |
Table Live SSTable Count | com.dynatrace.extension.prometheus.azure_cassandra_table_live_ss_table_count.gauge | Number of SSTables on disk for this table. |
Table Memtable Columns Count | com.dynatrace.extension.prometheus.azure_cassandra_table_memtable_columns_count.gauge | Total number of columns present in the memtable. |
Table Memtable Live Data Size | com.dynatrace.extension.prometheus.azure_cassandra_table_memtable_live_data_size | Total amount of live data stored in the memtable, excluding any data structure overhead. |
Table Memtable Switch Count Total | com.dynatrace.extension.prometheus.azure_cassandra_table_memtable_switch_count_total.count | Number of times that flush has resulted in the memtable being switched out. |
Table Pending Compactions | com.dynatrace.extension.prometheus.azure_cassandra_table_pending_compactions | Estimate of number of pending compactions for this table. |
Table Pending Flushes Total | com.dynatrace.extension.prometheus.azure_cassandra_table_pending_flushes_total.count | Estimate of number of pending flushes for this table. |
Table Read Total Latency Total | com.dynatrace.extension.prometheus.azure_cassandra_table_read_total_latency_total.count | Read latency for this table. |
Table Row Cache Hit Total | com.dynatrace.extension.prometheus.azure_cassandra_table_row_cache_hit_total.count | Number of table row cache hits. |
Table Row Cache Miss Total | com.dynatrace.extension.prometheus.azure_cassandra_table_row_cache_miss_total.count | Number of table row cache misses. |
Table Total Disk Space Used Total | com.dynatrace.extension.prometheus.azure_cassandra_table_total_disk_space_used_total | Total disk space used by SSTables belonging to this table, including obsolete ones waiting to for GC. |
Table Write Total Latency Total | com.dynatrace.extension.prometheus.azure_cassandra_table_write_total_latency_total.count | Write latency for this table. |