Alert on common Kubernetes/OpenShift issues
Dynatrace version 1.254+
ActiveGate version 1.253+
To alert on common Kubernetes platform issues, follow the instructions below.
Configure
There are three ways to configure alerts for common Kubernetes/OpenShift issues.
Configuring an alert on a different level is only intended to simplify the configuration of multiple entities at once. It does not change the behavior of an alert.
For example, enabling a workload CPU usage saturation alert will still evaluate and raise problems for each Kubernetes workload separately, even if it has been configured on the Kubernetes cluster level.
For further details on the settings hierarchy, see Settings documentation.
View alerts
You can view alerts
-
On the Problems page.
Example problem:
-
In the Events section of a cluster details page.
Example event:
Select the event to navigate to Data Explorer for more information about the metric that generated the event.
Available alerts
See below for a list of available alerts.
Cluster alerts
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.254
Resource
CPU-request saturation on cluster
CPU-request saturation exceeds the specified threshold.
5 minutes
Node CPU requests / Node CPU allocatable
Kubernetes Classic, Kubernetes app
1.254
Resource
Memory-request saturation on cluster
Memory-request saturation exceeds the specified threshold.
5 minutes
Node memory requests / Node memory allocatable
Kubernetes Classic, Kubernetes app
1.258
Resource
Pod saturation on cluster
Cluster pod-saturation exceeds the specified threshold.
5 minutes
Sum of ready pods / Sum of allocatable pods
Kubernetes Classic, Kubernetes app
1.254
Availability
Cluster not ready
Readyz endpoint indicates that this cluster is not ready.
5 minutes
Cluster readyz metric
Kubernetes Classic, Kubernetes app
1.258
Availability
Monitoring not available
Dynatrace API monitoring is not available.
5 minutes
Kubernetes Classic, Kubernetes app
Detect cluster CPU-request saturation
builtin:kubernetes.node.requests_cpu:splitBy():sum/builtin:kubernetes.node.cpu_allocatable:splitBy():sum*100.0
timeseries o1=sum(dt.kubernetes.container.requests_cpu, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {}| join [timeseries operand=sum(dt.kubernetes.node.cpu_allocatable, rollup: avg), nonempty:true, by: {}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Detect cluster memory-request saturation
builtin:kubernetes.node.requests_memory:splitBy():sum/builtin:kubernetes.node.memory_allocatable:splitBy():sum*100.0
timeseries o1=sum(dt.kubernetes.container.requests_memory, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {}| join [timeseries operand=sum(dt.kubernetes.node.memory_allocatable, rollup: avg), nonempty:true, by: {}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Detect cluster pod-saturation
(builtin:kubernetes.node.pods:filter(and(eq(pod_condition,Ready))):splitBy():sum/builtin:kubernetes.node.pods_allocatable:splitBy():sum):default(0.0)*100.0
timeseries o1=sum(dt.kubernetes.pods, rollup: avg), nonempty:true, filter: {((pod_condition=="Ready"))}, by: {}| join [timeseries operand=sum(dt.kubernetes.node.pods_allocatable, rollup: avg), nonempty:true, by: {}], on: {interval}, fields: {o2=operand}| fieldsAdd result=if(isNull(o1[]/o2[]), 0.0, else: o1[]/o2[])* 100.0| fieldsRemove {o1,o2}
Detect cluster readiness issues
builtin:kubernetes.cluster.readyz:splitBy():sum
timeseries {sum(dt.kubernetes.cluster.readyz, rollup: avg)}, by: {}
Detect monitoring issues
(no metric expression)
(no DQL)
Alert | Setting | Value |
---|---|---|
Readiness Issues | sample period in minutes | 3 |
observation period in minutes | 5 | |
Monitoring Issues | sample period in minutes | 15 |
observation period in minutes | 30 |
Namespace alerts
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.254
Resource
CPU-limit quota saturation
CPU-limit quota saturation exceeds the specified threshold.
5 minutes
Sum of resource quota CPU used / Sum of resource quota CPU limits
Kubernetes Classic, Kubernetes app
1.254
Resource
CPU-request quota saturation
CPU-request quota saturation exceeds the specified threshold.
5 minutes
Sum of resource quota CPU used / Sum of resource quota CPU requests
Kubernetes Classic, Kubernetes app
1.254
Resource
Memory-limit quota saturation
Memory-limit quota saturation exceeds the specified threshold.
5 minutes
Sum of resource quota memory used / Sum of resource quota memory limits
Kubernetes Classic, Kubernetes app
1.254
Resource
Memory-request quota saturation
Memory-request quota saturation exceeds the specified threshold.
5 minutes
Sum of resource quota memory used / Sum of resource quota memory requests
Kubernetes Classic, Kubernetes app
1.254
Resource
Pod quota saturation
Pod quota saturation exceeds the specified threshold.
5 minutes
Sum of resource quota pods used / Sum of resource quota pods limit
Kubernetes Classic, Kubernetes app
Detect namespace CPU-limit quota saturation
builtin:kubernetes.resourcequota.limits_cpu_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.limits_cpu:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.limits_cpu_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.limits_cpu, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Detect namespace CPU-request quota saturation
builtin:kubernetes.resourcequota.requests_cpu_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.requests_cpu:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.requests_cpu_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.requests_cpu, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Detect namespace memory-limit quota saturation
builtin:kubernetes.resourcequota.limits_memory_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.limits_memory:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.limits_memory_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.limits_memory, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Detect namespace memory-request quota saturation
builtin:kubernetes.resourcequota.requests_memory_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.requests_memory:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.requests_memory_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.requests_memory, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Detect namespace pod quota saturation
builtin:kubernetes.resourcequota.pods_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.pods:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.pods_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.pods, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Node alerts
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.254
Resource
CPU-request saturation on node
CPU-request saturation exceeds the specified threshold.
5 minutes
Sum of node CPU requests / Sum of node CPU allocatable
Kubernetes Classic, Kubernetes app
1.254
Resource
Memory-request saturation on node
Memory-request saturation exceeds the specified threshold.
5 minutes
Sum of node memory requests / Sum of node memory allocatable
Kubernetes Classic, Kubernetes app
1.254
Resource
Pod saturation on node
Pod saturation exceeds the specified threshold.
5 minutes
Sum of running pods on node / Node pod limit
Kubernetes Classic, Kubernetes app
1.254
Availability
Node not ready
Node is not ready.
5 minutes
Node condition metric filtered by 'not ready'
Kubernetes Classic, Kubernetes app
1.264
Error
Problematic node condition
Node has one or more problematic conditions out of the following: ContainerRuntimeProblem
, ContainerRuntimeUnhealthy
, CorruptDockerOverlay2
, DiskPressure
, FilesystemCorruptionProblem
, FrequentContainerdRestart
, FrequentDockerRestart
, FrequentGcfsSnapshotterRestart
, FrequentGcfsdRestart
, FrequentKubeletRestart
, FrequentUnregisterNetDevice
, GcfsSnapshotterMissingLayer
, GcfsSnapshotterUnhealthy
, GcfsdUnhealthy
, KernelDeadlock
, KubeletProblem
, KubeletUnhealthy
, MemoryPressure
, NetworkUnavailable
, OutOfDisk
, PIDPressure
, ReadonlyFilesystem
5 minutes
Nodes condition metric
Kubernetes Classic, Kubernetes app
Detect node CPU-request saturation
builtin:kubernetes.node.requests_cpu:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum/builtin:kubernetes.node.cpu_allocatable:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum*100.0
timeseries o1=sum(dt.kubernetes.container.requests_cpu, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}| join [timeseries operand=sum(dt.kubernetes.node.cpu_allocatable, rollup: avg), nonempty:true, by: {dt.kubernetes.node.system_uuid,k8s.node.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Detect node memory-request saturation
builtin:kubernetes.node.requests_memory:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum/builtin:kubernetes.node.memory_allocatable:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum*100.0
timeseries o1=sum(dt.kubernetes.container.requests_memory, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}| join [timeseries operand=sum(dt.kubernetes.node.memory_allocatable, rollup: avg), nonempty:true, by: {dt.kubernetes.node.system_uuid,k8s.node.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Detect node pod-saturation
builtin:kubernetes.node.pods:filter(and(eq(pod_phase,Running))):splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum/builtin:kubernetes.node.pods_allocatable:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum*100.0
timeseries o1=sum(dt.kubernetes.pods, rollup: avg), nonempty:true, filter: {((pod_phase=="Running"))}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}| join [timeseries operand=sum(dt.kubernetes.node.pods_allocatable, rollup: avg), nonempty:true, by: {dt.kubernetes.node.system_uuid,k8s.node.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Detect node readiness issues
builtin:kubernetes.node.conditions:filter(and(eq(node_condition,Ready),ne(condition_status,True))):splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum
timeseries {sum(dt.kubernetes.node.conditions, rollup: avg)}, filter: {((node_condition=="Ready")AND(condition_status!=true))}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}
Detect problematic node conditions
builtin:kubernetes.node.conditions:filter(and(or(eq(node_condition,ContainerRuntimeProblem),eq(node_condition,ContainerRuntimeUnhealthy),eq(node_condition,CorruptDockerOverlay2),eq(node_condition,DiskPressure),eq(node_condition,FilesystemCorruptionProblem),eq(node_condition,FrequentContainerdRestart),eq(node_condition,FrequentDockerRestart),eq(node_condition,FrequentGcfsSnapshotterRestart),eq(node_condition,FrequentGcfsdRestart),eq(node_condition,FrequentKubeletRestart),eq(node_condition,FrequentUnregisterNetDevice),eq(node_condition,GcfsSnapshotterMissingLayer),eq(node_condition,GcfsSnapshotterUnhealthy),eq(node_condition,GcfsdUnhealthy),eq(node_condition,KernelDeadlock),eq(node_condition,KubeletProblem),eq(node_condition,KubeletUnhealthy),eq(node_condition,MemoryPressure),eq(node_condition,NetworkUnavailable),eq(node_condition,OutOfDisk),eq(node_condition,PIDPressure),eq(node_condition,ReadonlyFilesystem)),eq(condition_status,True))):splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum
timeseries {sum(dt.kubernetes.node.conditions, rollup: avg)}, filter: {(((node_condition=="ContainerRuntimeProblem")OR(node_condition=="ContainerRuntimeUnhealthy")OR(node_condition=="CorruptDockerOverlay2")OR(node_condition=="DiskPressure")OR(node_condition=="FilesystemCorruptionProblem")OR(node_condition=="FrequentContainerdRestart")OR(node_condition=="FrequentDockerRestart")OR(node_condition=="FrequentGcfsSnapshotterRestart")OR(node_condition=="FrequentGcfsdRestart")OR(node_condition=="FrequentKubeletRestart")OR(node_condition=="FrequentUnregisterNetDevice")OR(node_condition=="GcfsSnapshotterMissingLayer")OR(node_condition=="GcfsSnapshotterUnhealthy")OR(node_condition=="GcfsdUnhealthy")OR(node_condition=="KernelDeadlock")OR(node_condition=="KubeletProblem")OR(node_condition=="KubeletUnhealthy")OR(node_condition=="MemoryPressure")OR(node_condition=="NetworkUnavailable")OR(node_condition=="OutOfDisk")OR(node_condition=="PIDPressure")OR(node_condition=="ReadonlyFilesystem"))AND(condition_status==true))}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}
Alert | Setting | Value |
---|---|---|
Readiness Issues | sample period in minutes | 3 |
observation period in minutes | 5 | |
Node Problematic Condition | sample period in minutes | 3 |
observation period in minutes | 5 |
Persistent volume claims alerts
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.294
RESOURCE_CONTENTION
Kubernetes PVC: Low disk space %
Available disk space for a persistent volume claim is below the threshold.
5 minutes
Volume stats available bytes / Volume stats capacity bytes
Kubernetes Classic, Kubernetes app
1.294
RESOURCE_CONTENTION
Kubernetes PVC: Low disk space
Available disk space for a persistent volume claim is below the threshold.
5 minutes
Kubelet volume stats available bytes metric
Kubernetes Classic, Kubernetes app
Detect low disk space (%)
builtin:kubernetes.persistentvolumeclaim.available:splitBy(k8s.namespace.name,k8s.persistent_volume_claim.name):avg/builtin:kubernetes.persistentvolumeclaim.capacity:splitBy(k8s.namespace.name,k8s.persistent_volume_claim.name):avg*100.0
timeseries {o1=avg(dt.kubernetes.persistentvolumeclaim.available), o2=avg(dt.kubernetes.persistentvolumeclaim.capacity)}, by: {k8s.namespace.name, k8s.persistentvolumeclaim.name} | fieldsAdd result=o1[]/o2[]* 100.0 | fieldsRemove {o1,o2}
Detect low disk space (MiB)
builtin:kubernetes.persistentvolumeclaim.available:splitBy(k8s.namespace.name,k8s.persistent_volume_claim.name):avg
timeseries {avg(dt.kubernetes.persistentvolumeclaim.available)}, by: {k8s.namespace.name,k8s.persistentvolumeclaim.name}
Workload alerts
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.264
Resource
CPU usage close to limits
The CPU usage exceeds the threshold in terms of the defined CPU limit.
5 minutes
Sum of workload CPU usage / Sum of workload CPU limits
Kubernetes Classic, Kubernetes app
1.254
Error
Container restarts
Observed container restarts exceed the specified threshold.
5 minutes
Container restarts metric
Kubernetes Classic, Kubernetes app
1.264
Resource
High CPU throttling
The CPU throttling to usage ratio exceeds the specified threshold.
5 minutes
Sum of workload CPU throttled / Sum of workload CPU usage
Kubernetes Classic, Kubernetes app
1.268
Error
Job failure event
Events with reason 'BackoffLimitExceeded', 'DeadlineExceeded', or 'PodFailurePolicy' have been detected.
60 minutes
Event metric filtered by reason and workload kind
Kubernetes Classic, Kubernetes app
1.264
Resource
Memory usage close to limits
The memory usage (working set memory) exceeds the threshold in terms of the defined memory limit.
5 minutes
Sum of workload working set memory / Sum of workload memory limits
Kubernetes Classic, Kubernetes app
1.268
Error
Out-of-memory kills
Out-of-memory kills have been observed for pods of this workload.
5 minutes
Out-of-memory kills metric
Kubernetes Classic, Kubernetes app
1.268
Error
Backoff event
Events with reason 'BackOff' have been detected for pods of this workload. Check for pods with status 'ImagePullBackOff' or 'CrashLoopBackOff'.
15 minutes
Event metric filtered by reason
Kubernetes Classic, Kubernetes app
1.268
Error
Pod eviction event
Events with reason 'Evicted' have been detected for pods of this workload.
60 minutes
Event metric filtered by reason
Kubernetes Classic, Kubernetes app
1.268
Error
Preemption event
Events with reasons 'Preempted' or 'Preempting' have been detected for pods of this workload.
60 minutes
Event metric filtered by reason
Kubernetes Classic, Kubernetes app
1.254
Resource
Pods stuck in pending
Workload has pending pods.
5 minutes
Pods metric filtered by phase 'Pending'
Kubernetes Classic, Kubernetes app
1.260
Resource
Pods stuck in terminating
Workload has pods stuck in terminating.
5 minutes
Pods metric filtered by status 'Terminating'
Kubernetes Classic, Kubernetes app
1.260
Error
Deployment stuck
Deployment is stuck and therefore is no longer progressing.
5 minutes
Workload condition metric filtered by 'not progressing'
Kubernetes Classic, Kubernetes app
1.258
Error
Not all pods ready
Workload has pods that are not ready.
5 minutes
Sum of non-failed pods - Sum of non-failed and non-ready pods
Kubernetes Classic, Kubernetes app
1.254
Error
No pod ready
Workload does not have any ready pods.
5 minutes
Sum of non-failed pods - Sum of non-failed and non-ready pods
Kubernetes Classic, Kubernetes app
Detect CPU usage saturation
(builtin:kubernetes.workload.cpu_usage:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.limits_cpu:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0
timeseries o1=sum(dt.kubernetes.container.cpu_usage, rollup: avg), nonempty:true, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| join [timeseries operand=sum(dt.kubernetes.container.limits_cpu, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=if(isNull(o1[]/o2[]), 0.0, else: o1[]/o2[])* 100.0| fieldsRemove {o1,o2}
Detect container restarts
builtin:kubernetes.container.restarts:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.container.restarts, default:0.0, rollup: avg)}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
Detect high CPU throttling
(builtin:kubernetes.workload.cpu_throttled:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.cpu_usage:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0
timeseries {o1=sum(dt.kubernetes.container.cpu_throttled, rollup: avg), o2=sum(dt.kubernetes.container.cpu_usage, rollup: avg)}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| fieldsAdd result=if(isNull(o1[]/o2[]), 0.0, else: o1[]/o2[])* 100.0| fieldsRemove {o1,o2}
Detect job failure events
builtin:kubernetes.events:filter(and(or(eq(k8s.event.reason,BackoffLimitExceeded),eq(k8s.event.reason,DeadlineExceeded),eq(k8s.event.reason,PodFailurePolicy)),or(eq(k8s.workload.kind,job),eq(k8s.workload.kind,cronjob)))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.events, default:0.0, rollup: avg)}, filter: {(((k8s.event.reason=="BackoffLimitExceeded")OR(k8s.event.reason=="DeadlineExceeded")OR(k8s.event.reason=="PodFailurePolicy"))AND((k8s.workload.kind=="job")OR(k8s.workload.kind=="cronjob")))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
Detect memory usage saturation
(builtin:kubernetes.workload.memory_working_set:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.limits_memory:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0
timeseries o1=sum(dt.kubernetes.container.memory_working_set, rollup: avg), nonempty:true, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| join [timeseries operand=sum(dt.kubernetes.container.limits_memory, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=if(isNull(o1[]/o2[]), 0.0, else: o1[]/o2[])* 100.0| fieldsRemove {o1,o2}
Detect out-of-memory kills
builtin:kubernetes.container.oom_kills:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.container.oom_kills, default:0.0, rollup: avg)}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
Detect pod backoff events
builtin:kubernetes.events:filter(and(eq(k8s.event.reason,BackOff))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.events, default:0.0, rollup: avg)}, filter: {((k8s.event.reason=="BackOff"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
Detect pod eviction events
builtin:kubernetes.events:filter(and(eq(k8s.event.reason,Evicted))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.events, default:0.0, rollup: avg)}, filter: {((k8s.event.reason=="Evicted"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
Detect pod preemption events
builtin:kubernetes.events:filter(or(eq(k8s.event.reason,Preempted),eq(k8s.event.reason,Preempting))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.events, default:0.0, rollup: avg)}, filter: {((k8s.event.reason=="Preempted")OR(k8s.event.reason=="Preempting"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
Detect pods stuck in pending
builtin:kubernetes.pods:filter(and(eq(pod_phase,Pending))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum
timeseries {sum(dt.kubernetes.pods, rollup: avg)}, filter: {((pod_phase=="Pending"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
Detect pods stuck in terminating
builtin:kubernetes.pods:filter(and(eq(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum
timeseries {sum(dt.kubernetes.pods, rollup: avg)}, filter: {((pod_status=="Terminating"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
Detect stuck deployments
builtin:kubernetes.workload.conditions:filter(and(eq(workload_condition,Progressing),eq(condition_status,False))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum
timeseries {sum(dt.kubernetes.workload.conditions, rollup: avg)}, filter: {((workload_condition=="Progressing")AND(condition_status==false))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
Detect workloads with non-ready pods
builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),ne(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum-builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),eq(pod_condition,Ready),ne(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries o1=sum(dt.kubernetes.pods, rollup: avg), filter: {((pod_phase!="Failed")AND(pod_phase!="Succeeded")AND(k8s.workload.kind!="job")AND(k8s.workload.kind!="cronjob")AND(pod_status!="Terminating"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| join [timeseries operand=sum(dt.kubernetes.pods, default:0.0, rollup: avg), nonempty:true, filter: {((pod_phase!="Failed")AND(pod_phase!="Succeeded")AND(k8s.workload.kind!="job")AND(k8s.workload.kind!="cronjob")AND(pod_condition=="Ready")AND(pod_status!="Terminating"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]-o2[]| fieldsRemove {o1,o2}
Detect workloads without ready pods
builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum-builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),ne(pod_condition,Ready))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries o1=sum(dt.kubernetes.pods, rollup: avg), filter: {((pod_phase!="Failed")AND(pod_phase!="Succeeded")AND(k8s.workload.kind!="job")AND(k8s.workload.kind!="cronjob"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| join [timeseries operand=sum(dt.kubernetes.pods, default:0.0, rollup: avg), nonempty:true, filter: {((pod_phase!="Failed")AND(pod_phase!="Succeeded")AND(k8s.workload.kind!="job")AND(k8s.workload.kind!="cronjob")AND(pod_condition!="Ready"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]-o2[]| fieldsRemove {o1,o2}
Alert | Setting | Value |
---|---|---|
Container Restarts | threshold | 1 |
sample period in minutes | 3 | |
observation period in minutes | 5 | |
Deployment Stuck | sample period in minutes | 3 |
observation period in minutes | 5 | |
Pending Pods | threshold | 1 |
sample period in minutes | 10 | |
observation period in minutes | 15 | |
Pod Stuck In Terminating | sample period in minutes | 10 |
observation period in minutes | 15 | |
Workload Without Ready Pods | sample period in minutes | 10 |
observation period in minutes | 15 | |
Oom Kills | alert | always |
Job Failure Events | alert | always |
Pod Backoff Events | alert | always |
Pod Eviction Events | alert | always |
Pod Preemption Events | alert | always |