Dynatrace version 1.254+
ActiveGate version 1.253+
To alert on common Kubernetes platform issues, follow the instructions below.
There are three ways to configure alerts for common Kubernetes/OpenShift issues.
Configuring an alert on a different level is only intended to simplify the configuration of multiple entities at once. It does not change the behavior of an alert.
For example, enabling a workload CPU usage saturation alert will still evaluate and raise problems for each Kubernetes workload separately, even if it has been configured on the Kubernetes cluster level.
For further details on the settings hierarchy, see Settings documentation.
Manually closed problems will show up again after 60 days if their root cause remains unresolved.
You can view alerts
On the Problems page.
Example problem:
In the Events section of a cluster details page.
Example event:
Select the event to navigate to Data Explorer for more information about the metric that generated the event.
See below for a list of available alerts.
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.254
Resource
CPU-request saturation on cluster
CPU-request saturation exceeds the specified threshold.
10 minutes
Node CPU requests / Node CPU allocatable
Kubernetes Classic, Kubernetes app
1.254
Resource
Memory-request saturation on cluster
Memory-request saturation exceeds the specified threshold.
10 minutes
Node memory requests / Node memory allocatable
Kubernetes Classic, Kubernetes app
1.258
Resource
Pod saturation on cluster
Cluster pod-saturation exceeds the specified threshold.
10 minutes
Sum of ready pods / Sum of allocatable pods
Kubernetes Classic, Kubernetes app
1.254
Availability
Cluster not ready
Readyz endpoint indicates that this cluster is not ready.
10 minutes
Cluster readyz metric
Kubernetes Classic, Kubernetes app
1.258
Availability
Monitoring not available
Dynatrace API monitoring is not available.
10 minutes
Kubernetes Classic, Kubernetes app
builtin:kubernetes.node.requests_cpu:splitBy():sum/builtin:kubernetes.node.cpu_allocatable:splitBy():sum*100.0
timeseries o1=sum(dt.kubernetes.container.requests_cpu, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {}| join [timeseries operand=sum(dt.kubernetes.node.cpu_allocatable, rollup: avg), nonempty:true, by: {}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.node.requests_memory:splitBy():sum/builtin:kubernetes.node.memory_allocatable:splitBy():sum*100.0
timeseries o1=sum(dt.kubernetes.container.requests_memory, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {}| join [timeseries operand=sum(dt.kubernetes.node.memory_allocatable, rollup: avg), nonempty:true, by: {}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
(builtin:kubernetes.node.pods:filter(and(eq(pod_condition,Ready))):splitBy():sum/builtin:kubernetes.node.pods_allocatable:splitBy():sum):default(0.0)*100.0
timeseries o1=sum(dt.kubernetes.pods, rollup: avg), nonempty:true, filter: {((pod_condition=="Ready"))}, by: {}| join [timeseries operand=sum(dt.kubernetes.node.pods_allocatable, rollup: avg), nonempty:true, by: {}], on: {interval}, fields: {o2=operand}| fieldsAdd result=if(isNull(o1[]/o2[]), 0.0, else: o1[]/o2[])* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.cluster.readyz:splitBy():sum
timeseries {sum(dt.kubernetes.cluster.readyz, rollup: avg)}, by: {}
(no metric expression)
(no DQL)
Alert | Setting | Value |
---|---|---|
Readiness Issues | sample period in minutes | 3 |
observation period in minutes | 5 | |
Monitoring Issues | sample period in minutes | 15 |
observation period in minutes | 30 |
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.254
Resource
CPU-limit quota saturation
CPU-limit quota saturation exceeds the specified threshold.
10 minutes
Sum of resource quota CPU used / Sum of resource quota CPU limits
Kubernetes Classic, Kubernetes app
1.254
Resource
CPU-request quota saturation
CPU-request quota saturation exceeds the specified threshold.
10 minutes
Sum of resource quota CPU used / Sum of resource quota CPU requests
Kubernetes Classic, Kubernetes app
1.254
Resource
Memory-limit quota saturation
Memory-limit quota saturation exceeds the specified threshold.
10 minutes
Sum of resource quota memory used / Sum of resource quota memory limits
Kubernetes Classic, Kubernetes app
1.254
Resource
Memory-request quota saturation
Memory-request quota saturation exceeds the specified threshold.
10 minutes
Sum of resource quota memory used / Sum of resource quota memory requests
Kubernetes Classic, Kubernetes app
1.254
Resource
Pod quota saturation
Pod quota saturation exceeds the specified threshold.
10 minutes
Sum of resource quota pods used / Sum of resource quota pods limit
Kubernetes Classic, Kubernetes app
builtin:kubernetes.resourcequota.limits_cpu_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.limits_cpu:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.limits_cpu_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.limits_cpu, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.resourcequota.requests_cpu_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.requests_cpu:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.requests_cpu_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.requests_cpu, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.resourcequota.limits_memory_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.limits_memory:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.limits_memory_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.limits_memory, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.resourcequota.requests_memory_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.requests_memory:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.requests_memory_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.requests_memory, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.resourcequota.pods_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.pods:splitBy(k8s.namespace.name):sum*100.0
timeseries {o1=sum(dt.kubernetes.resourcequota.pods_used, rollup: avg), o2=sum(dt.kubernetes.resourcequota.pods, rollup: avg)}, by: {k8s.namespace.name}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.254
Resource
CPU-request saturation on node
CPU-request saturation exceeds the specified threshold.
10 minutes
Sum of node CPU requests / Sum of node CPU allocatable
Kubernetes Classic, Kubernetes app
1.254
Resource
Memory-request saturation on node
Memory-request saturation exceeds the specified threshold.
10 minutes
Sum of node memory requests / Sum of node memory allocatable
Kubernetes Classic, Kubernetes app
1.254
Resource
Pod saturation on node
Pod saturation exceeds the specified threshold.
10 minutes
Sum of running pods on node / Node pod limit
Kubernetes Classic, Kubernetes app
1.254
Availability
Node not ready
Node is not ready.
10 minutes
Node condition metric filtered by 'not ready'
Kubernetes Classic, Kubernetes app
1.264
Error
Problematic node condition
Node has one or more problematic conditions out of the following: ContainerRuntimeProblem
, ContainerRuntimeUnhealthy
, CorruptDockerOverlay2
, DiskPressure
, FilesystemCorruptionProblem
, FrequentContainerdRestart
, FrequentDockerRestart
, FrequentGcfsSnapshotterRestart
, FrequentGcfsdRestart
, FrequentKubeletRestart
, FrequentUnregisterNetDevice
, GcfsSnapshotterMissingLayer
, GcfsSnapshotterUnhealthy
, GcfsdUnhealthy
, KernelDeadlock
, KubeletProblem
, KubeletUnhealthy
, MemoryPressure
, NetworkUnavailable
, OutOfDisk
, PIDPressure
, ReadonlyFilesystem
10 minutes
Nodes condition metric
Kubernetes Classic, Kubernetes app
builtin:kubernetes.node.requests_cpu:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum/builtin:kubernetes.node.cpu_allocatable:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum*100.0
timeseries o1=sum(dt.kubernetes.container.requests_cpu, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}| join [timeseries operand=sum(dt.kubernetes.node.cpu_allocatable, rollup: avg), nonempty:true, by: {dt.kubernetes.node.system_uuid,k8s.node.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.node.requests_memory:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum/builtin:kubernetes.node.memory_allocatable:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum*100.0
timeseries o1=sum(dt.kubernetes.container.requests_memory, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}| join [timeseries operand=sum(dt.kubernetes.node.memory_allocatable, rollup: avg), nonempty:true, by: {dt.kubernetes.node.system_uuid,k8s.node.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.node.pods:filter(and(eq(pod_phase,Running))):splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum/builtin:kubernetes.node.pods_allocatable:splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum*100.0
timeseries o1=sum(dt.kubernetes.pods, rollup: avg), nonempty:true, filter: {((pod_phase=="Running"))}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}| join [timeseries operand=sum(dt.kubernetes.node.pods_allocatable, rollup: avg), nonempty:true, by: {dt.kubernetes.node.system_uuid,k8s.node.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]/o2[]* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.node.conditions:filter(and(eq(node_condition,Ready),ne(condition_status,True))):splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum
timeseries {sum(dt.kubernetes.node.conditions, rollup: avg)}, filter: {((node_condition=="Ready")AND(condition_status!=true))}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}
builtin:kubernetes.node.conditions:filter(and(or(eq(node_condition,ContainerRuntimeProblem),eq(node_condition,ContainerRuntimeUnhealthy),eq(node_condition,CorruptDockerOverlay2),eq(node_condition,DiskPressure),eq(node_condition,FilesystemCorruptionProblem),eq(node_condition,FrequentContainerdRestart),eq(node_condition,FrequentDockerRestart),eq(node_condition,FrequentGcfsSnapshotterRestart),eq(node_condition,FrequentGcfsdRestart),eq(node_condition,FrequentKubeletRestart),eq(node_condition,FrequentUnregisterNetDevice),eq(node_condition,GcfsSnapshotterMissingLayer),eq(node_condition,GcfsSnapshotterUnhealthy),eq(node_condition,GcfsdUnhealthy),eq(node_condition,KernelDeadlock),eq(node_condition,KubeletProblem),eq(node_condition,KubeletUnhealthy),eq(node_condition,MemoryPressure),eq(node_condition,NetworkUnavailable),eq(node_condition,OutOfDisk),eq(node_condition,PIDPressure),eq(node_condition,ReadonlyFilesystem)),eq(condition_status,True))):splitBy(dt.kubernetes.node.system_uuid,k8s.node.name):sum
timeseries {sum(dt.kubernetes.node.conditions, rollup: avg)}, filter: {(((node_condition=="ContainerRuntimeProblem")OR(node_condition=="ContainerRuntimeUnhealthy")OR(node_condition=="CorruptDockerOverlay2")OR(node_condition=="DiskPressure")OR(node_condition=="FilesystemCorruptionProblem")OR(node_condition=="FrequentContainerdRestart")OR(node_condition=="FrequentDockerRestart")OR(node_condition=="FrequentGcfsSnapshotterRestart")OR(node_condition=="FrequentGcfsdRestart")OR(node_condition=="FrequentKubeletRestart")OR(node_condition=="FrequentUnregisterNetDevice")OR(node_condition=="GcfsSnapshotterMissingLayer")OR(node_condition=="GcfsSnapshotterUnhealthy")OR(node_condition=="GcfsdUnhealthy")OR(node_condition=="KernelDeadlock")OR(node_condition=="KubeletProblem")OR(node_condition=="KubeletUnhealthy")OR(node_condition=="MemoryPressure")OR(node_condition=="NetworkUnavailable")OR(node_condition=="OutOfDisk")OR(node_condition=="PIDPressure")OR(node_condition=="ReadonlyFilesystem"))AND(condition_status==true))}, by: {dt.kubernetes.node.system_uuid,k8s.node.name}
Alert | Setting | Value |
---|---|---|
Readiness Issues | sample period in minutes | 3 |
observation period in minutes | 5 | |
Node Problematic Condition | sample period in minutes | 3 |
observation period in minutes | 5 |
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.294
RESOURCE_CONTENTION
Kubernetes PVC: Low disk space %
Available disk space for a persistent volume claim is below the threshold.
10 minutes
Volume stats available bytes / Volume stats capacity bytes
Kubernetes Classic, Kubernetes app
1.294
RESOURCE_CONTENTION
Kubernetes PVC: Low disk space
Available disk space for a persistent volume claim is below the threshold.
10 minutes
Kubelet volume stats available bytes metric
Kubernetes Classic, Kubernetes app
builtin:kubernetes.persistentvolumeclaim.available:splitBy(k8s.namespace.name,k8s.persistent_volume_claim.name):avg/builtin:kubernetes.persistentvolumeclaim.capacity:splitBy(k8s.namespace.name,k8s.persistent_volume_claim.name):avg*100.0
timeseries {o1=avg(dt.kubernetes.persistentvolumeclaim.available), o2=avg(dt.kubernetes.persistentvolumeclaim.capacity)}, by: {k8s.namespace.name, k8s.persistentvolumeclaim.name} | fieldsAdd result=o1[]/o2[]* 100.0 | fieldsRemove {o1,o2}
builtin:kubernetes.persistentvolumeclaim.available:splitBy(k8s.namespace.name,k8s.persistent_volume_claim.name):avg
timeseries {avg(dt.kubernetes.persistentvolumeclaim.available)}, by: {k8s.namespace.name,k8s.persistentvolumeclaim.name}
Alert name
Dynatrace version
Problem type
Problem title
Problem description
De-alerts after
Calculation
Supported in
1.264
Resource
CPU usage close to limits
The CPU usage exceeds the threshold in terms of the defined CPU limit.
10 minutes
Sum of workload CPU usage / Sum of workload CPU limits
Kubernetes Classic, Kubernetes app
1.254
Error
Container restarts
Observed container restarts exceed the specified threshold.
15 minutes
Container restarts metric
Kubernetes Classic, Kubernetes app
1.264
Resource
High CPU throttling
The CPU throttling to usage ratio exceeds the specified threshold.
10 minutes
Sum of workload CPU throttled / Sum of workload CPU usage
Kubernetes Classic, Kubernetes app
1.268
Error
Job failure event
Events with reason 'BackoffLimitExceeded', 'DeadlineExceeded', or 'PodFailurePolicy' have been detected.
60 minutes
Event metric filtered by reason and workload kind
Kubernetes Classic, Kubernetes app
1.264
Resource
Memory usage close to limits
The memory usage (working set memory) exceeds the threshold in terms of the defined memory limit.
10 minutes
Sum of workload working set memory / Sum of workload memory limits
Kubernetes Classic, Kubernetes app
1.268
Error
Out-of-memory kills
Out-of-memory kills have been observed for pods of this workload.
15 minutes
Out-of-memory kills metric
Kubernetes Classic, Kubernetes app
1.268
Error
Backoff event
Events with reason 'BackOff' have been detected for pods of this workload. Check for pods with status 'ImagePullBackOff' or 'CrashLoopBackOff'.
15 minutes
Event metric filtered by reason
Kubernetes Classic, Kubernetes app
1.268
Error
Pod eviction event
Events with reason 'Evicted' have been detected for pods of this workload.
60 minutes
Event metric filtered by reason
Kubernetes Classic, Kubernetes app
1.268
Error
Preemption event
Events with reasons 'Preempted' or 'Preempting' have been detected for pods of this workload.
60 minutes
Event metric filtered by reason
Kubernetes Classic, Kubernetes app
1.254
Resource
Pods stuck in pending
Workload has pending pods.
10 minutes
Pods metric filtered by phase 'Pending'
Kubernetes Classic, Kubernetes app
1.260
Resource
Pods stuck in terminating
Workload has pods stuck in terminating.
10 minutes
Pods metric filtered by status 'Terminating'
Kubernetes Classic, Kubernetes app
1.260
Error
Deployment stuck
Deployment is stuck and therefore is no longer progressing.
10 minutes
Workload condition metric filtered by 'not progressing'
Kubernetes Classic, Kubernetes app
1.258
Error
Not all pods ready
Workload has pods that are not ready.
10 minutes
Sum of non-failed pods - Sum of non-failed and non-ready pods
Kubernetes Classic, Kubernetes app
1.254
Error
No pod ready
Workload does not have any ready pods.
10 minutes
Sum of non-failed pods - Sum of non-failed and non-ready pods
Kubernetes Classic, Kubernetes app
(builtin:kubernetes.workload.cpu_usage:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.limits_cpu:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0
timeseries o1=sum(dt.kubernetes.container.cpu_usage, rollup: avg), nonempty:true, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| join [timeseries operand=sum(dt.kubernetes.container.limits_cpu, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=if(isNull(o1[]/o2[]), 0.0, else: o1[]/o2[])* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.container.restarts:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.container.restarts, default:0.0, rollup: avg)}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
(builtin:kubernetes.workload.cpu_throttled:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.cpu_usage:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0
timeseries {o1=sum(dt.kubernetes.container.cpu_throttled, rollup: avg), o2=sum(dt.kubernetes.container.cpu_usage, rollup: avg)}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| fieldsAdd result=if(isNull(o1[]/o2[]), 0.0, else: o1[]/o2[])* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.events:filter(and(or(eq(k8s.event.reason,BackoffLimitExceeded),eq(k8s.event.reason,DeadlineExceeded),eq(k8s.event.reason,PodFailurePolicy)),or(eq(k8s.workload.kind,job),eq(k8s.workload.kind,cronjob)))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.events, default:0.0, rollup: avg)}, filter: {(((k8s.event.reason=="BackoffLimitExceeded")OR(k8s.event.reason=="DeadlineExceeded")OR(k8s.event.reason=="PodFailurePolicy"))AND((k8s.workload.kind=="job")OR(k8s.workload.kind=="cronjob")))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
(builtin:kubernetes.workload.memory_working_set:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.limits_memory:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0
timeseries o1=sum(dt.kubernetes.container.memory_working_set, rollup: avg), nonempty:true, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| join [timeseries operand=sum(dt.kubernetes.container.limits_memory, rollup: avg), nonempty:true, filter: {(dt.kubernetes.container.type=="app")}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=if(isNull(o1[]/o2[]), 0.0, else: o1[]/o2[])* 100.0| fieldsRemove {o1,o2}
builtin:kubernetes.container.oom_kills:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.container.oom_kills, default:0.0, rollup: avg)}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
builtin:kubernetes.events:filter(and(eq(k8s.event.reason,BackOff))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.events, default:0.0, rollup: avg)}, filter: {((k8s.event.reason=="BackOff"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
builtin:kubernetes.events:filter(and(eq(k8s.event.reason,Evicted))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.events, default:0.0, rollup: avg)}, filter: {((k8s.event.reason=="Evicted"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
builtin:kubernetes.events:filter(or(eq(k8s.event.reason,Preempted),eq(k8s.event.reason,Preempting))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries {sum(dt.kubernetes.events, default:0.0, rollup: avg)}, filter: {((k8s.event.reason=="Preempted")OR(k8s.event.reason=="Preempting"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
builtin:kubernetes.pods:filter(and(eq(pod_phase,Pending))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum
timeseries {sum(dt.kubernetes.pods, rollup: avg)}, filter: {((pod_phase=="Pending"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
builtin:kubernetes.pods:filter(and(eq(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum
timeseries {sum(dt.kubernetes.pods, rollup: avg)}, filter: {((pod_status=="Terminating"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
builtin:kubernetes.workload.conditions:filter(and(eq(workload_condition,Progressing),eq(condition_status,False))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum
timeseries {sum(dt.kubernetes.workload.conditions, rollup: avg)}, filter: {((workload_condition=="Progressing")AND(condition_status==false))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}
builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),ne(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum-builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),eq(pod_condition,Ready),ne(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries o1=sum(dt.kubernetes.pods, rollup: avg), filter: {((pod_phase!="Failed")AND(pod_phase!="Succeeded")AND(k8s.workload.kind!="job")AND(k8s.workload.kind!="cronjob")AND(pod_status!="Terminating"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| join [timeseries operand=sum(dt.kubernetes.pods, default:0.0, rollup: avg), nonempty:true, filter: {((pod_phase!="Failed")AND(pod_phase!="Succeeded")AND(k8s.workload.kind!="job")AND(k8s.workload.kind!="cronjob")AND(pod_condition=="Ready")AND(pod_status!="Terminating"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]-o2[]| fieldsRemove {o1,o2}
builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum-builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),ne(pod_condition,Ready))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
timeseries o1=sum(dt.kubernetes.pods, rollup: avg), filter: {((pod_phase!="Failed")AND(pod_phase!="Succeeded")AND(k8s.workload.kind!="job")AND(k8s.workload.kind!="cronjob"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}| join [timeseries operand=sum(dt.kubernetes.pods, default:0.0, rollup: avg), nonempty:true, filter: {((pod_phase!="Failed")AND(pod_phase!="Succeeded")AND(k8s.workload.kind!="job")AND(k8s.workload.kind!="cronjob")AND(pod_condition!="Ready"))}, by: {k8s.namespace.name,k8s.workload.kind,k8s.workload.name}], on: {interval}, fields: {o2=operand}| fieldsAdd result=o1[]-o2[]| fieldsRemove {o1,o2}
Alert | Setting | Value |
---|---|---|
Container Restarts | threshold | 1 |
sample period in minutes | 3 | |
observation period in minutes | 5 | |
Deployment Stuck | sample period in minutes | 3 |
observation period in minutes | 5 | |
Pending Pods | threshold | 1 |
sample period in minutes | 10 | |
observation period in minutes | 15 | |
Pod Stuck In Terminating | sample period in minutes | 10 |
observation period in minutes | 15 | |
Workload Without Ready Pods | sample period in minutes | 10 |
observation period in minutes | 15 | |
Oom Kills | alert | always |
Job Failure Events | alert | always |
Pod Backoff Events | alert | always |
Pod Eviction Events | alert | always |
Pod Preemption Events | alert | always |