Anomaly detectors

Dynatrace supports anomaly detection and alerting for Kubernetes entities. This section provides a complete list of available alerts as well as default settings for new environments.

Available alerts

Cluster

Alert name
Dynatrace version
Problem type
Problem title
Problem description
Calculation
Detect cluster readiness issues
1.254
Availability
Cluster not ready
Readyz endpoint indicates that this cluster is not ready.
Cluster readyz metric
Detect cluster CPU-request saturation
1.254
Resource
CPU-request saturation on cluster
CPU-request saturation exceeds the specified threshold.
Node CPU requests / Node CPU allocatable
Detect cluster memory-request saturation
1.254
Resource
Memory-request saturation on cluster
Memory-request saturation exceeds the specified threshold.
Node memory requests / Node memory allocatable
Detect cluster pod-saturation
1.258
Resource
Pod saturation on cluster
Cluster pod-saturation exceeds the specified threshold.
Sum of ready pods / Sum of allocatable pods
Detect monitoring issues
1.258
Availability
Monitoring not available
Dynatrace API monitoring is not available.

Node

Alert name
Dynatrace version
Problem type
Problem title
Problem description
Calculation
Detect node readiness issues
1.254
Availability
Node not ready
Node is not ready.
Node condition metric filtered by 'not ready'
Detect problematic node conditions
1.264
Error
Problematic node condition
Node has one or more problematic conditions out of the following: ContainerRuntimeUnhealthy, DiskPressure, FrequentContainerdRestart, FrequentDockerRestart, FrequentKubeletRestart, KernelDeadlock, KubeletUnhealthy, MemoryPressure, NetworkUnavailable, OutOfDisk, PIDPressure, ReadonlyFilesystem, ContainerRuntimeProblem, CorruptDockerOverlay2, FilesystemCorruptionProblem, FrequentGcfsdRestart, FrequentGcfsSnapshotterRestart, FrequentUnregisterNetDevice, GcfsdUnhealthy, GcfsSnapshotterMissingLayer, GcfsSnapshotterUnhealthy, KubeletProblem
Nodes condition metric
Detect node CPU-request saturation
1.254
Resource
CPU-request saturation on node
CPU-request saturation exceeds the specified threshold.
Sum of node CPU requests / Sum of node CPU allocatable
Detect node memory-request saturation
1.254
Resource
Memory-request saturation on node
Memory-request saturation exceeds the specified threshold.
Sum of node memory requests / Sum of node memory allocatable
Detect node pod-saturation
1.254
Resource
Pod saturation on node
Pod saturation exceeds the specified threshold.
Sum of running pods on node / Node pod limit

Namespace

Alert name
Dynatrace version
Problem type
Problem title
Problem description
Calculation
Detect namespace CPU-request quota saturation
1.254
Resource
CPU-request quota saturation
CPU-request quota saturation exceeds the specified threshold.
Sum of resource quota CPU used / Sum of resource quota CPU requests
Detect namespace CPU-limit quota saturation
1.254
Resource
CPU-limit quota saturation
CPU-limit quota saturation exceeds the specified threshold.
Sum of resource quota CPU used / Sum of resource quota CPU limits
Detect namespace memory-request quota saturation
1.254
Resource
Memory-request quota saturation
Memory-request quota saturation exceeds the specified threshold.
Sum of resource quota memory used / Sum of resource quota memory requests
Detect namespace memory-limit quota saturation
1.254
Resource
Memory-limit quota saturation
Memory-limit quota saturation exceeds the specified threshold.
Sum of resource quota memory used / Sum of resource quota memory limits
Detect namespace pod quota saturation
1.254
Resource
Pod quota saturation
Pod quota saturation exceeds the specified threshold.
Sum of resource quota pods used / Sum of resource quota pods limit

Workload

Alert name
Dynatrace version
Problem type
Problem title
Problem description
Calculation
Detect container restarts
1.254
Error
Container restarts
Observed container restarts exceed the specified threshold.
Container restarts metric
Detect stuck deployments
1.260
Error
Deployment stuck
Deployment is stuck and therefore is no longer progressing.
Workload condition metric filtered by 'not progressing'
Detect pods stuck in pending
1.254
Resource
Pods stuck in pending
Workload has pending pods.
Pods metric filtered by phase 'Pending'
Detect pods stuck in terminating
1.260
Resource
Pods stuck in terminating
Workload has pods stuck in terminating.
Pods metric filtered by status 'Terminating'
Detect workloads without ready pods
1.254
Error
No pod ready
Workload does not have any ready pods.
Sum of non-failed pods - Sum of non-failed and non-ready pods
Detect workloads with non-ready pods
1.258
Error
Not all pods ready
Workload has pods that are not ready.
Sum of non-failed pods - Sum of non-failed and non-ready pods
Detect memory usage saturation
1.264
Resource
Memory usage close to limits
The memory usage (working set memory) exceeds the threshold in terms of the defined memory limit.
Sum of workload working set memory / Sum of workload memory limits
Detect CPU usage saturation
1.264
Resource
CPU usage close to limits
The CPU usage exceeds the threshold in terms of the defined CPU limit.
Sum of workload CPU usage / Sum of workload CPU limits
Detect high CPU throttling
1.264
Resource
High CPU throttling
The CPU throttling to usage ratio exceeds the specified threshold.
Sum of workload CPU throttled / Sum of workload CPU usage
Detect out-of-memory kills
1.268
Error
Out-of-memory kills
Out-of-memory kills have been observed for pods of this workload.
Out-of-memory kills metric
Detect job failure events
1.268
Error
Job failure event
Events with reason 'BackoffLimitExceeded', 'DeadlineExceeded', or 'PodFailurePolicy' have been detected.
Event metric filtered by reason and workload kind
Detect pod backoff events
1.268
Error
Backoff event
Events with reason 'BackOff' have been detected for pods of this workload. Check for pods with status 'ImagePullBackOff' or 'CrashLoopBackOff'.
Event metric filtered by reason
Detect pod eviction events
1.268
Error
Pod eviction event
Events with reason 'Evicted' have been detected for pods of this workload.
Event metric filtered by reason
Detect pod preemption events
1.268
Error
Preemption event
Events with reasons 'Preempted' or 'Preempting' have been detected for pods of this workload.
Event metric filtered by reason

Persistent volume claim alerts

Alert name
Dynatrace version
Problem type
Problem title
Problem description
Calculation
Detect low disk space (MB)
1.262
Resource
Kubernetes PVC: Low disk space
Available disk space for a persistent volume claim is below the threshold.
Kubelet volume stats available bytes metric
Detect low disk space (%)
1.262
Resource
Kubernetes PVC: Low disk space %
Available disk space for a persistent volume claim is below the threshold.
Volume stats available bytes / Volume stats capacity bytes

Default settings for new environments

The following section outlines all the alerts that are enabled by default along with their respective settings.

Deviating default values

Default values for new environments may differ from the values applied when resetting alert configurations in the Dynatrace Classic interface.

Cluster

AlertSettingValue

Readiness Issues

sample period in minutes

3

observation period in minutes

5

Monitoring Issues

sample period in minutes

15

observation period in minutes

30

Node

AlertSettingValue

Readiness Issues

sample period in minutes

3

observation period in minutes

5

Node Problematic Condition

sample period in minutes

3

observation period in minutes

5

Pvc

AlertSettingValue

Low Disk Space Critical Percentage

threshold

3

sample period in minutes

3

observation period in minutes

5

Workload

AlertSettingValue

Container Restarts

threshold

1

sample period in minutes

3

observation period in minutes

5

Deployment Stuck

sample period in minutes

3

observation period in minutes

5

Pending Pods

threshold

1

sample period in minutes

10

observation period in minutes

15

Pod Stuck In Terminating

sample period in minutes

10

observation period in minutes

15

Workload Without Ready Pods

sample period in minutes

10

observation period in minutes

15

Oom Kills

alert

always

Job Failure Events

alert

always

Pod Backoff Events

alert

always

Pod Eviction Events

alert

always

Pod Preemption Events

alert

always