Anomaly detectors
Dynatrace supports anomaly detection and alerting for Kubernetes entities. This section provides a complete list of available alerts as well as default settings for new environments.
Available alerts
Cluster
Alert name | Dynatrace version | Problem type | Problem title | Problem description | Calculation |
---|---|---|---|---|---|
Detect cluster readiness issues | 1.254 | Availability | Cluster not ready | Readyz endpoint indicates that this cluster is not ready. | Cluster readyz metric |
Detect cluster CPU-request saturation | 1.254 | Resource | CPU-request saturation on cluster | CPU-request saturation exceeds the specified threshold. | Node CPU requests / Node CPU allocatable |
Detect cluster memory-request saturation | 1.254 | Resource | Memory-request saturation on cluster | Memory-request saturation exceeds the specified threshold. | Node memory requests / Node memory allocatable |
Detect cluster pod-saturation | 1.258 | Resource | Pod saturation on cluster | Cluster pod-saturation exceeds the specified threshold. | Sum of ready pods / Sum of allocatable pods |
Detect monitoring issues | 1.258 | Availability | Monitoring not available | Dynatrace API monitoring is not available. |
Node
Alert name | Dynatrace version | Problem type | Problem title | Problem description | Calculation |
---|---|---|---|---|---|
Detect node readiness issues | 1.254 | Availability | Node not ready | Node is not ready. | Node condition metric filtered by 'not ready' |
Detect problematic node conditions | 1.264 | Error | Problematic node condition | Node has one or more problematic conditions out of the following: ContainerRuntimeUnhealthy , DiskPressure , FrequentContainerdRestart , FrequentDockerRestart , FrequentKubeletRestart , KernelDeadlock , KubeletUnhealthy , MemoryPressure , NetworkUnavailable , OutOfDisk , PIDPressure , ReadonlyFilesystem , ContainerRuntimeProblem , CorruptDockerOverlay2 , FilesystemCorruptionProblem , FrequentGcfsdRestart , FrequentGcfsSnapshotterRestart , FrequentUnregisterNetDevice , GcfsdUnhealthy , GcfsSnapshotterMissingLayer , GcfsSnapshotterUnhealthy , KubeletProblem | Nodes condition metric |
Detect node CPU-request saturation | 1.254 | Resource | CPU-request saturation on node | CPU-request saturation exceeds the specified threshold. | Sum of node CPU requests / Sum of node CPU allocatable |
Detect node memory-request saturation | 1.254 | Resource | Memory-request saturation on node | Memory-request saturation exceeds the specified threshold. | Sum of node memory requests / Sum of node memory allocatable |
Detect node pod-saturation | 1.254 | Resource | Pod saturation on node | Pod saturation exceeds the specified threshold. | Sum of running pods on node / Node pod limit |
Namespace
Alert name | Dynatrace version | Problem type | Problem title | Problem description | Calculation |
---|---|---|---|---|---|
Detect namespace CPU-request quota saturation | 1.254 | Resource | CPU-request quota saturation | CPU-request quota saturation exceeds the specified threshold. | Sum of resource quota CPU used / Sum of resource quota CPU requests |
Detect namespace CPU-limit quota saturation | 1.254 | Resource | CPU-limit quota saturation | CPU-limit quota saturation exceeds the specified threshold. | Sum of resource quota CPU used / Sum of resource quota CPU limits |
Detect namespace memory-request quota saturation | 1.254 | Resource | Memory-request quota saturation | Memory-request quota saturation exceeds the specified threshold. | Sum of resource quota memory used / Sum of resource quota memory requests |
Detect namespace memory-limit quota saturation | 1.254 | Resource | Memory-limit quota saturation | Memory-limit quota saturation exceeds the specified threshold. | Sum of resource quota memory used / Sum of resource quota memory limits |
Detect namespace pod quota saturation | 1.254 | Resource | Pod quota saturation | Pod quota saturation exceeds the specified threshold. | Sum of resource quota pods used / Sum of resource quota pods limit |
Workload
Alert name | Dynatrace version | Problem type | Problem title | Problem description | Calculation |
---|---|---|---|---|---|
Detect container restarts | 1.254 | Error | Container restarts | Observed container restarts exceed the specified threshold. | Container restarts metric |
Detect stuck deployments | 1.260 | Error | Deployment stuck | Deployment is stuck and therefore is no longer progressing. | Workload condition metric filtered by 'not progressing' |
Detect pods stuck in pending | 1.254 | Resource | Pods stuck in pending | Workload has pending pods. | Pods metric filtered by phase 'Pending' |
Detect pods stuck in terminating | 1.260 | Resource | Pods stuck in terminating | Workload has pods stuck in terminating. | Pods metric filtered by status 'Terminating' |
Detect workloads without ready pods | 1.254 | Error | No pod ready | Workload does not have any ready pods. | Sum of non-failed pods - Sum of non-failed and non-ready pods |
Detect workloads with non-ready pods | 1.258 | Error | Not all pods ready | Workload has pods that are not ready. | Sum of non-failed pods - Sum of non-failed and non-ready pods |
Detect memory usage saturation | 1.264 | Resource | Memory usage close to limits | The memory usage (working set memory) exceeds the threshold in terms of the defined memory limit. | Sum of workload working set memory / Sum of workload memory limits |
Detect CPU usage saturation | 1.264 | Resource | CPU usage close to limits | The CPU usage exceeds the threshold in terms of the defined CPU limit. | Sum of workload CPU usage / Sum of workload CPU limits |
Detect high CPU throttling | 1.264 | Resource | High CPU throttling | The CPU throttling to usage ratio exceeds the specified threshold. | Sum of workload CPU throttled / Sum of workload CPU usage |
Detect out-of-memory kills | 1.268 | Error | Out-of-memory kills | Out-of-memory kills have been observed for pods of this workload. | Out-of-memory kills metric |
Detect job failure events | 1.268 | Error | Job failure event | Events with reason 'BackoffLimitExceeded', 'DeadlineExceeded', or 'PodFailurePolicy' have been detected. | Event metric filtered by reason and workload kind |
Detect pod backoff events | 1.268 | Error | Backoff event | Events with reason 'BackOff' have been detected for pods of this workload. Check for pods with status 'ImagePullBackOff' or 'CrashLoopBackOff'. | Event metric filtered by reason |
Detect pod eviction events | 1.268 | Error | Pod eviction event | Events with reason 'Evicted' have been detected for pods of this workload. | Event metric filtered by reason |
Detect pod preemption events | 1.268 | Error | Preemption event | Events with reasons 'Preempted' or 'Preempting' have been detected for pods of this workload. | Event metric filtered by reason |
Persistent volume claim alerts
Alert name | Dynatrace version | Problem type | Problem title | Problem description | Calculation |
---|---|---|---|---|---|
Detect low disk space (MB) | 1.262 | Resource | Kubernetes PVC: Low disk space | Available disk space for a persistent volume claim is below the threshold. | Kubelet volume stats available bytes metric |
Detect low disk space (%) | 1.262 | Resource | Kubernetes PVC: Low disk space % | Available disk space for a persistent volume claim is below the threshold. | Volume stats available bytes / Volume stats capacity bytes |
Default settings for new environments
The following section outlines all the alerts that are enabled by default along with their respective settings.
Default values for new environments may differ from the values applied when resetting alert configurations in the Dynatrace Classic interface.
Cluster
Alert | Setting | Value |
---|---|---|
Readiness Issues | sample period in minutes | 3 |
observation period in minutes | 5 | |
Monitoring Issues | sample period in minutes | 15 |
observation period in minutes | 30 |
Node
Alert | Setting | Value |
---|---|---|
Readiness Issues | sample period in minutes | 3 |
observation period in minutes | 5 | |
Node Problematic Condition | sample period in minutes | 3 |
observation period in minutes | 5 |
Pvc
Alert | Setting | Value |
---|---|---|
Low Disk Space Critical Percentage | threshold | 3 |
sample period in minutes | 3 | |
observation period in minutes | 5 |
Workload
Alert | Setting | Value |
---|---|---|
Container Restarts | threshold | 1 |
sample period in minutes | 3 | |
observation period in minutes | 5 | |
Deployment Stuck | sample period in minutes | 3 |
observation period in minutes | 5 | |
Pending Pods | threshold | 1 |
sample period in minutes | 10 | |
observation period in minutes | 15 | |
Pod Stuck In Terminating | sample period in minutes | 10 |
observation period in minutes | 15 | |
Workload Without Ready Pods | sample period in minutes | 10 |
observation period in minutes | 15 | |
Oom Kills | alert | always |
Job Failure Events | alert | always |
Pod Backoff Events | alert | always |
Pod Eviction Events | alert | always |
Pod Preemption Events | alert | always |