Anomaly detectors

  • Latest Dynatrace
  • Reference
  • 8-min read
  • Published Jan 19, 2024

Dynatrace supports anomaly detection and alerting for Kubernetes entities. This section provides a complete list of available alerts as well as default settings for new environments.

Available alerts

Cluster

Alert nameDynatrace versionProblem typeProblem titleProblem descriptionCalculation
Detect cluster readiness issues1.254AvailabilityCluster not readyReadyz endpoint indicates that this cluster is not ready.Cluster readyz metric
Detect cluster CPU-request saturation1.254ResourceCPU-request saturation on clusterCPU-request saturation exceeds the specified threshold.Node CPU requests / Node CPU allocatable
Detect cluster memory-request saturation1.254ResourceMemory-request saturation on clusterMemory-request saturation exceeds the specified threshold.Node memory requests / Node memory allocatable
Detect cluster pod-saturation1.258ResourcePod saturation on clusterCluster pod-saturation exceeds the specified threshold.Sum of ready pods / Sum of allocatable pods
Detect monitoring issues1.258AvailabilityMonitoring not availableDynatrace API monitoring is not available.

Node

Alert nameDynatrace versionProblem typeProblem titleProblem descriptionCalculation
Detect node readiness issues1.254AvailabilityNode not readyNode is not ready.Node condition metric filtered by 'not ready'
Detect problematic node conditions1.264ErrorProblematic node conditionNode has one or more problematic conditions out of the following: ContainerRuntimeUnhealthy, DiskPressure, FrequentContainerdRestart, FrequentDockerRestart, FrequentKubeletRestart, KernelDeadlock, KubeletUnhealthy, MemoryPressure, NetworkUnavailable, OutOfDisk, PIDPressure, ReadonlyFilesystem, ContainerRuntimeProblem, CorruptDockerOverlay2, FilesystemCorruptionProblem, FrequentGcfsdRestart, FrequentGcfsSnapshotterRestart, FrequentUnregisterNetDevice, GcfsdUnhealthy, GcfsSnapshotterMissingLayer, GcfsSnapshotterUnhealthy, KubeletProblemNodes condition metric
Detect node CPU-request saturation1.254ResourceCPU-request saturation on nodeCPU-request saturation exceeds the specified threshold.Sum of node CPU requests / Sum of node CPU allocatable
Detect node memory-request saturation1.254ResourceMemory-request saturation on nodeMemory-request saturation exceeds the specified threshold.Sum of node memory requests / Sum of node memory allocatable
Detect node pod-saturation1.254ResourcePod saturation on nodePod saturation exceeds the specified threshold.Sum of running pods on node / Node pod limit

Namespace

Alert nameDynatrace versionProblem typeProblem titleProblem descriptionCalculation
Detect namespace CPU-request quota saturation1.254ResourceCPU-request quota saturationCPU-request quota saturation exceeds the specified threshold.Sum of resource quota CPU used / Sum of resource quota CPU requests
Detect namespace CPU-limit quota saturation1.254ResourceCPU-limit quota saturationCPU-limit quota saturation exceeds the specified threshold.Sum of resource quota CPU used / Sum of resource quota CPU limits
Detect namespace memory-request quota saturation1.254ResourceMemory-request quota saturationMemory-request quota saturation exceeds the specified threshold.Sum of resource quota memory used / Sum of resource quota memory requests
Detect namespace memory-limit quota saturation1.254ResourceMemory-limit quota saturationMemory-limit quota saturation exceeds the specified threshold.Sum of resource quota memory used / Sum of resource quota memory limits
Detect namespace pod quota saturation1.254ResourcePod quota saturationPod quota saturation exceeds the specified threshold.Sum of resource quota pods used / Sum of resource quota pods limit

Workload

Alert nameDynatrace versionProblem typeProblem titleProblem descriptionCalculation
Detect container restarts1.254ErrorContainer restartsObserved container restarts exceed the specified threshold.Container restarts metric
Detect stuck deployments1.260ErrorDeployment stuckDeployment is stuck and therefore is no longer progressing.Workload condition metric filtered by 'not progressing'
Detect pods stuck in pending1.254ResourcePods stuck in pendingWorkload has pending pods.Pods metric filtered by phase 'Pending'
Detect pods stuck in terminating1.260ResourcePods stuck in terminatingWorkload has pods stuck in terminating.Pods metric filtered by status 'Terminating'
Detect workloads without ready pods1.254ErrorNo pod readyWorkload does not have any ready pods.Sum of non-failed pods - Sum of non-failed and non-ready pods
Detect workloads with non-ready pods1.258ErrorNot all pods readyWorkload has pods that are not ready.Sum of non-failed pods - Sum of non-failed and non-ready pods
Detect memory usage saturation1.264ResourceMemory usage close to limitsThe memory usage (working set memory) exceeds the threshold in terms of the defined memory limit.Sum of workload working set memory / Sum of workload memory limits
Detect CPU usage saturation1.264ResourceCPU usage close to limitsThe CPU usage exceeds the threshold in terms of the defined CPU limit.Sum of workload CPU usage / Sum of workload CPU limits
Detect high CPU throttling1.264ResourceHigh CPU throttlingThe CPU throttling to limits ratio exceeds the specified threshold.Sum of workload CPU throttled / Sum of workload CPU limits
Detect out-of-memory kills1.268ErrorOut-of-memory killsOut-of-memory kills have been observed for pods of this workload.Out-of-memory kills metric
Detect job failure events1.268ErrorJob failure eventEvents with reason 'BackoffLimitExceeded', 'DeadlineExceeded', or 'PodFailurePolicy' have been detected.Event metric filtered by reason and workload kind
Detect pod backoff events1.268ErrorBackoff eventEvents with reason 'BackOff' have been detected for pods of this workload. Check for pods with status 'ImagePullBackOff' or 'CrashLoopBackOff'.Event metric filtered by reason
Detect pod eviction events1.268ErrorPod eviction eventEvents with reason 'Evicted' have been detected for pods of this workload.Event metric filtered by reason
Detect pod preemption events1.268ErrorPreemption eventEvents with reasons 'Preempted' or 'Preempting' have been detected for pods of this workload.Event metric filtered by reason

Persistent volume claim alerts

Alert nameDynatrace versionProblem typeProblem titleProblem descriptionCalculation
Detect low disk space (MB)1.262ResourceKubernetes PVC: Low disk spaceAvailable disk space for a persistent volume claim is below the threshold.Kubelet volume stats available bytes metric
Detect low disk space (%)1.262ResourceKubernetes PVC: Low disk space %Available disk space for a persistent volume claim is below the threshold.Volume stats available bytes / Volume stats capacity bytes

Default settings for new environments

The following section outlines all the alerts that are enabled by default along with their respective settings.

Deviating default values

Default values for new environments may differ from the values applied when resetting alert configurations in the Dynatrace Classic interface.

Cluster

Alert

Setting

Value

Readiness Issues

sample period in minutes

3

Readiness Issues

observation period in minutes

5

Monitoring Issues

sample period in minutes

15

Monitoring Issues

observation period in minutes

30

Node

Alert

Setting

Value

Readiness Issues

sample period in minutes

3

Readiness Issues

observation period in minutes

5

Node Problematic Condition

sample period in minutes

3

Node Problematic Condition

observation period in minutes

5

Pvc

Alert

Setting

Value

Low Disk Space Critical Percentage

threshold

3

Low Disk Space Critical Percentage

sample period in minutes

3

Low Disk Space Critical Percentage

observation period in minutes

5

Workload

Alert

Setting

Value

Container Restarts

threshold

1

Container Restarts

sample period in minutes

3

Container Restarts

observation period in minutes

5

Deployment Stuck

sample period in minutes

3

Deployment Stuck

observation period in minutes

5

Pending Pods

threshold

1

Pending Pods

sample period in minutes

10

Pending Pods

observation period in minutes

15

Pod Stuck In Terminating

sample period in minutes

10

Pod Stuck In Terminating

observation period in minutes

15

Workload Without Ready Pods

sample period in minutes

10

Workload Without Ready Pods

observation period in minutes

15

Oom Kills

alert

always

Job Failure Events

alert

always

Pod Backoff Events

alert

always

Pod Eviction Events

alert

always

Pod Preemption Events

alert

always

Related tags
Infrastructure ObservabilityKubernetes (new)Kubernetes