Anomaly detectors

Dynatrace supports anomaly detection and alerting for Kubernetes entities. This section provides a complete list of available alerts as well as default settings for new environments.

Available alerts

Cluster

Alert name

Dynatrace version

Problem type

Problem title

Problem description

Calculation

Detect cluster readiness issues

1.254

Availability

Cluster not ready

Readyz endpoint indicates that this cluster is not ready.

Cluster readyz metric

Detect cluster CPU-request saturation

1.254

Resource

CPU-request saturation on cluster

CPU-request saturation exceeds the specified threshold.

Node CPU requests / Node CPU allocatable

Detect cluster memory-request saturation

1.254

Resource

Memory-request saturation on cluster

Memory-request saturation exceeds the specified threshold.

Node memory requests / Node memory allocatable

Detect cluster pod-saturation

1.258

Resource

Pod saturation on cluster

Cluster pod-saturation exceeds the specified threshold.

Sum of ready pods / Sum of allocatable pods

Detect monitoring issues

1.258

Availability

Monitoring not available

Dynatrace API monitoring is not available.

Node

Alert name

Dynatrace version

Problem type

Problem title

Problem description

Calculation

Detect node readiness issues

1.254

Availability

Node not ready

Node is not ready.

Node condition metric filtered by 'not ready'

Detect problematic node conditions

1.264

Error

Problematic node condition

Node has one or more problematic conditions out of the following: ContainerRuntimeUnhealthy, DiskPressure, FrequentContainerdRestart, FrequentDockerRestart, FrequentKubeletRestart, KernelDeadlock, KubeletUnhealthy, MemoryPressure, NetworkUnavailable, OutOfDisk, PIDPressure, ReadonlyFilesystem, ContainerRuntimeProblem, CorruptDockerOverlay2, FilesystemCorruptionProblem, FrequentGcfsdRestart, FrequentGcfsSnapshotterRestart, FrequentUnregisterNetDevice, GcfsdUnhealthy, GcfsSnapshotterMissingLayer, GcfsSnapshotterUnhealthy, KubeletProblem

Nodes condition metric

Detect node CPU-request saturation

1.254

Resource

CPU-request saturation on node

CPU-request saturation exceeds the specified threshold.

Sum of node CPU requests / Sum of node CPU allocatable

Detect node memory-request saturation

1.254

Resource

Memory-request saturation on node

Memory-request saturation exceeds the specified threshold.

Sum of node memory requests / Sum of node memory allocatable

Detect node pod-saturation

1.254

Resource

Pod saturation on node

Pod saturation exceeds the specified threshold.

Sum of running pods on node / Node pod limit

Namespace

Alert name

Dynatrace version

Problem type

Problem title

Problem description

Calculation

Detect namespace CPU-request quota saturation

1.254

Resource

CPU-request quota saturation

CPU-request quota saturation exceeds the specified threshold.

Sum of resource quota CPU used / Sum of resource quota CPU requests

Detect namespace CPU-limit quota saturation

1.254

Resource

CPU-limit quota saturation

CPU-limit quota saturation exceeds the specified threshold.

Sum of resource quota CPU used / Sum of resource quota CPU limits

Detect namespace memory-request quota saturation

1.254

Resource

Memory-request quota saturation

Memory-request quota saturation exceeds the specified threshold.

Sum of resource quota memory used / Sum of resource quota memory requests

Detect namespace memory-limit quota saturation

1.254

Resource

Memory-limit quota saturation

Memory-limit quota saturation exceeds the specified threshold.

Sum of resource quota memory used / Sum of resource quota memory limits

Detect namespace pod quota saturation

1.254

Resource

Pod quota saturation

Pod quota saturation exceeds the specified threshold.

Sum of resource quota pods used / Sum of resource quota pods limit

Workload

Alert name

Dynatrace version

Problem type

Problem title

Problem description

Calculation

Detect container restarts

1.254

Error

Container restarts

Observed container restarts exceed the specified threshold.

Container restarts metric

Detect stuck deployments

1.260

Error

Deployment stuck

Deployment is stuck and therefore is no longer progressing.

Workload condition metric filtered by 'not progressing'

Detect pods stuck in pending

1.254

Resource

Pods stuck in pending

Workload has pending pods.

Pods metric filtered by phase 'Pending'

Detect pods stuck in terminating

1.260

Resource

Pods stuck in terminating

Workload has pods stuck in terminating.

Pods metric filtered by status 'Terminating'

Detect workloads without ready pods

1.254

Error

No pod ready

Workload does not have any ready pods.

Sum of non-failed pods - Sum of non-failed and non-ready pods

Detect workloads with non-ready pods

1.258

Error

Not all pods ready

Workload has pods that are not ready.

Sum of non-failed pods - Sum of non-failed and non-ready pods

Detect memory usage saturation

1.264

Resource

Memory usage close to limits

The memory usage (working set memory) exceeds the threshold in terms of the defined memory limit.

Sum of workload working set memory / Sum of workload memory limits

Detect CPU usage saturation

1.264

Resource

CPU usage close to limits

The CPU usage exceeds the threshold in terms of the defined CPU limit.

Sum of workload CPU usage / Sum of workload CPU limits

Detect high CPU throttling

1.264

Resource

High CPU throttling

The CPU throttling to usage ratio exceeds the specified threshold.

Sum of workload CPU throttled / Sum of workload CPU usage

Detect out-of-memory kills

1.268

Error

Out-of-memory kills

Out-of-memory kills have been observed for pods of this workload.

Out-of-memory kills metric

Detect job failure events

1.268

Error

Job failure event

Events with reason 'BackoffLimitExceeded', 'DeadlineExceeded', or 'PodFailurePolicy' have been detected.

Event metric filtered by reason and workload kind

Detect pod backoff events

1.268

Error

Backoff event

Events with reason 'BackOff' have been detected for pods of this workload. Check for pods with status 'ImagePullBackOff' or 'CrashLoopBackOff'.

Event metric filtered by reason

Detect pod eviction events

1.268

Error

Pod eviction event

Events with reason 'Evicted' have been detected for pods of this workload.

Event metric filtered by reason

Detect pod preemption events

1.268

Error

Preemption event

Events with reasons 'Preempted' or 'Preempting' have been detected for pods of this workload.

Event metric filtered by reason

Persistent volume claim alerts

Alert name

Dynatrace version

Problem type

Problem title

Problem description

Calculation

Detect low disk space (MB)

1.262

Resource

Kubernetes PVC: Low disk space

Available disk space for a persistent volume claim is below the threshold.

Kubelet volume stats available bytes metric

Detect low disk space (%)

1.262

Resource

Kubernetes PVC: Low disk space %

Available disk space for a persistent volume claim is below the threshold.

Volume stats available bytes / Volume stats capacity bytes

Default settings for new environments

The following section outlines all the alerts that are enabled by default along with their respective settings.

Deviating default values

Default values for new environments may differ from the values applied when resetting alert configurations in the Dynatrace Classic interface.

Cluster

Alert	Setting	Value
Readiness Issues	sample period in minutes	3
Readiness Issues	observation period in minutes	5
Monitoring Issues	sample period in minutes	15
Monitoring Issues	observation period in minutes	30

Node

Alert	Setting	Value
Readiness Issues	sample period in minutes	3
Readiness Issues	observation period in minutes	5
Node Problematic Condition	sample period in minutes	3
Node Problematic Condition	observation period in minutes	5

Pvc

Alert	Setting	Value
Low Disk Space Critical Percentage	threshold	3
	sample period in minutes	3
	observation period in minutes	5

Workload

Alert	Setting	Value
Container Restarts	threshold	1
	sample period in minutes	3
	observation period in minutes	5
Deployment Stuck	sample period in minutes	3
Deployment Stuck	observation period in minutes	5
Pending Pods	threshold	1
	sample period in minutes	10
	observation period in minutes	15
Pod Stuck In Terminating	sample period in minutes	10
Pod Stuck In Terminating	observation period in minutes	15
Workload Without Ready Pods	sample period in minutes	10
Workload Without Ready Pods	observation period in minutes	15
Oom Kills	alert	always
Job Failure Events	alert	always
Pod Backoff Events	alert	always
Pod Eviction Events	alert	always
Pod Preemption Events	alert	always