Assess and troubleshoot cluster health

  • Latest Dynatrace
  • Tutorial
  • 2-min read
  • Published Dec 18, 2023

Kubernetes (new) Kubernetes streamlines the process of assessing the health of clusters, enabling you to easily monitor health signals and metrics across your environment. It provides clear insights into cluster health, helping you identify and address issues, and ensuring your clusters are functioning efficiently.

Davis AI health status

Get a quick health overview using the Davis AI health status on top of the cluster list. Davis AI automatically assesses and aggregates the health of Kubernetes clusters, nodes, namespaces, and workloads. This feature visualizes the current health state at a high level, enabling you to easily identify both healthy and unhealthy clusters, nodes, namespaces, and workloads.

Assess cluster health

Troubleshoot unhealthy resources

In the example below, we observe that all clusters are currently healthy, but some nodes, namespaces, and workloads are in an unhealthy state.

To troubleshoot these unhealthy Kubernetes objects

  1. Select the red numbers displayed within the health status area. This action reveals a list of the unhealthy objects in the corresponding table, providing additional insights into the nature of the problems they are facing. For instance, you might notice a node showing BackOff warning with the last termination reason out-of-memory killed, indicating the container exceeded its memory limit and Kubernetes is delaying restarts.

    Clusters section in the Kubernetes app

  2. Select the node from the list.

    Kubernetes app- Nodes > node details

    The Davis AI health status at the top signals the issue with conditions. There's a detailed breakdown of the node resource utilization. Pay attention to the Memory section. If memory usage exceeds the allocated requests, it indicates a potential resource strain. Kubernetes, in an effort to maintain node stability, might begin evicting pods to free memory. This is often a response to pods consuming more memory than available, based on their reserved requests.

    The Davis AI health status at the top signals the issue with conditions. There's a detailed breakdown of the node resource utilization. Pay attention to the Memory section. If memory usage exceeds the allocated requests, it indicates a potential resource strain. Kubernetes may report memory pressure at the node level, and affected containers can be terminated with OOMKilled when they exceed their limits.

  3. To identify which pods have been out-of-memory killed, refer to the Events section for this node.

    Kubernetes app: Events

  4. Select the relevant line in the Events section and choose Show full value to reveal the full details of the event. In this example, you can see the pod named oom-kill-deployment was terminated with OOMKilled after exceeding its memory limit.

    Full event

  5. Go to the Workloads list and use the filter bar to search for the oom-kill-deployment workload. Select this workload to display details. In the Workload utilization section, you can quickly spot misconfigurations of resource requests.

    Kubernetes app: Workloads

Related tags
Infrastructure ObservabilityKubernetes (new)Kubernetes