Monitoring issues troubleshooting
This guide provides general troubleshooting steps and guidance for common issues encountered when using Dynatrace with Kubernetes. It covers how to access debug logs, use the troubleshoot
subcommand, or generate a support archive.
Pods stuck in Terminating
state after upgrade
If your CSI driver and OneAgent pods get stuck in Terminating
state after upgrading from Dynatrace Operator version 0.9.0, you need to manually delete the pods that are stuck.
Run the command below.
1kubectl delete pod -n dynatrace --selector=app.kubernetes.io/component=csi-driver,app.kubernetes.io/name=dynatrace-operator,app.kubernetes.io/version=0.9.0 --force --grace-period=0
Unable to retrieve the complete list of server APIs
Dynatrace Operator
1unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
If the Dynatrace Operator pod logs this error, you need to identify and fix the problematic services. To identify them
-
Check available resources.
1kubectl api-resources -
If the command returns this error, list all the API services and make sure there aren't any
False
services.1kubectl get apiservice
CrashLoopBackOff: Downgrading OneAgent is not supported, please uninstall the old version first
Dynatrace Operator
If you get this error, the OneAgent version installed on your host is later than the version you're trying to run.
Solution: First uninstall OneAgent from the host, and then select your desired version in the Dynatrace web UI or in DynaKube. To uninstall OneAgent, connect to the host and run the uninstall.sh
script. (The default location is /opt/dynatrace/oneagent/agent/uninstall.sh
)
For CSI driver deployments, use the following commands instead:
Delete the DynaKube custom resources.
Delete the CSI driver manifest.
- Delete the
/var/lib/kubelet/plugins/csi.oneagent.dynatrace.com
directory from all Kubernetes nodes. Reapply the CSI driver and DynaKube custom resources.
Crash loop on pods when installing OneAgent
Application-only monitoring
If you get a crash loop on the pods when you install OneAgent, you need to increase the CPU memory of the pods.
Deployment seems successful but the dynatrace-oneagent
container doesn't show up as ready
DaemonSet
1kubectl get ds/dynatrace-oneagent --namespace=kube-system2NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE3dynatrace-oneagent 1 1 0 1 0 beta.kubernetes.io/os=linux 14mc
1kubectl logs -f dynatrace-oneagent-abcde --namespace=kube-system209:46:18 Started agent deployment as Docker image, PID 1234.309:46:18 Agent installer can only be downloaded from secure location. Your installer URL should start with 'https': REPLACE_WITH_YOUR_URL
Change the value REPLACE_WITH_YOUR_URL
in the dynatrace-oneagent.yml
DaemonSet with the Dynatrace OneAgent installer URL.
Deployment seems successful, however the dynatrace-oneagent
image can't be pulled
DaemonSet
1oc get pods23NAME READY STATUS RESTARTS AGE4dynatrace-oneagent-abcde 0/1 ErrImagePull 0 3s
1oc logs -f dynatrace-oneagent-abcde23Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: image can't be pulled
This is typically the case if the dynatrace
service account hasn't been allowed to pull images from the RHCC.
Deployment seems successful, but the dynatrace-oneagent
container doesn't produce meaningful logs
DaemonSet
1kubectl get pods --namespace=kube-system2NAME READY STATUS RESTARTS AGE3dynatrace-oneagent-abcde 0/1 ContainerCreating 0 3s
1kubectl logs -f dynatrace-oneagent-abcde --namespace=kube-system2Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: ContainerCreating
This is typically the case if the container hasn't yet fully started. Simply wait a few more seconds.
Deployment seems successful, but the dynatrace-oneagent
container isn't running
DaemonSet
1oc process -f dynatrace-oneagent-template.yml ONEAGENT_INSTALLER_SCRIPT_URL="[oneagent-installer-script-url]" | oc apply -f -23daemonset "dynatrace-oneagent" created
Please note that quotes are needed to protect the special shell characters in the OneAgent installer URL.
1oc get pods23No resources found.
This is typically the case if the dynatrace
service account hasn't been configured to run privileged pods.
1oc describe ds/dynatrace-oneagent23Name: dynatrace-oneagent4Image(s): dynatrace/oneagent5Selector: name=dynatrace-oneagent6Node-Selector: <none>7Labels: template=dynatrace-oneagent8Desired Number of Nodes Scheduled: 09Current Number of Nodes Scheduled: 010Number of Nodes Misscheduled: 011Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed12Events:13FirstSeen LastSeen Count From SubObjectPath Type Reason Message14--------- -------- ----- ---- ------------- -------- ------------ -------156m 3m 17 {daemon-set } Warning FailedCreate Error creating: pods "dynatrace-oneagent-" is forbidden: unable to validate against any security context constraint: [spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used]
Deployment was successful, but monitoring data isn't available in Dynatrace
DaemonSet
1kubectl get pods --namespace=kube-system2NAME READY STATUS RESTARTS AGE3dynatrace-oneagent-abcde 1/1 Running 0 1m
This is typically caused by a timing issue that occurs if application containers have started before OneAgent was fully installed on the system. As a consequence, some parts of your application run uninstrumented. To be on the safe side, OneAgent should be fully integrated before you start your application containers. If your application has already been running, restarting its containers will have the very same effect.
No pods scheduled on control-plane nodes
DaemonSet
Kubernetes version 1.24+
Taints on master and control plane nodes are changed on Kubernetes versions 1.24+, and the OneAgent DaemonSet is missing appropriate tolerations in the DynaKube custom resource.
To add the necessary tolerations, edit the DynaKube YAML as follows.
1tolerations:2 - effect: NoSchedule3 key: node-role.kubernetes.io/master4 operator: Exists5 - effect: NoSchedule6 key: node-role.kubernetes.io/control-plane7 operator: Exists
Error when applying the custom resource on GKE
1Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.dynatrace.com": Post "https://dynatrace-webhook.dynatrace.svc:443/validate?timeout=2s (https://dynatrace-webhook.dynatrace.svc/validate?timeout=2s)": context deadline exceeded
If you are getting this error when trying to apply the custom resource on your GKE cluster, the firewall is blocking requests from the Kubernetes API to the Dynatrace Webhook because the required port (8443) is blocked by default.
The default allowed ports (443 and 10250) on GCP refer to the ports exposed by your nodes and pods, not the ports exposed by any Kubernetes services. For example, if the cluster control plane attempts to access a service on port 443 such as the Dynatrace webhook, but the service is implemented by a pod using port 8443, this is blocked by the firewall.
To fix this, add a firewall rule to explicitly allow ingress to port 8443.
For more information about this issue, see API request that triggers admission webhook timing out.
CannotPullContainerError
If you get errors like this on your pods when installing Dynatrace OneAgent, your Docker download rate limit has been exceeded.
CannotPullContainerError: inspect image has been retried [X] time(s): httpReaderSeeker: failed open: unexpected status code
For details, consult the Docker documentation.
Limit log timeframe
cloudNativeFullStack applicationMonitoring
Dynatrace Operator version 0.10.0+
If there's DiskPressure
on your nodes, you can configure the CSI driver log garbage collection interval to lower the storage usage of the CSI driver. The default value of keeping logs before they are deleted from the file system is 7
(days). To edit this timeframe, select one of the options below, depending on your deployment mode.
Be careful when setting this value; you might need the logs to investigate problems.
-
Edit the manifests of the CSI driver daemonset (
kubernetes-csi.yaml
,openshift-csi.yaml
), by replacing the placeholders (<your_value>
) with your value.1apiVersion: apps/v12kind: DaemonSet3...4spec:5 ...6 template:7 ...8 spec:9 ...10 containers:11 ...12 - name: provisioner13 ...14 env:15 - name: MAX_UNMOUNTED_VOLUME_AGE16 value: <your_value> # defined in days, must be a plain number. `0` means logs are immediately deleted. If not set, defaults to `7`. -
Apply the changes.