Set resource limits for Dynatrace Operator components

  • 2-min read

Properly configured resource limits ensure optimal performance and stability of Dynatrace Operator components while preventing resource contention in your Kubernetes cluster. This guide helps you understand how to set appropriate resource limits based on your environment size and usage patterns.

Default resource limits baseline

The provided default resource limits have been validated through performance testing. These defaults performed well in the following environment:

  • 25 nodes (e2-standard-32 nodetype on Google Kubernetes Engine)
  • 20 DynaKubes
  • 2,500 namespaces
  • 5,000 pods

Resource consumption factors

The following five key indicators influence resource consumption across different Dynatrace Operator components:

IndicatorDynatrace OperatorWebhookCSI driver
NamespacesApplicableApplicable
NodesApplicable
DynaKubesApplicableApplicable
PodsApplicableApplicable
Number of OneAgent versionsApplicable

Understanding the impact indicators

  • Namespaces: More namespaces increase the workload for the Operator and webhook as they need to monitor and manage resources across all namespaces.
    • Impact:
      • Increases CPU/memory usage of Operator
      • Increases CPU usage of webhook
  • Nodes: Additional nodes require more resources as the Operator keeps a list of all available nodes on the Kubernetes clusters and verifies that they match the available hosts on the Dynatrace server.
    • Impact:
      • Increases CPU/memory usage of Operator
  • DynaKubes: Each DynaKube resource represents a separate Dynatrace deployment that needs individual management.
    • Impact:
      • Increases CPU/memory usage of Operator
      • Increases CPU/memory usage of webhook
      • Increases CPU/memory usage of CSI driver provisioner
  • Pods: The webhook processes admission requests for every pod, while the CSI driver handles volume mounting for pods using OneAgent.
    • Impact:
      • Increases CPU/memory usage of CSI driver server/liveness-probe/registrar
  • OneAgent versions: The CSI driver needs to manage and provide access to different OneAgent versions, requiring additional storage and processing resources.
    • Impact:
      • Increases CPU/memory usage of CSI driver provisioner

Minimize impacts of large number of Node

By default, Dynatrace Operator monitors host availability in your cluster to detect expected removal of host OneAgent pods especially in scaling scenarios. This monitoring is not necessary in serverless environments.

When to disable host availability detection

This functionality is present in all versions of the Operator before 1.6.0 and it can't be turned off.

It can be turned off in 1.6.3 and 1.7.3 or in newer Operator versions.

applicationMonitoring

You can reduce the Operator's resource consumption by disabling host availability detection if your DynaKubes:

  • only use application-only monitoring modes and
  • do not use any of the host-based monitoring features.

How to disable host availability detection

To disable host availability detection, set the following Helm value:

operator:
hostAvailabilityDetection: false

This optimization can reduce CPU and memory usage of the Operator, especially in large clusters with many nodes.

Cluster-wide setting: operator.hostAvailabilityDetection affects all DynaKubes managed by the Operator. Only disable this if you are certain that none of your DynaKubes require host-based monitoring. Disabling it when host OneAgents are required can cause false-positive host missing warnings during node scaling or other node-related operations.

Customize resource limits

While the default resource limits should be sufficient for most use cases, you can customize them based on your specific needs.

  • Dynatrace Operator

    operator:
    requests:
    cpu: 50m
    memory: 64Mi
    limits:
    cpu: 100m
    memory: 128Mi
  • WebHook

    webhook:
    requests:
    cpu: 300m
    memory: 128Mi
    limits:
    cpu: 300m
    memory: 128Mi
  • CSI driver

    csidriver:
    csiInit:
    resources:
    requests:
    cpu: 50m
    memory: 100Mi
    limits:
    cpu: 50m
    memory: 100Mi
    server:
    resources:
    requests:
    cpu: 50m
    memory: 100Mi
    limits:
    cpu: 50m
    memory: 100Mi
    provisioner:
    resources:
    requests:
    cpu: 300m
    memory: 100Mi
    limits:
    cpu: 300m
    memory: 100Mi
    registrar:
    resources:
    requests:
    cpu: 20m
    memory: 30Mi
    limits:
    cpu: 20m
    memory: 30Mi
    livenessprobe:
    resources:
    requests:
    cpu: 20m
    memory: 30Mi
    limits:
    cpu: 20m
    memory: 30Mi

Scaling resource limits for different environments

The default resource requests/limits are designed for medium-scale environments. Use the following guidelines to adjust limits based on your environment size:

These are starting recommendations. Always monitor actual resource usage in your environment and adjust accordingly.

Pod Quality of Service Classes: Some components have their limits and requests set to the same value to ensure a Guaranteed Pod Quality of Service. When scaling the limits of such components, always scale the requests proportionally as well.

Proportional fairness: Having low CPU requests on a container can cause throttling due to the CPU management policy of the node. The requests serve as a minimum guarantee. On a heavily utilized node, therefore, containers with smaller requests will be throttled more compared to containers with larger requests, regardless of their limit. When scaling the limits, always consider scaling the requests as well.

Large environments (> 50 nodes, > 10,000 pods)

Increase the default requests/limits by 50–100%:

  • Dynatrace Operator:
    • Request: CPU 100m, Memory 128Mi
    • Limits: CPU 200m, Memory 256Mi
  • Webhook:
    • Requests: CPU 600m, Memory 256Mi
    • Limits: CPU 600m, Memory 256Mi
  • CSI driver provisioner:
    • Requests: CPU 600m, Memory 200Mi
    • Limits: CPU 600m, Memory 200Mi
  • CSI driver server:
    • Requests: CPU 100m, Memory 200Mi
    • Limits: CPU 100m, Memory 200Mi
  • CSI driver liveness-probe:
    • Requests: CPU 30m, Memory 50Mi
    • Limits: CPU 30m, Memory 50Mi
  • CSI driver registrar:
    • Requests: CPU 30m, Memory 50Mi
    • Limits: CPU 30m, Memory 50Mi

Enterprise environments (> 100 nodes, > 25,000 pods)

Increase the default requests/limits by 100–200%:

  • Dynatrace Operator: CPU 400m, Memory 512Mi
    • Request: CPU 200m, Memory 256Mi
    • Limits: CPU 400m, Memory 512Mi
  • Webhook:
    • Requests: CPU 1000m, Memory 512Mi
    • Limits: CPU 1000m, Memory 512Mi
  • CSI driver provisioner:
    • Requests: CPU 900m, Memory 300Mi
    • Limits: CPU 900m, Memory 300Mi
  • CSI driver server:
    • Requests: CPU 150m, Memory 300Mi
    • Limits: CPU 150m, Memory 300Mi
  • CSI driver liveness-probe:
    • Requests: CPU 50m, Memory 70Mi
    • Limits: CPU 50m, Memory 70Mi
  • CSI driver registrar:
    • Requests: CPU 50m, Memory 70Mi
    • Limits: CPU 50m, Memory 70Mi