Adaptive traffic management for Dynatrace SaaS

Dynatrace Full-Stack Monitoring brings value with a variety of features, which include distributed tracing for applications via the patented PurePath® technology. Each monitored application or microservice is constantly monitored and produces distributed traces, containing code-level and business insights, that are sent to Dynatrace.

Depending on the number of application transactions, OneAgent captures end-to-end traces every minute up to a peak trace volume, which is defined per environment by your license. When the volume of transactions is high, the amount of traces that can be captured by OneAgent might exceed the peak trace volume available in your environment. When this happens, OneAgent starts sampling new incoming traces that have a root span in the most effective way possible, via the intelligent mechanism of Adaptive traffic management, stopping overages and consequentially saving a lot of network bandwidth.

The resulting capture rate is defined as the OneAgent capture rate. While not all possible traces might be captured, any trace that is captured represents a full end-to-end transaction.

Each OneAgent-monitored process captures a limited number of new PurePath traces every minute

How is Adaptive traffic management different from other sampling mechanisms?

In typical applications, the distribution of requests is not even. It's rather a combination of: a large number of unique URLs, a medium number of important requests, and, finally, a few kinds of requests that make up the majority of the traffic (for example, image requests or status checks).

With Adaptive traffic management, OneAgent first calculates a list of top requests starting each minute, from which it then captures:

  • Most traces of unique and rare requests.
  • A significant but lower volume of highly frequent requests.

Because the sampling is not random, all important data is captured while maintaining a statistically valid sample set.

The following table represents a top-request calculation example, along with the respective capture rates.

Request

Number of requests processed by the application

Capture factor

Captured distributed traces

URI A

900

1/2

450

URI B

440

1/2

220

URI C

250

1

250

URI D

60

1

60

…50 other URIs

100

1

100

Total:

1500

1080

In this example, a bit more than 1,000 requests/min are captured by OneAgent, accordingly to the configured target number of request. Depending on the capture factor, URIs are captured each time (URIs C, D, and 50 other URIs) or only 50% of the time (URIs A and B). In this last case, requests are traced end-to-end by OneAgent over 600 times/minute.

You can see the effect of Adaptive traffic management in the distributed trace list. If OneAgent is sampling and not all requests are captured, then captured traces will point out that similar requests have not been captured with the message [amount] more like this in the distributed trace list.

In this way, OneAgent reduces the data sent to your environment, ensuring that the amount of captured traces stays within the limits of your Dynatrace agreement.

Most sampling mechanisms manage traffic on the agent in an isolated and uncoordinated manner. Dynatrace on the other hand manages the peak trace volume on the environment level and thus the available volume is dynamically shared between all monitored applications. In a sense, low-volume applications share their unused trace volume with high-volume applications that need it. Note that the peak trace volume is available for all traces sent by OneAgent code modules or via OneAgent Trace API.

Dynatrace automatically manages the peak trace volume based on your license. To learn more about Adaptive traffic management, see either Adaptive traffic management with Dynatrace Platform Subscription (DPS) or Adaptive traffic management with classic licensing) below.

Adaptive traffic management with Dynatrace Platform Subscription (DPS)

In Adaptive traffic management with latest version of Dynatrace Platform Subscription (DPS), the peak trace volume is measured in Byte/minute and is calculated based on the number of Gibibytes that contribute to your environment's GiB-hour.

Each environment can process a minimum trace volume of 14 Mebibyte/min. For each contributing Gibibyte, the environment peak trace volume is increased by 45 Kibibyte/min. You can calculate the peak trace volume using the metric expression below.

dsfm:billing.fullstack.maximum_included_trace_volume_per_minute=(builtin:billing.full_stack_monitoring.usage:last:splitBy():last*4)*(45*1024)

Every 15 minutes, the peak trace volume is calculated and automatically adjusted based on the average contributing Gibibyte over the prior 15 minutes.

With Dynatrace Platform Subscription (DPS) all features, especially data-heavy ones like bind variables capture, are available.

Monitoring

To monitor your environment trace capture rate and volume ingress, use the preset dashboard for Adaptive traffic management with Dynatrace Platform Subscription (DPS).

To open the dashboard, go to Dashboards or Dashboards Classic (latest Dynatrace) and select the OneAgent Traces - Adaptive traffic management dashboard.

DPS License

Monitoring dashboard for ATMv3 in DPS

TileDescription

Dynatrace process rate

Percent of full-service calls processed by OneAgent over all full-service calls received by the environment. It represents the environment's health.

If the value is continuously below 90%, Please contact a Dynatrace product expert via live chat within your environment.

OneAgent capture rate

Percent of traces captured end-to-end by OneAgent over all traces received by the environment. Values below 100% indicate sampling was applied to reduce trace volume down to within licensed limits.

Processed and received full-service calls

Indicates the amount of received full-service calls (blue), the peak trace volume (red), and the overall potentially traceable service calls (green) processed by the OneAgent over time.

Size of full-service calls

Amount of data per service call 1. Typical values are around 2-3 Kibibytes per environment. Excessive usage of data-heavy features like bind variables can increase it and lead to a larger overall trace volume 2.

Trace Ingress/contributing GiB

Amount of trace volume per contributing Gibibyte. Typical values are around 45 Kibibytes.

Trace volume ingress

Indicates the captured trace volume 2 (green) and the licensed volume (red) over time 3.

Trace volume used

Percent of used trace volume 2 over the licensed volume 3.

1

Service call data includes request attributes, span attributes, HTTP headers, and bind variables.

2

The trace volume is roughly equal to [amount] full-service calls * Size of full-service calls.

3

Values above the licensed limit indicate overages, for which sampling is triggered. After sampling is applied, values return to within licensed limits and the OneAgent capture rate will be below 100%.

Adaptive traffic management with classic licensing

In Adaptive traffic management with classic licensing there are two versions, Version 2 and Version 3. Depending on the version, the peak trace volume is measured either in Full-service call/minute or Byte/minute.

For both versions, the environment peak trace volume is calculated based on the active host units in your environment. Each environment can process a minimum peak trace volume of approximately 20 host units. For each active host unit, the environment peak trace volume is increased by a version-specific fixed value. Every 15 minutes, the peak trace volume is calculated and automatically adjusted based on the current number of active host units.

SpecificationsVersion 2Version 3

Unit of measurement

Full-service call/minute

Byte/minute

Min. trace volume

5000 Full-service call/min

14 Mebibyte/min

Peak trace volume

[amount] active host unit × 250 Full-service call/min

[amount] active host units × 720 Kibibyte/min

Metric expression

dsfm:server.service_calls.maximum_allowed_per_minute = 250 * dsfm:billing.hostunit.connected:splitBy():last

dsfm:billing.fullstack.maximum_included_trace_volume_per_minute=(dsfm:billing.hostunit.connected:splitBy():last*16)*(720*1024)

Data-heavy features

🟡 Partial availability

🟢 All available

Version comparison

The trace volume that your environment can process is similar in both versions of Adaptive traffic management with classic licensing. A full-service call typically needs 2–3 Kibibytes in trace volume. For example, a moderate environment of 50 hosts with 32 GB each (= 100 host units) can process up to 25,000 full-service calls per minute in Version 2 (around 49–73 Mebibytes). The same environment can process up to 70.3 Mebibytes of traces per minute in Version 3.

  • If full-service calls are small, your environment will benefit from the high capture rate of Version 3. This happens because more smaller calls can be fitted into the same overall trace data volume.
  • Data-heavy features, such as bind variables capture, are only available in Version 3. Data-heavy features can produce a lot more data per full-service call and thus increase the overall trace data volume.

Monitoring

To monitor your environment trace capture rate and volume ingress, based on your version of the Adaptive traffic management with classic licensing, you can use the preset dashboard.

To open the dashboard, go to Dashboards or Dashboards Classic (latest Dynatrace) and select the OneAgent Traces - Adaptive traffic management dashboard.

Classic licensing

Monitoring dashboard for ATMv2 and ATMv3 w/ Classic License

TileDescription

Dynatrace process rate

Percent of full-service calls processed by OneAgent over all full-service calls received by the environment. It represents the environment's health.

If the value is continuously below 90%, Please contact a Dynatrace product expert via live chat within your environment.

OneAgent capture rate

Percent of traces captured end-to-end by OneAgent over all traces received by the environment. Values below 100% indicate sampling was applied to reduce trace volume down to within licensed limits.

Processed and received full-service calls

Indicates the amount of received full-service calls (blue), the peak trace volume (red), and the overall potentially traceable service calls (green) processed by the OneAgent over time.

Size of full-service calls

Amount of data per service call 1. Typical values are around 2-3 Kibibytes per environment. Excessive usage of data-heavy features like bind variables can increase it and lead to a larger overall trace volume 2.

Trace Ingress/in use HU

Amount of trace volume per active host unit. Typical values are around 720 Kibibytes.

Trace volume ingress

Indicates the captured trace volume 2 (green) and the licensed volume (red) over time 3.

Trace volume used

Percent of used trace volume 2 over the licensed volume 3.

1

Service call data includes request attributes, span attributes, HTTP headers, and bind variables.

2

The trace volume is roughly equal to [amount] full-service calls * Size of full-service calls.

3

Values above the licensed limit indicate overages, for which sampling is triggered. After sampling is applied, values return to within licensed limits and the OneAgent capture rate will be below 100%.

TileDescription

Dynatrace process rate

Percent of full-service calls processed by OneAgent over all full-service calls received by the environment. It represents the environment's health.

If the value is continuously below 90%, Please contact a Dynatrace product expert via live chat within your environment.

OneAgent capture rate

Percent of traces captured end-to-end by OneAgent over all traces received by the environment. Values below 100% indicate sampling was applied to reduce trace volume down to within licensed limits.

Processed and received full-service calls

Indicates the amount of received full-service calls (blue), the peak trace volume (red), and the overall potentially traceable service calls (green) processed by the OneAgent over time 1.

Size of full-service calls

Amount of data per service call 2 over time. Typical values are around 2-3 Kibibytes per environment.

FSC/HU

Number of full-service calls per active host unit. Typical values are around 250.

1

Values above the licensed limit indicate overages, for which sampling is triggered. After sampling is applied, values return to within licensed limits and the OneAgent capture rate will be below 100%.

2

Service call data includes request attributes, span attributes, HTTP headers, and bind variables.

Frequently asked questions

The short answer is, not at all.

The shaping of traffic is accounted for transparently and done in a way that ensures statistical validity while capturing rare requests with high probability. All charts show the total real number of requests that your application processes, as does all ad-hoc analysis you might perform. You will not see a difference in charts or service call analysis data unless you're looking at a single distributed trace. Indeed, the only place where this traffic shaping is visible is in the distributed traces list, which displays a message like [number of traces] more like this.

No, adaptive traffic management focuses only on the number of traces. Neither service settings nor (global) request settings are modified by adaptive traffic management. Depending on the capture rate and sampling, a low-volume or unique request might not be captured. Service settings such as request naming rules and key request settings will apply only to captured traces.

Yes, in a few cases, as service monitoring metrics are based on captured traces. The following are some known effects.

  • For low-frequency requests in high-volume environments, sampling and a low capture rate can impact the accuracy of metrics. Due to the low frequency of the requests, traces might be captured in a lower volume or not be captured at all. Consequentially some metrics values can't be collected. Note that this is reflected in service metric calculations to avoid distortions in charts.
  • Because every single request is accounted for in charts with high resolution and in short timeframes, for high-volume services, sampling and a low capture rate might impact the accuracy of metrics such as request count or error count. Conversely, the accuracy will statistically be better in charts with low resolutions and long timeframes.

If OneAgent capture rate is below 100%, sampling has been applied because the amount of traces that can be captured by OneAgent has exceeded the licensed limit. You can increase the capture rate by not capturing traces and service calls of lower relevance. Start by looking at the following:

  • Excessive custom services
    Custom services with poor configurations can lead to a high number of full-service calls and increase the trace ingress volume. If custom services are consuming a considerable amount of the trace volume, revisit the configuration to reduce the amount of capture custom-service calls.

  • Background activity services
    In certain environments, background activity produces a lot of service calls but adds little value. Currently, you can only disable this feature completely and not case by case. For more information, contact a Dynatrace product expert. Please contact a Dynatrace product expert via live chat within your environment.

  • Many very small service calls in Version 2 (classic licensing)
    If there are many small (below 2.5 KiB) service calls in your environment, we recommend that you switch from Version 2 to Version 3. Version 3 focuses on the trace volume in terms of bytes instead of the number of service calls, leading to a higher capture rate.

  • High number of low-value traces
    In all environments, there are transactions for which traces are of lower value. You can exclude from capture:

    • Specific transactions such as ping and health check traces. To disable the settings, go to Settings and select Server-side service monitoring > Deep monitoring.
    • Entire processes such as log forwarders written in a runtime for which deep monitoring by OneAgent is available, by disabling tracing. To modify the settings, go to the process-level Settings > Deep monitoring.
  • Unused assigned host units
    See Dynatrace is not using all available host units in my classic licensing. What can I do?

You can learn which version of Adaptive traffic management is active in your environment by looking at the preset dashboard. If your environment is using

  • The latest DPS license
    You're always on the DPS-specific version.
  • Classic licensing or earlier DPS
    Look for the Trace volume ingress and the Processed and received full-service calls tile. If a red line is in the Trace volume ingress tile, then your environment is on Version 3, otherwise on Version 2.

Classic licensing

In both versions of Adaptive traffic management with classic licensing, the calculation of peak trace volume is based on the currently active host units by default. To use all the host units assigned to the environment, contact a Dynatrace product expert via live chat and provide a rationale.

This is an environment-wide change, so you need Dynatrace administrator permissions to turn this feature on.

Classic licensing Dynatrace version 1.264+

Please contact a Dynatrace product expert via live chat within your environment.

This is an environment-wide change, so you need Dynatrace administrator permissions to turn on this feature.

Full-service call

Server side call that starts: a distributed trace, a service call at a deep monitored tier, or a custom service call. A single distributed trace can contain multiple full-service calls.

Full-service call

Applicable

All requests for web request services and web services (except for external ones), RMI services, messaging services and custom services are full-service calls.

Not applicable

External calls (such as database calls, external web requests, or generally any opaque service call) are not full-service calls, and so aren't counted against your traffic limit.

The minimum number of full-service calls per minute in a given environment is 5,000 (the equivalent of 20 host units). Each process can start between 50 and 50,000 full-service calls per minute.

Active host units

Host units currently in use and connected to the environment (not the host units assigned to the environment).