A distributed application under heavy load may generate a massive amount of observability data. This data incurs generation, processing, transmission, and storage costs. However, it's often possible to use sampling—where you use only a relatively small portion of the observability data and drop the rest—to reduce costs and still effectively monitor your application.
In OpenTelemetry, there are two main sampling methods:
Head sampling is done within your application by the OpenTelemetry SDK, and typically involves saving a random sample of transactions.
Head sampling is simple and effective, but it has important limitations. For example, because the sampling decision needs to be made at the start of the transaction, it can't be affected by anything that happens after that point.
Tail sampling is used to make sampling decisions based on information unknown at the start of the transaction.
In OpenTelemetry, tail sampling is typically done with the Collector by temporarily storing the full set of monitoring data until a transaction is completed. The Collector then decides to either save or drop the transaction data based on a set of sampling policies.
Because tail sampling typically is not random, it's important to ensure that any calculated metrics are unbiased. This can be done by calculating metrics from the full set of transactions, as shown below, or from a separate, randomly sampled stream.
The following configuration example shows how to configure a Collector instance to sample trace data and import it as an OTLP request into Dynatrace. It uses the spanmetrics
connector to compute service metrics from traces before sampling in order to ensure their accuracy.
transform
, filter
, and tail_sampling
processors, and the spanmetrics
connector:
receivers:otlp:protocols:grpc:endpoint: 0.0.0.0:4317http:endpoint: 0.0.0.0:4318processors:transform:metric_statements:- context: metricstatements:# Get count from the histogram. The new metric name will be <histogram_name>_count- extract_count_metric(true) where type == METRIC_DATA_TYPE_HISTOGRAM# Get sum from the histogram. The new metric name will be <histogram_name>_sum- extract_sum_metric(true) where type == METRIC_DATA_TYPE_HISTOGRAMfilter:metrics:metric:# The Dynatrace OTLP metrics ingest doesn't currently support histograms- type == METRIC_DATA_TYPE_HISTOGRAMtransform/spanmetrics:metric_statements:- context: metricstatements:# Map the units to something that explicitly counts them in Dynatrace.- set(unit, "{requests}") where IsMatch(name, "^requests.duration_count")- set(unit, "{requests}") where IsMatch(name, "^requests.calls")tail_sampling:# This configuration keeps errors, traces longer than 500ms, and 20% of all remaining traces.# Adjust with policies of your choice.policies:- name: policy1-keep-errorstype: status_codestatus_code: {status_codes: [ERROR, UNSET]}- name: policy2-keep-slow-tracestype: latencylatency: {threshold_ms: 500}- name: policy3-keep-random-sampletype: probabilisticprobabilistic: {sampling_percentage: 20}decision_wait: 30sconnectors:spanmetrics:aggregation_temporality: "AGGREGATION_TEMPORALITY_DELTA"namespace: "requests"metrics_flush_interval: 15sexporters:otlphttp:endpoint: ${env:DT_ENDPOINT}headers:Authorization: Api-Token ${env:DT_API_TOKEN}service:pipelines:traces:receivers: [otlp]processors: [tail_sampling]exporters: [otlphttp]traces/spanmetrics:receivers: [otlp]processors: []exporters: [spanmetrics]metrics:receivers: [spanmetrics]processors: [transform, filter, transform/spanmetrics]exporters: [otlphttp]
Validate your settings to avoid any configuration issues.
For our configuration, we configure the following components.
Under receivers
, we specify the standard otlp
receiver as active receiver component for our Collector instance and configure it to accept OTLP requests on gRPC and HTTP.
transform
to compute the desired sum and count values of the histograms. For details see Compute histogram summariesfilter
to drop the existing histogram metrics (based on type
) and avoid histogram-related error messages.transform/spanmetrics
to set appropriate units on span metrics.tail_sampling
to sample distributed traces based on properties of the trace.Under connectors
, we specify the spanmetrics
connector to compute service metrics from spans.
Under exporters
, we specify the default otlphttp
exporter and configure it with our Dynatrace API URL and the required authentication token.
For this purpose, we set the following two environment variables and reference them in the configuration values for endpoint
and Authorization
.
DT_ENDPOINT
contains the base URL of the Dynatrace API endpoint (for example, https://{your-environment-id}.live.dynatrace.com/api/v2/otlp
)DT_API_TOKEN
contains the API tokenUnder service
, we assemble three pipelines:
traces
assembles the OTLP receiver, tail sampling processor, and otlphttp
exporter to send sampled spans to Dynatrace.traces/spanmetrics
uses the same OTLP receiver and the spanmetrics
connector to compute service metrics from received spans, without sampling, and forwards the computed metrics to metrics
.metrics
uses the transform
, filter
, and transform/spanmetrics
processors to format metrics for Dynatrace metric ingest before sending metrics to Dynatrace using the otlphttp
exporter.OpenTelemetry and OneAgent use incompatible approaches to sampling that should not be mixed. If a distributed trace, which may include multiple applications and services, only partially utilizes either method, it's likely to result in inconsistent results and incomplete distributed traces. Each distributed trace should be sampled by only one of the methods to ensure it's captured in its entirety.
Dynatrace trace-derived metrics are calculated from trace data after it's ingested into Dynatrace.
If OpenTelemetry traces are sampled, the trace-derived metrics are calculated only from the sampled subset of trace data. This means that some trace-derived metrics might be biased or incorrect.
For example, a probabilistic sampler that saves 5% of traffic will result in a throughput metric that shows 5% of the actual throughput. If you use OpenTelemetry tail-based sampling to also capture 100% of slow or error traces, your service metrics will not only show incorrect throughput, but will also incorrectly bias error rates and response times.
To mitigate this, if you want to sample OpenTelemetry traces, you should calculate service metrics before sampling and use those metrics rather than the trace-derived metrics calculated by Dynatrace. If you're using the Collector for sampling, trace-derived metrics should be calculated by the Collector before applying sampling, or by the SDK. This can be done with the spanmetrics
connector as shown in the example above.
Data is ingested using the OpenTelemetry protocol (OTLP) via the Dynatrace OTLP APIs and is subject to the API's limits and restrictions. For more information see: