Sample data

A distributed application under heavy load may generate a massive amount of observability data. This data incurs generation, processing, transmission, and storage costs. However, it's often possible to use sampling—where you use only a relatively small portion of the observability data and drop the rest—to reduce costs and still effectively monitor your application.

In OpenTelemetry, there are two main sampling methods:

  • Head sampling is done within your application by the OpenTelemetry SDK, and typically involves saving a random sample of transactions.

    Head sampling is simple and effective, but it has important limitations. For example, because the sampling decision needs to be made at the start of the transaction, it can't be affected by anything that happens after that point.

  • Tail sampling is used to make sampling decisions based on information unknown at the start of the transaction.

    In OpenTelemetry, tail sampling is typically done with the Collector by temporarily storing the full set of monitoring data until a transaction is completed. The Collector then decides to either save or drop the transaction data based on a set of sampling policies.

    Because tail sampling typically is not random, it's important to ensure that any calculated metrics are unbiased. This can be done by calculating metrics from the full set of transactions, as shown below, or from a separate, randomly sampled stream.

The following configuration example shows how to configure a Collector instance to sample trace data and import it as an OTLP request into Dynatrace. It uses the spanmetrics connector to compute service metrics from traces before sampling in order to ensure their accuracy.

Prerequisites

Demo configuration

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
transform:
metric_statements:
- context: metric
statements:
# Get count from the histogram. The new metric name will be <histogram_name>_count
- extract_count_metric(true) where type == METRIC_DATA_TYPE_HISTOGRAM
# Get sum from the histogram. The new metric name will be <histogram_name>_sum
- extract_sum_metric(true) where type == METRIC_DATA_TYPE_HISTOGRAM
filter:
metrics:
metric:
# The Dynatrace OTLP metrics ingest doesn't currently support histograms
- type == METRIC_DATA_TYPE_HISTOGRAM
transform/spanmetrics:
metric_statements:
- context: metric
statements:
# Map the units to something that explicitly counts them in Dynatrace.
- set(unit, "{requests}") where IsMatch(name, "^requests.duration_count")
- set(unit, "{requests}") where IsMatch(name, "^requests.calls")
tail_sampling:
# This configuration keeps errors, traces longer than 500ms, and 20% of all remaining traces.
# Adjust with policies of your choice.
policies:
- name: policy1-keep-errors
type: status_code
status_code: {status_codes: [ERROR, UNSET]}
- name: policy2-keep-slow-traces
type: latency
latency: {threshold_ms: 500}
- name: policy3-keep-random-sample
type: probabilistic
probabilistic: {sampling_percentage: 20}
decision_wait: 30s
connectors:
spanmetrics:
aggregation_temporality: "AGGREGATION_TEMPORALITY_DELTA"
namespace: "requests"
metrics_flush_interval: 15s
exporters:
otlphttp:
endpoint: ${env:DT_ENDPOINT}
headers:
Authorization: Api-Token ${env:DT_API_TOKEN}
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [otlphttp]
traces/spanmetrics:
receivers: [otlp]
processors: []
exporters: [spanmetrics]
metrics:
receivers: [spanmetrics]
processors: [transform, filter, transform/spanmetrics]
exporters: [otlphttp]
Configuration validation

Validate your settings to avoid any configuration issues.

Components

For our configuration, we configure the following components.

Receiver

Under receivers, we specify the standard otlp receiver as active receiver component for our Collector instance and configure it to accept OTLP requests on gRPC and HTTP.

Processor

Connector

Under connectors, we specify the spanmetrics connector to compute service metrics from spans.

Exporter

Under exporters, we specify the default otlphttp exporter and configure it with our Dynatrace API URL and the required authentication token.

For this purpose, we set the following two environment variables and reference them in the configuration values for endpoint and Authorization.

Service pipelines

Under service, we assemble three pipelines:

  • traces assembles the OTLP receiver, tail sampling processor, and otlphttp exporter to send sampled spans to Dynatrace.
  • traces/spanmetrics uses the same OTLP receiver and the spanmetrics connector to compute service metrics from received spans, without sampling, and forwards the computed metrics to metrics.
  • metrics uses the transform, filter, and transform/spanmetrics processors to format metrics for Dynatrace metric ingest before sending metrics to Dynatrace using the otlphttp exporter.

OpenTelemetry sampling considerations

Mixed-mode sampling

OpenTelemetry and OneAgent use incompatible approaches to sampling that should not be mixed. If a distributed trace, which may include multiple applications and services, only partially utilizes either method, it's likely to result in inconsistent results and incomplete distributed traces. Each distributed trace should be sampled by only one of the methods to ensure it's captured in its entirety.

Trace-derived service metrics

Dynatrace trace-derived metrics are calculated from trace data after it's ingested into Dynatrace.

If OpenTelemetry traces are sampled, the trace-derived metrics are calculated only from the sampled subset of trace data. This means that some trace-derived metrics might be biased or incorrect.

For example, a probabilistic sampler that saves 5% of traffic will result in a throughput metric that shows 5% of the actual throughput. If you use OpenTelemetry tail-based sampling to also capture 100% of slow or error traces, your service metrics will not only show incorrect throughput, but will also incorrectly bias error rates and response times.

To mitigate this, if you want to sample OpenTelemetry traces, you should calculate service metrics before sampling and use those metrics rather than the trace-derived metrics calculated by Dynatrace. If you're using the Collector for sampling, trace-derived metrics should be calculated by the Collector before applying sampling, or by the SDK. This can be done with the spanmetrics connector as shown in the example above.

Limits and limitations

Data is ingested using the OpenTelemetry protocol (OTLP) via the Dynatrace OTLP APIs and is subject to the API's limits and restrictions. For more information see: