Dynatrace detects four types of anomalies for services: response time degradations, increases in failure rate, service load drops, and service load spikes. Each anomaly type is detected independently and triggers its own problems and alerts.
To adjust global configuration of anomaly detection for services
- Go to Settings.
- Select Anomaly detection > Services. Here you can configure detection for each anomaly type.
Response time degradation
This type of anomaly detection observes the response time of your services and triggers an alert if a metric violates the specified thresholds. Dynatrace can detect degradation based on automatic baselining or fixed thresholds.
Dynatrace evaluates degradation for two categories—all responses and the slowest 10%—and triggers an alert if any response time violates the threshold.
To configure response time degradation detection
- Turn on Detect response time degradation and select automatically from the list.
- Set degradation values in the remaining fields. Violation of any criterion triggers an alert.
- optional To avoid over-alerting, define an actions/min rate below which a service should be considered a low load. Services with lower load rates are excluded from evaluation.
- optional To avoid accidental alerts, define how long a service must stay in abnormal state to trigger an alert.
- Turn on Detect response time degradation and select using fixed thresholds from the list.
- Set degradation values in the remaining fields. Violation of any criterion triggers an alert.
- optional To avoid over-alerting, define an actions/min rate below which a service should be considered a low-load service. Services with lower load rates are excluded from evaluation.
- optional To avoid accidental alerts, define how long a service must stay in abnormal state to trigger an alert.
- From the Sensitivity list, select the sensitivity of the threshold:
- Low: High statistical confidence is used, so brief violations (for example, due to a surge in load) won't trigger alerts.
- Medium: Reasonable statistical confidence is used to not alert on every single violation.
- High: No statistical confidence is used. Each violation triggers an alert.
For fixed thresholds, the problem impact includes the fixed threshold and the amount by which the threshold was exceeded.
Failure rate increase
This type of anomaly detection observes the failure rate of your services and triggers an alert if the rate exceeds the specified thresholds. Dynatrace can detect failure rate increase based on automatic baselining or fixed thresholds.
To configure the increased failure rate detection
- Turn on Detect increase in failure rate and select automatically from the list.
- Specify the relative % and absolute % values above which alerts should be sent out. Both thresholds must be violated to trigger an alert.
- optional To avoid over-alerting, define an actions/min rate below which a service should be considered a low-load service. Services with lower load rates are excluded from evaluation.
- optional To avoid accidental alerts, define how long a service must stay in abnormal state to trigger an alert.
- Turn on Detect increase in failure rate and select using fixed thresholds from the list.
- Specify the absolute % value above which alerts should be sent out.
- optional To avoid over-alerting, define an actions/min rate below which a service should be considered a low-load service. Services with lower load rates are excluded from evaluation.
- optional To avoid accidental alerts, define how long a service must stay in abnormal state to trigger an alert.
- From the Sensitivity list, select the sensitivity of the threshold:
- Low: High statistical confidence is used, so brief violations (for example, due to a surge in load) won't trigger alerts.
- Medium: Reasonable statistical confidence is used to not alert on every single violation.
- High: No statistical confidence is used. Each violation triggers an alert.
For fixed thresholds, the problem impact includes the fixed threshold and the amount by which the threshold was exceeded.
Service load drops
This type of anomaly detection learns the normal behavior of your service load over a period of 7 days and triggers an alert if the load drops significantly.
To configure service drops detection
- Turn on Detect service load drops.
- Specify the observed load threshold to receive alerts in case of load drops.
- optional To avoid accidental alerts, define how long a service must stay in abnormal state to trigger an alert.
Service load spikes
This type of anomaly detection learns the normal behavior of your service load over a period of 7 days and triggers an alert if the load increases significantly.
To configure service spikes detection
- Turn on Detect service load spikes.
- Specify the observed load threshold to receive alerts in case of load spikes.
- optional To avoid accidental alerts, define how long a service must stay in abnormal state to trigger an alert.
Reference period
Davis automatically generates baselines during a recent reference period. The default reference period is the past 7 days.
If monitoring data detected during the reference period is no longer valid—for example, if you've deployed a new version of your application that includes major changes, and you're now receiving a high number of alerts—select Reset to establish a new baseline. Davis will purge the previous reference period and immediately begin collecting data for a new reference period.
Thresholds for a specific service
As an alternative to defining thresholds globally across your entire environment, you can provide fine-tuned thresholds for individual services. Service-level thresholds override global thresholds for the service, while global settings still apply to other services. You can revert to globally defined thresholds at any time.
To change threshold settings for a specific service
- Go to Services.
- Select the service you want to configure.
- In the browse menu (…), select Edit.
- Select Anomaly detection.
- Turn off Use global anomaly detection settings.
- Set the service-level thresholds in the same manner as described above for global settings.
Thresholds for a specific web request
By understanding the baseline performance of individual service requests, Davis can intelligently analyze highly divergent response times of multiple requests within the same service. For example, while the service request orange.jsf
might have a median response time of 200 ms
, the request orange-booking-payment.jsf
within the same service might be faster, with a median response time of 25 ms
. Davis understands such differences and raises an alert if either request begins to respond more slowly than these established baselines.
There are, however, use cases for which parameterization of automatic baselining algorithms might be beneficial at the request level. These include:
- To set lower response-time thresholds for specific, business-critical service requests.
- To set higher response-time thresholds for volatile service requests that are non-business critical.
- To set a fixed threshold rather than a relative threshold for requests that have a defined SLA.
- To set a fixed threshold rather than a relative threshold for requests that have a defined timeout point.
- To disable alerting for specific, non-business critical requests.
Each of these use cases can now also be achieved by setting request-level thresholds for a key web request. To configure threshold for a key request
- Go to Services.
- Select the service you want to configure.
- In the browse menu (…), select Edit.
- Select Anomaly detection.
- Scroll down to the Set thresholds on key requests section and expand the menu of the required key request.
- Turn off Use service or global settings.
- Set the request-level thresholds in the same manner as described above for global settings.