Within large environments, certain aspects of your system may consistently trigger alerts that are unnecessary because they relate to non-severe known issues that don't require a human response. Such alert noise may come from non-critical components or build machines that are low on resources, but aren't in a critical state.
To reduce such alert noise and avoid alert spamming, the Dynatrace AI causation engine automatically detects regularly occurring issues that originate from sub-optimal, though acceptable, conditions. Dynatrace detects such frequent issues by reviewing the problem patterns of monitored entities within specified observation periods of one day and one week.
When the same problem is detected multiple times within these periods, Dynatrace evaluates the problem based on the actual severity of a threshold breach combined with the duration of the problem. It then compares the severities and durations of past problem alerts on the same entity and only alerts if the severity of the problem has increased. The following diagram illustrates this process.
Problems that are less severe and have a shorter duration than previous alerts are considered to be frequent issues and so alerts are suppressed for these. For details on event severities, see event types.
This intelligent approach to detection and handling of frequent issues guarantees that you receive alerts for problems that increase in severity over time while simultaneously avoiding alert spamming.
Entity overview pages that are subject to frequent issues include a Frequent issue message.
The diagram below shows the classification of issues.
The goal of the evaluation process is to classify an incoming event as yellow or red.
The evaluation process is independent for every event type and every monitored entity. It begins with two sets of historic events:
And goes as follows:
After initial evaluation, every yellow event is evaluated again with a 1-minute interval until it shifts to red or is deactivated.
See the expandable section below for an example of the evaluation process.
For the sake of simplicity, this example only considers the 24-hours set. In this example the event type is CPU saturation on a host.
Historic events for the last 24 hours have the following durations and severities:
Event1—45 seconds, 95.5%
Event2—15 seconds, 99%
Event3—35 seconds, 98%
Event4—30 seconds, 97%
Event5—60 seconds, 96%
The sorted sets look like this:
Duration: {Event2, Event4, Event3, Event1, Event5}
Severity: {Event1, Event5, Event4, Event3, Event2}
A new event arrives: EventNEW—28 seconds, 95%. It takes the following positions in sorted sets:
Duration: {Event2, EventNEW, Event4, Event3, Event1, Event5}
Severity: {EventNEW, Event1, Event5, Event4, Event3, Event2}
The subsets, consisting of the events to the right, look like this:
Duration: {Event4, Event3, Event1, Event5}
Severity: {Event1, Event5, Event4, Event3, Event2}
The following events appear in both subsets and form the refence set: {Event1, Event3, Event4, Event5}.
The size of the reference set is 4. The condition is resolved as yellow.
The duration of the reference set is 170 seconds. The condition is resolved as red.
There is one yellow condition, therefore the EventNEW is classified as yellow and doesn't trigger an alert.