As dynamic systems architectures increase in complexity and scale, IT teams face mounting pressure to quickly detect and react to business-critical incidents across their multi-cloud environments. Incidents might affect one or more IT components, ultimately leading to large-scale outages that take down critical business services and applications. Such services and applications (for example, accounting systems or web shops) consist of many different components that depend on each other to work reliably and to deliver excellent user experience. If a critical component fails, the ripple effect negatively influences many other dependent components, triggering a large-scale incident.
Concepts such as Service Level Objectives (SLOs) establish a framework of trust between individual components and shift responsibility toward the teams building and owning those components. While SLOs are great to gain a machine-processable understanding of what the normal boundaries of operation for individual services are, they fail to deliver a deeper understanding on the cause of the problem and the best remediation.
Root cause analysis aims to fill this gap by using all available context information to evaluate an incident and determine its precise root cause. It is equally important to evaluate the impact of an incident, allowing the operations team to quickly triage the incident and reduce the mean time to repair (MTTR).
This page introduces the following core concepts:
Incidents and problems
Events
Root cause analysis
Fault tree analysis
Impact analysis
Problem lifecycle
Problem timing
Duplicate problems
An incident is an abnormality in your environment—in an application, infrastructure, and so on. A problem is an entity that represents the incident in Dynatrace. A problem is the result of automatic Davis® AI root cause and impact analysis of an incident. Problems (just like underlying incidences) can span across multiple dependent components, sharing the same impact and root cause.
A problem might result from single or multiple events occurring simultaneously within the same topology, which is often the case in complex environments. To prevent over-alerting, Davis correlates all events with the same root cause into a single problem.
All detected problems are listed in the problem feed and Davis automatically updates them in real time with all incoming events and findings.
Events represent different types of singular anomalies, such as a metric breaching a threshold, baseline degradations, or point-in-time events, such as process crashes. Dynatrace also detects and processes informational events such as new software deployments, configuration changes, and other event types.
Dynatrace ingests and stores events from multiple sources. Those events can trigger root cause analysis, an automation, or serve as raw data for dashboards and reports. Together with logs, metrics, and traces, events provide the input for root cause and impact analysis.
Most events don't indicate abnormal or unhealthy states, therefore only a small fraction of events are considered within problems.
Root cause analysis uses all the available context information—such as topology, transaction, and code-level information—to identify events that share the same root cause and impact.
To identify the root cause of problems, time correlation alone is not sufficient. Dynatrace follows a context-aware approach, detecting interdependent events across time, processes, hosts, services, applications, and both vertical and horizontal topological monitoring perspectives. This approach combines multiple standalone anomalies into a single consistent problem, dramatically reducing the alert load.
The image below illustrates how Davis analyzes all horizontal and vertical dependencies of a problem. In this example, the application shows abnormal behavior, while the underlying vertical stack operates normally. The analysis follows the transactions of the application, detecting dependency on a service (Service 1
) that also shows abnormal behavior. In turn, all of the service's dependencies behave abnormally as well and are part of the same problem.
Davis includes all relevant abnormalities and ranks all root cause contributors, determining which is the primary negative impact. You can drill down to the component level and analyze the root cause down to the source code level. For example, Dynatrace can show failing methods within your service code or high GC activity on underlying Java processes.
A problem is rarely a one-time event. Often they appear in regular patterns and are symptoms of larger issues within your environment. If multiple entities depending on an affected component experience issues around the same time, all those entities are included into root cause analysis.
Time correlation alone is, however, is not enough to identify the root cause of many application and service problems. Consider, for example, a simple time correlation in which the Booking
service calls the Verification
service, and the Verification
service experiences a slowdown. The first event in the problem evolution is a slowdown of the Verification
service. Subsequently, the Booking
service experiences a slowdown too, caused by dependency on the Verification
service. In this case, time correlation works well in detecting the root cause of the problem: a slowdown of the Verification
service. However, this is an oversimplified description of what happens in real applications.
What if the events in the problem evolution sequence are more nuanced and open to interpretation? What if, for example, the Booking
service has a long history of performance problems? Without the complete context, it becomes impossible to decisively conclude that the slowdown of the Booking
service is caused by the slowdown of the Verfication
service. There is a possibility that Booking
service is experiencing another performance issue, unrelated to the Verfication
service.
Davis root cause analysis uses all related monitoring data to identify interdependencies between the problem and other components that took place around the same time and within a dependent topology. That is, all topological dependencies, vertical and horizontal, are part of the analysis.
The Davis context model is built on known dependency information from Smartscape, OneAgent, and cloud integration. Davis uses this information to quickly conduct a fault tree analysis to analyze millions of dependencies and arrive at the most probable root cause of a problem.
The impact of a problem is equally as important as its root cause, since both represent essential information for triaging and remediating the underlying incident—the problems that threaten your company's business most have higher priority to resolve.
Impact analysis identifies which applications' entry-point services are affected by an incident and the size of the overall "blast radius" in terms of the total number of affected entities. Impact analysis also delivers the number of affected SLOs as well as the number of potentially affected real users.
Ultimately, impact analysis strives to determine how badly the incident affects your business.
Dynatrace opens a problem upon receiving the first indicator of an incident, which is typically a single event representing abnormal behavior, such as a service slowdown, node saturation, or a workload crash and restart.
A problem automatically follows a lifecycle and remains in the active state while there is still an affected entity in the unhealthy or abnormal state, mostly indicated by an active event.
In the following scenario, a problem that has a performance incident in the infrastructure layer is the root cause.
The Dynatrace Davis® root-cause engine collects all events that belong to the same incident. As a result, Davis generates a problem that references all incident-relevant information such as individual events that were detected on the impacted topology graph.
The following shows how two individual events are analyzed within one problem generated by Davis.
Consider an example of a metric event configured to use a five-minute sliding window where three one-minute samples need to violate the threshold to raise an event. In that case, the metric starts violating the threshold three minutes before the timestamp when the event is raised.
The detection and analysis of a large-scale incident always requires a balance between fast alerting and having a complete picture of the situation. Data collection is asynchronous by nature due to different observation windows, different synthetic check schedules, and different latencies of information originating from various data sources.
The asynchronous processing of incoming information leads to situations where root-cause detection identifies two problems that share the same root cause.
dt.davis.is_duplicate=true
).Duplicates could be avoided entirely by waiting for a longer period (for example, 30 minutes) to receive all possible information, but this would delay alerts to the operation teams.
Due to the criticality of real-time alerting, potentially incomplete information and problem duplicates are accepted to allow the fastest possible alerting.
After Davis AI detects a problem, it enters the Processing
state. This means that Davis AI analyzes the event to see if it needs to be merged into a more significant problem. The Processing
state aims to avoid duplicate issues and false alarms, so no alerts are sent until the state changes. Analysis and information gathering necessary for Davis AI to make a decision usually take up to 3 minutes.
Below, you can find the example of a problem in the Processing
state.
If you want to receive alerts immediately after the problem is detected, you can use Custom alert with Metric events. In this case, the Davis AI event flags dt.davis.trigger_delay
and dt.davis.analysis_time
will be set to 0
. The problem won't enter the Processing
state, and no Davis AI analysis will be performed.
Alternatively, you can set the processing state per event source in Log events configuration.