Quickly triaging, investigating, and remediating incoming incidences is the core challenge for operations teams. Problems supports them by automatically analyzing complex incidences, collecting all the context, and presenting the root cause and impact within a consistent view.
Problems , backed by data from Grail and Davis® AI analysis, helps operational and site reliability teams reduce the mean time to repair (MTTR) by presenting every aspect of the incident.
This guide shows you how to use Problems to triage detected problems and investigate their root cause and impact.
This guide is written for:
You need to install Problems from Dynatrace Hub.
Set focus and triage
Activate auto refresh
Investigate and compare problems
Read event properties for additional information
Analyze problems with your own tools by exporting CSV
Check the root cause without leaving your context
Investigate all problem relevant logs
Visually notify and automate to speed up remediation
Visualize affected deployments to gather additional insights
By default, Problems shows:
To focus on your domain and triage problems that affect it, set filters. The two most common filters—Status and Category–have selectable settings to the left of the table for quick access. To set other filters, use the filter bar above the table.
Active
or Closed
.
Filtering with the filter bar allows you to focus your feed on problems based on multiple criteria, such as status, number of affected entities, root cause entity, and more. Place your cursor in the input field to see all the available options. By default, filtering criteria are combined by the AND logic. For each criterion, Davis provides a list of suggested values, based on your problem feed.
For example, to see problems that are raised due to an increase of JavaScript errors and that persist for longer than 1 hour, use the following filter criteria:
Status=ACTIVE
Duration>1h
Category=Error
Name=JavaScript error rate increase
The problem filter bar supports Boolean logic filters. This allows you to combine AND and OR criteria and create complex filters using parentheses to group Boolean terms. You can see a Boolean logic filter statement within the Problems app in the example below.
Segments are predefined filters used for quickly filtering the data to include only the relevant entries. In the context of Problems , you or your team can utilize a predefined set of team-specific segments to filter your problem tables instead of having to create your own problem filters.
The following example shows how to use segments to filter problems connected to easyTravel.
In addition, using segments in Problems allows you to:
Since problems are stored as events in Grail, segments created for filtering problems must define an event filter. For example, if you want to filter problems that were raised in a specific cloud region, you can create a segment with the following event filter:
cloud.region = "us-east-1c" AND event.kind = "DAVIS_PROBLEM"
Segment filters are directly applied to the problem Grail records. Consequently, no entity filters are applied to the problem unless the entity ID is chosen as a primary field of the filtered problem.
For more information on segments and how they work, see Segments .
To make sure you always catch incoming problems, use the refresh settings in the upper-right corner of Problems .
Off
to turn off automatic refresh)To see the details of a problem
The problems details page provides all available details about the problem, highlighting the root cause entity with a red mark, to guide your attention to the right things. The example below shows details of a problem with user action degradation—including the root cause entity (easyTravelBusiness
service) and a chart of abnormal response time of that service.
All entities affected by the problem are listed in the Affected entities section, along with information about entity type and the number of events, detected during the analysis.
If all the filters are applied and you still have multiple problems to investigate, you can select and compare the details of multiple problems.
In the table, use the checkboxes to select two or more problems.
Select Show details.
This preloads the details of all selected problems and adds controls to the upper-right corner of the problem details page so you can quickly switch between each selected problem.
Dynatrace receives events from multiple event sources, such as OneAgent, Synthetic, extensions, and ingestion APIs. Dynatrace accepts and understands various properties (also referred to as fields) of those events that provide additional information about the event.
Event sources can be customized to provide the information you need to analyze and remediate problems caused by the events. For example, linking the configuration that detected the event (dt.settings.schema_id
and dt.settings.object_id
) helps you to quickly adapt the threshold or baseline if such action is necessary.
Another example is adjusting the sensitivity of the anomaly detector that triggered the event by modifying the detector's configuration in the settings.
Since available event properties depend on the event's source, events that are not generated by anomaly detectors don't contain links to relevant event settings. If you want an event to link to a settings object, you can do so by attaching a dt.settings.object_id
property to events ingested via API and/or extensions.
Problems displays all event properties for each collected event in a table and provides intent links, such as direct navigation to an anomaly detector's configuration, as shown below.
Examples of powerful event properties include:
event.description
). The event description supports Markdown-formatted text, enabling you to include links to resources that can help to remediate the problem.dt.query
) allows you to rebuild the event's chart in a notebook or at a dashboard or to copy the raw value of a property.dt.entity.*
) allow you to directly navigate to entities through the dt.entity.*
properties.dt.settings.object_id
) and settings schema (dt.settings.schema_id
).To learn more about the semantics and syntax of event properties and how they can be used across Dynatrace, see Semantic Dictionary.
For cases when your software tools create integration gaps preventing you from effective usage of Dynatrace data, we provide the ability to export problem feed data in the CSV format. You can later use this data in various tools, including spreadsheet programs, databases, and data analysis tools.
As illustrated below, you can export problem related-data from the problem feed table. You can also export it from Notebooks and Dashboards within all table visualizations.
You can export all loaded problems (up to a limit of 1000) or use the multi-select feature to choose specific problems. Additionally, the filter bar above the table allows you to filter through larger subsets of problems. The Select all checkbox helps you to export all problems in the filtered set of entries.
Depending on your team's responsibility, you might want to focus your attention on Kubernetes clusters, cloud resources, and workloads of critical services. To minimize context switching, Dynatrace offers consistent root cause information across multiple apps. No matter where your investigation starts, you don't have to switch to Problems to see the root cause.
In the example below, the Kubernetes app displays information about a problem affecting a workload.
A Davis-analyzed problem highlights the root cause of an incident and shows all the incident-relevant log lines across multiple entities in the problem details.
To access the log lines that were collected during the incident, select the Logs tab. Additionally, you're able to see their log level across all entities affected by the problem, allowing you to save time on manual investigations and filtering logs of relevant entities separately.
The Logs tab also includes references to the affected entities and information about all related entities, such as parent hosts. To verify which entities are affected by the problem event, you can refer to all the event properties that start with the dt.entity.
prefix.
See how Logs tab summarizes all problem-relevant logs in the image below.
The image below illustrates the further sorting of the log lines with the help of a DQL query.
Problems features a global problem indicator that shows the number of active problems within the environment and is always visible in the Dock. When the Dock is collapsed, a red dot is displayed next to the app icon instead of a number.
To personalize the indicator and the number of the displayed active issues, select filters in Category and save the filter configuration by selecting the icon. The saved filter will automatically apply to the global problem indicator, reducing the number of problems counted for the user, as shown below. Selecting the Default filter button restores the last saved configuration.
While a problem filter is active, the indicator number will only show active problems from your chosen categories. The indicator updates on a one-minute schedule, which means that after the filter is updated, it can take some time for the indicator to adapt.
You can also set up email notifications for filtered problems using your email address by selecting the icon, as shown below:
The email notification is your personal setting, so you can enable it without the need for configuration permissions or the risk of impacting other users within the same environment.
The email notification is directly triggered within OpenPipeline, meaning only simple filters can be applied. Workflows that query Problems through DQL can use the complete feature set of Grail queries, such as joining tables.
If you need to send out customized email messages or have more complex automation and integration needs, you should apply a complete workflow along with the problem trigger.
The Deployment perspective equips operations teams with deeper insight into the infrastructure and cloud resources impacted by large-scale incidents. The root cause analysis feature automatically collects and visualizes affected deployments and related resources.
The additional context provided by related resources allows you to:
Deployment view uses a diagram similar to a Unified Modeling Language (UML) deployment diagram and follows a top-down approach, starting with the largest container element at the top and becoming more detailed as you drill down. The deployment structure is visualized as collapsible cards with horizontally overlapping elements, for example, services running in multiple regions. In this case, cards representing such services are duplicated and shown in multiple deployment stacks.
The deployment containing the root cause is automatically expanded and tagged with a red root cause badge, while all other deployments are collapsed by default. The deployment hierarchy is focused on a maximum of 5 levels, starting with the hierarchy leaf nodes at the bottom of the diagram upwards, seen in the example below:
Interactivity is a crucial feature of the deployment view. On the right side, you can click on any element to visualize findings, such as events related to the problem, along with a direct link to the selected entity. This structured approach allows you and your operations team to reduce the time needed to respond to incidents by navigating a familiar visual representation.
Not all incident-relevant related elements may show information on the right. Some elements, like the cloud region, are displayed for better context but may not necessarily show problem-relevant events.
Problems streamlines triage, analysis, and remediation of active incidents by reducing the MTTR. It allows you to focus on AI-detected problems and quickly navigate to their root cause.