Triage, investigate, and remediate incidences in the Problems app

Quickly triaging, investigating, and remediating incoming incidences is the core challenge for operations teams. The Problems app supports them by automatically analyzing complex incidences, collecting all the context, and presenting the root cause and impact within a consistent view.

The Problems app, backed by data from Grail and Davis® AI analysis, helps operational and site reliability teams reduce the mean time to repair (MTTR) by presenting every aspect of the incident.

Aim and context

This guide shows you how to use the Problems app to triage detected problems and investigate their root cause and impact.

Target audience

This guide is written for:

  • Operations engineers
  • Pipeline engineers
  • Systems engineers
  • Site reliability engineers (SREs)
  • Build automation engineers

Prerequisites

You need to install Problems from Dynatrace Hub.

  1. In Dynatrace Hub, select Problems.
  2. Select Install.

Investigate and remediate active problems

Set focus and triage

By default, the Problems app shows:

  • A feed of all problems in the last 2 hours. To help operation teams spot open problems regardless of which filter is set, open problems remain on top of the feed no matter how long they are open.
  • A problem chart at the top visualizes any abnormality with a high number of problems in the past. Select a peak on the chart to drill into it and investigate further.

Problems app - problem feed view

Filtering

To focus on your domain and triage problems that affect it, set filters. The two most common filters—Status and Category–have selectable settings to the left of the table for quick access. To set other filters, use the filter bar above the table.

  • Status—Can be Active or Closed.
    • If this is not set, all problems (active or closed) are listed.
    • If you select a status in the controls on the left, the corresponding filter is also displayed in the filter bar.
  • Category—Indicates the nature of the incident, such as slowdowns, errors, resource-related issues, or availability incidences.
    • If you select one or more categories in the controls on the left, the corresponding filters are also displayed in the filter bar.

Filtering with the filter bar allows you to focus your feed on problems based on multiple criteria, such as status, number of affected entities, root cause entity, and more–place your cursor in the input field to see all the available options. Filtering criteria are combined by the AND logic. For each criterion, Davis provides a list of suggested values, based on your problem feed.

For example, to see problems that are raised due to an increase of JavaScript errors and that persist for longer than 1 hour, use the following filter criteria:

  • Status=ACTIVE
  • Duration>1h
  • Category=Error
  • Name=JavaScript error rate increase

Activate auto refresh

To make sure you always catch incoming problems, use the refresh settings in the upper-right corner of the Problems app.

  • To automatically refresh the problem feed, select and choose a refresh rate (or select Off to turn off automatic refresh)
  • To manually refresh the problem feed at any time, regardless of the automatic refresh setting, select

Investigate and compare problems

To see the details of a problem

  1. In the table, select the problem ID in the ID column.
  2. Review the details page.

The problems details page provides all available details about the problem, highlighting the root cause entity with a red mark, to guide your attention to the right things. The example below shows details of a problem with user action degradation—including the root cause entity (easyTravelBusiness service) and a chart of abnormal response time of that service.

Problems app - problem details view

All entities affected by the problem are listed in the Affected entities section, along with information about entity type and the number of events, detected during the analysis.

  • As a suggestion for the starting point of the investigation, Davis marks the entity that it determined to be the root cause of the problem.
  • To review details about an affected entity, select it in the table.

Compare multiple problems

If all the filters are applied and you still have multiple problems to investigate, you can select and compare the details of multiple problems.

  1. In the table, use the checkboxes to select two or more problems.

  2. Select Show details.

    This preloads the details of all selected problems and adds controls to the upper-right corner of the problem details page so you can quickly switch between each selected problem.

Read event properties for additional information

Dynatrace receives events from multiple event sources, such as OneAgent, Synthetic, extensions, and ingestion APIs. Dynatrace accepts and understands various properties (also referred to as fields) of those events that provide additional information about the event.

Event sources can be customized to provide the information you need to analyze and remediate problems caused by the events. For example, linking the configuration that detected the event (dt.settings.schema_id and dt.settings.object_id) helps you to quickly adapt the threshold or baseline if such action is necessary.

Another example is adjusting the sensitivity of the anomaly detector that triggered the event by modifying the detector's configuration in the settings.

Since available event properties depend on the event's source, events that are not generated by anomaly detectors don't contain links to relevant event settings. If you want an event to link to a settings object, you can do so by attaching a dt.settings.object_id property to events ingested via API and/or extensions.

The Problems app displays all event properties for each collected event in a table and provides intent links, such as direct navigation to an anomaly detector's configuration, as shown below.

Problems app offering a direct link to settings object

Examples of powerful event properties include:

  • Event description (event.description). The event description supports Markdown-formatted text, enabling you to include links to resources that can help to remediate the problem.
  • DQL query (dt.query) allows you to rebuild the event's chart in a notebook or at a dashboard or to copy the raw value of a property.
  • Related entities (dt.entity.*) allow you to directly navigate to entities through the dt.entity.* properties.
  • Link to a settings object (dt.settings.object_id) and settings schema (dt.settings.schema_id).

To learn more about the semantics and syntax of event properties and how they can be used across Dynatrace, see Semantic Dictionary.

Analyze problems with your own tools by exporting CSV

For cases when your software tools create integration gaps preventing you from effective usage of Dynatrace data, we provide the ability to export problem feed data in the CSV format. You can later use this data in various tools, including spreadsheet programs, databases, and data analysis tools.

As illustrated below, you can export problem related-data from the problem feed table. You can also export it from Notebooks and Dashboards within all table visualizations.

Export problems as CSV file

You can export all loaded problems (up to a limit of 1000) or use the multi-select feature to choose specific problems. Additionally, the filter bar above the table allows you to filter through larger subsets of problems. The Select all checkbox helps you to export all problems in the filtered set of entries.

Check the root cause without leaving your context

Depending on your team's responsibility, you might want to focus your attention on Kubernetes clusters, cloud resources, and workloads of critical services. To minimize context switching, Dynatrace offers consistent root cause information across multiple apps. No matter where your investigation starts, you don't have to switch to the Problems app to see the root cause.

In the example below, the Kubernetes app displays information about a problem affecting a workload.

Problem information integrated into the Kubernetes app

Investigate all problem relevant logs

A Davis-analyzed problem highlights the root cause of an incident and shows all the incident-relevant log lines across multiple entities in the problem details.

To access the log lines that were collected during the incident, select the Logs tab. Additionally, you're able to see their log level across all entities affected by the problem, allowing you to save time on manual investigations and filtering logs of relevant entities separately.

The Logs tab also includes references to the affected entities and information about all related entities, such as parent hosts. To verify which entities are affected by the problem event, you can refer to all the event properties that start with the dt.entity. prefix.

See how Logs tab summarizes all problem-relevant logs in the image below.

Davis problems app log count

The image below illustrates the further sorting of the log lines with the help of a DQL query.

Davis problems app log lines

Summary

The Problems app streamlines triage, analysis, and remediation of active incidences by reducing the MTTR. It allows you to focus on AI-detected problems and quickly navigate to their root cause.

  • The data provided by Grail and DQL makes it possible to slice and dice all problem-related information for huge amounts of problems and events.
  • Integration with context-specific Dynatrace apps allows you to analyze problems without the need to switch the context.