Triggering an alert on every log record where the status is marked as “ERROR” can create thousands of alerts, often lacking context. This flood of alerts not only overwhelms monitoring teams but also makes it challenging to discern critical issues.
To address this challenge, Dynatrace integrates Log Events and Davis® AI. With Log Events and Davis® AI you can reduce the number of alerts and put logs in context of detected problems and topology.
This guide describes the approach for leveraging the combined capabilities of Log Events and Davis® AI to optimize your log monitoring and troubleshooting workflows effectively. This integration eases the log monitoring process, and helps with gaining deeper insights into the root causes of encountered issues.
This article is intended for entry level site reliability engineers (SRE) who are responsible to maintaining the health of the digital ecosystem by proactively monitoring, diagnosing, and resolving issues that may impact the system's operation.
As the SRE responsible for maintaining the reliability and availability of the infrastructure, you are tasked with implementing a proactive monitoring and anomaly detection system for log events. This system aims to identify potential issues and trigger timely alerts for prompt resolution. To achieve this, you need to set up automated processes to monitor log events and detect anomalies, covering both the identification of specific error patterns and the detection of irregularities in log metrics.
By integrating log event monitoring with anomaly detection capabilities, you can ensure efficient infrastructure monitoring, enabling your team to maintain peak performance and mitigate any emerging issues. See the methods you can choose from to set up a monitoring procedure.
Make sure all of these are true before you start:
storage:logs:read
. For instructions, see Assign permissions in Grail.
To access permissions, go to the Settings menu in the upper-right corner of the Workflows app and select Authorization settings.
To fulfill the presented scenarios, you can choose either to open new problems based on log events, or detect anomalies based on log metrics.
Use log events when you have a single log recod that you are certain it should open a new Problem.
Use events based on log metrics when:
response_time
attribute, and you want to open a problem when this is above average.Learn more by acessing the Metric events page.
Choose one of the following methods to fulfill the presented scenarios.
The team wants to open a new Problem when a specific log record is ingested. This process has the following steps:
Learn more by accessing the Set log alert page.
The team wants to open a new Problem when an anomaly is detected in the log metric. This process has the following steps:
Learn more by accessing the Create anomaly detection metric page.
According to the chosen method, you can effectively decrease the number of alerts generated from log entries. By following any of these methods, you'll simplify your monitoring procedures, enhance operational efficiency, and respond more efficiently to important issues in your system.