Resolve team dependencies with Log Management and Analytics
Resolve team dependencies with Log Management and Analytics.
Context
Your company operates considerable infrastructure to provide customers with a SaaS offering. You use Dynatrace to monitor both the production and pre-production parts of this infrastructure. Dynatrace monitors thousands of hosts and ingests around 10 TB of logs per day.
These logs contain valuable information about errors that occurred on the systems that might need to be addressed by the various development teams. However, the volume makes it impossible to analyze manually, and the large number of teams adds the complexity of bringing the right log lines to the right teams.
Use case scenario
In this use case, you create a Log Analysis Dashboard that takes care of identifying bugs from logs, as well as grouping, triaging, and distributing to a bug tracker.
Request
With many teams, distributed systems, interconnected components, and large volumes of log data, this is a tricky question to answer. To clarify ambiguous responsibilities and interdependencies, representatives of these teams meet for a daily bug meeting. There, they review newly discovered problems and ensure that the ownership of each item is clearly defined.
-
Each development team wants to know how the code they're responsible for behaved in the production and pre-production stages and whether there are any issues they need to fix.
-
Teams want to detect bugs as early as possible, before they become visible to customers, because even with extensive QA processes, some errors are only discovered through real-life usage on infrastructure as close as possible to production.
Goal
The goal is to be able to collect logs grouped by their message, allow teams to easily react to bugs and problems, and integrate a bug ticket creation workflow.
-
Create an internal tool (Log Analysis Dashboard) that takes care of identifying bugs from logs, grouping, triaging, and distributing to your bug tracker.
-
Based on preliminary triaging and vetting, issues are created in the correct projects and assigned to the right teams.
-
For each individual issue that is found, the time required to collect and document all required information is reduced.
-
Automatic grouping of similar log lines avoids duplicates. Once a bug is detected and reported to the ticketing system, this grouping also ensures that new instances don't need to be reviewed again.
Mechanism
You plan to use a DQL query in Logs on Grail to identify log messages at or above a certain severity level. Group similar ones together to avoid duplicates and, based on tags or group metadata, identify teams that are likely the correct owners of the bug. Additionally, events from related process groups and hosts should be fetched to add context to the logs.
All teams have the same view of newly discovered bugs and can claim ownership in a self-service fashion. For all bugs, the clarification of responsibilities and ownership can be clarified in a daily sync of less than half an hour. Without this common view, discussions would be dispersed and could easily take hours.
The solution is created using the following building blocks:
- Logs on Grail
To ingest and process log data. - DQL
To filter log records that apply to your teams. - AppEngine
To build custom application for data collected by Dynatrace. - EdgeConnect
To connect to an on-premises bug tracker.
1. Ingest log data
Log records are ingested into Grail.
- Log ingestion via OneAgent
Automatically discover and ingest log and event data from a vast array of technologies. It supports multiple configurations of timestamps, masking sensitive data, detecting log rotation patterns, and much more. - Generic log ingestion
With generic log ingestion, you can stream log records to a system and have Dynatrace transform the stream into meaningful log messages.
2. Filter log content
Events are filtered by
Severity (critical and high)
Timestamp (newest events)
DQL examples for:
- OneAgent logs.
1fetch logs, timeframe:"2023-10-04T00:00:00Z/2023-10-04T04:00:00Z"2| filter startsWith(log.source, "/var/log/dynatrace/oneagent/java/")3| filter loglevel == "WARN" OR loglevel == "SEVERE"
- Process groups where process groups to be analyzed are known.
1fetch logs, timeframe:"2023-10-01T09:03:18Z/2023-10-02T09:03:18Z",2scanLimitGBytes: -13| filter dt.entity.process_group == "PROCESS_GROUP-9C77F43CE98EAA17" OR4dt.entity.process_group == "PROCESS_GROUP-a3444fa7ad1c3d41" OR5dt.entity.process_group == "PROCESS_GROUP-e50850ceefbd9c7f" OR6dt.entity.process_group == "PROCESS_GROUP-689574f4e137425c" OR7dt.entity.process_group == "PROCESS_GROUP-37925f9c37724091"8| filter ((loglevel == "WARN" OR loglevel == "WARNING" OR loglevel ==9"SEVERE" OR loglevel == "ERROR" OR loglevel == "CRITICAL" OR loglevel ==10"ALERT"))
3. Group log records
You can group the logs based on their message content into a simple backend service, running in your infrastructure and reachable through edge-connect by any Dynatrace Apps in your tenant(s).
-
Save the metadata, such as timestamp, thread name, class name, and log level.
-
Process the log message.
For the log message processing, split the message into tokens and categorize them as mandatory word (conventionally prefixed with
##
), word, separator, and non-relevant word (dynamic log parts as IDs or timestamps). The grouping message of a log is then built asclassname
(if present), plus the first four more relevant words, plus all remaining mandatory words.
To summarize, the log records are grouped by:
Tags
Group metadata
Related process group events
4. Create dashboard application
Create a Log Analysis dashboard application that consumes the grouped log data.
Dynatrace applications are self-contained and focus on specific use cases. The Dynatrace app isn't an isolated application; it interacts with the capabilities of the Dynatrace platform via APIs, or with publicly available third-party systems. Dynatrace applications can also interact with your on-premises systems via EdgeConnect, which you can run in your corporate network.
The log dashboard application will analyze error or warning logs that indicate bugs in the code or system configuration that needs to be enhanced (for example, low CPU or low disk space). It will show in the overview the grouped logs for which tickets and comments can be automatically or manually created.