Latest Dynatrace
Early Adopter
Time is crucial when dealing with security incidents. This page shows how you can use Dynatrace to speed up your incident response in two phases:
This page is intended for Security teams analyzing security incidents, such as the Incident Response team.
In the following, we address a scenario in which identifying an attack, researching the scope, determining the responsible entity owners, and remediating the attack takes hours, sometimes even days.
The team wants to quickly
Efficiency: The team should be able to respond much faster to attacks.
Flexibility: The team should have more flexibility in their response to security incidents.
Combining the Dynatrace automation capabilities with insights into security-related data, our solution helps security teams react and respond faster to attacks. The team automatically scans all ingested logs for patterns that might indicate possible attacks. Because each attack is different, they make use of schemaless queries with instant responses to quickly identify the scope of an attack, thus reducing the required time from days to minutes.
Logs from your Dynatrace-monitored environment are ingested into Grail via log ingestion. When an attack is detected, a Dynatrace problem is created.
A workflow is automatically triggered by this type of problem. The workflow collects, processes, and enriches the data with context, and converts the resulting information into notifications on your desired channels.
For an example of how you can set up an attack notification automation, see Intrusion notification automation.
Based on the information received, you can immediately respond to discoveries and perform further investigations by running a sequence of DQL queries in Notebooks tailored to the attack type.
For details, see Instant queries.
Dynatrace version 1.283+
Set up log ingestion (ingests security incidents into Grail).
Set up ownership teams (allows the workflow to assign security incidents based on ownership of the affected entity).
Set up Jira for Workflows (allows the workflow to convert resulting findings into Jira tickets).
Set up Slack Connector (allows the workflow to send resulting findings to Slack channels).
While the current scenario uses Slack and Jira as notification channels, other integrations are also available. For details, see Workflows integrations.
Basic knowledge of how to
Make sure the following permissions are enabled.
Grail: storage:logs:read
. For instructions, see Assign permissions in Grail.
Workflows: Permissions to access, view, write, and execute workflows. For details, see Authorization.
To access permissions, go to the Settings menu in the upper-right corner of the Workflows app and select Authorization settings.
The following example illustrates how you can implement an attack notification automation using Workflows. You can customize the workflow according to your needs.
Set the trigger
Determine ownership
Set up notification variables
Collect data and enrich with context
Extract successful requests
Replay against target
Send notifications
The automation needs to be triggered whenever an attack occurs.
In the Select trigger section, select and configure Davis Problem trigger. For details, see Create workflows in Dynatrace Workflows: Trigger.
Route notifications to the team responsible for the affected entities.
Select the Get owners action to create and configure this task. For details, see Ownership app: get_owners
.
Configure variables such as default values for Slack and Jira fields, that will be used in later steps in the notification process.
Select the Run JavaScript action to create and configure these tasks. For details, see Introduction to workflows: Action.
Query third-party services. For example:
Select the HTTP Request action to create and configure these tasks. For details, see Introduction to workflows: Action.
Query Dynatrace. For example:
Determine whether there were any successful logins from the attacker's IP address.
Find out additional traffic information from the attacker's IP address.
Select the Execute DQL Query action to create and configure these tasks. For details, see Introduction to workflows: Action.
Extract the successful requests from the total requests collected.
Select the Run JavaScript action to create and configure this task. For details, see Introduction to workflows: Action.
Replay the successful requests against the target entity to look for indicators of compromise. Custom code steps allow you to automate complex logic that you want to run for each detected attack. Depending on the detected attack and the affected systems, you might want to replay the attacks for more detailed analysis.
Select the Run JavaScript action to create and configure this task. For details, see Introduction to workflows: Action.
Notify the responsible team on Slack.
Select the Send message action to create and configure this task. For details, see Use Workflows with Slack.
Create a Jira ticket for the entity owner containing the collected information.
Select the Create issue action to create and configure this task. For details, see Create Jira issues with workflows.
After receiving the notification, the security team can immediately respond to discoveries and instantly run additional DQL queries in Notebooks without knowing beforehand where the information they're looking for is. In an emergency situation, this is crucial, as a speedy response can ensure that the attack can be contained.
The following are some examples of how you can query Grail in case of a web attack.
Query example:
fetch logs, scanLimitGBytes:-1| filter log.source == "/var/log/nginx/access.log"| filter net.peer.ip == "<<IP ADDRESS>>"| sort timestamp desc
Query example:
fetch logs, scanLimitGBytes:-1| filter log.source == "/var/log/nginx/access.log"| filter net.peer.ip == "<<IP Address>>"| filter http.status_code == "200"| sort timestamp desc
Query example:
fetch logs, scanLimitGBytes:-1| filter log.source == "/var/log/nginx/access.log"| filter net.peer.ip == "<<IP Address>>"| summarize requests=count(), by:{http.status_code, http.user_agent}| sort http.status_code
Query example:
fetch logs, scanLimitGBytes:-1| filter log.source == "/var/log/nginx/access.log"// Successful requests only| filter http.status_code == "200"| filter net.peer.ip == "<<IP Address>>"| summarize requests=count(), by:{http.target}| sort requests DESC
Query example:
fetch logs, scanLimitGBytes:-1| filter log.source == "/var/log/nginx/access.log"| fields timestamp, net.peer.ip, http.target, http.status_code, http.response_content_length, http.user_agent| filter http.status_code == "200"// Filter for a specific IP address| filter net.peer.ip == "<<IP Address>>"| sort toLong(http.response_content_length) DESC
Query example:
fetch logs, scanLimitGBytes:-1| filter log.source == "/var/log/nginx/access.log"| fields payload = "<<PAYLOAD>>", timestamp, net.peer.ip, http.method, http.target, http.status_code, http.request.header.referrer, http.response_content_length, http.user_agent, content| filter contains(content, payload)
Query example:
fetch logs, scanLimitGBytes: -1// Search for logins| filter log.source == "/var/log/sso.log"// Search for successful logins from a given IP address| filter contains(content, "user login successful") and contains(content, "<<IP address>>")| sort timestamp desc
Query example:
fetch logs, scanLimitGBytes: -1// Search for logins| filter log.source == "/var/log/sso.log"// Search for logins from a given account id| filter contains(content, "<<Account ID>>") and contains(content, "tenant: ")| sort timestamp desc
You can use the above instructions as building blocks to automate common steps in your incidents process. These can help you respond faster to security incidents, thus reducing their impact.