Latest Dynatrace
Business-critical services require thorough validation before being deployed to production, as potential faults can negatively impact overall productivity. Dynatrace tools, such as the Site Reliability Guardian and Workflows, can help you optimize your delivery process.
A product manager informs you that they wish to release a new feature. Your company utilizes flags to enable new features. This new feature is currently behind a flag.
To release the new feature, you'll do the following:
The goal of this tutorial is to teach you how to automatically detect negative impacts on your application after a new release.
In this tutorial, you'll learn how to
Alternatively, follow our tutorial. This lab has a GitHub Codespaces configuration that allows you to fully automate this use case.
Access to your Dynatrace tenant
Access to your GitHub account, a GitHub Repository, and a GitHub Personal Access Token (PAT)
A Dynatrace Platform API token
There two sets of steps. The setup steps are extensive and include a new SRG, workflow, and baselining. You only need to carry out these steps once. To automate the release validation you'll repeat these steps periodically. It includes running your guardian, enabling a new feature and running load test on it.
Before you can run this use case, you need to complete several setup steps:
After you send the event to Dynatrace, Grail will store all the related data, and your guardian will be able to query the results, validate them against your objectives, and save the result in Grail. You'll need three objectives instead of four. These are latency, traffic, and errors.
To create a new guardian with the three objectives
For a more detailed explanation, see Create Site Reliability Guardian - GitHub tutorial.
You can automatically create a Site Reliability Guardian to scale using Dynatrace Configuration as Code.
You can trigger your Site Reliability Guardian automatically using a workflow.
To create a workflow for this guardian
Go to Three golden signals guardian.
You can automate from the All guardians page by hovering over the Three golden signals guardian or opening it. Select Automate.
This step creates a new workflow called Three golden signals.
In the Event trigger pane, select events.
In Filter query box, copy and paste the following filter query and replace "your service name":
event.type == "CUSTOM_INFO" anddt.entity.service.name == "your service name"
Select Query past events.
Go to the run validation task. This task will start your Site Reliability Guardian validation.
In the From field, replace event.timeframe.from
with now-{{ event()['duration'] }}
.
In the From field, replace event.timeframe.to
with now
.
Select Save next to the name of the workflow.
Now, you have a new workflow connected to the guardian of the same name, which is triggered whenever Dynatrace receives the right event.
For a more detailed explanation, see Automate the Site Reliability Guardian - GitHub tutorial.
For more information on automating the trigger for a guardian, see Site Reliability Guardian.
Objectives set to Auto-adaptive thresholds in the Site Reliability Guardian require five runs to enable the baseline. In a real-life scenario, these test runs will likely run over hours, days, or weeks, providing Dynatrace ample time to gather sufficient usage data. To enable the baseline, you'll trigger five load tests after one another. After the baseline, you can view the completed training runs by selecting Workflows and Executions. You should see five successful workflow executions.
You could use this DQL query to see the Site Reliability Guardian results in a notebook:
fetch bizevents| filter event.provider == "dynatrace.site.reliability.guardian"| filter event.type == "guardian.validation.finished"| fieldsKeep guardian.id, validation.id, timestamp, guardian.name, validation.status, validation.summary, validation.from, validation.to
You could also view the SRG Status in the Site Reliability Guardian by opening the Three golden signals guardian. You should see five completed runs there as well.
To automate the release validation, you need to complete a couple of steps. First, you need to run the production SRG. Then enable a new feature and run the load test.
Remember, the workflow is currently configured to listen for test finished
events, but you could easily create additional workflows with different triggers, such as On demand or Cron schedule trigger.
Different triggers allow you to continuously test your service, for example, in production.
After you train the guardian, you should run it by triggering a load test. In the Validation history panel of the Three golden signals
guardian, select Refresh . Your Heatmap will show some errors.
It is possible to add an objective that is informational-only and doesn't contribute to the pass / fail decisions of the Site Reliability Guardian. They are useful for new services where you're trying to get an idea for the real-world data values of your metrics.
To set an objective as information-only
Your team is ready to release their new feature. You need to enable the feature and run the load test in your development environment. The new feature is behind a flag and you need to change it.
To enable a new feature and run the load test
In your system, update the flag. This flag change triggers a notification for Dynatrace using a CUSTOM_INFO
event of the change, including the new value.
In your system, change your service's flag value from off
to on
.
In your system, run the acceptance load test to see if the new feature has caused a regression.
Send an event to the affected service each time a flag changes. The flag changes your service's behavior, which your other services depend on.
Go to Dynatrace > Services, go to your service and check the changes. Notice that the configuration changed for the defaultValue
to on
.
In your system, wait for all the jobs to be completed.
Go to Dynatrace > Site Reliability Guardian and refresh your guardian's Validation history, and Heatmap. Notice that the guardian has failed due to the high error rate.
In Dynatrace, go to Services, go to your service. You can see the increase in failure rate.
Move through until you see the Distributed traces list. Notice the failed requests.
Open one of the failed requests. The exception message and stack trace show something similar
exception.message PaymentService Fail Feature Flag Enabledexception.stacktrace Error: PaymentService Fail Feature Flag Enabled at module.exports.charge(/usr/src/app/charge.js:21:11) at process.processTicksAndRejections(node:internal/process/task_queues:95:5) at async Object.chargeServiceHandler[as charge] (/usr/src/app/index.js:21:22)exception.type Error
In your system, roll back the flag change because of the high error rate by informing Dynatrace that a configuration change is coming. You'll reset your flag to off
.
In your system, change the flag value of your service from off
to on
.
In your system, apply the changes.
Looking back at the initial brief, it was your job to:
So, how did things turn out?
The techniques described here work with any metric from any source.
You're encouraged to use metrics from other devices and sources such as business-related metrics like revenue.