Automate release validation

Latest Dynatrace

Business-critical services require thorough validation before being deployed to production, as potential faults can negatively impact overall productivity. Dynatrace tools, such as the Site Reliability Guardian and Workflows, can help you optimize your delivery process.

Scenario

A product manager informs you that they wish to release a new feature. Your company utilizes flags to enable new features. This new feature is currently behind a flag.

To release the new feature, you'll do the following:

  1. Enable a flag in your development environment.
  2. Assess the impact of the change on your application.
  3. Observe the negative impact, gather the evidence of this negative impact, and turn off the flag.
  4. Decide if the feature can be released or not.
  5. Provide feedback on why you made that decision.

What will you learn

The goal of this tutorial is to teach you how to automatically detect negative impacts on your application after a new release.

In this tutorial, you'll learn how to

  • Create a Site Reliability Guardian (SRG) to test and ensure the health of your microservices.
  • Create a workflow to trigger your guardian automatically.
  • Baseline the site reliability guardian to suggest and adjust thresholds based on current and past performance.
  • Detect which feature release is breaking your application.

Alternatively, follow our tutorial. This lab has a GitHub Codespaces configuration that allows you to fully automate this use case.

Before you begin

Prerequisites

  • Access to your Dynatrace tenant

  • Access to your GitHub account, a GitHub Repository, and a GitHub Personal Access Token (PAT)

  • A Dynatrace Platform API token

Steps

There two sets of steps. The setup steps are extensive and include a new SRG, workflow, and baselining. You only need to carry out these steps once. To automate the release validation you'll repeat these steps periodically. It includes running your guardian, enabling a new feature and running load test on it.

Setup steps

Before you can run this use case, you need to complete several setup steps:

  1. Create a Site Reliability Guardian (SRG).
  2. Create a workflow to trigger the SRG automatically.
  3. Baseline your guardian.

Create a Site Reliability Guardian

After you send the event to Dynatrace, Grail will store all the related data, and your guardian will be able to query the results, validate them against your objectives, and save the result in Grail. You'll need three objectives instead of four. These are latency, traffic, and errors.

To create a new guardian with the three objectives

  1. Create a Site Reliability Guardian.
  2. Select a service.
  3. Select the Guardian name and change it to Three golden signals.
  4. Select the Saturation objective, and delete it.
  5. Select Save.

For a more detailed explanation, see Create Site Reliability Guardian - GitHub tutorial.

You can automatically create a Site Reliability Guardian to scale using Dynatrace Configuration as Code.

Create a workflow to trigger the Site Reliability Guardian automatically

You can trigger your Site Reliability Guardian automatically using a workflow.

To create a workflow for this guardian

  1. Go to Three golden signals guardian.

  2. You can automate from the All guardians page by hovering over the Three golden signals guardian or opening it. Select Automate.

    This step creates a new workflow called Three golden signals.

  3. In the Event trigger pane, select events.

  4. In Filter query box, copy and paste the following filter query and replace "your service name":

    event.type == "CUSTOM_INFO" and
    dt.entity.service.name == "your service name"
  5. Select Query past events.

  6. Go to the run validation task. This task will start your Site Reliability Guardian validation.

  7. In the From field, replace event.timeframe.from with now-{{ event()['duration'] }}.

  8. In the From field, replace event.timeframe.to with now.

  9. Select Save next to the name of the workflow.

Now, you have a new workflow connected to the guardian of the same name, which is triggered whenever Dynatrace receives the right event.

For a more detailed explanation, see Automate the Site Reliability Guardian - GitHub tutorial.

For more information on automating the trigger for a guardian, see Site Reliability Guardian.

Baseline your guardian

Objectives set to Auto-adaptive thresholds in the Site Reliability Guardian require five runs to enable the baseline. In a real-life scenario, these test runs will likely run over hours, days, or weeks, providing Dynatrace ample time to gather sufficient usage data. To enable the baseline, you'll trigger five load tests after one another. After the baseline, you can view the completed training runs by selecting Workflows and Executions. You should see five successful workflow executions.

You could use this DQL query to see the Site Reliability Guardian results in a notebook:

fetch bizevents
| filter event.provider == "dynatrace.site.reliability.guardian"
| filter event.type == "guardian.validation.finished"
| fieldsKeep guardian.id, validation.id, timestamp, guardian.name, validation.status, validation.summary, validation.from, validation.to

You could also view the SRG Status in the Site Reliability Guardian by opening the Three golden signals guardian. You should see five completed runs there as well.

Automate the release validation steps

To automate the release validation, you need to complete a couple of steps. First, you need to run the production SRG. Then enable a new feature and run the load test.

Run your trained guardian

Different triggers for production

Remember, the workflow is currently configured to listen for test finished events, but you could easily create additional workflows with different triggers, such as On demand or Cron schedule trigger.

Different triggers allow you to continuously test your service, for example, in production.

After you train the guardian, you should run it by triggering a load test. In the Validation history panel of the Three golden signals guardian, select Refresh . Your Heatmap will show some errors.

Informational-only objectives

It is possible to add an objective that is informational-only and doesn't contribute to the pass / fail decisions of the Site Reliability Guardian. They are useful for new services where you're trying to get an idea for the real-world data values of your metrics.

To set an objective as information-only

  1. Select the objective to open the side panel.
  2. In the side panel, move through until you see the Define thresholds section.
  3. Select the No thresholds option.

Enable a new feature and run the load test

Your team is ready to release their new feature. You need to enable the feature and run the load test in your development environment. The new feature is behind a flag and you need to change it.

To enable a new feature and run the load test

  1. In your system, update the flag. This flag change triggers a notification for Dynatrace using a CUSTOM_INFO event of the change, including the new value.

  2. In your system, change your service's flag value from off to on.

  3. In your system, run the acceptance load test to see if the new feature has caused a regression.

  4. Send an event to the affected service each time a flag changes. The flag changes your service's behavior, which your other services depend on.

  5. Go to Dynatrace > Services, go to your service and check the changes. Notice that the configuration changed for the defaultValue to on.

  6. In your system, wait for all the jobs to be completed.

  7. Go to Dynatrace > Site Reliability Guardian and refresh your guardian's Validation history, and Heatmap. Notice that the guardian has failed due to the high error rate.

  8. In Dynatrace, go to Services, go to your service. You can see the increase in failure rate.

  9. Move through until you see the Distributed traces list. Notice the failed requests.

  10. Open one of the failed requests. The exception message and stack trace show something similar

    exception.message PaymentService Fail Feature Flag Enabled
    exception.stacktrace Error: PaymentService Fail Feature Flag Enabled at module.exports.charge
    (/usr/src/app/charge.js:21:11) at process.processTicksAndRejections
    (node:internal/process/task_queues:95:5) at async Object.chargeServiceHandler
    [as charge] (/usr/src/app/index.js:21:22)
    exception.type Error
  11. In your system, roll back the flag change because of the high error rate by informing Dynatrace that a configuration change is coming. You'll reset your flag to off.

  12. In your system, change the flag value of your service from off to on.

  13. In your system, apply the changes.

Conclusion

Looking back at the initial brief, it was your job to:

  • Enable a flag in a development environment.
  • Judge the impact of that change on the application, if any.
  • If you observed an impact, gather the evidence and turn off the flag.
  • Decide if the feature can be released.
  • Provide feedback to the product managers on why you made the decision you did.

So, how did things turn out?

  • You have enabled a flag and sent contextual event information to Dynatrace.
  • You used OpenTelemetry and Dynatrace to make an evidence-based analysis of the new software quality.
  • You have automated the change analysis and noticed and remediated an impact.
  • You have protected users by automating this analysis in a development environment. You could repeat this setup in production, too, of course.
  • You have decided not to release based on evidence provided by OpenTelemetry and the Dynatrace Site Reliability Guardian.
  • You could provide the product manager with this evidence, including the stack trace and line of code, so that they can prioritize fixes.

The techniques described here work with any metric from any source.

You're encouraged to use metrics from other devices and sources such as business-related metrics like revenue.