Premium HA for multi-data centers

Published Apr 20, 2020

Dynatrace Premium High Availability (Premium HA) is a self-contained out-of-the-box solution that provides near-zero downtime and allows monitoring to continue without data loss in failover scenarios. This solution provides cost savings in terms of compute and storage allocations by eliminating the need for separate stand-by disaster recovery hosts and the associated infrastructure to store and transfer backup data.

While the computing capacity available to the cluster is positively affected by the additional nodes in the peered data center (DC), the impact is non-linear. For capacity planning, the nodes in the additional DC should be considered as redundant rather than as expanded capacity. This is because the additional DC will have a copy of all the Cassandra and Elasticsearch data from the initial DC.

Important notes

Premium HA requires a Premium High Availability license in Dynatrace classic licensing.
Premium HA cluster requires a connection to Dynatrace Mission Control and is therefore only available for Managed online clusters.
The minimum number of nodes required for a Premium HA cluster is 6 (3 nodes per DC).
The maximum number of nodes supported for a Premium HA cluster is 30 (15 nodes per DC).
Migrating a cluster to Premium HA is a non-reversible operation.
Both DCs within a Premium HA cluster should be symmetrically sized.

How to remedy segmented clusters

If one part of a cluster loses connection with another part of the cluster, this doesn't necessarily mean that that part of the cluster is unavailable. The problem may simply be a connectivity issue. You need to determine which part of the cluster will act as the surviving one. Short, up to 3 hours, network disconnections between data centers are repaired automatically. To avoid data inconsistency, for longer outages we recommend shutting down server service in all nodes at affected data center. You can start services when network connectivity is stable again.

To handle the situation when one part of the cluster is unavailable, Dynatrace Mission Control tracks the health of all nodes and automatically designates one part of the cluster as primary (surviving). During the recovery, this designation is used to determine how to re-sync all parts of the cluster.

Data sharding and replication

Using virtual racks, Dynatrace High Availability stores three copies of all configuration data, metrics, and user sessions in each DC. This provides optimal performance and reliability in failover scenarios.

Raw transaction data (such as distributed traces, call stacks, and database statements) is distributed randomly across all DCs so that a statistically representative data set is always available on each DC.

Data is synchronized asynchronously between DCs. This eliminates the 10-ms latency requirement that applies to all cross-DC clusters. Data synchronization is engineered to minimize bandwidth consumption between DCs and prevent data loss in case of a DC outage.

During outages of less than three hours, Premium HA will automatically and transparently re-synchronize the data across DCs.

Telemetry data routing

You can use network zones to control the flow of telemetry data to the cluster nodes in the various DCs. While Premium HA implements various optimizations to reduce cross-DC traffic, we recommend, for the sake of data redundancy, that you allow ActiveGates to send data to both DCs. OneAgents and ActiveGates can be configured to prefer certain network zones while preserving their ability to failover to another part of the cluster in case of a DC outage. Note that load balancers can be leveraged for this purpose as well.

For active-passive deployments of applications, we recommend that you not disable ActiveGates in the passive portions of the deployment. This keeps all parts of the Dynatrace infrastructure in play in case of a disaster recovery scenario and enables failover without reconfiguration or rediscovery.

Technical details

Premium HA requires an OS that supports cgroups version 1.0 and systemd version 219 or later (for example, RHEL/CentOS 7+).

The various nodes will continue to communicate with each other over the usual ports as described earlier. The bottom line is that the ports that need to be open between nodes in a single DC are the same ports that need to be open within the cluster if the cluster spans two DCs.

The connections between nodes in different DCs need to be encrypted. Dynatrace does not create or install the required certificates to ensure this—you’ll need to do that manually. Round-trip network latency of up to 100 ms is supported. Bandwidth consumption depends on a variety of factors. For more information, please contact a Dynatrace product specialist.

It is possible to migrate a single-DC cluster (or a DC-agnostic cross-DC cluster) to a dual-DC Premium HA cluster. For more information, please contact a Dynatrace product specialist.