Dynatrace Managed supports high availability deployments across single or multiple data centers (DC). Premium High Availability (PHA) is a self-contained solution for two DCs that provides minimal downtime and lets monitoring continue without data loss in failover scenarios.
PHA reduces compute and storage costs by eliminating separate standby disaster recovery hosts and the infrastructure to store and transfer backup data.
Before you plan a PHA deployment, review the following requirements and limitations:
Regional fault tolerance means that all Managed Cluster nodes in one location domain can fail. To achieve it, distribute Managed Cluster nodes across separate physical locations using one of the following options:
The replication factor of three ensures that each location has all the metric and event data.
Build your Managed Cluster with additional capacity and possible node failure in mind. Managed Clusters that operate at 100 percent of their processing capacity have no capacity to compensate for a lost node. Deployments operating at full capacity can drop data during a node failure.
Deployments planned for node failure should have a processing capacity one-third higher than their typical utilization.
For capacity planning in a PHA deployment, treat the nodes in the additional DC as redundant rather than as expanded capacity. The additional DC holds a copy of all Cassandra and Elasticsearch data from the initial DC.
If a node fails, NGINX automatically redirects all OneAgent traffic to the remaining working nodes. You only need to replace the failed node.
Dynatrace Managed distributes the entire configuration of the Managed Cluster and its environments across all nodes. The replicated data includes all events, user sessions, and metrics. The Managed Cluster maintains two copies of this data, so Dynatrace Managed can continue to operate after the loss of one node.
The loss of two or more nodes might affect Managed Cluster performance and availability. The impact depends on data distribution and the consistency level required for the data.
Log Monitoring event data replicates in the Elasticsearch store to achieve high availability and optimize storage cost. As a result, if a node goes down, another node has a copy. However, the failure of two nodes makes some log events unavailable. If the nodes go back up, the data becomes available again. Otherwise, Dynatrace Managed loses the data.
Dynatrace Managed doesn't replicate raw transaction data, such as call stacks, database statements, and code-level visibility, across nodes. Dynatrace Managed evenly distributes this data across all nodes.
As a result, when a node fails, Dynatrace Managed can estimate the missing data. The estimate is possible because raw transaction data is typically short-lived. Dynatrace Managed also collects a high volume of raw data, so each node still has a large enough data set even if another node isn't available for some time.
Using virtual racks, PHA stores three copies of all configuration data, metrics, and user sessions in each DC. Three-copy storage provides optimal performance and reliability in failover scenarios.
PHA distributes raw transaction data, such as distributed traces, call stacks, and database statements, across all DCs. Cross-DC distribution makes a statistically representative data set available in each DC.
PHA synchronizes data asynchronously between DCs. Asynchronous synchronization eliminates the 10 ms latency requirement that applies to Managed Clusters that span DCs. Data synchronization minimizes bandwidth consumption between DCs and prevents data loss during a DC outage.
During outages of up to 72 hours, PHA automatically re-synchronizes data across DCs.
PHA uses network zones to route telemetry to Managed Cluster nodes in different DCs. Network zones let OneAgents and ActiveGates prefer local routes while retaining the ability to fail over to another DC during an outage. For setup guidance, go to Replicate nodes across data centers for PHA.
PHA handles DC outages of up to 72 hours automatically and uses Mission Control to designate the surviving part of the Managed Cluster during recovery. For failover behavior, go to Multi-data center failover. For recovery after longer outages, go to Data center disaster recovery from data center.
You can migrate a Managed Cluster from a single DC, or a Managed Cluster that already spans DCs, to a dual-DC PHA deployment. For migration steps, go to Replicate nodes across data centers for PHA.
To prevent configuration, metrics, and logs data loss, deploy each node on a separate host. Deploy nodes on hardware with the same characteristics, especially disk, processor, and memory. Identical hardware minimizes performance degradation when some nodes are unavailable.
A hardware failure affects only the data on the failed machine. The failure doesn't affect metrics data or configuration because all nodes replicate them. Dynatrace Managed loses only the data stored on that node, such as distributed traces and session replays. However, other nodes still contain enough similar data for estimates.
Matching hardware and balanced workloads reduce performance degradation when nodes are unavailable.