The Premium High Availability (PHA) multi-data center failover mechanism detects Elasticsearch or Cassandra node outages longer than 15 minutes and shorter than 72 hours. If Mission Control (MC) detects that two or more Elasticsearch or Cassandra nodes in a data center (DC) are down for 15 minutes, it automatically stops the server processes in that DC. MC then marks the DC as unhealthy.
Mission Control controls the PHA multi-data center failover mechanism and instructs Nodekeepers on what to do. Every minute, each Nodekeeper sends a health check to Mission Control. In the response, each Nodekeeper receives information about the health state of other nodes collected by Mission Control, along with some instructions. The instructions can include stopping or starting the server, or starting a Cassandra repair. Nodekeepers don't receive the instructions simultaneously. Nodekeepers should receive the instructions within a minute if there are no connection problems.
If one part of a Managed Cluster loses connection with another part, the disconnected part isn't necessarily unavailable. The problem might be a connectivity issue. PHA automatically repairs network disconnections of up to 72 hours between DCs. MC tracks node health and designates one part as primary, or surviving. During recovery, this designation determines how to re-sync all parts of the Managed Cluster.
The graphics in the following sections illustrate the PHA failover mechanism in case of Elasticsearch or Cassandra node outages.
The following graphic illustrates the PHA failover mechanism when two or more Elasticsearch nodes in a DC are down.
The following graphic illustrates the PHA failover mechanism when two or more Cassandra nodes in a DC are down.
Dynatrace ignores racks when a DC contains only one or two racks.
When you configure rack awareness for a Managed Cluster, make sure to account for it during unhealthy data center detection.
The following five rules apply when evaluating DC health:
The PHA failover mechanism is triggered when at least two Elasticsearch or Cassandra nodes are down. When a node isn't reachable by other nodes, it's automatically added to the list of Elasticsearch/Cassandra down nodes.
No. Dynatrace reacts only to Elasticsearch down status.
Dynatrace repairs only the nodes that were down.
No. The Elasticsearch or Cassandra nodes must be down for 15 minutes for the failover mechanism to start. Mission Control needs 30 minutes to consider the Managed Cluster fully recovered.
The Managed Cluster will be marked in Mission Control as not repaired in 72 hours from the failover. Such a Managed Cluster isn't reliable and you should replicate it from a healthy DC.
If the primary DC fails and the Managed Cluster can't connect to MC, act as soon as possible.
Shut down any nodes that are still running in the failed DC.
The healthy secondary DC can continue to serve the Managed Cluster, provided that the failed DC is completely down.
After the Managed Cluster can connect to MC again, plan to restore or recreate the failed DC according to the PHA recovery procedures.
No. Without access to MC, the secondary DC doesn't autonomously detect that it needs to take over.
However, the Managed Cluster can continue to work from the healthy secondary DC if the failed DC is completely down.
No. There is no manual or local failover option that bypasses MC.
If MC isn't reachable, shut down any nodes still running in the failed DC and continue working from the healthy secondary DC.
After the Managed Cluster can connect to MC again, plan the failed DC restoration or recreation according to the PHA recovery procedures.
The message specifies which components are down (Elasticsearch, Cassandra, or both). First check if the machine is up and running. Then try to start Elasticsearch and Cassandra using the following commands:
/opt/dynatrace-managed/launcher/Elasticsearch.sh start/opt/dynatrace-managed/launcher/cassandra.sh start
If you notice any issues with the start process, check the following logs:
/var/opt/dynatrace-managed/log/Elasticsearch/*/var/opt/dynatrace-managed/log/cassandra/*
The repair process runs only once, even if it fails. You should manually run the repair process on nodes where it automatically triggered repair failure. Dynatrace continues repairs on other nodes. Only after all repairs complete does Dynatrace tell you which repairs failed and considers the repair complete.
In an unhealthy DC, check nodekeeper.0.0.log, nodekeeper-healthcheck.0.log, and repair-cassandra-data.log.
In a healthy DC, only server.log and audit.cluster.event.0.0.log are important.