Because node failover has the potential to reduce the performance of your cluster, you should consider how best to handle a failover situation. Using automated failover means that a cluster can fail over a node without user-intervention and without knowledge and identification of the issue that caused the node failure. It still requires you to initiate a rebalance in order to return the cluster to a healthy state.
If you choose manual failover to manage your cluster you need to monitor the cluster and identify when an issue occurs. If an issues does occur you then trigger a manual failover and rebalance operation. Although it requires more monitoring and manual intervention, there is a possibility that your cluster and data access may still degrade before you initiate failover and rebalance.
In the following sections the two alternatives and their issues are described in more detail.
Automatically failing components in any distributed system can cause problems. If you cannot identify the cause of failure, and you do not understand the load that will be placed on the remaining system, then automated failover can cause more problems than it is designed to solve. Some of the situations that might lead to problems include:
Avoiding Failover Chain-Reactions (Thundering Herd)
Imagine a scenario where a Couchbase Server cluster of five nodes is operating at 80-90% aggregate capacity in terms of network load. Everything is running well but at the limit of cluster capacity. Imagine a node fails and the software decides to automatically failover that node. It is unlikely that all of the remaining four nodes are be able to successfully handle the additional load.
The result is that the increased load could lead to another node failing and being automatically failed over. These failures can cascade and lead to the eventual loss of an entire cluster. Clearly having 1/5th of the requests not being serviced due to single node failure would be more desirable than none of the requests being serviced due to an entire cluster failure.
The solution in this case is to continue cluster operations with the single node failure, add a new server to the cluster to handle the missing capacity, mark the failed node for removal and then rebalance. This way there is a brief partial outage rather than an entire cluster being disabled.
One alternate preventative solution is to ensure there is excess capacity to handle unexpected node failures and allow replicas to take over.
Handling Failovers with Network Partitions
If you have a network partition across the nodes in a Couchbase cluster, automatic failover would lead to nodes on both sides of the partition to automatically failover. Each functioning section of the cluster would now have to assume responsibility for the entire document ID space. While there would be consistency for a document ID within each partial cluster, there would start to be inconsistency of data between the partial clusters. Reconciling those differences may be difficult, depending on the nature of your data and your access patterns.
Assuming one of the two partial clusters is large enough to cope with all traffic, the solution is to direct all traffic for the cluster to that single partial cluster. The separated nodes could then be re-added to the cluster to bring the cluster to its original size.
Handling Misbehaving Nodes
There are cases where one node loses connectivity to the cluster or functions as if it has lost connectivity to the cluster. If you enable it to automatically failover the rest of the cluster, that node is able to create a cluster-of-one. The result for your cluster is a similar partition situation we described previously.
In this case you should make sure there is spare node capacity in your cluster and failover the node with network issues. If you determine there is not enough capacity, add a node to handle the capacity after your failover the node with issues.
Performing manual failover through monitoring can take two forms, either by human monitoring or by using a system external to the Couchbase Server cluster. An external monitoring system can monitor both the cluster and the node environment and make a more information-driven decision. If you choose a manual failover solution, there are also issues you should be aware of. Although automated failover has potential issues, choosing to use manual or monitored failover is not without potential problems.
Human intervention
One option is to have a human operator respond to alerts and make a decision on what to do. Humans are uniquely capable of considering a wide range of data, observations and experiences to best resolve a situation. Many organizations disallow automated failover without human consideration of the implications. The drawback of using human intervention is that it will be slower to respond than using a computer-based monitoring system.
External monitoring
Another option is to have a system monitoring the cluster via the Management REST API. Such an external system is in a good position to failover nodes because it can take into account system components that are outside the scope of Couchbase Server.
For example monitoring software can observe that a network switch is failing and that there is a dependency on that switch by the Couchbase cluster. The system can determine that failing Couchbase Server nodes will not help the situation and will therefore not failover the node.
The monitoring system can also determine that components around Couchbase Server are functioning and that various nodes in the cluster are healthy. If the monitoring system determines the problem is only with a single node and remaining nodes in the cluster can support aggregate traffic, then the system may failover the node using the REST API or command-line tools.