Because failover has the potential to significantly reduce the performance of your cluster, you should consider how best to handle a failover situation.
Using automated failover implies that you are happy for a node to be failed over without user-intervention and without the knowledge and identification of the issue that initiated the failover situation. It does not, however, negate the need to initiate a rebalance to return the cluster to a healthy state.
Manual failover requires constant monitoring of the cluster to identify when an issue occurs, and then triggering a manual failover and rebalance operation. Although it requires more monitoring and manual intervention, there is a possibility that your cluster and data access may have degraded significantly before the failover and rebalance are initiated.
In the following sections the two alternatives and their issues are described in more detail.
Automatically failing components in any distributed system has the potential to cause problems. There are many examples of high-profile applications that have taken themselves off-line through unchecked automated failover strategies. Some of the situations that might lead to pathological in an automated failover solution include:
Scenario 1 — Thundering herd
Imagine a scenario where a Couchbase Server cluster of five nodes is operating at 80-90% aggregate capacity in terms of network load. Everything is running well, though at the limit. Now a node fails and the software decides to automatically failover that node. It is unlikely that all of the remaining four nodes would be able to handle the additional load successfully.
The result is that the increased load could lead to another node failing and being automatically failed over. These failures can cascade leading to the eventual loss of the entire cluster. Clearly having 1/5th of the requests not being serviced would be more desirable than none of the requests being serviced.
The solution in this case would be to live with the single node failure, add a new server to the cluster, mark the failed node for removal and then rebalance. This way there is a brief partial outage, rather than an entire cluster being disabled.
One alternative preparative solution is to ensure there is excess capacity to handle unexpected node failures and allow replicas to take over.
Situation 2 — Network partition
If a network partition exists across the nodes within a Couchbase cluster, automatic failover would lead both sides to decide that they are going to automatically failover. Each section of the cluster would now assume responsibility for the entire document ID space. While there would be consistency for a document ID within each partial cluster, there would start to be inconsistency of data between the partial clusters. Reconciling those differences may be difficult, depending on the nature of your data and your access patterns.
Assuming one of the two partial clusters is large enough to cope with all traffic, the solution would be to direct all traffic for the cluster to that single partial cluster. The separated nodes could then be re-added to the cluster to bring the cluster up to its original size.
Situation 3 — Misbehaving node
If one node loses connectivity to the cluster (or "thinks" that it has lost connectivity to the cluster), allowing it to automatically failover the rest of the cluster would lead to that node creating a cluster-of-one. As a result a similar partition situation as described above arises again.
In this case the solution is to take down the node that has connectivity issues and let the rest of the cluster handle the load (assuming there is spare capacity available).
Although automated failover has potential issues, choosing to use manual or monitored failover is not without potential problems.
If the cause of the failure is not identified, and the load that will be placed on the remaining system is not well understood, then automated failover can cause more problems than it is designed to solve. An alternative solution is to use monitoring to drive the failover decision. Monitoring can take two forms, either human or by using a system external to the Couchbase Server cluster that can monitor both the cluster and the node environment and make a more information driven decision.
One option is to have a human operator respond to alerts and make a decision on what to do. Humans are uniquely capable of considering a wide range of data, observations and experience to best resolve a situation. Many organizations disallow automated failover without human consideration of the implications.
Another option is to have a system monitoring the cluster via the Management REST API. Such an external system is in the best position to order the failover of nodes because it can take into account system components that are outside the scope of Couchbase Server visibility.
For example, by observing that a network switch is flaking and that there is a dependency on that switch by the Couchbase cluster, the management system may determine that failing the Couchbase Server nodes will not help the situation.
If, however, everything around Couchbase Server and across the various nodes is healthy and that it does indeed look like a single node problem, and that the aggregate traffic can support loading the remaining nodes with all traffic, then the management system may fail the system over using the REST API or command-line tools.