Automatically failing components in a distributed system can be perilous. There are countless examples of high-profile applications that have taken themselves offline through unchecked automated failover strategies. Some of the situations that might lead to pathological behavior include:
Scenario 1 - Thundering herd
Imagine a scenario where a Membase cluster of 5 nodes is being run at 80-90% aggregate capacity in terms of network load. Everything is running well, though at the limit. Now a node fails and the software decides to automatically failover that node. It is unlikely that all of the remaining 4 nodes would be able to handle the additional load successfully, which could lead to another node being automatically failed over. These failures can cascade leading to the eventual loss of the entire cluster. Clearly having 1/5th of the requests not being serviced would be more desirable than none of the requests being serviced.
As the data infrastructure architect of a large social networking site once said (paraphrased): "If half your servers are down and you're still serving half your users, you're doing really well."
The solution in this case would be to live with the single node failure, add a new server to the cluster, mark the failed node for removal and then rebalance. This way there is a brief partial outage, rather than an entire cluster being downed. Ensure there is excess capacity to handle node failures with replicas taking over is another solution to this problem.
Situation 2 - Network partition
If a network partition occurs within a Membase cluster, automatic failover would lead both sides to decide that they are going to automatically failover each other. Each portion would now assume responsibility for the entire key space, so whilst there is consistency for a key within the partial clusters, there would start to be inconsistency of data between the two partial clusters. Reconciling those differences may be difficult, depending on the nature of your data and your access patterns.
Assuming one of the two partial clusters is large enough to cope with all traffic, the solution would be to direct all traffic for the cluster to that single partial cluster and then later add-in the previously in-accessible machines to restore the original cluster size.
Situation 3 - Misbehaving node
If one node loses connectivity to the cluster (or "thinks" that it has lost connectivity to the cluster) allowing it to automatically failover the rest of the cluster would lead to that node being a cluster-of-one. As a result the same partition situation as described above arises again.
In this case the solution is to take down the node that has connectivity issues and let the rest of the cluster handle the load (assuming there is spare capacity available).