The lesson is that automatically failing over a node can lead to problems, if the cause of the failure and the load that will be placed on the remaining system is not well understood. The best solution is to use monitoring to drive the failover decision. Monitoring can take two forms: human or external system.
Human intervention. One option is to have a human operator respond to alerts and make a decision on what to do. Humans are uniquely capable of considering a wide range of data, observations and experience to best resolve a situation. Many organizations disallow automated failover without human consideration of the implications. But that's not always a feasible solution for companies (large or small).
External monitoring. Another option is to have a system monitoring the cluster via our Management REST API. Such an external system is in the best position to order the failover of nodes because it can take into account system components that are outside the scope of Membase visibility. For example, by observing that a network switch is flaking and that there is a dependency on that switch by the Membase cluster, the management system may determine that failing the Membase nodes will not help the situation. If, however, everything around Membase and across the various nodes is healthy and that it does indeed look like a single node problem, and that the aggregate traffic can support loading the remaining nodes with all traffic, then the management system may fail the system over. Membase fully supports this model through its REST interface.