Automatic Failover was introduced in Membase Server 1.7.1
If there is a genuine server failure (e.g. a hardware crash) of a node (or small number of nodes) in a cluster, and there is enough headroom in the remaining nodes to handle the additional load, automated failover with an alert can increase system availability. Of course, deciding that a node is down is non-trivial, especially in cloud environments with high variability in network latency.
Due to a number of possible bad situations, we have placed a number of restrictions on the feature:
Automatic failover is off by default. We still maintain that the best practice would be to have an external system (either human or automated) monitoring the Membase cluster to prevent things like network partitions from causing more harm than good.
Automatic failover is only available on clusters of at least 3 nodes. Also to prevent a network partition from causing both sides to fail each other over.
The Automatic failover feature will only fail over 1 node before requiring administrative interaction. This is to prevent a cascading failure from taking the cluster completely out of operation.
There is a minimum 30 second delay before a node will be failed over. This can be raised, but the software is hard coded to perform multiple "pings" of a node that is perceived down. This is to prevent a slow node or flaky network connection from being failed-over inappropriately.
If there are any node failures, an email can be configured to be sent out both when an automatic failover occurs, and when it doesn't.
To configure the feature, select Settings -> Automatic Failover from the UI or using the REST API to configure (REST API)
After a node has been automatically failed over, the administrator must reset the counter in order for the autofailover feature to work again.
This should only be done after restoring the cluster to a healthy and balanced state.