Due to the potential for problems when using automated failover (see Section 5.1.1.1, “Automated failover considerations”), there are a number of restrictions on the automatic failover functionality in Couchbase Server:
Automatic failover is disabled by default. This prevents Couchbase Server from using automatic failover without the functionality being explicitly enabled.
Automatic failover is only available on clusters of at least three nodes.
Automatic failover will only fail over one node before requiring administrative interaction. This is to prevent a cascading failure from taking the cluster completely out of operation.
There is a minimum 30 second delay before a node will be failed over. This can be raised, but the software is hard coded to perform multiple "pings" of a node that is perceived down. This is to prevent a slow node or flaky network connection from being failed-over inappropriately.
If two or more nodes go down at the same time within the specified delay period, the automatic failover system will not failover any nodes.
If there are any node failures, an email can be configured to be sent out both when an automatic failover occurs, and when it doesn't.
To configure automatic failover through the Administration Web Console, see Section 6.6.2, “Auto-Failover Settings”. For information on using the REST API, see Section 8.9, “Retrieve Auto-Failover Settings”.
Once an automatic failover has occurred, the Couchbase Cluster is relying on replicas to serve data. A rebalance should be initiated to return your cluster to proper operational state. For more information, see Section 5.1.4, “Handling a Failover Situation”.
After a node has been automatically failed over, an internal counter is used to identify how many nodes have been failed over. This counter is used to prevent the automatic failover system from failing over additional nodes until the issue that caused the failover has been identified and rectified.
To re-enable automatic failover, the administrator must reset the counter manually.
Resetting the automatic failover counter should only be performed after restoring the cluster to a healthy and balanced state.
You can reset the counter using the REST API:
shell> curl -i -ucluster-username:cluster-password\ http://localhost:8091/settings/autoFailover/resetCount
More information on using the REST API for this operation can be found in Section 8.11, “Resetting Auto-Failovers”.