The replica system within Couchbase Server enables the cluster to cope with a failure of one or more nodes within the cluster without affecting your ability to access the stored data. In the event of an issue on one of the nodes, you can initiate a failover status for the node. This removes the node from the cluster, and enables the replicas of the data stored on that node within the other nodes in the cluster.
Because failover of a node enables the replica vBuckets for the corresponding data stored, the load on the nodes holding the replica data will increase. Once the failover has occurred, your cluster performance will have degraded, and the replicas of your data will have been reduced by one.
To address this problem, once a node has been failed over, you should perform a rebalance as soon as possible. During a rebalance after a failover:
Data is redistributed between the nodes in the cluster
Replica vBuckets are recreated and enabled
Rebalancing should therefore take place as soon as possible after a failover situation to ensure the health and performance of your cluster is maintained.
Failover should be used on a node that has become unresponsive or that cannot be reached due to a network or other issue. If you need to remove a node for administration purposes, you should use the remove and rebalance functionality. See Section 5.3.2, “Performing a Rebalance”. This will ensure that replicas and data remain in tact.
Using failover on a live node (instead of using remove/rebalance) may introduce a small data-loss window as any data that has not yet been replicated may be lost when the failover takes place. You can still recover the data, but it will not be immediately available.
There are a number of considerations when planning, performing or responding to a failover situation:
Automated failover is available. This will automatically mark a node as failed over if the node has been identified as unresponsive or unavailable. However, there are deliberate limitations to the automated failover feature. For more information on choosing whether to use automated or manual (monitored) failover is available in Section 5.1.1, “Choosing a Failover Solution”.
For information on how to enable and monitor automatic failover, see Section 5.1.2, “Using Automatic Failover”.
Initiating a failover, whether automatically or manually requires additional operations to return the cluster back to full operational health. More information on handling a failover situation is provided in Section 5.1.4, “Handling a Failover Situation”.
Once the issue with the failed over node has been addressed, you can add the failed node back to your cluster. The steps and considerations required for this operation are provided in Section 5.1.5, “Adding Back a Failed Node”.