If a node in a cluster is unable to serve data you can failover that node. Failover means that Couchbase Server removes the node from a cluster and makes replicated data at other nodes available for client requests. Because Couchbase Server provides data replication within a cluster, the cluster can handle failure of one or more nodes without affecting your ability to access the stored data. In the event of a node failure, you can manually initiate a failover status for the node in Web Console and resolve the issues.
Alternately you can configure Couchbase Server so it will automatically remove a failed node from a cluster and have the cluster operate in a degraded mode. If you choose this automatic option, the workload for functioning nodes that remain the cluster will increase. You will still need to address the node failure, return a functioning node to the cluster and then rebalance the cluster in order for the cluster to function as it did prior to node failure.
Whether you manually failover a node or have Couchbase Server perform automatic failover, you should determine the underlying cause for the failure. You should then set up functioning nodes, add the nodes, and then rebalance the cluster. Keep in mind the following guidelines on replacing or adding nodes when you cope with node failure and failover scenarios:
If the node failed due to a hardware or system failure, you should add a new replacement node to the cluster and rebalance.
If the node failed because of capacity problems in your cluster, you should replace the node but also add additional nodes to meet the capacity needs.
If the node failure was transient in nature and the failed node functions once again, you can add the node back to the cluster.
Be aware that failover is a distinct operation compared to removing/rebalancing a node. Typically you remove a functioning node from a cluster for maintenance, or other reasons; in contrast you perform a failover for a node that does not function.
When you remove a functioning node from a cluster, you use Web Console to indicate the node will be removed, then you rebalance the cluster so that data requests for the node can be handled by other nodes. Since the node you want to remove still functions, it is able to handle data requests until the rebalance completes. At this point, other nodes in the cluster will handle data requests. There is therefore no disruption in data service or no loss of data that can occur when you remove a node then rebalance the cluster. If you need to remove a functioning node for administration purposes, you should use the remove and rebalance functionality not failover. See Performing a Rebalance, Adding a Node to a Cluster.
If you try to failover a functioning node it may result in data loss. This is because failover will immediately remove the node from the cluster and any data that has not yet been replicated to other nodes may be permanently lost if it had not been persisted to disk.
For more information about performing failover see the following resources:
Automated failover will automatically mark a node as failed over if the node has been identified as unresponsive or unavailable. There are some deliberate limitations to the automated failover feature. For more information on choosing whether to use automated or manual failover see Section 5.5.1, “Choosing a Failover Solution”.
For information on how to enable and monitor automatic failover, see Section 5.5.2, “Using Automatic Failover”.
Initiating a failover whether or not you use automatic or manual failover, you need to perform additional steps to bring a cluster into a fully functioning state. Information on handling a failover is in Section 5.5.4, “Handling a Failover Situation”.
Adding nodes after failover. After you resolve the issue with the failed over node you can add the node back to your cluster. Information about this process is in Section 5.5.5, “Adding Back a Failed Over Node”.