Help me understand Replicas vs Active content in a cluster

I have inherited an SSO system that is using Couchbase as it’s storage. It was setup with 2 web frontends each with a couchbase instance. These instances are clustered and replication is working.

We’re experiencing issues with rebooting these servers causing issues and I think it is related to how couchbase treats the replicated data vs. our assumptions about how it does it. The behavior we want is Active-Active HA. So, regardless of which cluster member is accessed, the query should succeed even if the data originated on a different server.

What appears to be happening, is that the application queries one server after another until it finds it. If a server is offline it times out and causes an issue. If I configure the application to only point to it’s local CB instance, it doesn’t see the data from the other server.

So, my question is… Will a server with replicated data respond to queries for that data as though it was in the “active” category?

Hi @CaptAwesome, can you clarify some details of your system? When you say “each with a couchbase instance”, do you mean two separate Couchbase clusters? Or two nodes within the same cluster?

Usually in Couchbase when we talk about “replicas” we mean data that is replicated between nodes within a single cluster, so that if one node goes down the replica data on another node will be promoted and made available.

Alternately, sometimes HA configurations are implemented using independent Couchbase clusters connected to each other via XDCR.

Yeah… Sorry. I was using my words carelessly. We have one cluster of two nodes (for this). When one node goes down, the data isn’t promoted automatically.

I’ve been looking into the auto-failover which the previous admin didn’t use. It seems like it would be the answer but requires 3 servers and I only have 2 at the moment. I’m also concerned about how long it takes. When I failover my dev servers manually, it takes close to a minute. That’s for boxes doing almost no work.

Is there a configuration where all content regardless of the server can be active amongst all cluster members at all times (replication delays not withstanding)?

Failover options have improved over time, which version of Couchbase are you using? With the latest versions you can get fast-failover in only a few seconds (after which you will need to manually rebalance once the failed server is back up, or replaced with a good server).

The requirement for 3 nodes in the cluster is to avoid “split-brain”, where a network partition blocking communication between the nodes causes each one to think that the other failed, causing both to continue as if the other one had failed.

You mentioned “issues with rebooting these servers”. Can you say more? Is some one manually rebooting the servers for other reasons? Or is the server rebooting due to some kind of fault?

I have a question on similar lines about replicas and failovers. Considering if we have 2 replicas for data and indexes in a 3 node cluster with all nodes handling data and indexes, if one node goes down, does the cluster activate available replicas on remaining nodes and can it fully serve read/writes ( considering two nodes are capable to handle the load ) OR do we need to have min number of nodes to meet the replica bucket and num_replica setting while one node is out. Also, what can be the implication if two nodes go down ( other than performance).

Should we just consider one replica instead. I know the best answer is to try and see for yourself. In fact, I would like to do this but before I do I want to understand the basic correctly. Last time, I had tried two replicas and was doing a maintenance on each node. While removing a single node threw some warnings about not enough replicas being possible with one node being out. I am not sure what that really meant. Appreciate your help in advance.

Hi @chetan.

I’m assuming that your 3-node cluster has the data service running on all three nodes. In that case, then the data is evenly distributed across the 3 nodes, with each node responsible for 1/3 of the workload.

If a node goes down and is failed over, the data on that node will have replicas on other nodes, which are promoted and the cluster will continue to server reads and writes.

You should choose the number of replicas based on your tolerance for failure. With one replica, you can tolerate the failure of any one node, and all the data will still be available. However, if you have one replica and lose two nodes, some of your data will be unavailable. With two replicas, all your data is still available even if you lose 2 nodes.

Of course, with two replicas on a 3-node cluster, as soon as you take out one node, you no longer have 2 replicas, so you will get warnings.

If you are confident that you will only ever lose one node at a time, then one replica should be sufficient.

Does that answer your questions?

Yes, it does thanks for prompt response.

One question though, do I have to run rebalance to promote replicas or they would automatically promote in the event of failover.

Replicas are promoted as part of failover. See for more details on failover.

Promoting the replicas makes the data available, but with a failed node there may not be a replica of that data anymore.

Rebalance is the process of moving data around so that it is balanced across a cluster’s nodes, and that it is all replicated per the settings.

1 Like