Couchbase query fails when one data node goes down

rajib761 · August 30, 2021, 7:13pm

We have a 5 node cluster, and three of them are data nodes. When we connect with the cluster we mention all the three data nodes. My understanding is that after the connection the SDK builds the cluster map and even if one node goes down later, my query will still return the data. But it did not happen like this. What happened is as below

We brought down one data node. Auto-failover kicked in. We also saw that rebalance also kicked in, which I was not expecting. Auto rebalance was turned off
We fired a query against the cluster with the three data nodes(one of them we brought down in step#1)
We ran the test twice
a. In one run, the query returned the data, but it waited till the rebalance occurred(our understanding was that rebalance will not start since we turned off auto-rebalancing)
b. In the 2nd run, the query failed telling that cluster refused connection

This is contrary to my understanding of how Couchbase works. My understanding was that

Firstly rebalance should not happen automatically
Secondly, in both cases I should have got back the query results since the cluster map knows which are the healthy nodes.

Did something change in Couchbase?

rajib761 · August 30, 2021, 10:12pm

I got the reason behind this. The cluster map is refreshed every 2.5 seconds. So, when the node went down, it was probably a timing issue. The queries failed when the cluster map was not updated yet. Regarding the rebalance, the rebalance did not happen. It was AUTO FAILOVER that happened, but since the POPUP header was shown as rebalance, we mistook it for REBALANCE. The solution for this is to use getFromReplica if it is a KV operation. if N1QL, then we need to have a retry framework.

ingenthr · August 31, 2021, 12:28am

Indeed, failover is different than rebalance. However, if you have a sufficient number of replicas, within seconds of failover you should go back to full availability. This is something we test to all of the time. You may have less replicas though, so a rebalance is needed to bring back in some redundancy, or adding a repaired/new node and then rebalancing.

You can use a getFromReplica if you are okay with the idea that in transient failure situations you may get an older copy of the data, yes. See the discussion on this in the docs. Also, I should say that with N1QL, SDKs after 3.x will automatically retry if it is safe to do so. You don’t indicate which SDK you’re using, but all of the modern SDKs have built-in retries up until the timeout as a default, with the ability to change the behavior to best effort if you see fit.

Topic		Replies	Views
Auto rebalance after node failure Couchbase Server	11	5598	May 17, 2017
Question: what happens before node failover Couchbase Server	10	1109	October 19, 2022
What happens when a node in the cluster goes down? Couchbase Server	14	21723	December 29, 2018
Node failure while rebalancing, won't come back up... data loss? Couchbase Server	1	2014	October 22, 2013
N1QL queries started failing after restart the server Couchbase Server query , n1ql	2	798	April 20, 2020

Couchbase query fails when one data node goes down

Related topics