Graceful failover takes a lot of time

jacek.golnik · April 23, 2018, 7:42am

Greetings,

I’d like to use graceful failover in my couchbase clusters (Community Edition 5.0.1 build 5003) but I noticed that it takes a lot of time. It hardly depends on the data size. I have got two databases clusters, both with 3 nodes. One with 137k documents in 2 buckets and the second one with 910k documents (2 buckets with 900k and 10k documents). Every bucket has got 1 replica.
For smaller database graceful failover takes about 1,5 minute but for larger database it takes about 9,5 minutes. This is the time from graceful failover start (calling endpoint /controller/startGracefulFailover) to time when the node is not an active member of cluster (doesn’t manage any vBuckets).
When I want to rejoin my node to the cluster I have to execute the rebalance process (endpoint /controller/rebalance). Of course after defining the recovery type for that node (I use delta recover). It also takes a lot of time. For smaller cluster about 4 minutes and for larger one about 8 minutes.

So if I want to use graceful failover for planned maintenance of my cluster then I have to execute that procedure on every node, so it takes for larger cluster (with 3 nodes) about 52 minutes (without time spent on maintenance).

Information from Couchbase (https://developer.couchbase.com/documentation/server/5.0/clustersetup/setup-failover-graceful.html) which suggests that process should not be tim-consuming: “You do not have enough time to do a full removal and rebalancing of the node, and then add it back and rebalance again. Graceful failover saves the day!”

Is it possible to reduce those times? Is my procedure correct?
I wonder how long the process will take when the amount of data in the database will increase 10 times. Will it increase significantly?
Does rebalance process is limited to vBuckets which belonged to node which was gracefully failovered?

Thanks in advance

sebarys · April 26, 2018, 11:28am

Anyone is facing the same issue or have any experience with Graceful failover?

kamkor · May 10, 2018, 9:16am

Did anyone experience similar problems? The documentation says that this process should be fast but it takes long time.

Using graceful failover to remove the node and then Delta Node Recovery to add it back when the maintenance is complete is very quick, easy, and fairly non-intrusive. - Couchbase SDKs

We have experienced something opposite of “very quick, easy, and fairly non-intrusive”. We are either doing something really wrong or the HA aspects of the Couchbase are really disappointing.

jacek.golnik · May 15, 2018, 12:10pm

Is there any difference in graceful failover functionality between the Community and Enterprise edition?

neilhoughton · August 2, 2018, 7:41am

Greetings,

I’m experiencing the same issue. My Couchbase version is 3.0.3 and have a 5 node cluster .

The only thing in the logs that I have noticed is it appears to be running the xdcr processes constantly.

This makes it impossible to perform server patching during the time window.

Does anyone have any further ideas what is going on?

Topic		Replies	Views
Any way to speed up graceful failover + rebalance? Couchbase Server	0	910	March 21, 2019
Failover during node's downtime Couchbase Server	1	559	January 12, 2022
Is "Remove and Rebalance" exactly the same as "Graceful failover and Rebalance" Couchbase Server	3	1422	January 30, 2019
Rebalancing is taking lot of time on couchbase server (several days) Couchbase Server	3	3846	February 8, 2023
Rolling restart of cluster Couchbase Server	7	4300	June 18, 2015

Graceful failover takes a lot of time

Related topics