Graceful failover takes a lot of time

Greetings,

I’d like to use graceful failover in my couchbase clusters (Community Edition 5.0.1 build 5003) but I noticed that it takes a lot of time. It hardly depends on the data size. I have got two databases clusters, both with 3 nodes. One with 137k documents in 2 buckets and the second one with 910k documents (2 buckets with 900k and 10k documents). Every bucket has got 1 replica.
For smaller database graceful failover takes about 1,5 minute but for larger database it takes about 9,5 minutes. This is the time from graceful failover start (calling endpoint /controller/startGracefulFailover) to time when the node is not an active member of cluster (doesn’t manage any vBuckets).
When I want to rejoin my node to the cluster I have to execute the rebalance process (endpoint /controller/rebalance). Of course after defining the recovery type for that node (I use delta recover). It also takes a lot of time. For smaller cluster about 4 minutes and for larger one about 8 minutes.

So if I want to use graceful failover for planned maintenance of my cluster then I have to execute that procedure on every node, so it takes for larger cluster (with 3 nodes) about 52 minutes (without time spent on maintenance).

Information from Couchbase (https://developer.couchbase.com/documentation/server/5.0/clustersetup/setup-failover-graceful.html) which suggests that process should not be tim-consuming: “You do not have enough time to do a full removal and rebalancing of the node, and then add it back and rebalance again. Graceful failover saves the day!”

Is it possible to reduce those times? Is my procedure correct?
I wonder how long the process will take when the amount of data in the database will increase 10 times. Will it increase significantly?
Does rebalance process is limited to vBuckets which belonged to node which was gracefully failovered?

Thanks in advance

3 Likes

Anyone is facing the same issue or have any experience with Graceful failover?

Did anyone experience similar problems? The documentation says that this process should be fast but it takes long time.

Using graceful failover to remove the node and then Delta Node Recovery to add it back when the maintenance is complete is very quick, easy, and fairly non-intrusive. - Couchbase SDKs

We have experienced something opposite of “very quick, easy, and fairly non-intrusive”. We are either doing something really wrong or the HA aspects of the Couchbase are really disappointing. :frowning:

Is there any difference in graceful failover functionality between the Community and Enterprise edition?

Greetings,

I’m experiencing the same issue. My Couchbase version is 3.0.3 and have a 5 node cluster .

The only thing in the logs that I have noticed is it appears to be running the xdcr processes constantly.

This makes it impossible to perform server patching during the time window.

Does anyone have any further ideas what is going on?