Greetings,
I’d like to use graceful failover in my couchbase clusters (Community Edition 5.0.1 build 5003) but I noticed that it takes a lot of time. It hardly depends on the data size. I have got two databases clusters, both with 3 nodes. One with 137k documents in 2 buckets and the second one with 910k documents (2 buckets with 900k and 10k documents). Every bucket has got 1 replica.
For smaller database graceful failover takes about 1,5 minute but for larger database it takes about 9,5 minutes. This is the time from graceful failover start (calling endpoint /controller/startGracefulFailover) to time when the node is not an active member of cluster (doesn’t manage any vBuckets).
When I want to rejoin my node to the cluster I have to execute the rebalance process (endpoint /controller/rebalance). Of course after defining the recovery type for that node (I use delta recover). It also takes a lot of time. For smaller cluster about 4 minutes and for larger one about 8 minutes.
So if I want to use graceful failover for planned maintenance of my cluster then I have to execute that procedure on every node, so it takes for larger cluster (with 3 nodes) about 52 minutes (without time spent on maintenance).
Information from Couchbase (https://developer.couchbase.com/documentation/server/5.0/clustersetup/setup-failover-graceful.html) which suggests that process should not be tim-consuming: “You do not have enough time to do a full removal and rebalancing of the node, and then add it back and rebalance again. Graceful failover saves the day!”
Is it possible to reduce those times? Is my procedure correct?
I wonder how long the process will take when the amount of data in the database will increase 10 times. Will it increase significantly?
Does rebalance process is limited to vBuckets which belonged to node which was gracefully failovered?
Thanks in advance