Couchbase Rebalance large Cluster

abhi271 · May 5, 2020, 2:06pm

Hi, we are running couchbase as KV and storing around 43TB of data in 16 nodes cluster, with replication factor 1.
So, whenever we need scaling up/down we require to Rebalance and rebalancing is becoming painful for us.
As we start rebalancing we have to suffer downtime of around 2 days. Our clients are not able to read or write from CB.
Is there any way to avoid this?
Thanks

shivani_g · May 8, 2020, 3:23am

Hi,
Rebalance is a completely online operation. It could take long, but there should be no downtime. Can you explain what you mean by ‘downtime of around 2 days’?
Also, 5TB of data per node (including the replica data) is over what we currently recommend. How much RAM do you have allocated on each node?

abhi271 · May 11, 2020, 2:35am

Hi Shivani_g,
We are using Node of 32 GB RAM with 25GB allocated on each node.
100% Rebalancing takes approx 2-2.5 days. CB is not allowing to read-write(which leads to application downtime) until rebalancing on any of the nodes is 100% complete. As soon as the 1st node is 100% rebalanced read-write on CB is started, meanwhile rebalancing is continued on other nodes.

Is there a specific config for rebalancing without downtime?

abhi271 · May 26, 2020, 6:37am

@shivani_g, any update over this.

shivani_g · May 26, 2020, 5:09pm

Based on the details you have provided it seems your memory to data ratio << 1%.
Couchbase recommends at least 10% memory to data ratio for operational (including rebalance) stability.
Do you see errors during rebalance? If you do, can you share the errors you see.

abhi271 · June 25, 2020, 11:52am

@shivani_g, is there any doc related to this also is this default feature in community version as well or do we need to configure it.
Thanks

shivani_g · June 25, 2020, 1:59pm

Rebalance is a community feature and there is no need to configure it. It is automatically configured. One rule of thumb to follow for a stable rebalance is that the ratio of memory to data on disk should be > 10%. If you go lower than that you can run into issues.

abhi271 · August 11, 2020, 11:19am

Hi @shivani_g , can you give example of “the ratio of memory to data on disk should be > 10%” . I didn’t get this. thanks

shivani_g · August 11, 2020, 2:11pm

E.g. if you have 32GB RAM on a node that is allocated to the bucket, you should not be storing more than 320 GB of data on the node (this includes replica data as well). Up to 320GB will ensure that you are not going below 10% ratio for memory to disk.

In your case, you have 43TB of data in 16 nodes, which means around 2.6TB data per node. Each node has 25GB allocated to the bucket as per your comment. This is < 1% of memory to disk ratio which can cause rebalance instability, long duration to complete, as well as significant impact to front-end workload as you are seeing.

You can either add more nodes to your cluster or use nodes with more RAM - 256GB RAM at least. Preferably 512GB RAM per node if you are going to use the same number of nodes.

abhi271 · August 11, 2020, 4:37pm

Thanks @shivani_g for the clarification. Is there any official documentation regarding the 10% rule you have mentioned. If you can share it will be great. So that I can take this for ward in my implementation.

shivani_g · August 11, 2020, 4:58pm

There is no documentation around it currently. However, it is a Best Practice that we provide during Sizing engagements.

Topic		Replies	Views
Rebalancing is taking lot of time on couchbase server (several days) Couchbase Server	3	3879	February 8, 2023
Replace a node, and rebalance Couchbase Server	9	3509	April 11, 2017
Data nodes on the cluster needed frequent rebalance Couchbase Server	5	230	January 30, 2024
Rebalancing taking lot of time and never completes Couchbase Server	4	2048	May 31, 2017
Memory consumption increased significantly after rebalance Couchbase Server	22	4786	December 4, 2017

Couchbase Rebalance large Cluster

Related topics