Replace a node, and rebalance

edwardzhong · March 24, 2017, 7:53pm

Hi,

We are planning replace a node with more capacity. The replica copy is set to 1.
We are wondering if it is ok to remove the node, and add new node, and rebalance everything.

Is there any risk of losing data if we choose to do so?

Please advise.

best,
edward

nraboy · March 27, 2017, 5:32pm

Hey @edwardzhong,

You shouldn’t lose any data when you bring a node offline if your cluster has replication. However, if your resources are at capacity, you risk your cluster breaking under too much load. To be clear I mean if your cluster is already at say 80% load and you lose a node, the remaining nodes may not be able to handle the load.

You might want to consider bringing the new node online first, rebalancing, remove the old node, then rebalance again.

There are probably many other strategies you could use, but that is what I’d do.

Best,

edwardzhong · March 29, 2017, 5:59pm

@nraboy Thank you so much for your advice. Now I added a new node, and did the rebalance. It is purely to make sure we have enough capacity. However the rebalance has been running for about 48 hours and the old nodes hit 92-96% completion, and the newly added node reached to 86% completion. And now it doesn’t seem to be moving forward and kind of stuck here.
The swap usage, cpu usage are at single digit percentage. The RAM usage is around 30-76%. Thus we should have enough resource. Any idea why this is running so slow?

One thing I noticed is that there are warnings about one bucket "Hard Out Of Memory Error. Bucket “xxx” on node 172.19.62.14 is full. All memory allocated to this bucket is used for metadata. "
Does it matter during the rebalance?

Any inputs would be appreciated.

best,
edward

nraboy · March 29, 2017, 10:19pm

That is a new one for me. @don have you run into this before?

edwardzhong · March 29, 2017, 10:48pm

One clarification, I don’t think the “Hard Out of Memory Error” is because of the rebalance. I intentionally reduced the RAM for this bucket so that we can have more RAM to serve bucket in higher priority.
So my question is whether this change would have any impact to the rebalance. Or it is totally unrelated?

perry · March 31, 2017, 11:52am

Hi @edwardzhong, that “hard out of memory error” may not be because of the rebalance, but it is likely causing the rebalancing to pause or seem hung.

“Hard out of memory” indicates that write traffic is being rejected because there is no enough RAM to allocate new data coming in. This would apply to both your application traffic as well as the rebalance, and so it’s very likely that the rebalance can’t proceed moving data for that bucket.

There are a few options here, but in general I would recommend increasing the RAM quota for that bucket and see if it allows the rebalance to proceed better. After that, you may want to re-evaluate your overall sizing for this cluster to make sure that you have enough RAM to support what you’re trying to do.

Remember that Couchbase has a very tightly managed caching layer of its own, it doesn’t rely upon the filesystem buffer cache. I believe this may explain why you are seeing RAM usage be relatively low. If you have set the bucket quota to a certain amount (as you indicated you had) then Couchbase will only use that much RAM for caching that particular dataset…even if you have much more RAM available in the system as a whole. i.e., if your node has 16GB of RAM available, but your bucket(s) is/are set to only use 1GB, then that’s all that Couchbase will use and it may seem like the rest of the system resources aren’t being utilized…they’re not. Thankfully you can both raise and lower the RAM quota for each bucket dynamically without restarting or affecting the system in anyway, Couchbase will just start using more or less RAM for that bucket.

On a related note, your initial thought process was correct…you can add a new node, mark an old node for removal and then perform the rebalance. Couchbase will automatically move data only between those two nodes and you can verify this in the logs by seeing a message that “this bucket is a swap-rebalance” (or something to that effect). You won’t lose any data, the cluster will stay at the same size, and the other nodes in the cluster not involved in that rebalance won’t be tasked with moving any data.

Finally, you mentioned that you were looking to increase the capacity of the nodes in your cluster. You are going about it in exactly the right way, I just wanted to make sure to explain the whole process and what to be aware of. When you add nodes of a larger RAM capacity to a cluster, you won’t be able to make use of that RAM right away, until all nodes have the same minimum amount. i.e., let’s say you have 3 nodes of 8GB of RAM each and you want to increase them to 16GB. You can add/remove one node at a time (or multiple at a time) but as long as there is still at least one node in the cluster with only 8GB of RAM, all the nodes will use only that much. Once all the nodes have been swapped, it will STILL be only using 8GB of RAM. At this point, you can raise the “cluster RAM quota” (also dynamically) from ~8GB per node to ~16GB per node (depending on how much headroom you want to give the OS, etc), and then you can raise the bucket quotas and/or add more buckets.

This page might help explain the architecture a bit more: https://developer.couchbase.com/documentation/server/4.6/architecture/managed-caching-layer-architecture.html

And this forum post as well: Cluster does not use the increased RAM quota - #4 by drigby

Hope that helps explain things, please let us know if you have any other questions.

Perry

edwardzhong · March 31, 2017, 6:46pm

@perry Really appreciated for the detailed clarification. The rebalance finally completed. Now I tried to remove a node. It indicated ‘pending removel’. But after a couple of minutes, this message disappeared and the server nodes list remains the same. No node was removed.

It seems the server tried, but failed. No error message. Any thoughts?

edwardzhong · March 31, 2017, 9:11pm

How long does it take for a node to change its state from “pending removal” to “removed”?

I clicked ‘rebalance’ when the state is ‘pending removal’ it is a rebalance on the original number of nodes. Do I have to wait until the node to be removed removes itself from the active server node list?

thanks
edward

perry · April 3, 2017, 10:19am

Hi @edwardzhong, apologies for the delay getting back to you.

Glad to hear we were able to help you make progress.

For removing a node, you do need to click rebalance after it is marked as “pending removal”, and then wait until the rebalance finishes. Once it is finished, the node will be automatically ejected from the cluster. The one nuance to keep in mind is that if the rebalance fails and you restart it, you need to make sure to re-remove that node. Otherwise the system will think that you’re actually bringing it back in.

What version of Couchbase Server are you using?

edwardzhong · April 11, 2017, 6:14pm

Hi @perry sorry for the late response.

We are using couchbase 4.5.
The node removal and rebalance worked out completely. Very appreciated for your help！

best,
edward

Topic		Replies	Views
Node failure while rebalancing, won't come back up... data loss? Couchbase Server	1	2012	October 22, 2013
Couchbase Rebalance large Cluster Couchbase Server	10	2002	August 11, 2020
Swap(?) rebalance in a single node failure scenario Couchbase Server	2	1431	October 5, 2018
Trying to recover from an outage, rebalancing fails immediately Couchbase Server	3	304	September 10, 2023
Couchbase cluster stuck after node failure Couchbase Server	2	2105	February 7, 2017

Replace a node, and rebalance

Related topics