Rebalance stuck at 0% and does not cancel

ksafonov · January 19, 2016, 9:11am

Hi,

We’re using 2.5.1 Community edition. After one of 24 nodes was failed over due to network problems I added it back and started rebalance. Rebalance is at 0% for 4 days so far and it does not respond to cancel. UI logs are full of those (useless?) “Metadata overhead warning. Over 62% of RAM allocated to bucket “XX” on node “XXXX” is taken up by keys and metadata.”

Please help as this is a serious problem for us.

ksafonov · January 19, 2016, 11:09am

Update: UI popup says “Rebalancing 0 nodes” despite we got 24 in the cluster.

ldoguin · January 19, 2016, 11:12am

There is no such thing as a useless log, even if it might be irrelevant to your current situation. @pvarley can surely help you on that.

ksafonov · January 19, 2016, 12:31pm

Sorry for saying that, we were just overwhelmed with the amount of such messages.

Update: command line client did not help us:

/opt/couchbase/bin/couchbase-cli rebalance-status --cluster=XXX:8091 --user=Administrator --password=XXX
(u’running’, None)
/opt/couchbase/bin/couchbase-cli rebalance-stop --cluster=XXX:8091 --user=Administrator --password=XXX
SUCCESS: rebalance cluster stopped
/opt/couchbase/bin/couchbase-cli rebalance-status --cluster=XXX:8091 --user=Administrator --password=XXX
(u’running’, None)

cihangirb · January 19, 2016, 4:41pm

Hi @ksafonov, Could you check the version number for your cluster. We don’t have a 2.5.1 of the community edition so you may be using an unfinished product. if we can identify the version, there may be workaround we can identify.
thanks
-cihan

ksafonov · January 20, 2016, 9:39am

Sorry, my mistake. The version is 2.2.0.

ksafonov · January 22, 2016, 12:41am

Hi guys, is there any workaround for our case? Rebalance is still at 0%…

ksafonov · January 25, 2016, 6:48am

Update:

There’s a message in UI logs probably related to my cancel attempts:
Server error during processing: [“web request failed”,
{path,“/controller/stopRebalance”},
{type,exit},
{what,
{noproc,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
stop_rebalance]}}},
{trace,
[{gen_fsm,sync_send_event,2},
{menelaus_web,handle_stop_rebalance,1},
{request_throttler,do_request,3},
{menelaus_web,loop,3},
{mochiweb_http,headers,5},
{proc_lib,init_p_do_apply,3}]}]

@ldoguin @cihangirb guys, any suggestions for us?

alkondratenko · January 28, 2016, 5:50am

Kiril reached me via gtalk (I don’t know how he found me) and I aggreed to help him. After looking at his logs I found that he was hit by that famous erlang master election thing. After searching my gmail history I found similar case and advised him to restart ns_servers via erlang web shell. Which helped “unstuck” rebalance.

He then had another issue with name resolution (at least one of machines was unable to resolve another, likely being added, machine). So I think case is closed now.

ksafonov · January 28, 2016, 4:54pm

I happily confirm that Alexey’s advices helped us and I would like to express my sincere gratitude to @alkondratenko for the support!

Hazok · March 2, 2016, 2:03am

What are the steps to resolve this issue? On Couchbase 4.0.0 Community edition encountered pretty much the exact same scenario and are stuck with the exact same issue.

In our case we have 6 nodes in the cluster, one went down temporarily, and auto-failover happened. When the node came back online we added the node back to the cluster and rebalanced. Now we are stuck with “Rebalancing 0 nodes” and no attempts to stop the rebalance nor restart the nodes is working.

Is this an issue since 2.5.1 through 4.0.0?

Hazok · March 2, 2016, 2:29am

I’ve tried several ways to workaround this now including:

Taking down the node that the failover happened previously for.
Adding this node back in.
Adding a completely new node to the cluster.

None of these work. There is no option in the UI that works. This seems like a critical bug to have from 2.5.1 through 4.0.0 server versions since there appears to be no immediate workaround to the user. Why is the “Stop Rebalance” button broken?

Hazok · March 2, 2016, 3:20am

So… I finally found a workaround: Crash the cluster.

Basically I took down nodes 1-by-1 until the number of downed nodes exceeded the number of replicas by 1, then brought the nodes back up.

I’m really curious as to what the technical details are on behind this bug causing the rebalance to get stuck at 0 nodes. There have been reports of this bug in other forums as well.

We were going to look into our company’s policy on the enterprise edition licenses and have been using the community edition for prototyping, but since this bug appears to be in the erlang layer would this bug occur in the enterprise edition as well?

Very curious on this one since getting stuck in such a state with crashing the cluster as the only recourse is highly undesirable behavior. I wouldn’t want to go to production with an issue like this present.

kobusroux · May 19, 2016, 10:36am

Hello

Just had the same issue last night and I was wondering whether you could perhaps provide the commands to “restart ns_servers via erlang web shell”? I’m guessing the command will run on each node - how does this impact a running cluster?

Thanks!

robert_hamon_ · June 3, 2016, 11:58pm

I have the exact same issue… I have a cluster of 18 nodes and node 004 had a network issue that forced me to do a hard failover on it. After a reboot and the network link fixed, the node was added back with “delta-recovery” and a rebalance was started.
After 3 days of nothing progressing (stuck at rebalancing 0 nodes) I figured I needed to fix it myself before it came crashing down hard.

So I ran the following command on all the cluster nodes in parallel:
curl -X POST -u Administrator:Password http://localhost:8091/diag/eval --data ‘erlang:halt().’

At that point the nodes were all in a standby state and in the “data buckets” tabs, all buckets had a yellow pie in the “data nodes” column.
I quickly hard-failover my problematic node 004 and the cluster and all buckets went back to ready in less than a minute. I have over 9 billion items in there so for sure there was no “warm up” that happened.

So I rebooted node 004 again and once it came back up, I tried a full recovery this time.
Rebalance is now in progress (I can see it progressing) and I’ll update later on success or failure.

update
I’m not sure when the rebalance finished, but the cluster is now healthy with all 18 nodes in.

Topic		Replies	Views
Rebalance is stuck at 0% Couchbase Server	0	1189	November 19, 2016
Rebalance failed and rebalance button is now disabled Couchbase Server	1	1706	February 2, 2018
Rebalance failed after removing node Couchbase Server	2	1584	June 18, 2020
Rebalancing does not work at all Couchbase Server	4	2192	January 22, 2015
Rebalance failed with error Couchbase Server	0	1814	May 23, 2016

Rebalance stuck at 0% and does not cancel

Related topics