Rebalance repeatedly fails with not_all_nodes_are_ready_yet

Hello, I have a 6-node cluster and one of the nodes is in the “pend” state and the UI says I need to rebalance the cluster. Each time I attempt to rebalance it, however, the rebalance fails. The log shows a not_all_nodes_are_ready_yet error. If I wait awhile and try again, the same error happens every time.

In addition to just rebalancing, I tried removing the node and then rebalancing, but the result is the same. I noticed that (a) the rebalance always aborts 60 seconds after I try to kick it off, and then 180 seconds after that an additional message in the log shows up - an error about ‘badmatch’, and it’s always on that same pending node:

Control connection to memcached on ‘ns_1@X’ disconnected: {badmatch,
{error, timeout}} ns_memcached004 ns_1@X 08:22:34 - Thu May 8, 2014

Rebalance exited with reason {not_all_nodes_are_ready_yet,
['ns_1@X]} ns_orchestrator002 ns_1@Y 08:19:34 - Thu May 8, 2014

Started rebalancing bucket dl_sessions ns_rebalancer000 ns_1@Y 08:18:33 - Thu May 8, 2014 Starting rebalance, KeepNodes = [‘ns_1@W’,‘ns_1@X’, ‘ns_1@Y’,‘ns_1@Z’, ‘ns_1@A’,‘ns_1@B’], EjectNodes = []

I’m running cb 2.2.0 (build-738-rel). Each node is a 64-bit EC2 r3.xlarge instance with ample unused RAM (30GB RAM, 17GB RAM quota, 9+GB RAM free) and disk space (74GB per node total, 9GB used, 60+GB free). Each server is running Ubuntu 13.10. My cluster has a single bucket with a cluster RAM quota of 102GB (< 3GB used) and a disk quota of 443 GB (<70GB used), and 1 replica and 1 production view. There are less than 500k documents currently, and the average object size is around 5KB or so.

The cluster is in our test environment, so I was able to temporarily turn off all traffic to the cluster and attempt a rebalance while it was completely idle, but got the same result. Even though it’s a test environment, I’d really like to figure out a way to get the cluster happy again without doing something too drastic so that we’d know what to do should it happen against our production clusters.

Thanks in advance for any help!

Hi dfb, I am having similar issue. You have a fix yet?
I am on 3 node clusters on AWS EC2. All nodes (Windows) are in the same network.