Unable to rebalance cluster after node failure

Hi,

Facing a strange issue that I have not seen before with 2.0.1. We had a node failure a few hours back. By the time we got the alert and logged into the console, it was showing node could be added back.

Upon clicking on Add Back and rebalance, that failed. Since then we have tried removing the node, failing it over, repeatedly but the process is continuously failing. Initially rebalancing would start go up to 5% and then fail. Once it went up to 29% and then failed.

Now it is not even starting up. See the 2 log entries that I see on the console. Any suggestions?

Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.31599.3143>,
{{badmatch,
[{‘EXIT’,
{timeout,
{gen_server,call,
[<18112.25473.1825>,had_backfill,
30000]}}}]},
[{ns_single_vbucket_mover,
’-wait_backfill_determination/1-fun-1-’,
1}]}}}
ns_orchestrator002 ns_1@10.57.49.18 11:22:06 - Sun Jul 28, 2013

<0.31572.3143> exited with {unexpected_exit,
{‘EXIT’,<0.31599.3143>,
{{badmatch,
[{‘EXIT’,
{timeout,
{gen_server,call,
[<18112.25473.1825>,had_backfill,
30000]}}}]},
[{ns_single_vbucket_mover,
’-wait_backfill_determination/1-fun-1-’,1}]}}} ns_vbucket_mover000 ns_1@10.57.49.18 11:22:06 - Sun Jul 28, 2013

regards,
-Piyush

Update. After about 3-4 hours of trying rebalance the remaining nodes in the cluster (which is 7 out of the original 8) and that kept failing.

I finally tried and added the 8th node back to the cluster and rebalanced and it worked. So in that sense our problem is solved. But I am not sure what happened? Why wouldn’t the cluster rebalance with the remaining 7 nodes?

Any explanations/thoughts?

regards,
-Piyush

Can you confirm that you cluster size can handle all your working set with 7 nodes?

For the time being yes. However, at that point in time, we were (for all practical purpose) running with reduced capacity.

But working set will grow with time and we need the 8th node back in. Hence I had to add it back sooner or later to get back to original configuration.