[MB-4375] rebalance failing with retry_not_ready_vbuckets error if ns_server janitors sets the vbucket state from pending to dead when rebalance fails or stops Created: 24/Oct/11 Updated: 09/Jan/13 Resolved: 24/Jul/12
|Affects Version/s:||1.7 GA, 220.127.116.11, 1.8.0|
|Reporter:||Farshid Ghods||Assignee:||Farshid Ghods|
|Remaining Estimate:||Not Specified|
|Time Spent:||Not Specified|
|Original Estimate:||Not Specified|
when rebalance fails or is stopped by the user the vbucket state for those rebalance operations which were making progress are still pending.
ns_server janitor runs every few seconds which will change the vbucket state from pending->dead
now when the user restarts the rebalance sooner than 5 minutes ep-engine will try to reuse that tap stream and will not send TAP_VBUCKET_SET when restarting the takeover and since the vbucket state is dead now ep-engine will not start the vbucket transfer and this will result in rebalance getting stuck.
|Comment by Farshid Ghods [ 24/Oct/11 ]|
|Comment by Matt Ingenthron [ 05/Mar/12 ]|
A user of 1.8.0 has run into this issue it seems. The specific message in the log is:|
CRASH REPORT <0.2530.13> 2012-03-01 09:40:29
|Comment by Farshid Ghods [ 05/Mar/12 ]|
do you have access to the user to grab diags from their cluster ?
|Comment by Aleksey Kondratenko [ 05/Mar/12 ]|
Folks, retry_not_ready_vbucket is actually a "voluntarily crash". We do that in order to restart replication later. I.e. when replicating from some node if some of vbuckets we need to replicate from are not ready yet (i.e. we're second replica and 1st is not yet ready to be replicated from) we just don't replicate those vbuckets, but after 30 seconds we perform harakiri so that supervisor restarts us and we check again. This was quick fix in few days before 1.7.0 and I'm really sorry for not making log messages clearer that it's not a problem at all. 1.8.1 will fix that.
So this message has nothing at all to do with rebalance failing. May we ask logs from master node ? Master node can be identified by looking at user visible logs. Server that logs "rebalance failed" message is the master node for that failed rebalance.