[MB-4375] rebalance failing with retry_not_ready_vbuckets error if ns_server janitors sets the vbucket state from pending to dead when rebalance fails or stops Created: 24/Oct/11  Updated: 09/Jan/13  Resolved: 24/Jul/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.7 GA,, 1.8.0
Fix Version/s: 1.7.2
Security Level: Public

Type: Bug Priority: Major
Reporter: Farshid Ghods (Inactive) Assignee: Farshid Ghods (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

when rebalance fails or is stopped by the user the vbucket state for those rebalance operations which were making progress are still pending.
ns_server janitor runs every few seconds which will change the vbucket state from pending->dead

now when the user restarts the rebalance sooner than 5 minutes ep-engine will try to reuse that tap stream and will not send TAP_VBUCKET_SET when restarting the takeover and since the vbucket state is dead now ep-engine will not start the vbucket transfer and this will result in rebalance getting stuck.

Comment by Farshid Ghods (Inactive) [ 24/Oct/11 ]
Comment by Matt Ingenthron [ 05/Mar/12 ]
A user of 1.8.0 has run into this issue it seems. The specific message in the log is:

CRASH REPORT <0.2530.13> 2012-03-01 09:40:29
Crashing process
   initial_call {ebucketmigrator_srv,init,['Argument__1']}
   pid <0.2530.13>
   registered_name []
   ancestors ['ns_vbm_sup-default','single_bucket_sup-default',<0.1025.0>]
   messages []
   links [<0.1063.0>]
   dictionary []
   trap_exit false
   status running
   heap_size 4181
   stack_size 24
   reductions 218220
Comment by Farshid Ghods (Inactive) [ 05/Mar/12 ]
Hi Matt,

do you have access to the user to grab diags from their cluster ?
Comment by Aleksey Kondratenko [ 05/Mar/12 ]
Folks, retry_not_ready_vbucket is actually a "voluntarily crash". We do that in order to restart replication later. I.e. when replicating from some node if some of vbuckets we need to replicate from are not ready yet (i.e. we're second replica and 1st is not yet ready to be replicated from) we just don't replicate those vbuckets, but after 30 seconds we perform harakiri so that supervisor restarts us and we check again. This was quick fix in few days before 1.7.0 and I'm really sorry for not making log messages clearer that it's not a problem at all. 1.8.1 will fix that.

So this message has nothing at all to do with rebalance failing. May we ask logs from master node ? Master node can be identified by looking at user visible logs. Server that logs "rebalance failed" message is the master node for that failed rebalance.
Generated at Thu Oct 23 12:32:56 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.