[MB-6384] inability to reach some node should not cause entire per-bucket supervisor to fail [was: Rebalance 5->4 nodes is failed with reason bulk_set_vbucket_state_failed] Created: 22/Aug/12  Updated: 10/Sep/12  Resolved: 24/Aug/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: None
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: regression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centOS, 64 -bit, 4 cores VMs, build #1620

Attachments: GZip Archive 10.3.3.58-8091-diag.txt.gz     GZip Archive 10.3.3.64-8091-diag.txt.gz     GZip Archive 10.3.3.68-8091-diag.txt.gz     GZip Archive 10.3.3.71-8091-diag.txt.gz     GZip Archive 10.3.3.73-8091-diag.txt.gz    

 Description   
1.Rebalance in 1->5 nodes
2. Load data (1M), no views or ddocs are created
3. Start rebalance out
4. Created 3 ddocs, 2 view per ddoc
5. Rebalance is failed

2012-08-22 18:18:41.623 ns_orchestrator:4:info:message(ns_1@10.3.3.58) - Starting rebalance, KeepNodes = ['ns_1@10.3.3.64','ns_1@10.3.3.68',
                                 'ns_1@10.3.3.58','ns_1@10.3.3.71'], EjectNodes = ['ns_1@10.3.3.73']

2012-08-22 18:18:41.933 ns_rebalancer:0:info:message(ns_1@10.3.3.58) - Started rebalancing bucket default
2012-08-22 18:18:42.512 ns_vbucket_mover:0:info:message(ns_1@10.3.3.58) - Bucket "default" rebalance does not seem to be swap rebalance
2012-08-22 18:18:45.428 ns_memcached:2:info:message(ns_1@10.3.3.73) - Shutting down bucket "default" on 'ns_1@10.3.3.73' for server shutdown
2012-08-22 18:18:45.747 ns_orchestrator:2:info:message(ns_1@10.3.3.58) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
                               [{'ns_1@10.3.3.64',
                                 {'EXIT',
                                  {killed,
                                   {gen_server,call,
                                    [{'janitor_agent-default',
                                      'ns_1@10.3.3.64'},
                                     {if_rebalance,<0.22908.39>,
                                      {update_vbucket_state,820,replica,
                                       undefined,'ns_1@10.3.3.58'}},
                                     infinity]}}}}]},
                              [{janitor_agent,bulk_set_vbucket_state,4},
                               {ns_vbucket_mover,
                                update_replication_post_move,3},
                               {ns_vbucket_mover,handle_info,2},
                               {gen_server,handle_msg,5},
                               {proc_lib,init_p_do_apply,3}]}


 Comments   
Comment by Aleksey Kondratenko [ 22/Aug/12 ]
Root cause is problem in MB-6385. But this is causing per-bucket supervisor of .64 to fail because .73 deletes bucket incorrectly thinking there's server shutdown.
Comment by Farshid Ghods (Inactive) [ 22/Aug/12 ]
regressions are marked as bockers
Comment by Aleksey Kondratenko [ 24/Aug/12 ]
Should be done as well
Comment by Thuan Nguyen [ 25/Aug/12 ]
Integrated in github-ns-server-2-0 #453 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/453/])
    MB-6384: don't shutdown bucket unless we're deleting it (Revision 2e7b50a5c0faa23a1f5367536e75358e105a0d19)
MB-6384: changed replicators' supervision type to termporary (Revision b5ab81c848aef02d010062a5eb10361ed2965088)

     Result = SUCCESS
Aliaksey Kandratsenka :
Files :
* src/ns_memcached.erl

Aliaksey Kandratsenka :
Files :
* src/ns_vbm_new_sup.erl
* src/replication_changes.erl
Comment by Iryna Mironava [ 10/Sep/12 ]
verified
Generated at Tue Jul 29 17:01:59 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.