[MB-5108] Rolling upgrade from 172 to latest 181 fails with failed rebalance {type,exit}, {what,{noproc, {gen_fsm,sync_send_event,}}} Created: 18/Apr/12  Updated: 09/Jan/13  Resolved: 27/Apr/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.1-release-candidate
Fix Version/s: 1.8.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Karan Kumar (Inactive) Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: 1.8.1-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 10.04

Attachments: GZip Archive 10.3.121.92-8091-diag.txt.gz     GZip Archive 10.3.121.93-8091-diag.txt.gz     GZip Archive 10.3.121.94-8091-diag.txt.gz     GZip Archive 10.3.121.95-8091-diag.txt.gz     GZip Archive 10.3.121.97-8091-diag.txt.gz     GZip Archive 10.3.121.98-8091-diag.txt.gz     GZip Archive 1cdc7aec-3cf0-4614-bcbf-1902c99b111f-10.3.121.92-diag.txt.gz     GZip Archive 1cdc7aec-3cf0-4614-bcbf-1902c99b111f-10.3.121.93-diag.txt.gz     GZip Archive 1cdc7aec-3cf0-4614-bcbf-1902c99b111f-10.3.121.94-diag.txt.gz     GZip Archive 1cdc7aec-3cf0-4614-bcbf-1902c99b111f-10.3.121.95-diag.txt.gz     GZip Archive 1cdc7aec-3cf0-4614-bcbf-1902c99b111f-10.3.121.97-diag.txt.gz     GZip Archive 1cdc7aec-3cf0-4614-bcbf-1902c99b111f-10.3.121.98-diag.txt.gz    

 Description   
Failing test
upgradetests.MultipleNodeUpgradeTests.test_upgrade,initial_version=1.7.2,create_buckets=True,insert_data=True,start_upgraded_first=False,load_ratio=10,online_upgrade=True


2012-04-18 05:05:23,369 - root - INFO - adding node : 10.3.121.92:8091 to the cluster
2012-04-18 05:05:23,370 - root - INFO - adding remote node : 10.3.121.92 to this cluster @ : 10.3.121.98
2012-04-18 05:05:24,121 - root - INFO - added node : ns_1@10.3.121.92 to the cluster
2012-04-18 05:05:24,134 - root - INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.92%2Cns_1%4010.3.121.98%2Cns_1%4010.3.121.93%2Cns_1%4010.3.121.97%2Cns_1%4010.3.121.95
2012-04-18 05:05:24,140 - root - ERROR - http://10.3.121.98:8091/controller/rebalance error 500 reason: unknown ["Unexpected server error, request logged."]
2012-04-18 05:05:24,140 - root - ERROR - rebalance operation failed




INFO REPORT <0.6440.0> 2012-04-18 05:06:53
===============================================================================

ns_log: logging menelaus_web:19:Server error during processing: ["web request failed",
                                 {path,"/controller/rebalance"},
                                 {type,exit},
                                 {what,
                                  {noproc,
                                   {gen_fsm,sync_send_event,
                                    [{global,ns_orchestrator},
                                     {start_rebalance,
                                      ['ns_1@10.3.121.94','ns_1@10.3.121.92',
                                       'ns_1@10.3.121.98','ns_1@10.3.121.93',
                                       'ns_1@10.3.121.97','ns_1@10.3.121.95'],
                                      [],[]}]}}},
                                 {trace,
                                  [{gen_fsm,sync_send_event,2},
                                   {menelaus_web,do_handle_rebalance,3},
                                   {menelaus_web,loop,3},
                                   {mochiweb_http,headers,5},
                                   {proc_lib,init_p_do_apply,3}]}]




 Comments   
Comment by Karan Kumar (Inactive) [ 18/Apr/12 ]
This looks to be regression in ns_server
Comment by Aleksey Kondratenko [ 18/Apr/12 ]
Found this to be issue in old ns_server. Rebalance requests needs to either be sent to node running new version or work around this issue by waiting and retrying. Commit that fixed it (for 1.8.0) is:

commit d45ccaab92158d4a4fc882d3216d1557b7b39816
Author: Aliaksey Kandratsenka <alk@tut.by>
Date: Tue Nov 29 15:10:09 2011 +0300

    wait for orchestrator presense for key operations. MB-4214 MB-4559
Comment by Aleksey Kondratenko [ 18/Apr/12 ]
Not a "bug".
Comment by Karan Kumar (Inactive) [ 26/Apr/12 ]
Still failing. The suggested workaround does not work.

In the test we are issuing rebalance call to the newly upgraded 181 node..

Results in the rebalance failure.

Rebalance exited with reason {{case_clause,
{badrpc,
{'EXIT',
{{badfun,#Fun<erl_eval.4.88154533>},
[{erlang,apply,2},
{rpc,'-handle_call_call/6-fun-0-',5}]}}}},
[{ns_vbm_sup,change_vbucket_filter,4},
{ns_vbm_sup,'-set_replicas/3-fun-2-',5},
{lists,foldl,3},
{ns_vbm_sup,set_replicas,3},
{ns_vbm_sup,'-set_replicas/2-fun-1-',3},
{lists,foreach,2},
{ns_vbm_sup,apply_changes,2},
{ns_vbucket_mover,sync_replicas,0}]}
Comment by Karan Kumar (Inactive) [ 26/Apr/12 ]
Neither does waiting for the newly added node to become orchestrator solves this issue.
Comment by Aleksey Kondratenko [ 26/Apr/12 ]
That's different failure. Thanks for finding it.
Comment by Aleksey Kondratenko [ 27/Apr/12 ]
Fixed in http://review.couchbase.org/15366
Comment by Thuan Nguyen [ 28/Apr/12 ]
Integrated in github-ns-server-2-0 #342 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/342/])
    reimplemented backwards-compat for change_vbucket_filter. MB-5108 (Revision 023a90b14d2530823602f9c0c1c03dc86c33013e)
forward-ported new change_filter code (023a90b14). MB-5108 (Revision f6d217bec4b036b617f6ccf19404ac5ef8b0b793)

     Result = SUCCESS
Aliaksey Kandratsenka :
Files :
* src/ns_vbm_sup.erl

Aliaksey Kandratsenka :
Files :
* src/ns_vbm_sup.erl
* src/cb_gen_vbm_sup.erl
Generated at Fri Oct 24 05:49:56 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.