[MB-8094] [Doc'd in 2.0.2]1.8.1 & 2.0.2 mixed cluster:Rebalance exited with reason {badarg, [{ns_rebalancer, '-wait_for_memcached/3-lc$^0/1-0-',2}, {ns_rebalancer,wait_for_memcached, Created: 15/Apr/13  Updated: 15/May/13  Resolved: 08/May/13

Status: Closed
Project: Couchbase Server
Component/s: documentation, ns_server
Affects Version/s: 2.1.0
Fix Version/s: 2.1.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-new-rebalance-mixed-cluster/62/consoleFull
./testrunner -i /tmp/rebalance_in.ini get-logs=True,wait_timeout=180,GROUP=P0,EXCLUDE_GROUP=FROM_2_0,get-cbcollect-infrebalance_in_with_ops (rebalance.rebalancein.RebalanceInTests) ... o=True -t rebalance.rebalancein.RebalanceInTests.rebalance_in_with_ops,nodes_in=3,GROUP=IN;P0


mixed suite/cluster:
1.8.1-937-rel
10.3.3.92
10.3.3.94
10.3.3.93

2.0.2-764-rel
10.3.3.99
10.3.3.91
10.3.3.82
10.3.3.97

add to 10.3.3.92(1.8.1) 1*1.8.1(10.3.3.93) and 2*2.0.2(10.3.3.82, 10.3.3.99)

2013-04-14 09:06:38 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.3.3.92%2Cns_1%4010.3.3.82%2Cns_1%4010.3.3.99%2Cns_1%4010.3.3.93
2013-04-14 09:06:38 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance operation started
2013-04-14 09:06:38 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 0 %
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
2013-04-14 09:06:48 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] Latest logs from UI:
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.99', u'code': 0, u'text': u"Candidate got master heartbeat from node 'ns_1@10.3.3.92' which has lower priority. Will try to take over.", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1365956005954.0, u'type': u'info'}
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.82', u'code': 0, u'text': u"Candidate got master heartbeat from node 'ns_1@10.3.3.92' which has lower priority. Will try to take over.", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1365956005954.0, u'type': u'info'}
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.93', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.3.3.93\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1365956004357.0, u'type': u'info'}
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.92', u'code': 2, u'text': u"Rebalance exited with reason {badarg,\n [{ns_rebalancer,\n '-wait_for_memcached/3-lc$^0/1-0-',2},\n {ns_rebalancer,wait_for_memcached,3},\n {ns_rebalancer,'-rebalance/3-fun-0-',5},\n {lists,foreach,2},\n {ns_rebalancer,rebalance,3}]}\n", u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1365956004324.0, u'type': u'info'}
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.99', u'code': 0, u'text': u"Candidate got master heartbeat from node 'ns_1@10.3.3.92' which has lower priority. But I won't try to take over since rebalance seems to be running", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1365956003954.0, u'type': u'info'}
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.82', u'code': 0, u'text': u"Candidate got master heartbeat from node 'ns_1@10.3.3.92' which has lower priority. But I won't try to take over since rebalance seems to be running", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1365956003954.0, u'type': u'info'}
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.82', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.3.3.82\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1365956003367.0, u'type': u'info'}
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.99', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.3.3.99\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1365956003359.0, u'type': u'info'}
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.92', u'code': 0, u'text': u'Started rebalancing bucket default', u'shortText': u'message', u'module': u'ns_rebalancer', u'tstamp': 1365956003298.0, u'type': u'info'}
2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.92', u'code': 4, u'text': u"Starting rebalance, KeepNodes = ['ns_1@10.3.3.92','ns_1@10.3.3.82',\n 'ns_1@10.3.3.99','ns_1@10.3.3.93'], EjectNodes = []\n", u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1365956003257.0, u'type': u'info'}

 Comments   
Comment by Andrei Baranouski [ 15/Apr/13 ]
https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.82-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.91-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.92-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.93-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.94-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.97-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.99-diag.txt.gz
Comment by Aleksey Kondratenko [ 15/Apr/13 ]
We cannot retroactively fix 1.8.1.

The problem is you're requesting rebalance too soon after joining nodes to cluster. So 1.8.1 is remains master and runs rebalance.

Whoever does scripted rolling upgrades need to add let say 10 second delay between adding first 2.0.2 node to 1.8.1 cluster and requesting rebalance.
Comment by Maria McDuff (Inactive) [ 22/Apr/13 ]
By design.
Comment by Maria McDuff (Inactive) [ 22/Apr/13 ]
Karen, pls doc:

per Alk:
Whoever does scripted rolling upgrades need to add let say 10 second delay between adding first 2.0.2 node to 1.8.1 cluster and requesting rebalance.
Comment by Thuan Nguyen [ 29/Apr/13 ]
Integrated in win-ui-testing-P0 #46 (See [http://qa.hq.northscale.net/job/win-ui-testing-P0/46/])
    MB-8094: sleep 10sec before rebalance(mixed cluster) (Revision 2541a334a1aca3065e6c7a9475373fe28147c2fe)

     Result = SUCCESS
andrei :
Files :
* lib/tasks/task.py
Comment by kzeller [ 08/May/13 ]
Added to 2.0.2 RN as:

If you perform an online upgrade an rebalance with 1.8.1 and 2.0.2 nodes, it may fail and
produce the error. This is caused by requesting rebalance too quickly after adding a node.
To avoid this problem you should script a delay of 10 seconds
 after you add a node before you request rebalance.

added to Use Online Upgrades for Couchbase Server 1.8 to Couchbase Server 2.0:

Be aware that if you perform a scripted online upgrade from 1.8.x to 2.0 you should have a 10 second delay between adding a 2.0 node to the cluster and rebalancing. If you request rebalance too soon after adding a 2.0 node, the rebalance may fail.
Comment by Maria McDuff (Inactive) [ 08/May/13 ]
andrei,

pls review the release note from karen.
Comment by Andrei Baranouski [ 15/May/13 ]
approve
Generated at Thu Jul 24 04:52:51 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.