[MB-7168] [Doc'd 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node) Created: 13/Nov/12  Updated: 11/Oct/13  Resolved: 10/Oct/13

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0, 2.0.1, 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Andrei Baranouski Assignee: Aleksey Kondratenko
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: GZip Archive 01774fe6-193d-4949-a7f5-eec408f72856-10.1.3.114-diag.txt.gz     GZip Archive 01774fe6-193d-4949-a7f5-eec408f72856-10.1.3.115-diag.txt.gz     GZip Archive 01774fe6-193d-4949-a7f5-eec408f72856-10.1.3.116-diag.txt.gz     GZip Archive 01774fe6-193d-4949-a7f5-eec408f72856-10.1.3.117-diag.txt.gz     GZip Archive 01774fe6-193d-4949-a7f5-eec408f72856-10.1.3.118-diag.txt.gz     GZip Archive 5286537d-1565-4e27-a147-8c8b8f064759-10.1.3.114-diag.txt.gz     GZip Archive 5286537d-1565-4e27-a147-8c8b8f064759-10.1.3.116-diag.txt.gz     GZip Archive 5286537d-1565-4e27-a147-8c8b8f064759-10.1.3.118-diag.txt.gz    
Flagged:
Release Note

 Description   
version=2.0.0-1947-rel
http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/448/consoleFull
/testrunner -i resources/jenkins/centos-64-5node-failover.ini get-logs=True,GROUP=BAT -t failovertests.FailoverTests.test_failover_firewall,replica=1,keys_count=100000,dgm_run=True,GROUP=BAT

steps:
1.3 nodes in cluster:10.1.3.114, 10.1.3.118,10.1.3.116
2.enabled firewall on ip:10.1.3.118 : /sbin/iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
3
[2012-11-12 11:28:37,102] - [failovertests:260] INFO - node 10.1.3.118:8091 is 'unhealthy' as expected
[2012-11-12 11:29:07,544] - [rest_client:849] INFO - fail_over successful
[2012-11-12 11:29:07,545] - [failovertests:278] INFO - failed over node : ns_1@10.1.3.118
[2012-11-12 11:29:07,545] - [failovertests:292] INFO - 10 seconds sleep after failover before invoking rebalance...
4.[2012-11-12 11:29:17,545] - [rest_client:883] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.118&user=Administrator&knownNodes=ns_1%4010.1.3.114%2Cns_1%4010.1.3.118%2Cns_1%4010.1.3.116

result:
[2012-11-12 11:29:17,562] - [rest_client:890] INFO - rebalance operation started
[2012-11-12 11:29:17,569] - [rest_client:986] INFO - rebalance percentage : 0 %
[2012-11-12 11:29:19,573] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:21,577] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:23,584] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:25,589] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:27,593] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:29,599] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:31,603] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:33,607] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:35,612] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:37,616] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:39,621] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:41,626] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:43,630] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:45,635] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:47,640] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:49,644] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:51,648] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:53,652] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:55,657] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:57,662] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:29:59,667] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:30:01,671] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:30:03,676] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:30:05,682] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:30:07,698] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:30:09,703] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:30:11,708] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:30:13,712] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:30:15,716] - [rest_client:986] INFO - rebalance percentage : 0.0 %
[2012-11-12 11:30:17,721] - [rest_client:971] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed

[rebalance:info,2012-11-12T11:49:03.581,ns_1@10.1.3.114:<0.20240.1>:ns_rebalancer:rebalance:258]Waiting for bucket "default" to be ready on ['ns_1@10.1.3.114',
                                             'ns_1@10.1.3.116']
[ns_server:info,2012-11-12T11:49:09.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
                                        {dict,3,16,16,8,80,48,
                                         {[],[],[],[],[],[],[],[],[],[],[],[],
                                          [],[],[],[]},
                                         {{[],[],[],[],
                                           [['ns_1@10.1.3.114'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.116'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.118'|0.0]],
                                           [],[],[],[],[],[],[]}}},
                                        ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
                                        [],
                                        ['ns_1@10.1.3.118']}
[ns_server:info,2012-11-12T11:49:19.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
                                        {dict,3,16,16,8,80,48,
                                         {[],[],[],[],[],[],[],[],[],[],[],[],
                                          [],[],[],[]},
                                         {{[],[],[],[],
                                           [['ns_1@10.1.3.114'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.116'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.118'|0.0]],
                                           [],[],[],[],[],[],[]}}},
                                        ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
                                        [],
                                        ['ns_1@10.1.3.118']}
[ns_server:info,2012-11-12T11:49:23.120,ns_1@10.1.3.114:<0.20319.1>:compaction_daemon:try_to_cleanup_indexes:439]Cleaning up indexes for bucket `default`
[ns_server:info,2012-11-12T11:49:23.121,ns_1@10.1.3.114:<0.20319.1>:compaction_daemon:spawn_bucket_compactor:404]Compacting bucket default with config:
[{database_fragmentation_threshold,{30,undefined}},
 {view_fragmentation_threshold,{30,undefined}}]
[ns_server:info,2012-11-12T11:49:24.348,ns_1@10.1.3.114:ns_config_rep<0.358.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.1.3.116'

[ns_server:info,2012-11-12T11:49:29.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
                                        {dict,3,16,16,8,80,48,
                                         {[],[],[],[],[],[],[],[],[],[],[],[],
                                          [],[],[],[]},
                                         {{[],[],[],[],
                                           [['ns_1@10.1.3.114'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.116'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.118'|0.0]],
                                           [],[],[],[],[],[],[]}}},
                                        ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
                                        [],
                                        ['ns_1@10.1.3.118']}
[ns_server:warn,2012-11-12T11:49:29.546,ns_1@10.1.3.114:capi_set_view_manager-default<0.15508.1>:capi_set_view_manager:handle_info:345]Remote server node {'capi_ddoc_replication_srv-default','ns_1@10.1.3.118'} process down: noconnection
[ns_server:warn,2012-11-12T11:49:29.546,ns_1@10.1.3.114:xdc_rdoc_replication_srv<0.470.0>:xdc_rdoc_replication_srv:handle_info:128]Remote server node {xdc_rdoc_replication_srv,'ns_1@10.1.3.118'} process down: noconnection
[user:warn,2012-11-12T11:49:29.546,ns_1@10.1.3.114:ns_node_disco<0.351.0>:ns_node_disco:handle_info:168]Node 'ns_1@10.1.3.114' saw that node 'ns_1@10.1.3.118' went down.
[error_logger:error,2012-11-12T11:49:29.546,ns_1@10.1.3.114:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Node 'ns_1@10.1.3.118' not responding **
** Removing (timedout) connection **

[ns_server:info,2012-11-12T11:49:30.536,ns_1@10.1.3.114:ns_config_rep<0.358.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.1.3.116'

[ns_server:info,2012-11-12T11:49:39.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
                                        {dict,3,16,16,8,80,48,
                                         {[],[],[],[],[],[],[],[],[],[],[],[],
                                          [],[],[],[]},
                                         {{[],[],[],[],
                                           [['ns_1@10.1.3.114'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.116'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.118'|0.0]],
                                           [],[],[],[],[],[],[]}}},
                                        ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
                                        [],
                                        ['ns_1@10.1.3.118']}
[ns_server:info,2012-11-12T11:49:49.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
                                        {dict,3,16,16,8,80,48,
                                         {[],[],[],[],[],[],[],[],[],[],[],[],
                                          [],[],[],[]},
                                         {{[],[],[],[],
                                           [['ns_1@10.1.3.114'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.116'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.118'|0.0]],
                                           [],[],[],[],[],[],[]}}},
                                        ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
                                        [],
                                        ['ns_1@10.1.3.118']}
[ns_server:info,2012-11-12T11:49:53.135,ns_1@10.1.3.114:<0.20445.1>:compaction_daemon:try_to_cleanup_indexes:439]Cleaning up indexes for bucket `default`
[ns_server:info,2012-11-12T11:49:53.136,ns_1@10.1.3.114:<0.20445.1>:compaction_daemon:spawn_bucket_compactor:404]Compacting bucket default with config:
[{database_fragmentation_threshold,{30,undefined}},
 {view_fragmentation_threshold,{30,undefined}}]
[ns_server:info,2012-11-12T11:49:59.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
                                        {dict,3,16,16,8,80,48,
                                         {[],[],[],[],[],[],[],[],[],[],[],[],
                                          [],[],[],[]},
                                         {{[],[],[],[],
                                           [['ns_1@10.1.3.114'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.116'|0.0]],
                                           [],
                                           [['ns_1@10.1.3.118'|0.0]],
                                           [],[],[],[],[],[],[]}}},
                                        ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
                                        [],
                                        ['ns_1@10.1.3.118']}
[user:info,2012-11-12T11:50:03.582,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {not_all_nodes_are_ready_yet,
                                 ['ns_1@10.1.3.114','ns_1@10.1.3.116']}






 Comments   
Comment by Andrei Baranouski [ 13/Nov/12 ]
almost all test fail due to rebalance failure after failover nodes on version=2.0.0-1949-rel
http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/449/consoleFull
Comment by Farshid Ghods (Inactive) [ 19/Nov/12 ]
if this is an automated test please rerun and confrm the behavior twice
Comment by Steve Yen [ 19/Nov/12 ]
per bug-scrub2
Comment by Andrei Baranouski [ 19/Nov/12 ]
3 tests are constantly falling with the same reason http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/457/consoleFull ( 1954 build)
Comment by Iryna Mironava [ 20/Nov/12 ]
can be easily reproduced for windows - more info is in duplicate bug MB-7205 (reproduced manually)
Comment by Farshid Ghods (Inactive) [ 20/Nov/12 ]
so this failure occurs when the other node is unavailable ?

Iryna , when you reproduce this on windows manually does the rebalance succeed second time ?
do you see any data loss ? does rebalance fail immediately ?
Comment by Iryna Mironava [ 20/Nov/12 ]
it not succeed after retrying during some time window, after about 15 minutes it succeed
there is no data loss,
rebalance fails in first 1-5 mins of rebalancing
Comment by Aleksey Kondratenko [ 20/Nov/12 ]
I cannot see any clear evidence of what caused this. But it looks bad enough. 60 seconds timeout to query vbucket states on both nodes was hit here.

I also see that janitor pass immediately preceding rebalance also failed but in different phase, where it was waiting for vbucket change requests to be complete also failed on both nodes. And this timeout is 30 seconds.


I'd like diag to be grabbed _immediately_ after rebalance fails so that I could see what is the state of janitor_ageng and ns_memcached on each node. I.e. without doing any cleanup please. May I have that ?
Comment by Aleksey Kondratenko [ 20/Nov/12 ]
See above
Comment by Andrei Baranouski [ 21/Nov/12 ]
logs immediately after rebalance failure
Comment by Andrei Baranouski [ 21/Nov/12 ]
rebalance is successful in second time
Comment by Aleksey Kondratenko [ 21/Nov/12 ]
It's not quite immediately, sadly. Few seconds after problem. And backtraces on .118 (node where we failed waiting) don't have anything bad sadly.
Comment by Farshid Ghods (Inactive) [ 21/Nov/12 ]
given that rebalance succeeded second time moving this to 2.0.1
Comment by Farshid Ghods (Inactive) [ 09/Jan/13 ]
Andrei,

is this test fialing with the latest 2.0.1 build ?
Comment by Andrei Baranouski [ 10/Jan/13 ]
yes, 3 tests constantly fall in failover job: http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-failover-tests/501/consoleFull
Comment by Andrei Baranouski [ 10/Jan/13 ]
2.0.1-120-rel
Comment by Farshid Ghods (Inactive) [ 17/Jan/13 ]
per bug scrub

retrying rebalance after a few minutes should work.
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
Aliaksey,

i know this was discussed before but we want to confirm what the test does with your conclusion

we put the node behind the firewall , then wait until node is marked as unhealthy by NS_SERVER
then failover this node and click on rebalance to eject the node ( ejectedNodes = 10.1.3.118 )

we want to know why when the node was already failed over from the cluster we wait for it be ready as part of rebalance. ( or the timeouts happened on existing nodes ? )

Comment by Aleksey Kondratenko [ 22/Jan/13 ]
This is quite subtle to explain fully, but main problem is we have to wait until failover actually completes internally and that is subject to timeouts in certain areas.

So if you keep trying for 2-3 minutes it should eventually work.
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
>>but main problem is we have to wait until failover actually completes internally and that is subject to timeouts in certain areas.
so when failover REST api returns it does not mean that failover process internally is completed.
is failover a synchronous call or asynchronous ? in case its asyncronous is there a way for a test to check before initiating a rebalance so that we have a test that is more deterministic
Comment by Aleksey Kondratenko [ 22/Jan/13 ]
It is best-effort sync. If it fails to be sync with reasonably short timeout, it'll silently become async.

There's no way to detect that right now.
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
per bug scrub - assigning to Jin
Comment by Jin Lim [ 24/Jan/13 ]
A few things came out of the engineering talk regarding this issue:
1) This is a good catch in that it is confirming we need a better way of handling the api since it cannot %100 warranty the completion of failover.
2) However, it is not a critical or blocker since the symptom is more obvious (highly probable) while running an automated testing case.
3) Only feasible approach (per ns server team) for now is to wait and retry.

Based on this and the fact the fix would require changes across components (ep engine, etc) we may want to consider to put this into a future enhancement.
Assign this to Yaseen for his input here. Pelase assign it back to Jin or Dipti afterwards. Thanks.
Comment by Jin Lim [ 24/Jan/13 ]
Please see the above comment.
Comment by Farshid Ghods (Inactive) [ 24/Jan/13 ]
Thanks Jin for confirming this. I think this expected behavior can be included in the release notes as well so that users and support team can be aware of the issue and the suggested workaround.

Andrei,
can you then modify the test accordingly.

Comment by Farshid Ghods (Inactive) [ 28/Jan/13 ]
per bug scrub :

please revise the test accordingly and after running the test for few times can you propose how long customer should wait before kicking the rebalance again
Comment by Jin Lim [ 28/Jan/13 ]
Please assign it back to Jin after having enough information for recommendation. Will follow up with the doc team.
Comment by Andrei Baranouski [ 29/Jan/13 ]
with timeout 90 sec after failover befor rebalance - tests passed
will launch with 60 sec
Comment by Farshid Ghods (Inactive) [ 30/Jan/13 ]
can you try with lower timeout ( 10 seconds for instance ) ?
Comment by Andrei Baranouski [ 31/Jan/13 ]
10 secs was in initial state for ticket.

[2012-11-12 11:29:07,545] - [failovertests:292] INFO - 10 seconds sleep after failover before invoking rebalance...

60 secs tests passed

will try with 30 sec
Comment by Jin Lim [ 11/Feb/13 ]
any update on the test w/ 30 sec?
Comment by Andrei Baranouski [ 11/Feb/13 ]
30 sec, build 151, tests passed http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-failover-tests/536/
Comment by Farshid Ghods (Inactive) [ 11/Feb/13 ]
Karen,

could you please add this to the release notes that the user needs to wait for 30 seconds before they attempt to run rebalance operation after failing over a node.
comment son this bug should explain under what conditions this 30 seconds delay is neeeded
Comment by kzeller [ 13/Feb/13 ]
Jin,


Can you summarize the situation when someone should wait 30 seconds to reattempt rebalance?


Thanks!
Comment by Jin Lim [ 11/Mar/13 ]
Failover REST api is sync operation with timeout. When it fails to complete the failover process within the timeout period, it internally switches to async operation (continues the failover to completion) and immediately returns. Subsequent rebalance in this case would fail because the failover process is still running. User can wait between 30 seconds upto a minute and reattempt rebalance.
Comment by Jin Lim [ 11/Mar/13 ]
The failover REST api timeout is 30 second.
Comment by kzeller [ 11/Mar/13 ]
<para>
A cluster rebalance may exit and produce the error
{not_all_nodes_are_ready_yet} if you perform the rebalance right
after failing over a node in the cluster. You may need to
wait 30 seconds after the node failover before you
attempt the cluster rebalance.
</para>
<para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to
complete the failover process by the timeout, the operation internally switches into a
asynchronous operation. It will immediately return and re-attempt failover in the background which will cause
rebalance to fail since the failover operation is still running.</para>
Comment by kzeller [ 11/Mar/13 ]
Added to RN as :

<para>
A cluster rebalance may exit and produce the error
{not_all_nodes_are_ready_yet} if you perform the rebalance right
after failing over a node in the cluster. You may need to
wait 30 seconds after the node failover before you
attempt the cluster rebalance.
</para>
<para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to
complete the failover process by the timeout, the operation internally switches into a
asynchronous operation. It will immediately return and re-attempt failover in the background which will cause
rebalance to fail since the failover operation is still running.</para>
Comment by Andrei Baranouski [ 10/Apr/13 ]
Jin, I got the same error with 30 seconds waiting after failover on build 2.0.2-761-rel
Should we update RN for 2.0.1/next 2.0.2 release?
http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-failover-tests/583/consoleFull

[2013-04-08 19:41:13,981] - [failovertests:148] INFO - failed over node : ns_1@10.1.3.115
[2013-04-08 19:41:13,981] - [failovertests:163] INFO - 30 seconds sleep after failover before invoking rebalance...
[2013-04-08 19:41:43,981] - [rest_client:834] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.115&user=Administrator&knownNodes=ns_1%4010.1.3.114%2Cns_1%4010.1.3.115%2Cns_1%4010.1.3.118%2Cns_1%4010.1.3.116
[2013-04-08 19:41:43,991] - [rest_client:838] INFO - rebalance operation started
[2013-04-08 19:41:44,004] - [rest_client:940] INFO - rebalance percentage : 0 %
[2013-04-08 19:41:46,009] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:41:48,014] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:41:50,018] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:41:52,023] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:41:54,029] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:41:56,033] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:41:58,039] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:00,043] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:02,048] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:04,056] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:06,061] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:08,066] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:10,071] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:12,077] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:14,084] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:16,089] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:18,096] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:20,102] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:22,108] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:24,113] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:26,119] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:28,125] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:30,130] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:32,137] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:34,143] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:36,152] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:38,157] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:40,163] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:42,168] - [rest_client:940] INFO - rebalance percentage : 0.0 %
[2013-04-08 19:42:44,174] - [rest_client:923] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
[2013-04-08 19:42:44,175] - [rest_client:924] INFO - Latest logs from UI:
[2013-04-08 19:42:44,273] - [rest_client:925] ERROR - {u'node': u'ns_1@10.1.3.114', u'code': 2, u'text': u"Rebalance exited with reason {not_all_nodes_are_ready_yet,\n
Comment by Andrei Baranouski [ 10/Apr/13 ]
it's happened twice in this run
Comment by Jin Lim [ 10/Apr/13 ]
Thanks for the update, Andrei. Before we update the RN first let's figure out how long a user should wait before retrying the rebalance. Can we upgrade the wait period to 1 minute and see if that is long enough? Thanks.
Comment by Andrei Baranouski [ 11/Apr/13 ]
set timeout in 1 min http://review.couchbase.org/#/c/25587/
before that we had a timeout for 60 sec, the tests do not fall
http://www.couchbase.com/issues/browse/MB-7168?focusedCommentId=49128&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-49128
Comment by Thuan Nguyen [ 17/Apr/13 ]
Integrated in ui-testing #42 (See [http://qa.hq.northscale.net/job/ui-testing/42/])
    
     Result = SUCCESS
Comment by Andrei Baranouski [ 22/Apr/13 ]
tested with timeout for 60 sec, all tests passed
Comment by Andrei Baranouski [ 22/Apr/13 ]
Jin, I think to add in RN waiting for 60 seconds after failover should be okay
Comment by Wayne Siu [ 29/Apr/13 ]
Karen, please update the release notes to suggest 60 sec (not 30). Also lowering the priority from Blocker to Critical as it's not a 2.0.2 release blocking issue.
Comment by kzeller [ 08/May/13 ]
Ok, changed to : A cluster rebalance may exit and produce the error
{not_all_nodes_are_ready_yet} if you perform the rebalance right
after failing over a node in the cluster. You may need to
wait 60 seconds after the node failover before you
attempt the cluster rebalance.

in RN 2.0.2
Comment by kzeller [ 08/May/13 ]
added to RN 2.0.2
Comment by Thuan Nguyen [ 09/May/13 ]
Integrated in win-ui-testing-P0 #56 (See [http://qa.hq.northscale.net/job/win-ui-testing-P0/56/])
    MB-7168: sleep 30 sec after reb falied when killed memcached (Revision 44f755b962b7987d5b972caf9a283baa95edaed1)

     Result = SUCCESS
andrei :
Files :
* pytests/swaprebalance.py
Comment by Wayne Siu [ 09/May/13 ]
Verified that the doc change has also gone to 2.0.1 RN.

http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-rn_2-0-0l.html

Comment by Wayne Siu [ 09/May/13 ]
Assigning to PM for the next step.
Comment by Anil Kumar [ 10/May/13 ]
discussed with ALK, this bug will be fixed in 2.1. talked about having UI alert but since this happens only when node dies completely for now Release Note as Known Issue should fine. thanks
Comment by Aleksey Kondratenko [ 10/May/13 ]
Updated ticket to reflect badness of this and must-have-ness for 2.1
Comment by kzeller [ 10/May/13 ]
added flag to include 2.0.1 release note to 2.0.2.
Comment by Thuan Nguyen [ 10/May/13 ]
Integrated in windows32_sanity_P0 #29 (See [http://qa.hq.northscale.net/job/windows32_sanity_P0/29/])
    
     Result = UNSTABLE
Comment by Thuan Nguyen [ 12/May/13 ]
Integrated in windows32_view_P0 #6 (See [http://qa.hq.northscale.net/job/windows32_view_P0/6/])
    
     Result = UNSTABLE
Comment by kzeller [ 11/Jul/13 ]
Removing RN flag until ID'd by QA/Eng for 2.2.0
Comment by Aleksey Kondratenko [ 21/Aug/13 ]
We've discussed this as part of scrum planning.

The thinking is that upr changes have best chance of addressing this. Otherwise hard and too late for 2.2.0
Comment by Aleksey Kondratenko [ 26/Aug/13 ]
Removing this from our backlog. Given this was moved out of 2.2.0 and some upr work is planned for 3.0 (and will change a lot in this area) there's nothing that we can do at this time and we'll wait until upr stuff gets clearer.
Comment by kzeller [ 26/Aug/13 ]
Added as Known issue for 2.2 in RN.

     * A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet} if you perform the rebalance right after failing over a node in the cluster. You may need to wait 60 seconds after the node failover before you attempt the cluster rebalance.

      This is because the failover REST API is a synchronous operation with a timeout. If it fails to complete the failover process by the timeout, the operation internally switches into a asynchronous operation. It will immediately return and re-attempt failover in the background which will cause rebalance to fail since the failover operation is still running.

      *Issues* : [MB-7168](http://www.couchbase.com/issues/browse/MB-7168)
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
MB-8039 fixed the problem for all but master node.

Fixing it for master node is possible work for 3.0 but will take at least week.
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
MB-9321
Comment by Maria McDuff (Inactive) [ 11/Oct/13 ]
closing as dupes.
Generated at Sat Sep 20 18:44:11 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.