[MB-7508] Rebalance exited with reason {{bulk_set_vbucket_state_failed after crasher in new_ns_replicas_builder(rebalance continued after killing memcached) Created: 09/Jan/13  Updated: 23/Apr/13  Resolved: 21/Jan/13

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 2.0.1
Fix Version/s: 2.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
build 2.0.1-120-rel
http://qa.hq.northscale.net/job/centos-32-2.0-swaprebalance-tests/250/consoleFull
/testrunner -i /tmp/swaprebalance-cent-32.ini get-logs=True,GROUP=P0 -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=1,num-buckets=4,num-swap=3,swap-orchestrator=True,GROUP=P0

steps:
1.rebalance params : password=password&ejectedNodes=ns_1%4010.3.2.146%2Cns_1%4010.3.2.152%2Cns_1%4010.3.2.145&user=Administrator&knownNodes=ns_1%4010.3.2.146%2Cns_1%4010.3.2.147%2Cns_1%4010.3.2.149%2Cns_1%4010.3.2.148%2Cns_1%4010.3.2.145%2Cns_1%4010.3.2.152
2. kill memcached when rebalance reached >40%
[2013-01-08 13:47:49,441] - [rest_client:881] INFO - rebalance percentage : 40.0627668879 %
[2013-01-08 13:47:51,444] - [rest_client:73] INFO - rebalance reached >40.0627668879% in 663.750247002 seconds
[2013-01-08 13:47:51,971] - [remote_util:116] INFO - connecting to 10.3.2.149 with username : root password : couchbase ssh_key:
[2013-01-08 13:47:52,196] - [remote_util:148] INFO - Connected
[2013-01-08 13:47:52,557] - [remote_util:1128] INFO - running command.raw sudo cat /proc/cpuinfo
[2013-01-08 13:47:52,657] - [remote_util:1157] INFO - command executed successfully
[2013-01-08 13:47:52,658] - [remote_util:1128] INFO - running command.raw sudo df -Th
[2013-01-08 13:47:52,774] - [remote_util:1157] INFO - command executed successfully
[2013-01-08 13:47:52,775] - [remote_util:1128] INFO - running command.raw sudo cat /proc/meminfo
[2013-01-08 13:47:52,882] - [remote_util:1157] INFO - command executed successfully
[2013-01-08 13:47:52,883] - [remote_util:1128] INFO - running command.raw hostname
[2013-01-08 13:47:52,991] - [remote_util:1157] INFO - command executed successfully
[2013-01-08 13:47:52,993] - [remote_util:1128] INFO - running command.raw ps -eo comm,pid | awk '$1 == "memcached" { print $2 }'
[2013-01-08 13:47:53,107] - [remote_util:1157] INFO - command executed successfully
[2013-01-08 13:47:53,108] - [swaprebalance:401] INFO - os:cmd("kill -9 3663 ")
[2013-01-08 13:47:53,130] - [rest_client:803] INFO - /diag/eval status: True content: [] command: os:cmd("kill -9 3663 ")
[2013-01-08 13:47:53,130] - [swaprebalance:403] INFO - killed 10.3.2.149:8091?? (True, '[]')
[2013-01-08 13:47:53,131] - [swaprebalance:404] INFO - sleep for 10 sec after kill memcached
...
2013-01-08 13:49:04,037] - [swaprebalance:412] INFO - rebalance progress: 42.5532203758
[2013-01-08 13:49:05,058] - [swaprebalance:415] INFO - rebalance progress status:running
[2013-01-08 13:49:05,068] - [swaprebalance:417] INFO - rebalance is still running even after restarting memcached
[2013-01-08 13:49:05,069] - [swaprebalance:379] INFO - FAIL SWAP REBALANCE PHASE @ 60
[2013-01-08 13:49:05,079] - [rest_client:881] INFO - rebalance percentage : 42.5775868475 %
...
[2013-01-08 13:52:34,570] - [rest_client:881] INFO - rebalance percentage : 50.0 %
[2013-01-08 13:52:36,597] - [rest_client:866] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed

[ns_server:info,2013-01-08T12:32:23.377,ns_1@10.3.2.145:<0.32321.1>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.2.145': [<<"replication_building_138_'ns_1@10.3.2.147'">>,
                                                      <<"replication_building_138_'ns_1@10.3.2.149'">>]
[error_logger:error,2013-01-08T12:32:23.377,ns_1@10.3.2.145:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.32321.1> terminating
** Last message in was {'EXIT',<20175.7759.0>,{badmatch,{error,closed}}}
** When Server state == {state,"default",138,'ns_1@10.3.2.145',
                               [{'ns_1@10.3.2.147',<20177.4979.0>},
                                {'ns_1@10.3.2.149',<20175.7759.0>}]}
** Reason for termination ==
** {{badmatch,[{<20177.4979.0>,{badmatch,{error,closed}}},
               {<20175.7759.0>,noproc}]},
    [{misc,sync_shutdown_many_i_am_trapping_exits,1},
     {misc,try_with_maybe_ignorant_after,2},
     {gen_server,terminate,6},
     {proc_lib,init_p_do_apply,3}]}

[ns_server:error,2013-01-08T12:32:23.378,ns_1@10.3.2.145:<0.32285.1>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<0.32321.1>,
                                    {{badmatch,
                                      [{<20177.4979.0>,
                                        {badmatch,{error,closed}}},
                                       {<20175.7759.0>,noproc}]},
                                     [{misc,
                                       sync_shutdown_many_i_am_trapping_exits,
                                       1},
                                      {misc,try_with_maybe_ignorant_after,2},
                                      {gen_server,terminate,6},
                                      {proc_lib,init_p_do_apply,3}]}}]
[ns_server:error,2013-01-08T12:32:23.379,ns_1@10.3.2.145:<0.32285.1>:misc:try_with_maybe_ignorant_after:1444]Eating exception from ignorant after-block:
{error,{badmatch,[{<0.32321.1>,
                   {{badmatch,[{<20177.4979.0>,{badmatch,{error,closed}}},
                               {<20175.7759.0>,noproc}]},
                    [{misc,sync_shutdown_many_i_am_trapping_exits,1},
                     {misc,try_with_maybe_ignorant_after,2},
                     {gen_server,terminate,6},
                     {proc_lib,init_p_do_apply,3}]}}]},
       [{misc,sync_shutdown_many_i_am_trapping_exits,1},
        {misc,try_with_maybe_ignorant_after,2},
        {ns_single_vbucket_mover,mover,6},
        {proc_lib,init_p_do_apply,3}]}
[error_logger:error,2013-01-08T12:32:23.379,ns_1@10.3.2.145:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: new_ns_replicas_builder:init/1
    pid: <0.32321.1>
    registered_name: []
    exception exit: {{badmatch,[{<20177.4979.0>,{badmatch,{error,closed}}},
                                {<20175.7759.0>,noproc}]},
                     [{misc,sync_shutdown_many_i_am_trapping_exits,1},
                      {misc,try_with_maybe_ignorant_after,2},
                      {gen_server,terminate,6},
                      {proc_lib,init_p_do_apply,3}]}
      in function gen_server:terminate/6
    ancestors: [<0.32285.1>,<0.13837.1>,<0.13752.1>]
    messages: [{'EXIT',<0.32285.1>,shutdown}]
    links: [<0.32285.1>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 196418
    stack_size: 24
    reductions: 24429
  neighbours:

[views:debug,2013-01-08T12:32:23.379,ns_1@10.3.2.145:mc_couch_events<0.425.0>:capi_set_view_manager:handle_mc_couch_event:529]Got set_vbucket event for default/438. Updated state: dead (1)
[ns_server:info,2013-01-08T12:32:23.383,ns_1@10.3.2.145:janitor_agent-default<0.643.0>:janitor_agent:handle_info:676]Undoing temporary vbucket states caused by rebalance
[user:info,2013-01-08T12:32:23.383,ns_1@10.3.2.145:<0.2801.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {{bulk_set_vbucket_state_failed,
                               [{'ns_1@10.3.2.146',
                                 {'EXIT',
                                  {{{{unexpected_reason,
                                      {{badmatch,{error,closed}},
                                       [{mc_binary,quick_stats_recv,3},
                                        {mc_binary,quick_stats_loop,5},
                                        {mc_binary,quick_stats,5},
                                        {mc_client_binary,
                                         get_zero_open_checkpoint_vbuckets,3},
                                        {ebucketmigrator_srv,handle_call,3},
                                        {gen_server,handle_msg,5},
                                        {proc_lib,init_p_do_apply,3}]}},
                                     [{misc,executing_on_new_process,1},
                                      {tap_replication_manager,
                                       change_vbucket_filter,4},
                                      {tap_replication_manager,
                                       '-do_set_incoming_replication_map/3-lc$^5/1-5-',
                                       2},
                                      {tap_replication_manager,
                                       do_set_incoming_replication_map,3},
                                      {tap_replication_manager,handle_call,3},
                                      {gen_server,handle_msg,5},
                                      {proc_lib,init_p_do_apply,3}]},
                                    {gen_server,call,
                                     ['tap_replication_manager-default',
                                      {change_vbucket_replication,139,
                                       undefined},
                                      infinity]}},
                                   {gen_server,call,
                                    [{'janitor_agent-default',
                                      'ns_1@10.3.2.146'},
                                     {if_rebalance,<0.13837.1>,
                                      {update_vbucket_state,139,replica,
                                       undefined,undefined}},
                                     infinity]}}}}]},
                              [{janitor_agent,bulk_set_vbucket_state,4},
                               {ns_vbucket_mover,
                                update_replication_post_move,3},
                               {ns_vbucket_mover,on_move_done,2},
                               {gen_server,handle_msg,5},
                               {proc_lib,init_p_do_apply,3}]}

[ns_server:debug,2013-01-08T12:32:23.384,ns_1@10.3.2.145:compaction_daemon<0.467.0>:compaction_daemon:handle_info:367]Looks like vbucket mover inhibiting view compaction for for bucket "default" is dead. Canceling inhibition
[ns_server:debug,2013-01-08T12:32:23.383,ns_1@10.3.2.145:<0.13845.1>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_node_disco_events,<0.13837.1>} exited with reason {{bulk_set_vbucket_state_failed,
                                                                                       [{'ns_1@10.3.2.146',
                                                                                         {'EXIT',
                                                                                          {{{{unexpected_reason,
                                                                                              {{badmatch,
                                                                                                {error,
                                                                                                 closed}},
                                                                                               [{mc_binary,
                                                                                                 quick_stats_recv,
                                                                                                 3},
                                                                                                {mc_binary,
                                                                                                 quick_stats_loop,
                                                                                                 5},
                                                                                                {mc_binary,
                                                                                                 quick_stats,
                                                                                                 5},
                                                                                                {mc_client_binary,
                                                                                                 get_zero_open_checkpoint_vbuckets,
                                                                                                 3},
                                                                                                {ebucketmigrator_srv,
                                                                                                 handle_call,
                                                                                                 3},
                                                                                                {gen_server,
                                                                                                 handle_msg,
                                                                                                 5},
                                                                                                {proc_lib,
                                                                                                 init_p_do_apply,
                                                                                                 3}]}},
                                                                                             [{misc,
                                                                                               executing_on_new_process,
                                                                                               1},
                                                                                              {tap_replication_manager,
                                                                                               change_vbucket_filter,
                                                                                               4},
                                                                                              {tap_replication_manager,
                                                                                               '-do_set_incoming_replication_map/3-lc$^5/1-5-',
                                                                                               2},
                                                                                              {tap_replication_manager,
                                                                                               do_set_incoming_replication_map,
                                                                                               3},
                                                                                              {tap_replication_manager,
                                                                                               handle_call,
                                                                                               3},
                                                                                              {gen_server,
                                                                                               handle_msg,
                                                                                               5},
                                                                                              {proc_lib,
                                                                                               init_p_do_apply,
                                                                                               3}]},
                                                                                            {gen_server,
                                                                                             call,
                                                                                             ['tap_replication_manager-default',
                                                                                              {change_vbucket_replication,
                                                                                               139,
                                                                                               undefined},
                                                                                              infinity]}},
                                                                                           {gen_server,
                                                                                            call,
                                                                                            [{'janitor_agent-default',
                                                                                              'ns_1@10.3.2.146'},
                                                                                             {if_rebalance,
                                                                                              <0.13837.1>,
                                                                                              {update_vbucket_state,
                                                                                               139,
                                                                                               replica,
                                                                                               undefined,
                                                                                               undefined}},
                                                                                             infinity]}}}}]},
                                                                                      [{janitor_agent,
                                                                                        bulk_set_vbucket_state,
                                                                                        4},
                                                                                       {ns_vbucket_mover,
                                                                                        update_replication_post_move,
                                                                                        3},
                                                                                       {ns_vbucket_mover,
                                                                                        on_move_done,
                                                                                        2},
                                                                                       {gen_server,
                                                                                        handle_msg,
                                                                                        5},
                                                                                       {proc_lib,
                                                                                        init_p_do_apply,
                                                                                        3}]}
[ns_server:debug,2013-01-08T12:32:23.384,ns_1@10.3.2.145:<0.32356.1>:mc_connection:do_notify_vbucket_update:112]Signaled mc_couch_event: {set_vbucket,"default",438,dead,1}
[ns_server:debug,2013-01-08T12:32:23.387,ns_1@10.3.2.145:capi_set_view_manager-default<0.617.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[error_logger:error,2013-01-08T12:32:23.387,ns_1@10.3.2.145:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.32285.1>
    registered_name: []
    exception exit: {unexpected_exit,
                     {'EXIT',<0.13837.1>,
                      {{bulk_set_vbucket_state_failed,
                        [{'ns_1@10.3.2.146',
                          {'EXIT',
                           {{{{unexpected_reason,
                               {{badmatch,{error,closed}},
                                [{mc_binary,quick_stats_recv,3},
                                 {mc_binary,quick_stats_loop,5},
                                 {mc_binary,quick_stats,5},
                                 {mc_client_binary,
                                  get_zero_open_checkpoint_vbuckets,3},
                                 {ebucketmigrator_srv,handle_call,3},
                                 {gen_server,handle_msg,5},
                                 {proc_lib,init_p_do_apply,3}]}},
                              [{misc,executing_on_new_process,1},
                               {tap_replication_manager,
                                change_vbucket_filter,4},
                               {tap_replication_manager,
                                '-do_set_incoming_replication_map/3-lc$^5/1-5-',
                                2},
                               {tap_replication_manager,
                                do_set_incoming_replication_map,3},
                               {tap_replication_manager,handle_call,3},
                               {gen_server,handle_msg,5},
                               {proc_lib,init_p_do_apply,3}]},
                             {gen_server,call,
                              ['tap_replication_manager-default',
                               {change_vbucket_replication,139,undefined},
                               infinity]}},
                            {gen_server,call,
                             [{'janitor_agent-default','ns_1@10.3.2.146'},
                              {if_rebalance,<0.13837.1>,
                               {update_vbucket_state,139,replica,undefined,
                                undefined}},
                              infinity]}}}}]},
                       [{janitor_agent,bulk_set_vbucket_state,4},
                        {ns_vbucket_mover,update_replication_post_move,3},
                        {ns_vbucket_mover,on_move_done,2},
                        {gen_server,handle_msg,5},
                        {proc_lib,init_p_do_apply,3}]}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.13837.1>,<0.13752.1>]
    messages: [{'EXIT',<0.13837.1>,
                   {{bulk_set_vbucket_state_failed,
                     [{'ns_1@10.3.2.146',
                       {'EXIT',
                        {{{{unexpected_reason,
                            {{badmatch,{error,closed}},
                             [{mc_binary,quick_stats_recv,3},
                              {mc_binary,quick_stats_loop,5},
                              {mc_binary,quick_stats,5},
                              {mc_client_binary,
                               get_zero_open_checkpoint_vbuckets,3},
                              {ebucketmigrator_srv,handle_call,3},
                              {gen_server,handle_msg,5},
                              {proc_lib,init_p_do_apply,3}]}},
                           [{misc,executing_on_new_process,1},
                            {tap_replication_manager,change_vbucket_filter,4},
                            {tap_replication_manager,
                             '-do_set_incoming_replication_map/3-lc$^5/1-5-',
                             2},
                            {tap_replication_manager,
                             do_set_incoming_replication_map,3},
                            {tap_replication_manager,handle_call,3},
                            {gen_server,handle_msg,5},
                            {proc_lib,init_p_do_apply,3}]},
                          {gen_server,call,
                           ['tap_replication_manager-default',
                            {change_vbucket_replication,139,undefined},
                            infinity]}},
                         {gen_server,call,
                          [{'janitor_agent-default','ns_1@10.3.2.146'},
                           {if_rebalance,<0.13837.1>,
                            {update_vbucket_state,139,replica,undefined,
                             undefined}},
                           infinity]}}}}]},
                    [{janitor_agent,bulk_set_vbucket_state,4},
                     {ns_vbucket_mover,update_replication_post_move,3},
                     {ns_vbucket_mover,on_move_done,2},
                     {gen_server,handle_msg,5},
                     {proc_lib,init_p_do_apply,3}]}}]
    links: [<0.13837.1>,<0.32352.1>]
    dictionary: [{cleanup_list,[<0.32321.1>,<0.32322.1>]}]
    trap_exit: true
    status: running
    heap_size: 6765
    stack_size: 24
    reductions: 9680
  neighbours:




 Comments   
Comment by Andrei Baranouski [ 09/Jan/13 ]
no memcached crash
Comment by Andrei Baranouski [ 09/Jan/13 ]
https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.145-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.146-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.147-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.148-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.149-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.152-diag.txt.gz
Comment by Aleksey Kondratenko [ 09/Jan/13 ]
Indeed no memcached crash, but errors indicate that memcached dropped connection.
Comment by Farshid Ghods (Inactive) [ 10/Jan/13 ]
test code :



        for i in [1, 2, 3]:
            expected_progress = 20 * i
            self.log.info("FAIL SWAP REBALANCE PHASE @ {0}".format(expected_progress))
            RestHelper(rest).rebalance_reached(expected_progress)
            bucket = rest.get_buckets()[0].name
            pid = None
            if self.swap_orchestrator:
                # get PID via remote connection if master is a new node
                shell = RemoteMachineShellConnection(master)
                o, _ = shell.execute_command("ps -eo comm,pid | awk '$1 == \"memcached\" { print $2 }'")
                pid = o[0]
                shell.disconnect()
            else:
                for i in xrange(2):
                    try:
                        _mc = MemcachedClientHelper.direct_client(master, bucket)
                        pid = _mc.stats()["pid"]
                        break
                    except EOFError as e:
                        self.log.error("{0}.Retry in 2 sec".format(e))
                        time.sleep(1)
            if pid is None:
                self.fail("impossible to get a PID")
            command = "os:cmd(\"kill -9 {0} \")".format(pid)
            self.log.info(command)
            killed = rest.diag_eval(command)
            self.log.info("killed {0}:{1}?? {2} ".format(master.ip, master.port, killed))
            self.log.info("sleep for 10 sec after kill memcached")
            time.sleep(10)
            # we can't get stats for new node when rebalance falls
            if not self.swap_orchestrator:
                ClusterOperationHelper._wait_warmup_completed(self, [master], bucket, wait_time=600)
            i = 0
            #we expect that rebalance will be failed
            while rest._rebalance_progress_status() == "running" and i < 60:
                self.log.info("rebalance progress: {0}".format(rest._rebalance_progress()))
                time.sleep(1)
                i += 1
            self.log.info("rebalance progress status:{0}".format(rest._rebalance_progress_status()))
            if rest._rebalance_progress_status() == "running":
                self.log.info("rebalance is still running even after restarting memcached")
                continue
            knownNodes = rest.node_statuses();
            self.log.info("nodes are still in cluster: {0}".format([(node.ip, node.port) for node in knownNodes]))
            ejectedNodes = list(set(optNodesIds) & set([node.id for node in knownNodes]))
            rest.rebalance(otpNodes=[node.id for node in knownNodes],
                ejectedNodes=ejectedNodes)

        self.assertTrue(rest.monitorRebalance(),
            msg="rebalance operation failed after adding node {0}".format(toBeEjectedNodes))


rebalance ,
then kill a node
wait until node is warmed up
wait until rebalance fails because node is down

then restart the rebalance.
Comment by Chiyoung Seo [ 10/Jan/13 ]
Andrei,

Aliaskey A, Farshid, and I discussed the test case that you ran, and identified some stuffs that we can improve in the test case.

Basically, we should wait enough until the first rebalance fails instead of waiting for 60 sec in three iterations.

Therefore, the suggestion is

1) Remove the for loop in the code

2) Wait for one hour as the timeout for the first rebalance failure

3) If the rebalance status is not changed even after an hour, print out the error message and exit the test case.


Can you rerun the test case after adapting the test code?

Thanks,
Comment by Chiyoung Seo [ 10/Jan/13 ]
See my comments above.
Comment by Andrei Baranouski [ 11/Jan/13 ]
Hi,
Let me first clarify what happened in this case

1. started rebalance and when it reach > 40% we killed memcached on one of the nodes
 here we handle 2 situation: reblance failed and it continued without failure

in our case, the second option worked:
[2013-01-08 13:47:49,441] - [rest_client:881] INFO - rebalance percentage : 40.0627668879 %
...
[2013-01-08 13:47:53,130] - [swaprebalance:403] INFO - killed 10.3.2.149:8091?? (True, '[]')
[2013-01-08 13:47:53,131] - [swaprebalance:404] INFO - sleep for 10 sec after kill memcached
...
2013-01-08 13:49:04,037] - [swaprebalance:412] INFO - rebalance progress: 42.5532203758
[2013-01-08 13:49:05,058] - [swaprebalance:415] INFO - rebalance progress status:running
[2013-01-08 13:49:05,068] - [swaprebalance:417] INFO - rebalance is still running even after restarting memcached

2. when rebalance reached > 60% we killed memcashed in second time, but this did not happen, and it fell before:

[2013-01-08 13:49:05,079] - [rest_client:881] INFO - rebalance percentage : 42.5775868475 %
...
[2013-01-08 13:52:34,570] - [rest_client:881] INFO - rebalance percentage : 50.0 %
[2013-01-08 13:52:36,597] - [rest_client:866] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed

So we got to rebalance fell in ~5 minutes after we killed memcached(13:47:53-13:52:36), and all this time the progress rebalance changed(~40->50%)

I would not want to change the logic of this test, so it works for a long time. In addition, I have a question about your recommendations:
1) why I need to remove loop?
2) why do we need to wait an hour for the first rebalance failure? In our case it turns out that we were waiting for the fall of about 5 minutes after the killing memcached
3)'If the rebalance status is not changed even after an hour' - do you mean it stuck?


all the logs for clarity

[2013-01-08 13:47:53,130] - [swaprebalance:403] INFO - killed 10.3.2.149:8091?? (True, '[]')
[2013-01-08 13:47:53,131] - [swaprebalance:404] INFO - sleep for 10 sec after kill memcached
[2013-01-08 13:48:03,182] - [rest_client:881] INFO - rebalance percentage : 40.4774970989 %
[2013-01-08 13:48:03,183] - [swaprebalance:412] INFO - rebalance progress: 40.4774970989
[2013-01-08 13:48:04,223] - [rest_client:881] INFO - rebalance percentage : 40.5263729542 %
[2013-01-08 13:48:04,224] - [swaprebalance:412] INFO - rebalance progress: 40.5263729542
[2013-01-08 13:48:05,254] - [rest_client:881] INFO - rebalance percentage : 40.550739426 %
[2013-01-08 13:48:05,254] - [swaprebalance:412] INFO - rebalance progress: 40.550739426
[2013-01-08 13:48:06,289] - [rest_client:881] INFO - rebalance percentage : 40.5995438254 %
[2013-01-08 13:48:06,289] - [swaprebalance:412] INFO - rebalance progress: 40.5995438254
[2013-01-08 13:48:07,311] - [rest_client:881] INFO - rebalance percentage : 40.623981753 %
[2013-01-08 13:48:07,311] - [swaprebalance:412] INFO - rebalance progress: 40.623981753
[2013-01-08 13:48:08,331] - [rest_client:881] INFO - rebalance percentage : 40.6727861524 %
[2013-01-08 13:48:08,331] - [swaprebalance:412] INFO - rebalance progress: 40.6727861524
[2013-01-08 13:48:09,366] - [rest_client:881] INFO - rebalance percentage : 40.7216620077 %
[2013-01-08 13:48:09,367] - [swaprebalance:412] INFO - rebalance progress: 40.7216620077
[2013-01-08 13:48:10,389] - [rest_client:881] INFO - rebalance percentage : 40.7704664071 %
[2013-01-08 13:48:10,390] - [swaprebalance:412] INFO - rebalance progress: 40.7704664071
[2013-01-08 13:48:11,404] - [rest_client:881] INFO - rebalance percentage : 40.7949043348 %
[2013-01-08 13:48:11,405] - [swaprebalance:412] INFO - rebalance progress: 40.7949043348
[2013-01-08 13:48:12,446] - [rest_client:881] INFO - rebalance percentage : 40.8437087342 %
[2013-01-08 13:48:12,447] - [swaprebalance:412] INFO - rebalance progress: 40.8437087342
[2013-01-08 13:48:13,468] - [rest_client:881] INFO - rebalance percentage : 40.8681466619 %
[2013-01-08 13:48:13,468] - [swaprebalance:412] INFO - rebalance progress: 40.8681466619
[2013-01-08 13:48:14,560] - [rest_client:881] INFO - rebalance percentage : 40.9169510613 %
[2013-01-08 13:48:14,560] - [swaprebalance:412] INFO - rebalance progress: 40.9169510613
[2013-01-08 13:48:15,708] - [rest_client:881] INFO - rebalance percentage : 40.9413889889 %
[2013-01-08 13:48:15,708] - [swaprebalance:412] INFO - rebalance progress: 40.9413889889
[2013-01-08 13:48:16,722] - [rest_client:881] INFO - rebalance percentage : 40.9657554607 %
[2013-01-08 13:48:16,723] - [swaprebalance:412] INFO - rebalance progress: 40.9657554607
[2013-01-08 13:48:17,737] - [rest_client:881] INFO - rebalance percentage : 40.9901933883 %
[2013-01-08 13:48:17,737] - [swaprebalance:412] INFO - rebalance progress: 40.9901933883
[2013-01-08 13:48:18,762] - [rest_client:881] INFO - rebalance percentage : 41.0389977877 %
[2013-01-08 13:48:18,763] - [swaprebalance:412] INFO - rebalance progress: 41.0389977877
[2013-01-08 13:48:19,810] - [rest_client:881] INFO - rebalance percentage : 41.0634357154 %
[2013-01-08 13:48:19,811] - [swaprebalance:412] INFO - rebalance progress: 41.0634357154
[2013-01-08 13:48:20,833] - [rest_client:881] INFO - rebalance percentage : 41.1122401148 %
[2013-01-08 13:48:20,833] - [swaprebalance:412] INFO - rebalance progress: 41.1122401148
[2013-01-08 13:48:21,848] - [rest_client:881] INFO - rebalance percentage : 41.1366780425 %
[2013-01-08 13:48:21,848] - [swaprebalance:412] INFO - rebalance progress: 41.1366780425
[2013-01-08 13:48:22,864] - [rest_client:881] INFO - rebalance percentage : 41.1611159701 %
[2013-01-08 13:48:22,865] - [swaprebalance:412] INFO - rebalance progress: 41.1611159701
[2013-01-08 13:48:23,881] - [rest_client:881] INFO - rebalance percentage : 41.1854824418 %
[2013-01-08 13:48:23,881] - [swaprebalance:412] INFO - rebalance progress: 41.1854824418
[2013-01-08 13:48:24,897] - [rest_client:881] INFO - rebalance percentage : 41.2099203695 %
[2013-01-08 13:48:24,897] - [swaprebalance:412] INFO - rebalance progress: 41.2099203695
[2013-01-08 13:48:25,928] - [rest_client:881] INFO - rebalance percentage : 41.2343582972 %
[2013-01-08 13:48:25,929] - [swaprebalance:412] INFO - rebalance progress: 41.2343582972
[2013-01-08 13:48:26,959] - [rest_client:881] INFO - rebalance percentage : 41.2587247689 %
[2013-01-08 13:48:26,959] - [swaprebalance:412] INFO - rebalance progress: 41.2587247689
[2013-01-08 13:48:28,008] - [rest_client:881] INFO - rebalance percentage : 41.2831626966 %
[2013-01-08 13:48:28,008] - [swaprebalance:412] INFO - rebalance progress: 41.2831626966
[2013-01-08 13:48:29,054] - [rest_client:881] INFO - rebalance percentage : 41.331967096 %
[2013-01-08 13:48:29,054] - [swaprebalance:412] INFO - rebalance progress: 41.331967096
[2013-01-08 13:48:30,070] - [rest_client:881] INFO - rebalance percentage : 41.3564050236 %
[2013-01-08 13:48:30,071] - [swaprebalance:412] INFO - rebalance progress: 41.3564050236
[2013-01-08 13:48:31,094] - [rest_client:881] INFO - rebalance percentage : 41.3808429513 %
[2013-01-08 13:48:31,094] - [swaprebalance:412] INFO - rebalance progress: 41.3808429513
[2013-01-08 13:48:32,111] - [rest_client:881] INFO - rebalance percentage : 41.405280879 %
[2013-01-08 13:48:32,112] - [swaprebalance:412] INFO - rebalance progress: 41.405280879
[2013-01-08 13:48:33,133] - [rest_client:881] INFO - rebalance percentage : 41.4540852784 %
[2013-01-08 13:48:33,133] - [swaprebalance:412] INFO - rebalance progress: 41.4540852784
[2013-01-08 13:48:34,152] - [rest_client:881] INFO - rebalance percentage : 41.5028896778 %
[2013-01-08 13:48:34,153] - [swaprebalance:412] INFO - rebalance progress: 41.5028896778
[2013-01-08 13:48:35,197] - [rest_client:881] INFO - rebalance percentage : 41.5273276054 %
[2013-01-08 13:48:35,197] - [swaprebalance:412] INFO - rebalance progress: 41.5273276054
[2013-01-08 13:48:36,216] - [rest_client:881] INFO - rebalance percentage : 41.5517655331 %
[2013-01-08 13:48:36,217] - [swaprebalance:412] INFO - rebalance progress: 41.5517655331
[2013-01-08 13:48:37,231] - [rest_client:881] INFO - rebalance percentage : 41.6006413884 %
[2013-01-08 13:48:37,232] - [swaprebalance:412] INFO - rebalance progress: 41.6006413884
[2013-01-08 13:48:38,244] - [rest_client:881] INFO - rebalance percentage : 41.6250793161 %
[2013-01-08 13:48:38,244] - [swaprebalance:412] INFO - rebalance progress: 41.6250793161
[2013-01-08 13:48:39,262] - [rest_client:881] INFO - rebalance percentage : 41.6495172437 %
[2013-01-08 13:48:39,263] - [swaprebalance:412] INFO - rebalance progress: 41.6495172437
[2013-01-08 13:48:40,285] - [rest_client:881] INFO - rebalance percentage : 41.6983930991 %
[2013-01-08 13:48:40,285] - [swaprebalance:412] INFO - rebalance progress: 41.6983930991
[2013-01-08 13:48:41,302] - [rest_client:881] INFO - rebalance percentage : 41.7228310267 %
[2013-01-08 13:48:41,302] - [swaprebalance:412] INFO - rebalance progress: 41.7228310267
[2013-01-08 13:48:42,323] - [rest_client:881] INFO - rebalance percentage : 41.7472689544 %
[2013-01-08 13:48:42,323] - [swaprebalance:412] INFO - rebalance progress: 41.7472689544
[2013-01-08 13:48:43,341] - [rest_client:881] INFO - rebalance percentage : 41.7717068821 %
[2013-01-08 13:48:43,341] - [swaprebalance:412] INFO - rebalance progress: 41.7717068821
[2013-01-08 13:48:44,356] - [rest_client:881] INFO - rebalance percentage : 41.7961448097 %
[2013-01-08 13:48:44,357] - [swaprebalance:412] INFO - rebalance progress: 41.7961448097
[2013-01-08 13:48:45,372] - [rest_client:881] INFO - rebalance percentage : 41.8205827374 %
[2013-01-08 13:48:45,373] - [swaprebalance:412] INFO - rebalance progress: 41.8205827374
[2013-01-08 13:48:46,391] - [rest_client:881] INFO - rebalance percentage : 41.8450206651 %
[2013-01-08 13:48:46,391] - [swaprebalance:412] INFO - rebalance progress: 41.8450206651
[2013-01-08 13:48:47,445] - [rest_client:881] INFO - rebalance percentage : 41.8694585927 %
[2013-01-08 13:48:47,446] - [swaprebalance:412] INFO - rebalance progress: 41.8694585927
[2013-01-08 13:48:48,460] - [rest_client:881] INFO - rebalance percentage : 41.918334448 %
[2013-01-08 13:48:48,460] - [swaprebalance:412] INFO - rebalance progress: 41.918334448
[2013-01-08 13:48:49,485] - [rest_client:881] INFO - rebalance percentage : 41.9672103034 %
[2013-01-08 13:48:49,485] - [swaprebalance:412] INFO - rebalance progress: 41.9672103034
[2013-01-08 13:48:50,506] - [rest_client:881] INFO - rebalance percentage : 42.0160861587 %
[2013-01-08 13:48:50,507] - [swaprebalance:412] INFO - rebalance progress: 42.0160861587
[2013-01-08 13:48:51,522] - [rest_client:881] INFO - rebalance percentage : 42.0405240864 %
[2013-01-08 13:48:51,523] - [swaprebalance:412] INFO - rebalance progress: 42.0405240864
[2013-01-08 13:48:52,537] - [rest_client:881] INFO - rebalance percentage : 42.0893284858 %
[2013-01-08 13:48:52,538] - [swaprebalance:412] INFO - rebalance progress: 42.0893284858
[2013-01-08 13:48:53,552] - [rest_client:881] INFO - rebalance percentage : 42.1381328852 %
[2013-01-08 13:48:53,553] - [swaprebalance:412] INFO - rebalance progress: 42.1381328852
[2013-01-08 13:48:54,591] - [rest_client:881] INFO - rebalance percentage : 42.1870087405 %
[2013-01-08 13:48:54,592] - [swaprebalance:412] INFO - rebalance progress: 42.1870087405
[2013-01-08 13:48:55,709] - [rest_client:881] INFO - rebalance percentage : 42.2358131399 %
[2013-01-08 13:48:55,710] - [swaprebalance:412] INFO - rebalance progress: 42.2358131399
[2013-01-08 13:48:56,723] - [rest_client:881] INFO - rebalance percentage : 42.2602510676 %
[2013-01-08 13:48:56,724] - [swaprebalance:412] INFO - rebalance progress: 42.2602510676
[2013-01-08 13:48:57,746] - [rest_client:881] INFO - rebalance percentage : 42.309055467 %
[2013-01-08 13:48:57,747] - [swaprebalance:412] INFO - rebalance progress: 42.309055467
[2013-01-08 13:48:58,768] - [rest_client:881] INFO - rebalance percentage : 42.3579313223 %
[2013-01-08 13:48:58,769] - [swaprebalance:412] INFO - rebalance progress: 42.3579313223
[2013-01-08 13:48:59,821] - [rest_client:881] INFO - rebalance percentage : 42.382297794 %
[2013-01-08 13:48:59,821] - [swaprebalance:412] INFO - rebalance progress: 42.382297794
[2013-01-08 13:49:00,878] - [rest_client:881] INFO - rebalance percentage : 42.4311021934 %
[2013-01-08 13:49:00,879] - [swaprebalance:412] INFO - rebalance progress: 42.4311021934
[2013-01-08 13:49:01,925] - [rest_client:881] INFO - rebalance percentage : 42.4799780487 %
[2013-01-08 13:49:01,926] - [swaprebalance:412] INFO - rebalance progress: 42.4799780487
[2013-01-08 13:49:02,950] - [rest_client:881] INFO - rebalance percentage : 42.5043445205 %
[2013-01-08 13:49:02,951] - [swaprebalance:412] INFO - rebalance progress: 42.5043445205
[2013-01-08 13:49:04,036] - [rest_client:881] INFO - rebalance percentage : 42.5532203758 %
[2013-01-08 13:49:04,037] - [swaprebalance:412] INFO - rebalance progress: 42.5532203758
[2013-01-08 13:49:05,058] - [swaprebalance:415] INFO - rebalance progress status:running
[2013-01-08 13:49:05,068] - [swaprebalance:417] INFO - rebalance is still running even after restarting memcached
[2013-01-08 13:49:05,069] - [swaprebalance:379] INFO - FAIL SWAP REBALANCE PHASE @ 60
[2013-01-08 13:49:05,079] - [rest_client:881] INFO - rebalance percentage : 42.5775868475 %
[2013-01-08 13:49:07,088] - [rest_client:881] INFO - rebalance percentage : 42.6752671023 %
[2013-01-08 13:49:09,134] - [rest_client:881] INFO - rebalance percentage : 42.772947357 %
[2013-01-08 13:49:11,151] - [rest_client:881] INFO - rebalance percentage : 42.8218232123 %
[2013-01-08 13:49:13,164] - [rest_client:881] INFO - rebalance percentage : 42.8949940834 %
[2013-01-08 13:49:15,179] - [rest_client:881] INFO - rebalance percentage : 42.9682364105 %
[2013-01-08 13:49:17,189] - [rest_client:881] INFO - rebalance percentage : 43.0414787376 %
[2013-01-08 13:49:19,208] - [rest_client:881] INFO - rebalance percentage : 43.1390875364 %
[2013-01-08 13:49:21,219] - [rest_client:881] INFO - rebalance percentage : 43.2124013194 %
[2013-01-08 13:49:23,229] - [rest_client:881] INFO - rebalance percentage : 43.3100815741 %
[2013-01-08 13:49:25,239] - [rest_client:881] INFO - rebalance percentage : 43.3588859735 %
[2013-01-08 13:49:27,252] - [rest_client:881] INFO - rebalance percentage : 43.4321283005 %
[2013-01-08 13:49:29,261] - [rest_client:881] INFO - rebalance percentage : 43.5054420835 %
[2013-01-08 13:49:31,272] - [rest_client:881] INFO - rebalance percentage : 43.5542464829 %
[2013-01-08 13:49:33,292] - [rest_client:881] INFO - rebalance percentage : 43.6519267377 %
[2013-01-08 13:49:35,303] - [rest_client:881] INFO - rebalance percentage : 43.7251690647 %
[2013-01-08 13:49:37,317] - [rest_client:881] INFO - rebalance percentage : 43.7740449201 %
[2013-01-08 13:49:39,326] - [rest_client:881] INFO - rebalance percentage : 43.8472872471 %
[2013-01-08 13:49:41,337] - [rest_client:881] INFO - rebalance percentage : 43.9205295742 %
[2013-01-08 13:49:43,346] - [rest_client:881] INFO - rebalance percentage : 43.9937719012 %
[2013-01-08 13:49:45,354] - [rest_client:881] INFO - rebalance percentage : 44.0425763006 %
[2013-01-08 13:49:47,368] - [rest_client:881] INFO - rebalance percentage : 44.1158186277 %
[2013-01-08 13:49:49,375] - [rest_client:881] INFO - rebalance percentage : 44.1890609548 %
[2013-01-08 13:49:51,387] - [rest_client:881] INFO - rebalance percentage : 44.2623032818 %
[2013-01-08 13:49:53,442] - [rest_client:881] INFO - rebalance percentage : 44.3355456089 %
[2013-01-08 13:49:55,451] - [rest_client:881] INFO - rebalance percentage : 44.3843500083 %
[2013-01-08 13:49:57,465] - [rest_client:881] INFO - rebalance percentage : 44.4576637913 %
[2013-01-08 13:49:59,481] - [rest_client:881] INFO - rebalance percentage : 44.5309061183 %
[2013-01-08 13:50:01,491] - [rest_client:881] INFO - rebalance percentage : 44.5797819737 %
[2013-01-08 13:50:03,508] - [rest_client:881] INFO - rebalance percentage : 44.6773907725 %
[2013-01-08 13:50:05,525] - [rest_client:881] INFO - rebalance percentage : 44.7750710272 %
[2013-01-08 13:50:07,544] - [rest_client:881] INFO - rebalance percentage : 44.8239468825 %
[2013-01-08 13:50:09,562] - [rest_client:881] INFO - rebalance percentage : 44.8971892096 %
[2013-01-08 13:50:11,570] - [rest_client:881] INFO - rebalance percentage : 44.9947980084 %
[2013-01-08 13:50:13,592] - [rest_client:881] INFO - rebalance percentage : 45.0680403354 %
[2013-01-08 13:50:15,606] - [rest_client:881] INFO - rebalance percentage : 45.1412826625 %
[2013-01-08 13:50:17,625] - [rest_client:881] INFO - rebalance percentage : 45.2145964455 %
[2013-01-08 13:50:19,653] - [rest_client:881] INFO - rebalance percentage : 45.336571716 %
[2013-01-08 13:50:21,665] - [rest_client:881] INFO - rebalance percentage : 45.3853761154 %
[2013-01-08 13:50:23,674] - [rest_client:881] INFO - rebalance percentage : 45.4586184425 %
[2013-01-08 13:50:25,699] - [rest_client:881] INFO - rebalance percentage : 45.5807366249 %
[2013-01-08 13:50:27,710] - [rest_client:881] INFO - rebalance percentage : 45.6539789519 %
[2013-01-08 13:50:29,729] - [rest_client:881] INFO - rebalance percentage : 45.727221279 %
[2013-01-08 13:50:31,739] - [rest_client:881] INFO - rebalance percentage : 45.7760971343 %
[2013-01-08 13:50:33,769] - [rest_client:881] INFO - rebalance percentage : 45.8737773891 %
[2013-01-08 13:50:35,779] - [rest_client:881] INFO - rebalance percentage : 45.9469482602 %
[2013-01-08 13:50:37,799] - [rest_client:881] INFO - rebalance percentage : 46.0202620432 %
[2013-01-08 13:50:39,813] - [rest_client:881] INFO - rebalance percentage : 46.0935043702 %
[2013-01-08 13:50:41,831] - [rest_client:881] INFO - rebalance percentage : 46.1666752414 %
[2013-01-08 13:50:43,841] - [rest_client:881] INFO - rebalance percentage : 46.2155510967 %
[2013-01-08 13:50:45,851] - [rest_client:881] INFO - rebalance percentage : 46.2887934238 %
[2013-01-08 13:50:47,860] - [rest_client:881] INFO - rebalance percentage : 46.3620357508 %
[2013-01-08 13:50:49,874] - [rest_client:881] INFO - rebalance percentage : 46.4352780779 %
[2013-01-08 13:50:51,883] - [rest_client:881] INFO - rebalance percentage : 46.4841539332 %
[2013-01-08 13:50:53,893] - [rest_client:881] INFO - rebalance percentage : 46.6062006597 %
[2013-01-08 13:50:55,901] - [rest_client:881] INFO - rebalance percentage : 46.6794429867 %
[2013-01-08 13:50:57,911] - [rest_client:881] INFO - rebalance percentage : 46.7526853138 %
[2013-01-08 13:50:59,921] - [rest_client:881] INFO - rebalance percentage : 46.8503655685 %
[2013-01-08 13:51:01,936] - [rest_client:881] INFO - rebalance percentage : 46.9236793515 %
[2013-01-08 13:51:03,947] - [rest_client:881] INFO - rebalance percentage : 46.9968502227 %
[2013-01-08 13:51:05,974] - [rest_client:881] INFO - rebalance percentage : 47.0701640056 %
[2013-01-08 13:51:07,984] - [rest_client:881] INFO - rebalance percentage : 47.118968405 %
[2013-01-08 13:51:10,010] - [rest_client:881] INFO - rebalance percentage : 47.2166486598 %
[2013-01-08 13:51:12,017] - [rest_client:881] INFO - rebalance percentage : 47.2898909868 %
[2013-01-08 13:51:14,032] - [rest_client:881] INFO - rebalance percentage : 47.3631333139 %
[2013-01-08 13:51:16,055] - [rest_client:881] INFO - rebalance percentage : 47.4364470969 %
[2013-01-08 13:51:18,069] - [rest_client:881] INFO - rebalance percentage : 47.4852514963 %
[2013-01-08 13:51:20,089] - [rest_client:881] INFO - rebalance percentage : 47.5340558957 %
[2013-01-08 13:51:22,103] - [rest_client:881] INFO - rebalance percentage : 47.6317361504 %
[2013-01-08 13:51:24,113] - [rest_client:881] INFO - rebalance percentage : 47.7293449492 %
[2013-01-08 13:51:26,122] - [rest_client:881] INFO - rebalance percentage : 47.8026587322 %
[2013-01-08 13:51:28,133] - [rest_client:881] INFO - rebalance percentage : 47.8759010593 %
[2013-01-08 13:51:30,146] - [rest_client:881] INFO - rebalance percentage : 47.9247054587 %
[2013-01-08 13:51:32,157] - [rest_client:881] INFO - rebalance percentage : 47.9735098581 %
[2013-01-08 13:51:34,168] - [rest_client:881] INFO - rebalance percentage : 48.0467521851 %
[2013-01-08 13:51:36,178] - [rest_client:881] INFO - rebalance percentage : 48.1445038958 %
[2013-01-08 13:51:38,205] - [rest_client:881] INFO - rebalance percentage : 48.2176747669 %
[2013-01-08 13:51:40,220] - [rest_client:881] INFO - rebalance percentage : 48.2909885499 %
[2013-01-08 13:51:42,231] - [rest_client:881] INFO - rebalance percentage : 48.3397929493 %
[2013-01-08 13:51:44,241] - [rest_client:881] INFO - rebalance percentage : 48.4131067323 %
[2013-01-08 13:51:46,268] - [rest_client:881] INFO - rebalance percentage : 48.4863490594 %
[2013-01-08 13:51:48,288] - [rest_client:881] INFO - rebalance percentage : 48.5595199305 %
[2013-01-08 13:51:50,299] - [rest_client:881] INFO - rebalance percentage : 48.6328337135 %
[2013-01-08 13:51:52,320] - [rest_client:881] INFO - rebalance percentage : 48.7060045846 %
[2013-01-08 13:51:54,330] - [rest_client:881] INFO - rebalance percentage : 48.7793183676 %
[2013-01-08 13:51:56,341] - [rest_client:881] INFO - rebalance percentage : 48.8769271664 %
[2013-01-08 13:51:58,357] - [rest_client:881] INFO - rebalance percentage : 48.9989738929 %
[2013-01-08 13:52:00,371] - [rest_client:881] INFO - rebalance percentage : 49.0722876759 %
[2013-01-08 13:52:02,386] - [rest_client:881] INFO - rebalance percentage : 49.1210920753 %
[2013-01-08 13:52:04,396] - [rest_client:881] INFO - rebalance percentage : 49.1943344023 %
[2013-01-08 13:52:06,404] - [rest_client:881] INFO - rebalance percentage : 49.2431388017 %
[2013-01-08 13:52:08,416] - [rest_client:881] INFO - rebalance percentage : 49.3163811288 %
[2013-01-08 13:52:10,429] - [rest_client:881] INFO - rebalance percentage : 49.4140613835 %
[2013-01-08 13:52:12,440] - [rest_client:881] INFO - rebalance percentage : 49.4873037106 %
[2013-01-08 13:52:14,450] - [rest_client:881] INFO - rebalance percentage : 49.5605460376 %
[2013-01-08 13:52:16,460] - [rest_client:881] INFO - rebalance percentage : 49.6338598206 %
[2013-01-08 13:52:18,469] - [rest_client:881] INFO - rebalance percentage : 49.7070306918 %
[2013-01-08 13:52:20,491] - [rest_client:881] INFO - rebalance percentage : 49.7802015629 %
[2013-01-08 13:52:22,501] - [rest_client:881] INFO - rebalance percentage : 49.8534438899 %
[2013-01-08 13:52:24,512] - [rest_client:881] INFO - rebalance percentage : 49.926686217 %
[2013-01-08 13:52:26,520] - [rest_client:881] INFO - rebalance percentage : 50.0 %
[2013-01-08 13:52:28,532] - [rest_client:881] INFO - rebalance percentage : 50.0 %
[2013-01-08 13:52:30,544] - [rest_client:881] INFO - rebalance percentage : 50.0 %
[2013-01-08 13:52:32,554] - [rest_client:881] INFO - rebalance percentage : 50.0 %
[2013-01-08 13:52:34,570] - [rest_client:881] INFO - rebalance percentage : 50.0 %
[2013-01-08 13:52:36,597] - [rest_client:866] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
Comment by Farshid Ghods (Inactive) [ 11/Jan/13 ]
Andrei,

lets discuss this test case on monday.
Comment by Farshid Ghods (Inactive) [ 11/Jan/13 ]
>>I would not want to change the logic of this test, so it works for a long time. In addition, I have a question about your recommendations:
if the test logic in incorrect and have been passing we still need to fix the test logic.


>>2) why do we need to wait an hour for the first rebalance failure? In our case it turns out that we were waiting for the fall of about 5 minutes after the killing memcached

so currently we only wait up to 60 seconds for rebalance to fail or complete. that is not enough. rebalance could continue even though memcached which was shutdown on one of the nodes. consider a use case where rebalance has already rebalanced out node X from 6 node cluster and now removing node Y. if test shuts down memcached X it would not impact rebalance operation and it will succeed.

so the test should wait for rebalance to finish and ignore whether rebalance after memcached shutdown has succeeded or failed.

>>1) Remove the for loop in the code
I actually dont know know why there is a for loop there. can you take a look and see why we needed the for looop there ?

Comment by Farshid Ghods (Inactive) [ 11/Jan/13 ]
>>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck?

yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.
Comment by Farshid Ghods (Inactive) [ 11/Jan/13 ]
>>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck?

yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.
Comment by Andrei Baranouski [ 16/Jan/13 ]
test was changed http://review.couchbase.org/#/c/23987
Comment by Farshid Ghods (Inactive) [ 17/Jan/13 ]
please close the ticket if test passes after changes are merged
Comment by Andrei Baranouski [ 21/Jan/13 ]
all tests passed http://qa.hq.northscale.net/view/2.0.1/job/centos-32-2.0-swaprebalance-tests/263/
Generated at Thu Nov 27 02:46:57 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.