[MB-7115] Rebalance operation failed repetitively while trying to rebalance in 5 nodes and rebalance out 3 nodes on a 5 node cluster, reason possibly because: "Unable to listen" to one of the nodes that was being rebalanced out. Created: 06/Nov/12  Updated: 04/Feb/13  Resolved: 20/Nov/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Abhinav Dangeti Assignee: Abhinav Dangeti
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1942-rel.rpm.manifest.xml
http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1944-rel.setup.exe.manifest.xml

Ubuntu 12.04 LTS: EC2 nodes
64bit - 15GB RAM


 Description   
Scenario:
- 10 node cluster with build 1942
- Rebalance out 5 nodes (completed successfully)
- Cluster right now: 5 nodes
- Add 5 nodes (with build 1944) and remove 3 nodes.
- Hit rebalance.
- Rebalance failed with reason:

Rebalance exited with reason {badmatch,
[{<0.26283.119>,
{{badmatch,{error,emfile}},
[{ns_replicas_builder_utils,
kill_a_bunch_of_tap_names,3},
{misc,try_with_maybe_ignorant_after,2},
{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]}}]}

- Tried rebalance again, but failed repetitively:

Rebalance exited with reason {{{badmatch,[{<18058.14511.0>,noproc}]},
[{misc,sync_shutdown_many_i_am_trapping_exits,
1},
{misc,try_with_maybe_ignorant_after,2},
{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[<0.11023.120>,
{shutdown_replicator,
'ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com'},
infinity]}}

Will upload logs from one of the nodes in the cluster present in the cluster during the time of the rebalance failures, shortly.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Noticed this on one of the nodes being rebalanced out:
Unable to listen on 'ns_1@ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com'.

So failed over the node and tried rebalancing, rebalancing still failed.

So added that node back, and did not involve that particular node in the rebalance operation, rebalance succeeded.

 Comments   
Comment by Abhinav Dangeti [ 06/Nov/12 ]
Grabbed diags: https://s3.amazonaws.com/bugdb/MB-7115/C_rebalance_fails_ns-diag-20121107040344.txt.zip
Comment by Aliaksey Artamonau [ 07/Nov/12 ]
Diags from ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com are needed here.
Comment by Junyi Xie (Inactive) [ 07/Nov/12 ]
Looks a rebalance issue with no direct connection to XDCR
Comment by Thuan Nguyen [ 07/Nov/12 ]
Hit this bug again in build 2.0.0-1941. Bin creates a 4 nodes windows cluster with 5 million items. Then Bin creates xdcr to one ubuntu 10.04 64bit (ec2-50-16-18-63.compute-1.amazonaws.com). At destination, add another ubuntu node to node ec2-50 and rebalance. Rebalance failed with error:
Rebalance exited with reason {mover_failed,{badmatch,{error,emfile}}}

Look at diags, I see ebucketmigrator crashed

[rebalance:info,2012-11-07T22:36:53.434,ns_1@10.108.30.25:<0.11871.3>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.46.215.52 (vbucket 677)
[ns_server:info,2012-11-07T22:36:53.445,ns_1@10.108.30.25:<0.11709.3>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.108.30.25': [<<"replication_building_677_'ns_1@10.46.215.52'">>]
[ns_server:debug,2012-11-07T22:36:53.458,ns_1@10.108.30.25:<0.11706.3>:ns_single_vbucket_mover:spawn_ebucketmigrator_mover:283]Spawned mover "default" 677 'ns_1@10.108.30.25' -> 'ns_1@10.46.215.52': <0.11874.3>
[ns_server:info,2012-11-07T22:36:53.461,ns_1@10.108.30.25:<0.11874.3>:ebucketmigrator_srv:init:490]Setting {"10.46.215.52",11209} vbucket 677 to state replica
[ns_server:error,2012-11-07T22:36:53.621,ns_1@10.108.30.25:<0.11876.3>:menelaus_web_alerts_srv:can_listen:349]gen_udp:open failed due to emfile
[user:info,2012-11-07T22:36:53.621,ns_1@10.108.30.25:<0.11876.3>:menelaus_web_alerts_srv:global_alert:87]IP address seems to have changed. Unable to listen on 'ns_1@10.108.30.25'.
[ns_server:info,2012-11-07T22:36:53.621,ns_1@10.108.30.25:ns_log<0.785.0>:ns_log:handle_cast:125]suppressing duplicate log menelaus_web_alerts_srv:1("IP address seems to have changed. Unable to listen on 'ns_1@10.108.30.25'.") because it's been seen 1 times in the past 9.049832 secs (last seen 9.049832 secs ago
[ns_server:info,2012-11-07T22:36:53.638,ns_1@10.108.30.25:ns_port_memcached<0.864.0>:ns_port_server:log:171]memcached<0.864.0>: Wed Nov 7 22:36:53.446940 UTC 3: TAP (Producer) eq_tapq:replication_building_677_'ns_1@10.46.215.52' - disconnected, keep alive for 300 seconds
memcached<0.864.0>: Wed Nov 7 22:36:53.454932 UTC 3: TAP (Producer) eq_tapq:replication_building_677_'ns_1@10.46.215.52' - Connection is closed by force.

[ns_server:debug,2012-11-07T22:36:53.659,ns_1@10.108.30.25:<0.11874.3>:ebucketmigrator_srv:kill_tapname:966]killing tap named: rebalance_677
[ns_server:info,2012-11-07T22:36:53.663,ns_1@10.108.30.25:<0.30736.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.31142.0>,
                                        {dict,2,16,16,8,80,48,
                                         {[],[],[],[],[],[],[],[],[],[],[],[],
                                          [],[],[],[]},
                                         {{[],[],
                                           [['ns_1@10.46.215.52'|0.322265625]],
                                           [],[],
                                           [['ns_1@10.108.30.25'|
                                             0.7740885416666666]],
                                           [],[],[],[],[],[],[],[],[],[]}}},
                                        ['ns_1@10.108.30.25',
                                         'ns_1@10.46.215.52'],
                                        [],[]}
[error_logger:error,2012-11-07T22:36:53.663,ns_1@10.108.30.25:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ebucketmigrator_srv:init/1
    pid: <0.11874.3>
    registered_name: []
    exception error: no match of right hand side value {error,emfile}
      in function ebucketmigrator_srv:connect/4
      in call from ebucketmigrator_srv:init/1
    ancestors: [<0.11706.3>,<0.31209.0>,<0.31142.0>]
    messages: []
    links: [#Port<0.142811>,<0.11706.3>,#Port<0.142808>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 24
    reductions: 123486
  neighbours:

[ns_server:info,2012-11-07T22:36:53.665,ns_1@10.108.30.25:<0.11709.3>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.108.30.25': []
[rebalance:error,2012-11-07T22:36:53.666,ns_1@10.108.30.25:<0.31209.0>:ns_vbucket_mover:handle_info:252]<0.11706.3> exited with {mover_failed,{badmatch,{error,emfile}}}
[error_logger:error,2012-11-07T22:36:53.666,ns_1@10.108.30.25:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.11706.3>
    registered_name: []
    exception exit: {mover_failed,{badmatch,{error,emfile}}}
      in function ns_single_vbucket_mover:wait_for_mover/5
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.31209.0>,<0.31142.0>]
    messages: []
    links: [<0.31209.0>]
    dictionary: [{cleanup_list,[<0.11709.3>]}]
    trap_exit: true
    status: running
    heap_size: 75025
    stack_size: 24
    reductions: 13260
  neighbours:

[ns_server:info,2012-11-07T22:36:53.684,ns_1@10.108.30.25:janitor_agent-default<0.1056.0>:janitor_agent:handle_info:676]Undoing temporary vbucket states caused by rebalance
[ns_server:debug,2012-11-07T22:36:53.685,ns_1@10.108.30.25:<0.31219.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_node_disco_events,<0.31209.0>} exited with reason {mover_failed,
                                                                                      {badmatch,
                                                                                       {error,
                                                                                        emfile}}}
[user:info,2012-11-07T22:36:53.685,ns_1@10.108.30.25:<0.30736.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {mover_failed,{badmatch,{error,emfile}}}

[ns_server:debug,2012-11-07T22:36:53.774,ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:info,2012-11-07T22:36:53.778,ns_1@10.108.30.25:<0.12564.3>:diag_handler:log_all_tap_and_checkpoint_stats:127]logging tap & checkpoint stats
[ns_server:debug,2012-11-07T22:36:53.778,ns_1@10.108.30.25:ns_config_rep<0.797.0>:ns_config_rep:do_push_keys:317]Replicating some config keys ([counters,rebalance_status,rebalancer_pid]..)
[ns_server:debug,2012-11-07T22:36:53.797,ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:debug,2012-11-07T22:36:53.798,ns_1@10.108.30.25:ns_config_log<0.779.0>:ns_config_log:log_common:111]config change:
counters ->
[{rebalance_fail,1},{rebalance_start,1}]
[ns_server:debug,2012-11-07T22:36:53.804,ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:debug,2012-11-07T22:36:53.818,ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:debug,2012-11-07T22:36:53.818,ns_1@10.108.30.25:ns_config_log<0.779.0>:ns_config_log:log_common:111]config change:
rebalancer_pid ->
undefined
[ns_server:debug,2012-11-07T22:36:53.831,ns_1@10.108.30.25:ns_config_log<0.779.0>:ns_config_log:log_common:111]config change:
rebalance_status ->
{none,<<"Rebalance failed. See logs for detailed reason. You can try rebalance again.">>}
[ns_server:debug,2012-11-07T22:36:53.830,ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[error_logger:error,2012-11-07T22:36:53.669,ns_1@10.108.30.25:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.31209.0> terminating
** Last message in was {'EXIT',<0.11706.3>,
                               {mover_failed,{badmatch,{error,emfile}}}}
** When Server state == {state,"default",<0.31219.0>,
                               {dict,2,16,16,8,80,48,
                                     {[],[],[],[],[],[],[],[],[],[],[],[],[],
                                      [],[],[]},
                                     {{[],[],
                                       [['ns_1@

Link to collect info of all nodes of destination cluster https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201211/2-ec2-1941reb-failed-no-match-right-hand-20121107.tgz

Comment by Steve Yen [ 09/Nov/12 ]
reassigning since tony attached some more info
Comment by Aleksey Kondratenko [ 09/Nov/12 ]
emfile in this case (at least in collect infos from Tony) seem to be caused by CLOSE_WAIT sockets. My leading guess is that it's those hang is_missing_rev requests that we're seeing in MB-7129. Only in this environment it's not killing node entirely but rather exhausts file descriptors
Comment by Ketaki Gangal [ 13/Nov/12 ]
Re-opening this , for repro/ observation on the xdcr- fixed changes.
Comment by Abhinav Dangeti [ 14/Nov/12 ]
When this issue originally occurred, the clusters were in an XDCR setup but before starting the rebalance operation, replication had been deleted, and the failures occurred on the destination cluster.

To investigate whether this issue happens with/without XDCR,

case1: Tried reproducing this with 2 ongoing unidirectional replications, on the destination cluster (with 7 nodes) - rebalanced in 3 nodes (1949) and rebalanced out 5 nodes (1944)
           Source was on 1944 (a 7 node cluster) had a fixed no. of items on both buckets, with no front end load.
           Rebalance on the destination cluster failed with the same reason, where the cluster was "unable to listen" to one of the nodes.

case2: Deleted replications on the source, cleaned the destination cluster, loaded all nodes with (1949).
           Created a 5 node cluster (not part of any xdcr).
           Created 2 buckets and with ongoing front end load:
                    - rebalanced out 3 nodes, rebalanced in 5 nodes :: rebalance operation completed successfully.
                    - rebalanced out 5 nodes, rebalanced in 3 nodes :: rebalance operation completed successfully.
Comment by Steve Yen [ 15/Nov/12 ]
bug-scrub: 1949 didn't have the timeout fix? Please try again with 1950. thanks.
Comment by Steve Yen [ 20/Nov/12 ]
bug-scrub:

ketaki reports it works
Generated at Thu Oct 23 03:28:51 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.