Hit this bug again in build 2.0.0-1941. Bin creates a 4 nodes windows cluster with 5 million items. Then Bin creates xdcr to one ubuntu 10.04 64bit (ec2-50-16-18-63.compute-1.amazonaws.com). At destination, add another ubuntu node to node ec2-50 and rebalance. Rebalance failed with error:
Rebalance exited with reason {mover_failed,{badmatch,{error,emfile}}}
Look at diags, I see ebucketmigrator crashed
[rebalance:info,2012-11-07T22:36:53.434,
ns_1@10.108.30.25:<0.11871.3>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for
ns_1@10.46.215.52 (vbucket 677)
[ns_server:info,2012-11-07T22:36:53.445,
ns_1@10.108.30.25:<0.11709.3>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on '
ns_1@10.108.30.25': [<<"replication_building_677_'
ns_1@10.46.215.52'">>]
[ns_server:debug,2012-11-07T22:36:53.458,
ns_1@10.108.30.25:<0.11706.3>:ns_single_vbucket_mover:spawn_ebucketmigrator_mover:283]Spawned mover "default" 677 '
ns_1@10.108.30.25' -> '
ns_1@10.46.215.52': <0.11874.3>
[ns_server:info,2012-11-07T22:36:53.461,
ns_1@10.108.30.25:<0.11874.3>:ebucketmigrator_srv:init:490]Setting {"10.46.215.52",11209} vbucket 677 to state replica
[ns_server:error,2012-11-07T22:36:53.621,
ns_1@10.108.30.25:<0.11876.3>:menelaus_web_alerts_srv:can_listen:349]gen_udp:open failed due to emfile
[user:info,2012-11-07T22:36:53.621,
ns_1@10.108.30.25:<0.11876.3>:menelaus_web_alerts_srv:global_alert:87]IP address seems to have changed. Unable to listen on '
ns_1@10.108.30.25'.
[ns_server:info,2012-11-07T22:36:53.621,
ns_1@10.108.30.25:ns_log<0.785.0>:ns_log:handle_cast:125]suppressing duplicate log menelaus_web_alerts_srv:1("IP address seems to have changed. Unable to listen on '
ns_1@10.108.30.25'.") because it's been seen 1 times in the past 9.049832 secs (last seen 9.049832 secs ago
[ns_server:info,2012-11-07T22:36:53.638,
ns_1@10.108.30.25:ns_port_memcached<0.864.0>:ns_port_server:log:171]memcached<0.864.0>: Wed Nov 7 22:36:53.446940 UTC 3: TAP (Producer) eq_tapq:replication_building_677_'
ns_1@10.46.215.52' - disconnected, keep alive for 300 seconds
memcached<0.864.0>: Wed Nov 7 22:36:53.454932 UTC 3: TAP (Producer) eq_tapq:replication_building_677_'
ns_1@10.46.215.52' - Connection is closed by force.
[ns_server:debug,2012-11-07T22:36:53.659,
ns_1@10.108.30.25:<0.11874.3>:ebucketmigrator_srv:kill_tapname:966]killing tap named: rebalance_677
[ns_server:info,2012-11-07T22:36:53.663,
ns_1@10.108.30.25:<0.30736.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.31142.0>,
{dict,2,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],
[],[],[],[]},
{{[],[],
[['
ns_1@10.46.215.52'|0.322265625]],
[],[],
[['
ns_1@10.108.30.25'|
0.7740885416666666]],
[],[],[],[],[],[],[],[],[],[]}}},
['
ns_1@10.108.30.25',
'
ns_1@10.46.215.52'],
[],[]}
[error_logger:error,2012-11-07T22:36:53.663,
ns_1@10.108.30.25:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ebucketmigrator_srv:init/1
pid: <0.11874.3>
registered_name: []
exception error: no match of right hand side value {error,emfile}
in function ebucketmigrator_srv:connect/4
in call from ebucketmigrator_srv:init/1
ancestors: [<0.11706.3>,<0.31209.0>,<0.31142.0>]
messages: []
links: [#Port<0.142811>,<0.11706.3>,#Port<0.142808>]
dictionary: []
trap_exit: false
status: running
heap_size: 987
stack_size: 24
reductions: 123486
neighbours:
[ns_server:info,2012-11-07T22:36:53.665,
ns_1@10.108.30.25:<0.11709.3>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on '
ns_1@10.108.30.25': []
[rebalance:error,2012-11-07T22:36:53.666,
ns_1@10.108.30.25:<0.31209.0>:ns_vbucket_mover:handle_info:252]<0.11706.3> exited with {mover_failed,{badmatch,{error,emfile}}}
[error_logger:error,2012-11-07T22:36:53.666,
ns_1@10.108.30.25:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_single_vbucket_mover:mover/6
pid: <0.11706.3>
registered_name: []
exception exit: {mover_failed,{badmatch,{error,emfile}}}
in function ns_single_vbucket_mover:wait_for_mover/5
in call from ns_single_vbucket_mover:mover_inner/6
in call from misc:try_with_maybe_ignorant_after/2
in call from ns_single_vbucket_mover:mover/6
ancestors: [<0.31209.0>,<0.31142.0>]
messages: []
links: [<0.31209.0>]
dictionary: [{cleanup_list,[<0.11709.3>]}]
trap_exit: true
status: running
heap_size: 75025
stack_size: 24
reductions: 13260
neighbours:
[ns_server:info,2012-11-07T22:36:53.684,
ns_1@10.108.30.25:janitor_agent-default<0.1056.0>:janitor_agent:handle_info:676]Undoing temporary vbucket states caused by rebalance
[ns_server:debug,2012-11-07T22:36:53.685,
ns_1@10.108.30.25:<0.31219.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_node_disco_events,<0.31209.0>} exited with reason {mover_failed,
{badmatch,
{error,
emfile}}}
[user:info,2012-11-07T22:36:53.685,
ns_1@10.108.30.25:<0.30736.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {mover_failed,{badmatch,{error,emfile}}}
[ns_server:debug,2012-11-07T22:36:53.774,
ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:info,2012-11-07T22:36:53.778,
ns_1@10.108.30.25:<0.12564.3>:diag_handler:log_all_tap_and_checkpoint_stats:127]logging tap & checkpoint stats
[ns_server:debug,2012-11-07T22:36:53.778,
ns_1@10.108.30.25:ns_config_rep<0.797.0>:ns_config_rep:do_push_keys:317]Replicating some config keys ([counters,rebalance_status,rebalancer_pid]..)
[ns_server:debug,2012-11-07T22:36:53.797,
ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:debug,2012-11-07T22:36:53.798,
ns_1@10.108.30.25:ns_config_log<0.779.0>:ns_config_log:log_common:111]config change:
counters ->
[{rebalance_fail,1},{rebalance_start,1}]
[ns_server:debug,2012-11-07T22:36:53.804,
ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:debug,2012-11-07T22:36:53.818,
ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:debug,2012-11-07T22:36:53.818,
ns_1@10.108.30.25:ns_config_log<0.779.0>:ns_config_log:log_common:111]config change:
rebalancer_pid ->
undefined
[ns_server:debug,2012-11-07T22:36:53.831,
ns_1@10.108.30.25:ns_config_log<0.779.0>:ns_config_log:log_common:111]config change:
rebalance_status ->
{none,<<"Rebalance failed. See logs for detailed reason. You can try rebalance again.">>}
[ns_server:debug,2012-11-07T22:36:53.830,
ns_1@10.108.30.25:capi_set_view_manager-default<0.1035.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[error_logger:error,2012-11-07T22:36:53.669,
ns_1@10.108.30.25:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.31209.0> terminating
** Last message in was {'EXIT',<0.11706.3>,
{mover_failed,{badmatch,{error,emfile}}}}
** When Server state == {state,"default",<0.31219.0>,
{dict,2,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[],[]},
{{[],[],
[['ns_1@
Link to collect info of all nodes of destination cluster
https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201211/2-ec2-1941reb-failed-no-match-right-hand-20121107.tgz
ketaki reports it works