[MB-6222] rebalance failed because memcached died on windows Created: 14/Aug/12  Updated: 09/Jan/13  Resolved: 31/Aug/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0-beta
Fix Version/s: 2.0-beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Jin Lim
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows server r2 2008
build #1583

Attachments: GZip Archive 10.1.3.146-8091-diag.txt.gz     GZip Archive 10.1.3.147-8091-diag.txt.gz     File error.1     File info.1    

 Description   
1 node, 1 bucket, 3 items, rebalance in 1 node

** Last message in was {#Port<0.18432>,{exit_status,255}}
** When Server state == {state,#Port<0.18432>,memcached,
                               {["114 3: Schedule cleanup of \"eq_tapq:anon_78\"",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_590_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"initial_vbucket_stream\" and vbucket 590",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_590_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_590_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_590_'ns_1@10.1.3.147' - Schedule the backfill for vbucket 590",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_589_'ns_1@10.1.3.147' - Connection is closed by force.",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_589 - Clear the tap queues by force",
                                 "114 3: Schedule cleanup of \"eq_tapq:rebalance_589\"",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_589 - disconnected",
                                 "114 3: TAP takeover is completed. Disconnecting tap stream <eq_tapq:rebalance_589>",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_589 - Sending TAP_VBUCKET_SET with vbucket 589 and state \"active\"",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_589 - VBucket <589> is going dead to complete vbucket takeover.",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_589 - Sending TAP_VBUCKET_SET with vbucket 589 and state \"pending\"",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_589 - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_589 - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_589_'ns_1@10.1.3.147' - disconnected, keep alive for 300 seconds",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_589_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"close_backfill\" and vbucket 589",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_589_'ns_1@10.1.3.147' - Backfill is completed with VBuckets 589, ",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_588 - Clear the tap queues by force",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_588_'ns_1@10.1.3.147' - Clear the tap queues by force",
                                 "114 3: Schedule cleanup of \"eq_tapq:rebalance_588\"",
                                 "114 3: Schedule cleanup of \"eq_tapq:anon_77\"",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_589_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"initial_vbucket_stream\" and vbucket 589",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_589_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_589_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_589_'ns_1@10.1.3.147' - Schedule the backfill for vbucket 589",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_588_'ns_1@10.1.3.147' - Connection is closed by force.",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_588 - disconnected",
                                 "114 3: TAP takeover is completed. Disconnecting tap stream <eq_tapq:rebalance_588>",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_588 - Sending TAP_VBUCKET_SET with vbucket 588 and state \"active\"",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_588 - VBucket <588> is going dead to complete vbucket takeover.",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_588 - Sending TAP_VBUCKET_SET with vbucket 588 and state \"pending\"",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_588 - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "114 3: TAP (Producer) eq_tapq:rebalance_588 - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "114 3: TAP (Producer) eq_tapq:replication_building_588_'ns_1@10.1.3.147' - disconnected, keep alive for 300 seconds",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_588_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"close_backfill\" and vbucket 588",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_588_'ns_1@10.1.3.147' - Backfill is completed with VBuckets 588, ",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_587 - Clear the tap queues by force"],
                                ["112 3: TAP (Producer) eq_tapq:replication_building_585_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "112 3: TAP (Producer) eq_tapq:replication_building_585_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"initial_vbucket_stream\" and vbucket 585",
                                 "112 3: Schedule cleanup of \"eq_tapq:anon_73\"",
                                 "112 3: Schedule cleanup of \"eq_tapq:rebalance_584\"",
                                 "112 3: TAP (Producer) eq_tapq:replication_building_584_'ns_1@10.1.3.147' - Clear the tap queues by force",
                                 "112 3: TAP (Producer) eq_tapq:rebalance_584 - Clear the tap queues by force",
                                 "112 3: TAP (Producer) eq_tapq:replication_building_585_'ns_1@10.1.3.147' - Backfill is completed with VBuckets 585, ",
                                 "112 3: TAP (Producer) eq_tapq:replication_building_585_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"close_backfill\" and vbucket 585",
                                 "112 3: TAP (Producer) eq_tapq:replication_building_585_'ns_1@10.1.3.147' - disconnected, keep alive for 300 seconds",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_585 - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_585 - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_585 - Sending TAP_VBUCKET_SET with vbucket 585 and state \"pending\"",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_585 - VBucket <585> is going dead to complete vbucket takeover.",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_585 - Sending TAP_VBUCKET_SET with vbucket 585 and state \"active\"",
                                 "113 3: TAP takeover is completed. Disconnecting tap stream <eq_tapq:rebalance_585>",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_585 - disconnected",
                                 "113 3: Schedule cleanup of \"eq_tapq:rebalance_585\"",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_585 - Clear the tap queues by force",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_585_'ns_1@10.1.3.147' - Connection is closed by force.",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_586_'ns_1@10.1.3.147' - Schedule the backfill for vbucket 586",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_586_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_586_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_586_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"initial_vbucket_stream\" and vbucket 586",
                                 "113 3: Schedule cleanup of \"eq_tapq:anon_74\"",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_585_'ns_1@10.1.3.147' - Clear the tap queues by force",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_586_'ns_1@10.1.3.147' - Backfill is completed with VBuckets 586, ",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_586_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"close_backfill\" and vbucket 586",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_586_'ns_1@10.1.3.147' - disconnected, keep alive for 300 seconds",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_586 - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_586 - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_586 - Sending TAP_VBUCKET_SET with vbucket 586 and state \"pending\"",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_586 - VBucket <586> is going dead to complete vbucket takeover.",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_586 - Sending TAP_VBUCKET_SET with vbucket 586 and state \"active\"",
                                 "113 3: TAP takeover is completed. Disconnecting tap stream <eq_tapq:rebalance_586>",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_586 - disconnected",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_586_'ns_1@10.1.3.147' - Connection is closed by force.",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_587_'ns_1@10.1.3.147' - Schedule the backfill for vbucket 587",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_587_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_587_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_587_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"initial_vbucket_stream\" and vbucket 587",
                                 "113 3: Schedule cleanup of \"eq_tapq:anon_75\"",
                                 "113 3: Schedule cleanup of \"eq_tapq:rebalance_586\"",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_586_'ns_1@10.1.3.147' - Clear the tap queues by force",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_586 - Clear the tap queues by force",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_587_'ns_1@10.1.3.147' - Backfill is completed with VBuckets 587, ",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_587_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"close_backfill\" and vbucket 587",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_587_'ns_1@10.1.3.147' - disconnected, keep alive for 300 seconds",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_587 - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_587 - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_587 - Sending TAP_VBUCKET_SET with vbucket 587 and state \"pending\"",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_587 - VBucket <587> is going dead to complete vbucket takeover.",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_587 - Sending TAP_VBUCKET_SET with vbucket 587 and state \"active\"",
                                 "113 3: TAP takeover is completed. Disconnecting tap stream <eq_tapq:rebalance_587>",
                                 "113 3: TAP (Producer) eq_tapq:rebalance_587 - disconnected",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_587_'ns_1@10.1.3.147' - Connection is closed by force.",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_588_'ns_1@10.1.3.147' - Schedule the backfill for vbucket 588",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_588_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"opaque_enable_auto_nack\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_588_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"enable_checkpoint_sync\" and vbucket 0",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_588_'ns_1@10.1.3.147' - Sending TAP_OPAQUE with command \"initial_vbucket_stream\" and vbucket 588",
                                 "113 3: Schedule cleanup of \"eq_tapq:anon_76\"",
                                 "113 3: Schedule cleanup of \"eq_tapq:rebalance_587\"",
                                 "113 3: TAP (Producer) eq_tapq:replication_building_587_'ns_1@10.1.3.147' - Clear the tap queues by force"]},
                               undefined,[],0,true}
** Reason for termination ==
** {abnormal,255}

[error_logger:error,2012-08-14T10:24:04.859,ns_1@10.1.3.146:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_port_server:init/1
    pid: <0.22831.0>
    registered_name: ns_port_memcached
    exception exit: {abnormal,255}
      in function gen_server:terminate/6
    ancestors: [<0.22830.0>,ns_port_sup,ns_server_sup,ns_server_cluster_sup,
                  <0.67.0>]
    messages: [{'EXIT',#Port<0.18432>,normal}]
    links: [<0.22830.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 121393
    stack_size: 24
    reductions: 109012
  neighbours:

 Comments   
Comment by Chiyoung Seo [ 15/Aug/12 ]
Iryna,

Can you please increase the log level (probably -vvv) in memcached and redirect all the memcached / ep_eninge logs to a separate file and test the rebalance again?

I didn't see any warning logs from memcached / ep_engine at the exit time:

[ns_server:info,2012-08-14T10:24:04.844,ns_1@10.1.3.146:<0.24034.0>:mc_connection:run_loop:190]mccouch connection was normally closed
[user:info,2012-08-14T10:24:04.844,ns_1@10.1.3.146:ns_port_memcached:ns_port_server:handle_info:107]Port server memcached on node 'ns_1@10.1.3.146' exited with status 255. Restarting.

Regarding the log level and config, please get more information from Karan or Farshid.
Comment by Jin Lim [ 31/Aug/12 ]
The memcached crash is no longer seen. The fix being merged here, http://review.couchbase.org/#/c/20383/, must have addressed this issue.

However, there is still a rebalance failure due to a different issue. Basically ns_server exited rebalance before completion. Please see snippets from the log below. QA team may want to open a new bug tracking this issue (first assign it to ns_server team for triaging). Thanks.

er:info,2012-08-31T15:00:53.098,ns_1@10.3.121.142:<0.6808.1>:diag_handler:log_all_tap_and_checkpoint_stats:130]end of logging tap & checkpoint stats
[ns_server:info,2012-08-31T15:00:58.442,ns_1@10.3.121.142:<0.857.0>:ns_orchestrator:handle_info:258]Skipping janitor in state rebalancing: {rebalancing_state,<0.714.1>,
                                        {dict,2,16,16,8,80,48,
                                         {[],[],[],[],[],[],[],[],[],[],[],[],
                                          [],[],[],[]},
                                         {{[],[],
                                           [['ns_1@10.3.121.142'|
                                             0.31380208333333337]],
                                           [['ns_1@10.3.2.123'|0.0]],
                                           [],[],[],[],[],[],[],[],[],[],[],
                                           []}}},
                                        ['ns_1@10.3.121.142',
                                         'ns_1@10.3.2.123'],
                                        [],[]}
[ns_server:error,2012-08-31T15:00:58.989,ns_1@10.3.121.142:<0.6678.1>:ns_replicas_builder:build_replicas_main:90]Got premature exit from one of ebucketmigrators: {'EXIT',<20019.17095.0>,
                                                  {badmatch,{error,timeout}}}
[ns_server:error,2012-08-31T15:00:59.005,ns_1@10.3.121.142:<0.6678.1>:misc:sync_shutdown_many_i_am_trapping_exits:1373]Shutdown of the following failed: [{<20019.17095.0>,
                                    {badmatch,{error,timeout}}}]
[error_logger:error,2012-08-31T15:00:59.005,ns_1@10.3.121.142:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ebucketmigrator_srv:init/1
    pid: <20019.17095.0>
    registered_name: []
    exception error: no match of right hand side value {error,timeout}
      in function mc_client_binary:cmd_binary_vocal_recv/5
      in call from mc_client_binary:set_vbucket/3
      in call from ebucketmigrator_srv:'-init/1-lc$^0/1-0-'/3
      in call from ebucketmigrator_srv:init/1
    ancestors: [<0.6678.1>,<0.6673.1>,<0.888.1>,<0.714.1>]
    messages: []
    links: [#Port<20019.19165>,<0.6678.1>,#Port<20019.19164>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 24
    reductions: 1802
  neighbours:

[ns_server:info,2012-08-31T15:00:59.005,ns_1@10.3.121.142:<0.6678.1>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.142': [<<"replication_building_241_'ns_1@10.3.2.123'">>]
[ns_server:error,2012-08-31T15:00:59.005,ns_1@10.3.121.142:<0.6678.1>:misc:try_with_maybe_ignorant_after:1409]Eating exception from ignorant after-block:
{error,{badmatch,[{<20019.17095.0>,{badmatch,{error,timeout}}}]},
       [{misc,sync_shutdown_many_i_am_trapping_exits,1},
        {misc,try_with_maybe_ignorant_after,2},
        {misc,try_with_maybe_ignorant_after,2},
        {ns_replicas_builder,build_replicas_main,6},
        {proc_lib,init_p_do_apply,3}]}
[ns_server:info,2012-08-31T15:00:59.005,ns_1@10.3.121.142:<0.6673.1>:ns_single_vbucket_mover:mover_inner_old_style:194]Got exit message (parent is <0.888.1>). Exiting...
{'EXIT',<0.6678.1>,
        {replicator_died,{'EXIT',<20019.17095.0>,{badmatch,{error,timeout}}}}}

Rebalance exited with reason {exited,
{'EXIT',<0.6678.1>,
{replicator_died,
{'EXIT',<20019.17095.0>,
{badmatch,{error,timeout}}}}}}
Comment by Jin Lim [ 31/Aug/12 ]
server log (info.1 & error.1) attached.
Comment by Thuan Nguyen [ 31/Aug/12 ]
Integrated in github-ep-engine-2-0 #421 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/421/])
    MB-6222 Requeue failed bgfetch requests for retry (Revision 4882ef87ccd7ba80dad6f2299d733979bf9763d0)

     Result = SUCCESS
Jin Lim :
Files :
* src/vbucket.hh
* src/bgfetcher.cc
* src/vbucket.cc
* src/bgfetcher.hh
Generated at Mon Sep 01 22:13:21 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.