[MB-5343] memcached dropped connections: Rebalance failed due to replicator_died: exited (ns_single_vbucket_mover) Created: 21/May/12  Updated: 20/Jun/13  Resolved: 28/Jan/13

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 1.8.1-release-candidate
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Karan Kumar (Inactive) Assignee: Frank Weigel
Resolution: Fixed Votes: 0
Labels: 1.8.1-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 64 bit
181-831-rel

Attachments: GZip Archive 10.5.2.13-8091-diag.txt.gz     GZip Archive 10.5.2.14-8091-diag.txt.gz     GZip Archive 10.5.2.15-8091-diag.txt.gz     GZip Archive 10.5.2.16-8091-diag.txt.gz     GZip Archive 10.5.2.18-8091-diag.txt.gz     GZip Archive 10.5.2.19-8091-diag.txt.gz     GZip Archive 2ae93467-b468-4d0e-966e-7f14cdc3bb1f-10.1.3.82-diag.gz     GZip Archive 2ae93467-b468-4d0e-966e-7f14cdc3bb1f-10.1.3.83-diag.gz     GZip Archive 2ae93467-b468-4d0e-966e-7f14cdc3bb1f-10.1.3.84-diag.gz     GZip Archive 2ae93467-b468-4d0e-966e-7f14cdc3bb1f-10.1.3.85-diag.gz     GZip Archive 2ae93467-b468-4d0e-966e-7f14cdc3bb1f-10.1.3.86-diag.gz     GZip Archive 32326eea-21d4-4ef0-bf82-233b681a5336-10.1.3.111-diag.gz     GZip Archive 32326eea-21d4-4ef0-bf82-233b681a5336-10.1.3.112-diag.gz     GZip Archive 32326eea-21d4-4ef0-bf82-233b681a5336-10.1.3.119-diag.gz     GZip Archive 32326eea-21d4-4ef0-bf82-233b681a5336-10.1.3.120-diag.gz     GZip Archive 32326eea-21d4-4ef0-bf82-233b681a5336-10.1.3.121-diag.gz     GZip Archive f9bf72eb-4450-44a3-ae0e-f6328be40592-10.1.3.109-diag.gz     GZip Archive f9bf72eb-4450-44a3-ae0e-f6328be40592-10.1.3.111-diag.gz     GZip Archive f9bf72eb-4450-44a3-ae0e-f6328be40592-10.1.3.112-diag.gz     GZip Archive f9bf72eb-4450-44a3-ae0e-f6328be40592-10.1.3.119-diag.gz     GZip Archive f9bf72eb-4450-44a3-ae0e-f6328be40592-10.1.3.120-diag.gz     GZip Archive f9bf72eb-4450-44a3-ae0e-f6328be40592-10.1.3.121-diag.gz     GZip Archive f9bf72eb-4450-44a3-ae0e-f6328be40592-10.1.3.160-diag.gz    

 Description   
Failing test is:-
rebalancetests.RebalanceInOutWithParallelLoad.test_load,get-logs:True,replica:2,num_nodes:7

[ns_server:info] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.15105.17>:ns_replicas_builder:kill_a_bunch_of_tap_names:209] Killed the following tap names on 'ns_1@10.1.3.109': [<<"replication_building_62_'ns_1@10.1.3.112'">>]
[ns_server:info] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.15104.17>:ns_single_vbucket_mover:mover_inner:88] Got exit message (parent is <0.13796.17>). Exiting...
{'EXIT',<0.15105.17>,{replicator_died,{'EXIT',<16541.21868.10>,normal}}}
[ns_server:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.15104.17>:ns_replicas_builder:sync_shutdown_many:147] Shutdown of the following failed: [{<0.15105.17>,
                                    {replicator_died,
                                     {'EXIT',<16541.21868.10>,normal}}}]
[error_logger:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.15105.17>
    registered_name: []
    exception exit: {replicator_died,{'EXIT',<16541.21868.10>,normal}}
      in function ns_replicas_builder:'-build_replicas_main/6-fun-0-'/1
      in call from ns_replicas_builder:observe_wait_all_done_tail/5
      in call from ns_replicas_builder:observe_wait_all_done/5
      in call from ns_replicas_builder:'-build_replicas_main/6-fun-1-'/8
      in call from ns_replicas_builder:try_with_maybe_ignorant_after/2
      in call from ns_replicas_builder:build_replicas_main/6
    ancestors: [<0.15104.17>,<0.13796.17>,<0.13763.17>]
    messages: [{'EXIT',<16541.21868.10>,normal}]
    links: [<0.15104.17>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 121393
    stack_size: 24
    reductions: 12423
  neighbours:

[ns_server:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.15104.17>:ns_replicas_builder:try_with_maybe_ignorant_after:68] Eating exception from ignorant after-block:
{error,{badmatch,[{<0.15105.17>,
                   {replicator_died,{'EXIT',<16541.21868.10>,normal}}}]},
       [{ns_replicas_builder,sync_shutdown_many,1},
        {ns_replicas_builder,try_with_maybe_ignorant_after,2},
        {ns_single_vbucket_mover,mover,6},
        {proc_lib,init_p_do_apply,3}]}
[rebalance:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.13796.17>:ns_vbucket_mover:handle_info:158] <0.15104.17> exited with {exited,
                             {'EXIT',<0.15105.17>,
                                 {replicator_died,
                                     {'EXIT',<16541.21868.10>,normal}}}}
[ns_server:info] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.9348.1>:ns_port_server:log:161] memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Schedule the backfill for vbucket 61
memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_VBUCKET_SET with vbucket 61 and state "pending"
memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 61
memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Backfill is completed with VBuckets 61,
memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_OPAQUE with command "close_backfill" and vbucket 61
memcached<0.9348.1>: Vbucket <61> is going dead.
memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_VBUCKET_SET with vbucket 61 and state "active"
memcached<0.9348.1>: TAP takeover is completed. Disconnecting tap stream <eq_tapq:rebalance_61>
memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Schedule the backfill for vbucket 62
memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 62

[error_logger:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.15104.17>
    registered_name: []
    exception exit: {exited,
                        {'EXIT',<0.15105.17>,
                            {replicator_died,
                                {'EXIT',<16541.21868.10>,normal}}}}
      in function ns_single_vbucket_mover:mover_inner/6
      in call from ns_replicas_builder:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.13796.17>,<0.13763.17>]
    messages: []
    links: [<0.13796.17>]
    dictionary: [{cleanup_list,[<0.15105.17>]}]
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 24
    reductions: 4014





 Comments   
Comment by Aleksey Kondratenko [ 21/May/12 ]
ep-engine closed tap-producer side of socket. We don't expect this for replica building sockets. Here are relevant logs from ebucketmigrator (run on destination node):

[ns_server:info] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:init:181] Setting {"10.1.3.112",11209} vbucket 62 to state replica
[rebalance:debug] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:init:186] CheckpointIdsDict:
{dict,64,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
      {{[[0|1],[16|1],[32|1],[48|1]],
        [[3|1],[19|1],[35|1],[51|1]],
        [[6|1],[22|1],[38|1],[54|1]],
        [[9|1],[25|1],[41|1],[57|1]],
        [[12|1],[28|1],[44|1],[60|1]],
        [[15|1],[31|1],[47|1],[63|1]],
        [[2|1],[18|1],[34|1],[50|1]],
        [[5|1],[21|1],[37|1],[53|1]],
        [[8|1],[24|1],[40|1],[56|1]],
        [[11|1],[27|1],[43|1],[59|1]],
        [[14|1],[30|1],[46|1],[62|1]],
        [[1|1],[17|1],[33|1],[49|1]],
        [[4|1],[20|1],[36|1],[52|1]],
        [[7|1],[23|1],[39|1],[55|1]],
        [[10|1],[26|1],[42|1],[58|1]],
        [[13|1],[29|1],[45|1],[61|1]]}}}
[ns_server:debug] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:init:209] killing tap named: replication_building_62_'ns_1@10.1.3.112'
[rebalance:debug] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:init:247] upstream_sender pid: <0.21869.10>
[rebalance:info] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:process_upstream:447] Initial stream for vbucket 62
[rebalance:info] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:do_confirm_sent_messages:315] Got close ack!

And this is logs from memcached on source node (.109)

memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Schedule the backfill for vbucket 62
memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 62
Comment by Aleksey Kondratenko [ 21/May/12 ]
Passing down to Chiyoung. See message above
Comment by Chiyoung Seo [ 21/May/12 ]
Karan,

Did you test it with 64 or 1024 vbuckets?

I'm running the same regression tests on my 6VM cluster, but would like to know if I can set it to 64 vbuckets to reproduce this issue more quickly.
Comment by Karan Kumar (Inactive) [ 21/May/12 ]
1024
Comment by Chiyoung Seo [ 22/May/12 ]
I was trying to reproduce this issue. Is the following command correct?

python testrunner -i my_vm.ini -t rebalancetests.RebalanceInOutWithParallelLoad.test_load,get-logs=True,replica=2,num_nodes=6 -p skip-cleanup=True
Comment by Karan Kumar (Inactive) [ 22/May/12 ]
yes. Are you having issues?
Let me know.
Comment by Karan Kumar (Inactive) [ 07/Jun/12 ]
Hitting the issue on windows as well. 181-871-rel
Attaching the logs.
Comment by Karan Kumar (Inactive) [ 07/Jun/12 ]
[ns_server:error] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30696.89>:ns_replicas_builder:build_replicas_main:99] Got premature exit from one of ebucketmigrators: {'EXIT',<16944.21895.27>,
                                                  normal}
[ns_server:info] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30696.89>:ns_replicas_builder:kill_a_bunch_of_tap_names:209] Killed the following tap names on 'ns_1@10.1.3.86': [<<"replication_building_291_'ns_1@10.1.3.87'">>,
                                                     <<"replication_building_291_'ns_1@10.1.3.85'">>]
[ns_server:info] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30695.89>:ns_single_vbucket_mover:mover_inner:88] Got exit message (parent is <0.25817.89>). Exiting...
{'EXIT',<0.30696.89>,{replicator_died,{'EXIT',<16944.21895.27>,normal}}}
[ns_server:error] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30695.89>:ns_replicas_builder:sync_shutdown_many:147] Shutdown of the following failed: [{<0.30696.89>,
                                    {replicator_died,
                                     {'EXIT',<16944.21895.27>,normal}}}]
[ns_server:error] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30695.89>:ns_replicas_builder:try_with_maybe_ignorant_after:68] Eating exception from ignorant after-block:
{error,{badmatch,[{<0.30696.89>,
                   {replicator_died,{'EXIT',<16944.21895.27>,normal}}}]},
       [{ns_replicas_builder,sync_shutdown_many,1},
        {ns_replicas_builder,try_with_maybe_ignorant_after,2},
        {ns_single_vbucket_mover,mover,6},
        {proc_lib,init_p_do_apply,3}]}


=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.30696.89>
    registered_name: []
    exception exit: {replicator_died,{'EXIT',<16944.21895.27>,normal}}
      in function ns_replicas_builder:'-build_replicas_main/6-fun-0-'/1
      in call from ns_replicas_builder:observe_wait_all_done_tail/5
      in call from ns_replicas_builder:observe_wait_all_done/5
      in call from ns_replicas_builder:'-build_replicas_main/6-fun-1-'/8
      in call from ns_replicas_builder:try_with_maybe_ignorant_after/2
      in call from ns_replicas_builder:build_replicas_main/6
    ancestors: [<0.30695.89>,<0.25817.89>,<0.25765.89>]
    messages: [{'EXIT',<16944.21895.27>,normal}]
    links: [<0.30695.89>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 514229
    stack_size: 24
    reductions: 53630
  neighbours:

=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.30764.89>
    registered_name: []
    exception exit: {exited,
                     {'EXIT',<0.25817.89>,
                      {exited,
                       {'EXIT',<0.30696.89>,
                        {replicator_died,{'EXIT',<16944.21895.27>,normal}}}}}}
      in function ns_single_vbucket_mover:mover_inner/6
      in call from ns_replicas_builder:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.25817.89>,<0.25765.89>]
    messages: [{'EXIT',<0.25817.89>,
                      {exited,
                          {'EXIT',<0.30696.89>,
                              {replicator_died,
                                  {'EXIT',<16944.21895.27>,normal}}}}}]
    links: [<0.25817.89>,<0.30766.89>]
    dictionary: [{cleanup_list,[<0.30765.89>]}]
    trap_exit: true
    status: running
    heap_size: 2584
    stack_size: 24
    reductions: 1767
  neighbours:
Comment by Karan Kumar (Inactive) [ 13/Jun/12 ]
Hitting this occasionally on 181-907-rel
Comment by Chiyoung Seo [ 13/Jun/12 ]
Can you upload the latest diag logs?
Comment by Karan Kumar (Inactive) [ 13/Jun/12 ]
We have logs from June 7th.
Comment by Karan Kumar (Inactive) [ 14/Jun/12 ]
Hitting this on 181-910-rel (latest)

I will upload the logs.
@Chiyoung, does it look to be a issue similar to memcached dropping connection?

[ns_server:info] [2012-06-14 2:09:36] [ns_1@10.1.3.109:ns_port_memcached:ns_port_server:log:166] memcached<0.252.0>: Schedule cleanup of "eq_tapq:anon_23189"
memcached<0.252.0>: TAP (Producer) eq_tapq:replication_building_94_'ns_1@10.1.3.112' - Clear the tap queues by force
memcached<0.252.0>: TAP (Consumer) eq_tapq:anon_23186 - disconnected
memcached<0.252.0>: Schedule cleanup of "eq_tapq:anon_23186"
memcached<0.252.0>: TAP (Producer) eq_tapq:replication_ns_1@10.1.3.112 - disconnected, keep alive for 300 seconds
memcached<0.252.0>: TAP (Consumer) eq_tapq:anon_23187 - disconnected
memcached<0.252.0>: Schedule cleanup of "eq_tapq:anon_23187"

[ns_server:info] [2012-06-14 2:09:36] [ns_1@10.1.3.109:<0.19238.14>:ns_replicas_builder:kill_a_bunch_of_tap_names:219] Killed the following tap names on 'ns_1@10.1.3.112': [<<"replication_building_606_'ns_1@10.1.3.109'">>,
                                                      <<"replication_building_606_'ns_1@10.1.3.111'">>]
[ns_server:info] [2012-06-14 2:09:36] [ns_1@10.1.3.109:<0.19238.14>:ns_replicas_builder:build_replicas_main:143] Got exit: {'EXIT',<0.19237.14>,shutdown}
[ns_server:error] [2012-06-14 2:09:36] [ns_1@10.1.3.109:<0.19272.14>:ns_replicas_builder:build_replicas_main:109] Got premature exit from one of ebucketmigrators: {'EXIT',<16817.4853.4>,
                                                  normal}
Comment by Ketaki Gangal [ 15/Jun/12 ]
Seeing same issue on Swap-Rebalance on 181-916 build.
Rebalance exited with reason {exited,
{'EXIT',<0.12083.14>,
{replicator_died,
{'EXIT',<15753.26224.12>,normal}}}}

Steps:
1.A large cluster (18 nodes), with 3 buckets ( 50M , 22M, 0.5 M items each).
2. Enable auto-failover
3. Remove master-node -105, add node 126 and Rebalance

Output
Rebalance fails with "replicator_died" error.

Logs attached at -
https://s3.amazonaws.com/bugdb/jira/bug3-rebalance-181/bug3.tar
Comment by Ketaki Gangal [ 15/Jun/12 ]
jira-error : Not able to delete above comment- Please ignore.
Comment by Peter Wansch (Inactive) [ 05/Jul/12 ]
Let's see if we catch in 2.0 regression runs.
Comment by Andrei Baranouski [ 08/Aug/12 ]
got the same error on build-1554 for test:
./testrunner -i andrei.ini -t view.createdeleteview.CreateDeleteViewTests.rebalance_in_and_out_with_ddoc_ops,ddoc_ops=delete,test_with_view=True,num_ddocs=2,num_views_per_ddoc=3,items=2000

[error_logger:error] [2012-08-08 12:54:03] [ns_1@10.5.2.11:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.17517.2>
    registered_name: []
    exception error: {bulk_set_vbucket_state_failed,
                      [{'ns_1@10.5.2.19',
                        {'EXIT',
                         {{{normal,
                            {gen_server,call,
                             [<20615.11732.0>,
                              {set_state,[],"k",
                               [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,
                                18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,
                                33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,
                                48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,
                                63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,
                                78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,
                                93,94,95,96,97,98,99,100,101,102,103,104,105,
                                106,108,109,110,111,112,113,114,115,116,117,
                                118,119,120,121,122,123,124,125,126,127]},
                              infinity]}},
                           {gen_server,call,


attached logs
Comment by Andrei Baranouski [ 08/Aug/12 ]
It seems like it's another issue, fill a separate bug http://www.couchbase.com/issues/browse/MB-6160 . please, skip previous comment
Comment by Aleksey Kondratenko [ 20/Jun/13 ]
Hm. less than year has passed but I'm failing to locate relevant commits
Generated at Thu Aug 28 07:36:20 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.