[MB-5343] memcached dropped connections: Rebalance failed due to replicator_died: exited (ns_single_vbucket_mover) Created: 21/May/12 Updated: 28/Jan/13 Resolved: 28/Jan/13 |
|
| Status: | Resolved |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket |
| Affects Version/s: | 1.8.1-release-candidate |
| Fix Version/s: | 2.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Critical |
| Reporter: | Karan Kumar | Assignee: | Frank Weigel |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 1.8.1-release-notes | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 64 bit
181-831-rel |
||
| Attachments: |
|
| Description |
|
Failing test is:-
rebalancetests.RebalanceInOutWithParallelLoad.test_load,get-logs:True,replica:2,num_nodes:7 [ns_server:info] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.15105.17>:ns_replicas_builder:kill_a_bunch_of_tap_names:209] Killed the following tap names on 'ns_1@10.1.3.109': [<<"replication_building_62_'ns_1@10.1.3.112'">>] [ns_server:info] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.15104.17>:ns_single_vbucket_mover:mover_inner:88] Got exit message (parent is <0.13796.17>). Exiting... {'EXIT',<0.15105.17>,{replicator_died,{'EXIT',<16541.21868.10>,normal}}} [ns_server:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.15104.17>:ns_replicas_builder:sync_shutdown_many:147] Shutdown of the following failed: [{<0.15105.17>, {replicator_died, {'EXIT',<16541.21868.10>,normal}}}] [error_logger:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:error_logger:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: erlang:apply/2 pid: <0.15105.17> registered_name: [] exception exit: {replicator_died,{'EXIT',<16541.21868.10>,normal}} in function ns_replicas_builder:'-build_replicas_main/6-fun-0-'/1 in call from ns_replicas_builder:observe_wait_all_done_tail/5 in call from ns_replicas_builder:observe_wait_all_done/5 in call from ns_replicas_builder:'-build_replicas_main/6-fun-1-'/8 in call from ns_replicas_builder:try_with_maybe_ignorant_after/2 in call from ns_replicas_builder:build_replicas_main/6 ancestors: [<0.15104.17>,<0.13796.17>,<0.13763.17>] messages: [{'EXIT',<16541.21868.10>,normal}] links: [<0.15104.17>] dictionary: [] trap_exit: true status: running heap_size: 121393 stack_size: 24 reductions: 12423 neighbours: [ns_server:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.15104.17>:ns_replicas_builder:try_with_maybe_ignorant_after:68] Eating exception from ignorant after-block: {error,{badmatch,[{<0.15105.17>, {replicator_died,{'EXIT',<16541.21868.10>,normal}}}]}, [{ns_replicas_builder,sync_shutdown_many,1}, {ns_replicas_builder,try_with_maybe_ignorant_after,2}, {ns_single_vbucket_mover,mover,6}, {proc_lib,init_p_do_apply,3}]} [rebalance:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.13796.17>:ns_vbucket_mover:handle_info:158] <0.15104.17> exited with {exited, {'EXIT',<0.15105.17>, {replicator_died, {'EXIT',<16541.21868.10>,normal}}}} [ns_server:info] [2012-05-21 8:17:31] [ns_1@10.1.3.109:<0.9348.1>:ns_port_server:log:161] memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Schedule the backfill for vbucket 61 memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0 memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0 memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_VBUCKET_SET with vbucket 61 and state "pending" memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 61 memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Backfill is completed with VBuckets 61, memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_OPAQUE with command "close_backfill" and vbucket 61 memcached<0.9348.1>: Vbucket <61> is going dead. memcached<0.9348.1>: TAP (Producer) eq_tapq:rebalance_61 - Sending TAP_VBUCKET_SET with vbucket 61 and state "active" memcached<0.9348.1>: TAP takeover is completed. Disconnecting tap stream <eq_tapq:rebalance_61> memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Schedule the backfill for vbucket 62 memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0 memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0 memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 62 [error_logger:error] [2012-05-21 8:17:31] [ns_1@10.1.3.109:error_logger:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: ns_single_vbucket_mover:mover/6 pid: <0.15104.17> registered_name: [] exception exit: {exited, {'EXIT',<0.15105.17>, {replicator_died, {'EXIT',<16541.21868.10>,normal}}}} in function ns_single_vbucket_mover:mover_inner/6 in call from ns_replicas_builder:try_with_maybe_ignorant_after/2 in call from ns_single_vbucket_mover:mover/6 ancestors: [<0.13796.17>,<0.13763.17>] messages: [] links: [<0.13796.17>] dictionary: [{cleanup_list,[<0.15105.17>]}] trap_exit: true status: running heap_size: 987 stack_size: 24 reductions: 4014 |
| Comments |
| Comment by Aleksey Kondratenko [ 21/May/12 ] |
|
ep-engine closed tap-producer side of socket. We don't expect this for replica building sockets. Here are relevant logs from ebucketmigrator (run on destination node):
[ns_server:info] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:init:181] Setting {"10.1.3.112",11209} vbucket 62 to state replica [rebalance:debug] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:init:186] CheckpointIdsDict: {dict,64,16,16,8,80,48, {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}, {{[[0|1],[16|1],[32|1],[48|1]], [[3|1],[19|1],[35|1],[51|1]], [[6|1],[22|1],[38|1],[54|1]], [[9|1],[25|1],[41|1],[57|1]], [[12|1],[28|1],[44|1],[60|1]], [[15|1],[31|1],[47|1],[63|1]], [[2|1],[18|1],[34|1],[50|1]], [[5|1],[21|1],[37|1],[53|1]], [[8|1],[24|1],[40|1],[56|1]], [[11|1],[27|1],[43|1],[59|1]], [[14|1],[30|1],[46|1],[62|1]], [[1|1],[17|1],[33|1],[49|1]], [[4|1],[20|1],[36|1],[52|1]], [[7|1],[23|1],[39|1],[55|1]], [[10|1],[26|1],[42|1],[58|1]], [[13|1],[29|1],[45|1],[61|1]]}}} [ns_server:debug] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:init:209] killing tap named: replication_building_62_'ns_1@10.1.3.112' [rebalance:debug] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:init:247] upstream_sender pid: <0.21869.10> [rebalance:info] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:process_upstream:447] Initial stream for vbucket 62 [rebalance:info] [2012-05-21 8:17:31] [ns_1@10.1.3.112:<0.21868.10>:ebucketmigrator_srv:do_confirm_sent_messages:315] Got close ack! And this is logs from memcached on source node (.109) memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Schedule the backfill for vbucket 62 memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0 memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0 memcached<0.9348.1>: TAP (Producer) eq_tapq:replication_building_62_'ns_1@10.1.3.112' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 62 |
| Comment by Aleksey Kondratenko [ 21/May/12 ] |
| Passing down to Chiyoung. See message above |
| Comment by Chiyoung Seo [ 21/May/12 ] |
|
Karan,
Did you test it with 64 or 1024 vbuckets? I'm running the same regression tests on my 6VM cluster, but would like to know if I can set it to 64 vbuckets to reproduce this issue more quickly. |
| Comment by Karan Kumar [ 21/May/12 ] |
| 1024 |
| Comment by Chiyoung Seo [ 22/May/12 ] |
|
I was trying to reproduce this issue. Is the following command correct?
python testrunner -i my_vm.ini -t rebalancetests.RebalanceInOutWithParallelLoad.test_load,get-logs=True,replica=2,num_nodes=6 -p skip-cleanup=True |
| Comment by Karan Kumar [ 22/May/12 ] |
|
yes. Are you having issues?
Let me know. |
| Comment by Karan Kumar [ 07/Jun/12 ] |
|
Hitting the issue on windows as well. 181-871-rel
Attaching the logs. |
| Comment by Karan Kumar [ 07/Jun/12 ] |
|
[ns_server:error] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30696.89>:ns_replicas_builder:build_replicas_main:99] Got premature exit from one of ebucketmigrators: {'EXIT',<16944.21895.27>,
normal} [ns_server:info] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30696.89>:ns_replicas_builder:kill_a_bunch_of_tap_names:209] Killed the following tap names on 'ns_1@10.1.3.86': [<<"replication_building_291_'ns_1@10.1.3.87'">>, <<"replication_building_291_'ns_1@10.1.3.85'">>] [ns_server:info] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30695.89>:ns_single_vbucket_mover:mover_inner:88] Got exit message (parent is <0.25817.89>). Exiting... {'EXIT',<0.30696.89>,{replicator_died,{'EXIT',<16944.21895.27>,normal}}} [ns_server:error] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30695.89>:ns_replicas_builder:sync_shutdown_many:147] Shutdown of the following failed: [{<0.30696.89>, {replicator_died, {'EXIT',<16944.21895.27>,normal}}}] [ns_server:error] [2012-06-07 1:44:48] [ns_1@10.1.3.82:<0.30695.89>:ns_replicas_builder:try_with_maybe_ignorant_after:68] Eating exception from ignorant after-block: {error,{badmatch,[{<0.30696.89>, {replicator_died,{'EXIT',<16944.21895.27>,normal}}}]}, [{ns_replicas_builder,sync_shutdown_many,1}, {ns_replicas_builder,try_with_maybe_ignorant_after,2}, {ns_single_vbucket_mover,mover,6}, {proc_lib,init_p_do_apply,3}]} =========================CRASH REPORT========================= crasher: initial call: erlang:apply/2 pid: <0.30696.89> registered_name: [] exception exit: {replicator_died,{'EXIT',<16944.21895.27>,normal}} in function ns_replicas_builder:'-build_replicas_main/6-fun-0-'/1 in call from ns_replicas_builder:observe_wait_all_done_tail/5 in call from ns_replicas_builder:observe_wait_all_done/5 in call from ns_replicas_builder:'-build_replicas_main/6-fun-1-'/8 in call from ns_replicas_builder:try_with_maybe_ignorant_after/2 in call from ns_replicas_builder:build_replicas_main/6 ancestors: [<0.30695.89>,<0.25817.89>,<0.25765.89>] messages: [{'EXIT',<16944.21895.27>,normal}] links: [<0.30695.89>] dictionary: [] trap_exit: true status: running heap_size: 514229 stack_size: 24 reductions: 53630 neighbours: =========================CRASH REPORT========================= crasher: initial call: ns_single_vbucket_mover:mover/6 pid: <0.30764.89> registered_name: [] exception exit: {exited, {'EXIT',<0.25817.89>, {exited, {'EXIT',<0.30696.89>, {replicator_died,{'EXIT',<16944.21895.27>,normal}}}}}} in function ns_single_vbucket_mover:mover_inner/6 in call from ns_replicas_builder:try_with_maybe_ignorant_after/2 in call from ns_single_vbucket_mover:mover/6 ancestors: [<0.25817.89>,<0.25765.89>] messages: [{'EXIT',<0.25817.89>, {exited, {'EXIT',<0.30696.89>, {replicator_died, {'EXIT',<16944.21895.27>,normal}}}}}] links: [<0.25817.89>,<0.30766.89>] dictionary: [{cleanup_list,[<0.30765.89>]}] trap_exit: true status: running heap_size: 2584 stack_size: 24 reductions: 1767 neighbours: |
| Comment by Karan Kumar [ 13/Jun/12 ] |
| Hitting this occasionally on 181-907-rel |
| Comment by Chiyoung Seo [ 13/Jun/12 ] |
| Can you upload the latest diag logs? |
| Comment by Karan Kumar [ 13/Jun/12 ] |
| We have logs from June 7th. |
| Comment by Karan Kumar [ 14/Jun/12 ] |
|
Hitting this on 181-910-rel (latest)
I will upload the logs. @Chiyoung, does it look to be a issue similar to memcached dropping connection? [ns_server:info] [2012-06-14 2:09:36] [ns_1@10.1.3.109:ns_port_memcached:ns_port_server:log:166] memcached<0.252.0>: Schedule cleanup of "eq_tapq:anon_23189" memcached<0.252.0>: TAP (Producer) eq_tapq:replication_building_94_'ns_1@10.1.3.112' - Clear the tap queues by force memcached<0.252.0>: TAP (Consumer) eq_tapq:anon_23186 - disconnected memcached<0.252.0>: Schedule cleanup of "eq_tapq:anon_23186" memcached<0.252.0>: TAP (Producer) eq_tapq:replication_ns_1@10.1.3.112 - disconnected, keep alive for 300 seconds memcached<0.252.0>: TAP (Consumer) eq_tapq:anon_23187 - disconnected memcached<0.252.0>: Schedule cleanup of "eq_tapq:anon_23187" [ns_server:info] [2012-06-14 2:09:36] [ns_1@10.1.3.109:<0.19238.14>:ns_replicas_builder:kill_a_bunch_of_tap_names:219] Killed the following tap names on 'ns_1@10.1.3.112': [<<"replication_building_606_'ns_1@10.1.3.109'">>, <<"replication_building_606_'ns_1@10.1.3.111'">>] [ns_server:info] [2012-06-14 2:09:36] [ns_1@10.1.3.109:<0.19238.14>:ns_replicas_builder:build_replicas_main:143] Got exit: {'EXIT',<0.19237.14>,shutdown} [ns_server:error] [2012-06-14 2:09:36] [ns_1@10.1.3.109:<0.19272.14>:ns_replicas_builder:build_replicas_main:109] Got premature exit from one of ebucketmigrators: {'EXIT',<16817.4853.4>, normal} |
| Comment by Ketaki Gangal [ 15/Jun/12 ] |
|
Seeing same issue on Swap-Rebalance on 181-916 build.
Rebalance exited with reason {exited, {'EXIT',<0.12083.14>, {replicator_died, {'EXIT',<15753.26224.12>,normal}}}} Steps: 1.A large cluster (18 nodes), with 3 buckets ( 50M , 22M, 0.5 M items each). 2. Enable auto-failover 3. Remove master-node -105, add node 126 and Rebalance Output Rebalance fails with "replicator_died" error. Logs attached at - https://s3.amazonaws.com/bugdb/jira/bug3-rebalance-181/bug3.tar |
| Comment by Ketaki Gangal [ 15/Jun/12 ] |
| jira-error : Not able to delete above comment- Please ignore. |
| Comment by Peter Wansch [ 05/Jul/12 ] |
| Let's see if we catch in 2.0 regression runs. |
| Comment by Andrei Baranouski [ 08/Aug/12 ] |
|
got the same error on build-1554 for test:
./testrunner -i andrei.ini -t view.createdeleteview.CreateDeleteViewTests.rebalance_in_and_out_with_ddoc_ops,ddoc_ops=delete,test_with_view=True,num_ddocs=2,num_views_per_ddoc=3,items=2000 [error_logger:error] [2012-08-08 12:54:03] [ns_1@10.5.2.11:error_logger:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: ns_single_vbucket_mover:mover/6 pid: <0.17517.2> registered_name: [] exception error: {bulk_set_vbucket_state_failed, [{'ns_1@10.5.2.19', {'EXIT', {{{normal, {gen_server,call, [<20615.11732.0>, {set_state,[],"k", [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17, 18,19,20,21,22,23,24,25,26,27,28,29,30,31,32, 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47, 48,49,50,51,52,53,54,55,56,57,58,59,60,61,62, 63,64,65,66,67,68,69,70,71,72,73,74,75,76,77, 78,79,80,81,82,83,84,85,86,87,88,89,90,91,92, 93,94,95,96,97,98,99,100,101,102,103,104,105, 106,108,109,110,111,112,113,114,115,116,117, 118,119,120,121,122,123,124,125,126,127]}, infinity]}}, {gen_server,call, attached logs |
| Comment by Andrei Baranouski [ 08/Aug/12 ] |
| It seems like it's another issue, fill a separate bug http://www.couchbase.com/issues/browse/MB-6160 . please, skip previous comment |