[MB-7572] test_add_back_failed_node: Rebalance exited with reason {badmatch, [{<0.25911.52>, {{badmatch,[{<20234.16808.17>,killed}]}, [{misc,sync_shutdown_many_i_am_trapping_exits,1 Created: 22/Jan/13  Updated: 22/Apr/13  Resolved: 28/Jan/13

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0.1
Fix Version/s: 2.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
build 139
http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/170/consoleFull
./testrunner -i /tmp/rebalance_regression.ini get-logs=True,wait_timeout=100 -t swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=2,num-buckets=3,num-swap=1,keys-count=1000000,swap-orchestrator=False


steps:
[2013-01-21 08:18:48,775] - [swaprebalance:472] INFO - failover node ns_1@10.3.121.94 and rebalance afterwards
[2013-01-21 08:19:19,849] - [rest_client:767] INFO - fail_over successful
[2013-01-21 08:19:20,071] - [rest_client:795] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.121.94&user=Administrator&knownNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.92%2Cns_1%4010.3.121.98%2Cns_1%4010.3.121.96%2Cns_1%4010.3.121.93%2Cns_1%4010.3.121.97%2Cns_1%4010.3.121.95
[2013-01-21 08:19:20,161] - [rest_client:799] INFO - rebalance operation started
[2013-01-21 08:19:20,337] - [rest_client:894] INFO - rebalance percentage : 0 %
...
[2013-01-21 09:11:14,394] - [rest_client:894] INFO - rebalance percentage : 68.451909784 %
[2013-01-21 09:11:16,519] - [rest_client:879] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed




2013-01-21 09:38:12.596 ns_vbucket_mover:0:critical:message(ns_1@10.3.121.93) - <0.25863.52> exited with {badmatch,
                             [{<0.25911.52>,
                               {{badmatch,[{<20234.16808.17>,killed}]},
                                [{misc,
                                     sync_shutdown_many_i_am_trapping_exits,1},
                                 {misc,try_with_maybe_ignorant_after,2},
                                 {gen_server,terminate,6},
                                 {proc_lib,init_p_do_apply,3}]}}]}
2013-01-21 09:38:12.774 ns_orchestrator:2:info:message(ns_1@10.3.121.93) - Rebalance exited with reason {badmatch,
                              [{<0.25911.52>,
                                {{badmatch,[{<20234.16808.17>,killed}]},
                                 [{misc,
                                   sync_shutdown_many_i_am_trapping_exits,1},
                                  {misc,try_with_maybe_ignorant_after,2},
                                  {gen_server,terminate,6},
                                  {proc_lib,init_p_do_apply,3}]}}]}


 Comments   
Comment by Andrei Baranouski [ 22/Jan/13 ]
https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.92-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.93-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.94-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.95-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.96-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.97-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.98-8091-diag.txt.gz

Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
andrei,

does this exact same test pass against 2.0 ?
Comment by Andrei Baranouski [ 22/Jan/13 ]
yes, sure
Comment by Aliaksey Artamonau [ 22/Jan/13 ]
Log file was rotated long past the rebalance failure. Please reproduce and upload diags grabbed shortly after the failure.
Comment by Andrei Baranouski [ 23/Jan/13 ]
sorry, Aliaksey

following logs should be ok

https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121.92-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-93-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-94-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-95-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-96-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-97-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-98-diag.txt.gz

Comment by Aliaksey Artamonau [ 23/Jan/13 ]
So here what happened.

Node .93 builds a replica for vbucket 635 on node .95:

[ns_server:debug,2013-01-21T9:37:00.815,ns_1@10.3.121.93:<0.25911.52>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 635 into 'ns_1@10.3.121.95' is <20234.16808.17>

Relevant messages from the .95:

[ns_server:info,2013-01-21T9:37:00.723,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:509]Setting {"10.3.121.95",11209} vbucket 635 to state replica
[ns_server:info,2013-01-21T9:37:00.815,ns_1@10.3.121.95:ns_port_memcached<0.15884.7>:ns_port_server:log:171]memcached<0.15884.7>: Mon Jan 21 09:37:00.613175 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.93 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0

[ns_server:debug,2013-01-21T9:37:00.869,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:kill_tapname:1005]killing tap named: replication_building_635_'ns_1@10.3.121.95'
[rebalance:info,2013-01-21T9:37:00.876,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:568]Starting tap stream:
[{vbuckets,[635]},
 {checkpoints,[{635,0}]},
 {name,<<"replication_building_635_'ns_1@10.3.121.95'">>},
 {takeover,false}]
{{"10.3.121.96",11209},
 {"10.3.121.95",11209},
 [{vbuckets,[635]},
  {takeover,false},
  {suffix,"building_635_'ns_1@10.3.121.95'"},
  {username,"bucket-0"},
  {password,"password"}]}
[rebalance:debug,2013-01-21T9:37:00.887,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:605]upstream_sender pid: <0.16809.17>
[rebalance:debug,2013-01-21T9:37:00.888,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:handle_call:336]Suspended had_backfill waiter
{had_backfill,undefined,undefined,
              [{<18136.25927.52>,#Ref<18136.0.225.131688>}]}
[rebalance:info,2013-01-21T9:37:00.889,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_upstream:932]Initial stream for vbucket 635
[rebalance:debug,2013-01-21T9:37:00.890,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_downstream:896]Replied had_backfill: true to [{<18136.25927.52>,#Ref<18136.0.225.131688>}]
[ns_server:info,2013-01-21T9:37:00.925,ns_1@10.3.121.95:<0.16810.17>:ebucketmigrator_srv:init:509]Setting {"10.3.121.95",11209} vbucket 811 to state replica
[ns_server:info,2013-01-21T9:37:01.094,ns_1@10.3.121.95:ns_port_memcached<0.15884.7>:ns_port_server:log:171]memcached<0.15884.7>: Mon Jan 21 09:37:00.890055 PST 3: TAP (Consumer) eq_tapq:anon_3574 - Reset vbucket 635 was completed succecssfully.

Then we see a TAP_OPAQUE_CLOSE_BACKFILL:

[ns_server:debug,2013-01-21T9:37:11.320,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_upstream:946]seen backfill-close message

Then we attempt to shutdown the ebucketmigrator. As part of this, to confirm all the messages that we've sent to downstream we send an opaque message with (last seen sequence number + 1) and wait until downstream replies. We set timeout to 30 seconds for the entire sequence of actions.

[ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:confirm_sent_messages:727]Going to wait for reception of opaque message ack
[ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.17215.17>:ebucketmigrator_srv:confirm_sent_messages:722]Sending opaque message to confirm downstream reception
[ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.17215.17>:ebucketmigrator_srv:confirm_sent_messages:724]Sent fine

And exactly 30 seconds later we see that the ebucketmigrator got killed since timeout expired:

[ns_server:error,2013-01-21T9:37:52.271,ns_1@10.3.121.93:<0.25911.52>:misc:sync_shutdown_many_i_am_trapping_exits:1416]Shutdown of the following failed: [{<20234.16808.17>,killed}]

So the problem seems to be that memcached on .95 didn't manage to reply to us in 30 seconds.
Comment by Aliaksey Artamonau [ 23/Jan/13 ]
Please see my comment above Chiyoung.
Comment by Dipti Borkar [ 24/Jan/13 ]
per bug-scrub: is this a regression?
Comment by Chiyoung Seo [ 24/Jan/13 ]
memcached log was truncated unfortunately. Can you upload the separated memcached log file from the node .95?
Comment by Andrei Baranouski [ 25/Jan/13 ]
don't have logs anymore. Cluster has been cleaned
Comment by Farshid Ghods (Inactive) [ 28/Jan/13 ]
Andrei,

did test pass in the next iteration ?
Comment by Andrei Baranouski [ 28/Jan/13 ]
this test passed against 144
Comment by Farshid Ghods (Inactive) [ 28/Jan/13 ]
please reopen if test fails again
Generated at Fri Aug 29 03:14:42 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.