[MB-7572] test_add_back_failed_node: Rebalance exited with reason {badmatch, [{<0.25911.52>, {{badmatch,[{<20234.16808.17>,killed}]}, [{misc,sync_shutdown_many_i_am_trapping_exits,1 Created: 22/Jan/13 Updated: 22/Apr/13 Resolved: 28/Jan/13 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | 2.0.1 |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Andrei Baranouski | Assignee: | Andrei Baranouski |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
build 139
http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/170/consoleFull ./testrunner -i /tmp/rebalance_regression.ini get-logs=True,wait_timeout=100 -t swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=2,num-buckets=3,num-swap=1,keys-count=1000000,swap-orchestrator=False steps: [2013-01-21 08:18:48,775] - [swaprebalance:472] INFO - failover node ns_1@10.3.121.94 and rebalance afterwards [2013-01-21 08:19:19,849] - [rest_client:767] INFO - fail_over successful [2013-01-21 08:19:20,071] - [rest_client:795] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.121.94&user=Administrator&knownNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.92%2Cns_1%4010.3.121.98%2Cns_1%4010.3.121.96%2Cns_1%4010.3.121.93%2Cns_1%4010.3.121.97%2Cns_1%4010.3.121.95 [2013-01-21 08:19:20,161] - [rest_client:799] INFO - rebalance operation started [2013-01-21 08:19:20,337] - [rest_client:894] INFO - rebalance percentage : 0 % ... [2013-01-21 09:11:14,394] - [rest_client:894] INFO - rebalance percentage : 68.451909784 % [2013-01-21 09:11:16,519] - [rest_client:879] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed 2013-01-21 09:38:12.596 ns_vbucket_mover:0:critical:message(ns_1@10.3.121.93) - <0.25863.52> exited with {badmatch, [{<0.25911.52>, {{badmatch,[{<20234.16808.17>,killed}]}, [{misc, sync_shutdown_many_i_am_trapping_exits,1}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}]} 2013-01-21 09:38:12.774 ns_orchestrator:2:info:message(ns_1@10.3.121.93) - Rebalance exited with reason {badmatch, [{<0.25911.52>, {{badmatch,[{<20234.16808.17>,killed}]}, [{misc, sync_shutdown_many_i_am_trapping_exits,1}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}]} |
| Comments |
| Comment by Andrei Baranouski [ 22/Jan/13 ] |
|
https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.92-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.93-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.94-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.95-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.96-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.97-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.98-8091-diag.txt.gz |
| Comment by Farshid Ghods [ 22/Jan/13 ] |
|
andrei,
does this exact same test pass against 2.0 ? |
| Comment by Andrei Baranouski [ 22/Jan/13 ] |
| yes, sure |
| Comment by Aliaksey Artamonau [ 22/Jan/13 ] |
| Log file was rotated long past the rebalance failure. Please reproduce and upload diags grabbed shortly after the failure. |
| Comment by Andrei Baranouski [ 23/Jan/13 ] |
|
sorry, Aliaksey
following logs should be ok https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121.92-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-93-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-94-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-95-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-96-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-97-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-98-diag.txt.gz |
| Comment by Aliaksey Artamonau [ 23/Jan/13 ] |
|
So here what happened.
Node .93 builds a replica for vbucket 635 on node .95: [ns_server:debug,2013-01-21T9:37:00.815,ns_1@10.3.121.93:<0.25911.52>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 635 into 'ns_1@10.3.121.95' is <20234.16808.17> Relevant messages from the .95: [ns_server:info,2013-01-21T9:37:00.723,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:509]Setting {"10.3.121.95",11209} vbucket 635 to state replica [ns_server:info,2013-01-21T9:37:00.815,ns_1@10.3.121.95:ns_port_memcached<0.15884.7>:ns_port_server:log:171]memcached<0.15884.7>: Mon Jan 21 09:37:00.613175 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.93 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0 [ns_server:debug,2013-01-21T9:37:00.869,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:kill_tapname:1005]killing tap named: replication_building_635_'ns_1@10.3.121.95' [rebalance:info,2013-01-21T9:37:00.876,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:568]Starting tap stream: [{vbuckets,[635]}, {checkpoints,[{635,0}]}, {name,<<"replication_building_635_'ns_1@10.3.121.95'">>}, {takeover,false}] {{"10.3.121.96",11209}, {"10.3.121.95",11209}, [{vbuckets,[635]}, {takeover,false}, {suffix,"building_635_'ns_1@10.3.121.95'"}, {username,"bucket-0"}, {password,"password"}]} [rebalance:debug,2013-01-21T9:37:00.887,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:605]upstream_sender pid: <0.16809.17> [rebalance:debug,2013-01-21T9:37:00.888,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:handle_call:336]Suspended had_backfill waiter {had_backfill,undefined,undefined, [{<18136.25927.52>,#Ref<18136.0.225.131688>}]} [rebalance:info,2013-01-21T9:37:00.889,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_upstream:932]Initial stream for vbucket 635 [rebalance:debug,2013-01-21T9:37:00.890,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_downstream:896]Replied had_backfill: true to [{<18136.25927.52>,#Ref<18136.0.225.131688>}] [ns_server:info,2013-01-21T9:37:00.925,ns_1@10.3.121.95:<0.16810.17>:ebucketmigrator_srv:init:509]Setting {"10.3.121.95",11209} vbucket 811 to state replica [ns_server:info,2013-01-21T9:37:01.094,ns_1@10.3.121.95:ns_port_memcached<0.15884.7>:ns_port_server:log:171]memcached<0.15884.7>: Mon Jan 21 09:37:00.890055 PST 3: TAP (Consumer) eq_tapq:anon_3574 - Reset vbucket 635 was completed succecssfully. Then we see a TAP_OPAQUE_CLOSE_BACKFILL: [ns_server:debug,2013-01-21T9:37:11.320,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_upstream:946]seen backfill-close message Then we attempt to shutdown the ebucketmigrator. As part of this, to confirm all the messages that we've sent to downstream we send an opaque message with (last seen sequence number + 1) and wait until downstream replies. We set timeout to 30 seconds for the entire sequence of actions. [ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:confirm_sent_messages:727]Going to wait for reception of opaque message ack [ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.17215.17>:ebucketmigrator_srv:confirm_sent_messages:722]Sending opaque message to confirm downstream reception [ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.17215.17>:ebucketmigrator_srv:confirm_sent_messages:724]Sent fine And exactly 30 seconds later we see that the ebucketmigrator got killed since timeout expired: [ns_server:error,2013-01-21T9:37:52.271,ns_1@10.3.121.93:<0.25911.52>:misc:sync_shutdown_many_i_am_trapping_exits:1416]Shutdown of the following failed: [{<20234.16808.17>,killed}] So the problem seems to be that memcached on .95 didn't manage to reply to us in 30 seconds. |
| Comment by Aliaksey Artamonau [ 23/Jan/13 ] |
| Please see my comment above Chiyoung. |
| Comment by Dipti Borkar [ 24/Jan/13 ] |
| per bug-scrub: is this a regression? |
| Comment by Chiyoung Seo [ 24/Jan/13 ] |
| memcached log was truncated unfortunately. Can you upload the separated memcached log file from the node .95? |
| Comment by Andrei Baranouski [ 25/Jan/13 ] |
| don't have logs anymore. Cluster has been cleaned |
| Comment by Farshid Ghods [ 28/Jan/13 ] |
|
Andrei,
did test pass in the next iteration ? |
| Comment by Andrei Baranouski [ 28/Jan/13 ] |
| this test passed against 144 |
| Comment by Farshid Ghods [ 28/Jan/13 ] |
| please reopen if test fails again |