Details
Description
Test failing
swaprebalance.SwapRebalanceBasicTests.do_test,replica=2,num-buckets=1,load-ratio=5,num-swap=3
This test swaps 3 nodes at the same time.
2012-04-11 23:05:45,621 - root - INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.76%2Cns_1%4010.1.3.75&user=Administrator&knownNodes=ns_1%4010.1.3.75%2Cns_1%4010.1.3.78%2Cns_1%4010.1.3.77%2Cns_1%4010.1.3.79%2Cns_1%4010.1.3.76%2Cns_1%4010.1.3.74
2012-04-11 23:05:45,627 - root - INFO - rebalance operation started
2012-04-11 23:05:47,637 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:49,643 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:51,652 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:53,658 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:55,665 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:57,672 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:59,680 - root - ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
2012-04-11 23:06:01,683 - root - ERROR - rebalance progress code : -1
FAIL
Rebalance exited with reason {{case_clause,
{badrpc,
{'EXIT',
{timeout,
{gen_server,call,
[<16840.28072.2>,
start_vbucket_filter_change]}}}}},
[{ns_vbm_sup,change_vbucket_filter,4},
{ns_vbm_sup,'-set_replicas/3-fun-2-',5},
{lists,foldl,3},
{ns_vbm_sup,set_replicas,3},
{ns_vbm_sup,'-set_replicas/2-fun-1-',3},
{lists,foreach,2},
{ns_vbm_sup,apply_changes,2},
{ns_vbucket_mover,sync_replicas,0}]}
I am unable to attach logs due toMB-5051
swaprebalance.SwapRebalanceBasicTests.do_test,replica=2,num-buckets=1,load-ratio=5,num-swap=3
This test swaps 3 nodes at the same time.
2012-04-11 23:05:45,621 - root - INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.76%2Cns_1%4010.1.3.75&user=Administrator&knownNodes=ns_1%4010.1.3.75%2Cns_1%4010.1.3.78%2Cns_1%4010.1.3.77%2Cns_1%4010.1.3.79%2Cns_1%4010.1.3.76%2Cns_1%4010.1.3.74
2012-04-11 23:05:45,627 - root - INFO - rebalance operation started
2012-04-11 23:05:47,637 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:49,643 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:51,652 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:53,658 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:55,665 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:57,672 - root - INFO - rebalance percentage : 0.0 %
2012-04-11 23:05:59,680 - root - ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
2012-04-11 23:06:01,683 - root - ERROR - rebalance progress code : -1
FAIL
Rebalance exited with reason {{case_clause,
{badrpc,
{'EXIT',
{timeout,
{gen_server,call,
[<16840.28072.2>,
start_vbucket_filter_change]}}}}},
[{ns_vbm_sup,change_vbucket_filter,4},
{ns_vbm_sup,'-set_replicas/3-fun-2-',5},
{lists,foldl,3},
{ns_vbm_sup,set_replicas,3},
{ns_vbm_sup,'-set_replicas/2-fun-1-',3},
{lists,foreach,2},
{ns_vbm_sup,apply_changes,2},
{ns_vbucket_mover,sync_replicas,0}]}
I am unable to attach logs due to
-
- 10.3.121.113-8091-diag.txt.gz
- 12/Apr/12 1:55 PM
- 31 kB
- Thuan Nguyen
-
- 10.3.121.114-8091-diag.txt.gz
- 12/Apr/12 1:55 PM
- 30 kB
- Thuan Nguyen
-
- 10.3.121.115-8091-diag.txt.gz
- 12/Apr/12 1:55 PM
- 30 kB
- Thuan Nguyen
-
- 10.3.121.116-8091-diag.txt.gz
- 12/Apr/12 1:55 PM
- 30 kB
- Thuan Nguyen
-
- 10.3.121.117-8091-diag.txt.gz
- 12/Apr/12 1:55 PM
- 30 kB
- Thuan Nguyen
-
- 10.3.121.118-8091-diag.txt.gz
- 15/Apr/12 2:43 PM
- 1.01 MB
- Thuan Nguyen
-
- 10.3.121.119-8091-diag.txt.gz
- 15/Apr/12 2:43 PM
- 821 kB
- Thuan Nguyen
-
- 10.3.121.120-8091-diag.txt.gz
- 15/Apr/12 2:43 PM
- 946 kB
- Thuan Nguyen
-
- 10.3.121.121-8091-diag.txt.gz
- 15/Apr/12 2:43 PM
- 844 kB
- Thuan Nguyen
-
- 10.3.121.122-8091-diag.txt.gz
- 15/Apr/12 2:43 PM
- 806 kB
- Thuan Nguyen
-
- node_74.log.gz
- 12/Apr/12 2:09 AM
- 2.82 MB
- Karan Kumar
-
- node_75.log.gz
- 12/Apr/12 2:09 AM
- 1.75 MB
- Karan Kumar
-
- node_76.log.gz
- 12/Apr/12 2:09 AM
- 1.73 MB
- Karan Kumar
-
- node_77.log.gz
- 12/Apr/12 2:09 AM
- 1.86 MB
- Karan Kumar
-
- node_78.log.gz
- 12/Apr/12 2:09 AM
- 1.83 MB
- Karan Kumar
-
- node_79.log.gz
- 12/Apr/12 2:09 AM
- 1.83 MB
- Karan Kumar
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Permalink
Karan Kumar
added a comment -
Attaching logs from cbbrowse_logs
Show
Karan Kumar
added a comment - Attaching logs from cbbrowse_logs
Hide
Karan Kumar
added a comment -
These logs dont seem to contain the rebalance failure described above.
Show
Karan Kumar
added a comment - These logs dont seem to contain the rebalance failure described above.
Hide
Aliaksey Artamonau
added a comment -
Log was taken too late and all the useful information is missing from it.
Show
Aliaksey Artamonau
added a comment - Log was taken too late and all the useful information is missing from it.
Hide
Thuan Nguyen
added a comment -
I can repo this bug in failovertests.FailoverTests.test_failover_stop_server,replica=2,load_ratio=10 test
Show
Thuan Nguyen
added a comment - I can repo this bug in failovertests.FailoverTests.test_failover_stop_server,replica=2,load_ratio=10 test
Show
Karan Kumar
added a comment - Aliaksey
Tony saw the same errors and was able to capture logs.
Show
Karan Kumar
added a comment - Logs are also present on MB-5069
Hide
Thuan Nguyen
added a comment -
test normal failover 3 replica on couchbase server 1.8.1r-764.
rebalance failed with error
error_logger:error] [2012-04-14 20:20:13] [ns_1@10.3.121.118:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_vbucket_mover:init/1
pid: <0.6953.0>
registered_name: []
exception exit: {{case_clause,
{badrpc,
{'EXIT',
{timeout,
{gen_server,call,
[<16476.6180.0>,start_vbucket_filter_change]}}}}},
[{ns_vbm_sup,change_vbucket_filter,4},
{ns_vbm_sup,'-set_replicas/3-fun-2-',5},
{lists,foldl,3},
{ns_vbm_sup,set_replicas,3},
{ns_vbm_sup,'-set_replicas/2-fun-1-',3},
{lists,foreach,2},
{ns_vbm_sup,apply_changes,2},
{ns_vbucket_mover,sync_replicas,0}]}
in function gen_server:terminate/6
ancestors: [<0.6910.0>]
messages: [{'EXIT',<0.7860.0>,normal}]
links: [<0.6910.0>]
dictionary: [{bucket_name,"default"},
{i_am_master_mover,true},
{actual_changes,126},
{total_changes,126},
rebalance failed with error
error_logger:error] [2012-04-14 20:20:13] [ns_1@10.3.121.118:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_vbucket_mover:init/1
pid: <0.6953.0>
registered_name: []
exception exit: {{case_clause,
{badrpc,
{'EXIT',
{timeout,
{gen_server,call,
[<16476.6180.0>,start_vbucket_filter_change]}}}}},
[{ns_vbm_sup,change_vbucket_filter,4},
{ns_vbm_sup,'-set_replicas/3-fun-2-',5},
{lists,foldl,3},
{ns_vbm_sup,set_replicas,3},
{ns_vbm_sup,'-set_replicas/2-fun-1-',3},
{lists,foreach,2},
{ns_vbm_sup,apply_changes,2},
{ns_vbucket_mover,sync_replicas,0}]}
in function gen_server:terminate/6
ancestors: [<0.6910.0>]
messages: [{'EXIT',<0.7860.0>,normal}]
links: [<0.6910.0>]
dictionary: [{bucket_name,"default"},
{i_am_master_mover,true},
{actual_changes,126},
{total_changes,126},
Show
Thuan Nguyen
added a comment - test normal failover 3 replica on couchbase server 1.8.1r-764.
rebalance failed with error
error_logger:error] [2012-04-14 20:20:13] [ ns_1@10.3.121.118 :error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_vbucket_mover:init/1
pid: <0.6953.0>
registered_name: []
exception exit: {{case_clause,
{badrpc,
{'EXIT',
{timeout,
{gen_server,call,
[<16476.6180.0>,start_vbucket_filter_change]}}}}},
[{ns_vbm_sup,change_vbucket_filter,4},
{ns_vbm_sup,'-set_replicas/3-fun-2-',5},
{lists,foldl,3},
{ns_vbm_sup,set_replicas,3},
{ns_vbm_sup,'-set_replicas/2-fun-1-',3},
{lists,foreach,2},
{ns_vbm_sup,apply_changes,2},
{ns_vbucket_mover,sync_replicas,0}]}
in function gen_server:terminate/6
ancestors: [<0.6910.0>]
messages: [{'EXIT',<0.7860.0>,normal}]
links: [<0.6910.0>]
dictionary: [{bucket_name,"default"},
{i_am_master_mover,true},
{actual_changes,126},
{total_changes,126},
Hide
Aleksey Kondratenko
added a comment -
Found that problematic ebucketmigrator wasn't actually doing anything because only vbucket it was configured to replicate was not yet ready to be replicated from. Thus it ignored vbucket_filter_change requests which caused timeout.
Genuine bug.
Genuine bug.
Show
Aleksey Kondratenko
added a comment - Found that problematic ebucketmigrator wasn't actually doing anything because only vbucket it was configured to replicate was not yet ready to be replicated from. Thus it ignored vbucket_filter_change requests which caused timeout.
Genuine bug.
Show
Aliaksey Artamonau
added a comment - http://review.couchbase.org/14970
Hide
Aleksey Kondratenko
added a comment -
I'm also concerned why this is happening. Here's what I'm seeing in logs.
As part of rebalancing vbucket 17 we're creating few replication_building_17_<hostname> taps. One of them is on .122
[rebalance:info] [2012-04-14 20:20:00] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:init:214] Starting tap stream:
[{vbuckets,[17]},
{checkpoints,[{17,0}]},
{name,"replication_building_17_'ns_1@10.3.121.122'"},
{takeover,false}]
[ns_server:debug] [2012-04-14 20:20:00] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:init:217] killing tap named: replication_building_17_'ns_1@10.3.121.122'
[rebalance:debug] [2012-04-14 20:20:00] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:init:234] upstream_sender pid: <0.6173.0>
[rebalance:info] [2012-04-14 20:20:00] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:process_upstream:431] Initial stream for vbucket 17
Then on master node (who also happens to be source) we're seeing end of backfill for all target nodes and successfully complete takeover. Here's evidence:
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:init:214] Starting tap stream:
[{vbuckets,[17]},{checkpoints,[{17,0}]},{name,"rebalance_17"},{takeover,true}]
[ns_server:debug] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:init:217] killing tap named: rebalance_17
[rebalance:debug] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:init:234] upstream_sender pid: <0.7318.0>
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:process_upstream:431] Initial stream for vbucket 17
[ns_server:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.244.0>:ns_port_server:log:161] memcached<0.244.0>: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.121 - Connection is re-established. Rollback unacked messages...
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.121 - 16, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_building_17_'ns_1@10.3.121.119' - 17, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_building_17_'ns_1@10.3.121.122' - 17, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_building_17_'ns_1@10.3.121.120' - 17, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:rebalance_17 - 17, fill is completed with VBuckets
memcached<0.244.0>: Vbucket <17> is going dead.
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:terminate:270] Skipping close ack for successfull takover
[ns_server:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7315.0>:ns_replicas_builder:kill_a_bunch_of_tap_names:207] Killed the following tap names on 'ns_1@10.3.121.118': [<<"replication_building_17_'ns_1@10.3.121.119'">>,
<<"replication_building_17_'ns_1@10.3.121.120'">>,
<<"replication_building_17_'ns_1@10.3.121.122'">>]
[ns_server:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7315.0>:ns_replicas_builder:build_replicas_main:131] Got exit: {'EXIT',<0.7314.0>,shutdown}
We can see how replication_building_17_'ns_1@10.3.121.122' complete successfully in logs of .122
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:do_confirm_sent_messages:302] Got close ack!
Then few moments later we're starting replicating vbucket 17 from .122 and find that it still has last closed checkpoint 0.
[error_logger:info] [2012-04-14 20:20:01] [ns_1@10.3.121.122:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,'ns_vbm_sup-default'}
started: [{pid,<0.6180.0>},
{name,{child_id,[17],'ns_1@10.3.121.118'}},
{mfargs,
{ebucketmigrator_srv,start_link,
[{"10.3.121.122",11209},
{"10.3.121.118",11209},
[{username,"default"},
{password,[]},
{vbuckets,[17]},
{takeover,false},
{suffix,"ns_1@10.3.121.118"}]]}},
{restart_type,permanent},
{shutdown,60000},
{child_type,worker}]
[rebalance:debug] [2012-04-14 20:20:01] [ns_1@10.3.121.122:<0.6180.0>:ebucketmigrator_srv:init:183] CheckpointIdsDict:
{dict,10,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[[16|1]],
[[3|1]],
[[6|1]],
[[9|1]],
[[12|1]],
[[15|1]],
[],[],
[[8|1]],
[[11|1]],
[],
[[17|0]],
[],[],
[[10|1]],
[]}}}
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.122:<0.6180.0>:ebucketmigrator_srv:init:194] Some vbuckets were not yet ready to replicate from:
[17]
So something is wrong and our attempt to reliably build replica as part of rebalance is clearly not working. But I'm pretty sure ns_server saw backfill_completed true for this tap stream and you can also see that in log messages from memcached. So looks like some race and maybe we need to do more to ensure backfill is _really_ done.
As part of rebalancing vbucket 17 we're creating few replication_building_17_<hostname> taps. One of them is on .122
[rebalance:info] [2012-04-14 20:20:00] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:init:214] Starting tap stream:
[{vbuckets,[17]},
{checkpoints,[{17,0}]},
{name,"replication_building_17_'ns_1@10.3.121.122'"},
{takeover,false}]
[ns_server:debug] [2012-04-14 20:20:00] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:init:217] killing tap named: replication_building_17_'ns_1@10.3.121.122'
[rebalance:debug] [2012-04-14 20:20:00] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:init:234] upstream_sender pid: <0.6173.0>
[rebalance:info] [2012-04-14 20:20:00] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:process_upstream:431] Initial stream for vbucket 17
Then on master node (who also happens to be source) we're seeing end of backfill for all target nodes and successfully complete takeover. Here's evidence:
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:init:214] Starting tap stream:
[{vbuckets,[17]},{checkpoints,[{17,0}]},{name,"rebalance_17"},{takeover,true}]
[ns_server:debug] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:init:217] killing tap named: rebalance_17
[rebalance:debug] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:init:234] upstream_sender pid: <0.7318.0>
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:process_upstream:431] Initial stream for vbucket 17
[ns_server:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.244.0>:ns_port_server:log:161] memcached<0.244.0>: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.121 - Connection is re-established. Rollback unacked messages...
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.121 - 16, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_building_17_'ns_1@10.3.121.119' - 17, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_building_17_'ns_1@10.3.121.122' - 17, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_building_17_'ns_1@10.3.121.120' - 17, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:rebalance_17 - 17, fill is completed with VBuckets
memcached<0.244.0>: Vbucket <17> is going dead.
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7317.0>:ebucketmigrator_srv:terminate:270] Skipping close ack for successfull takover
[ns_server:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7315.0>:ns_replicas_builder:kill_a_bunch_of_tap_names:207] Killed the following tap names on 'ns_1@10.3.121.118': [<<"replication_building_17_'ns_1@10.3.121.119'">>,
<<"replication_building_17_'ns_1@10.3.121.120'">>,
<<"replication_building_17_'ns_1@10.3.121.122'">>]
[ns_server:info] [2012-04-14 20:20:01] [ns_1@10.3.121.118:<0.7315.0>:ns_replicas_builder:build_replicas_main:131] Got exit: {'EXIT',<0.7314.0>,shutdown}
We can see how replication_building_17_'ns_1@10.3.121.122' complete successfully in logs of .122
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.122:<0.6172.0>:ebucketmigrator_srv:do_confirm_sent_messages:302] Got close ack!
Then few moments later we're starting replicating vbucket 17 from .122 and find that it still has last closed checkpoint 0.
[error_logger:info] [2012-04-14 20:20:01] [ns_1@10.3.121.122:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,'ns_vbm_sup-default'}
started: [{pid,<0.6180.0>},
{name,{child_id,[17],'ns_1@10.3.121.118'}},
{mfargs,
{ebucketmigrator_srv,start_link,
[{"10.3.121.122",11209},
{"10.3.121.118",11209},
[{username,"default"},
{password,[]},
{vbuckets,[17]},
{takeover,false},
{suffix,"ns_1@10.3.121.118"}]]}},
{restart_type,permanent},
{shutdown,60000},
{child_type,worker}]
[rebalance:debug] [2012-04-14 20:20:01] [ns_1@10.3.121.122:<0.6180.0>:ebucketmigrator_srv:init:183] CheckpointIdsDict:
{dict,10,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[[16|1]],
[[3|1]],
[[6|1]],
[[9|1]],
[[12|1]],
[[15|1]],
[],[],
[[8|1]],
[[11|1]],
[],
[[17|0]],
[],[],
[[10|1]],
[]}}}
[rebalance:info] [2012-04-14 20:20:01] [ns_1@10.3.121.122:<0.6180.0>:ebucketmigrator_srv:init:194] Some vbuckets were not yet ready to replicate from:
[17]
So something is wrong and our attempt to reliably build replica as part of rebalance is clearly not working. But I'm pretty sure ns_server saw backfill_completed true for this tap stream and you can also see that in log messages from memcached. So looks like some race and maybe we need to do more to ensure backfill is _really_ done.
Show
Aleksey Kondratenko
added a comment - I'm also concerned why this is happening. Here's what I'm seeing in logs.
As part of rebalancing vbucket 17 we're creating few replication_building_17_<hostname> taps. One of them is on .122
[rebalance:info] [2012-04-14 20:20:00] [ ns_1@10.3.121.122 :<0.6172.0>:ebucketmigrator_srv:init:214] Starting tap stream:
[{vbuckets,[17]},
{checkpoints,[{17,0}]},
{name,"replication_building_17_' ns_1@10.3.121.122 '"},
{takeover,false}]
[ns_server:debug] [2012-04-14 20:20:00] [ ns_1@10.3.121.122 :<0.6172.0>:ebucketmigrator_srv:init:217] killing tap named: replication_building_17_' ns_1@10.3.121.122 '
[rebalance:debug] [2012-04-14 20:20:00] [ ns_1@10.3.121.122 :<0.6172.0>:ebucketmigrator_srv:init:234] upstream_sender pid: <0.6173.0>
[rebalance:info] [2012-04-14 20:20:00] [ ns_1@10.3.121.122 :<0.6172.0>:ebucketmigrator_srv:process_upstream:431] Initial stream for vbucket 17
Then on master node (who also happens to be source) we're seeing end of backfill for all target nodes and successfully complete takeover. Here's evidence:
[rebalance:info] [2012-04-14 20:20:01] [ ns_1@10.3.121.118 :<0.7317.0>:ebucketmigrator_srv:init:214] Starting tap stream:
[{vbuckets,[17]},{checkpoints,[{17,0}]},{name,"rebalance_17"},{takeover,true}]
[ns_server:debug] [2012-04-14 20:20:01] [ ns_1@10.3.121.118 :<0.7317.0>:ebucketmigrator_srv:init:217] killing tap named: rebalance_17
[rebalance:debug] [2012-04-14 20:20:01] [ ns_1@10.3.121.118 :<0.7317.0>:ebucketmigrator_srv:init:234] upstream_sender pid: <0.7318.0>
[rebalance:info] [2012-04-14 20:20:01] [ ns_1@10.3.121.118 :<0.7317.0>:ebucketmigrator_srv:process_upstream:431] Initial stream for vbucket 17
[ns_server:info] [2012-04-14 20:20:01] [ ns_1@10.3.121.118 :<0.244.0>:ns_port_server:log:161] memcached<0.244.0>: TAP (Producer) eq_tapq: replication_ns_1@10.3.121.121 - Connection is re-established. Rollback unacked messages...
memcached<0.244.0>: TAP (Producer) eq_tapq: replication_ns_1@10.3.121.121 - 16, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_building_17_' ns_1@10.3.121.119 ' - 17, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_building_17_' ns_1@10.3.121.122 ' - 17, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:replication_building_17_' ns_1@10.3.121.120 ' - 17, fill is completed with VBuckets
memcached<0.244.0>: TAP (Producer) eq_tapq:rebalance_17 - 17, fill is completed with VBuckets
memcached<0.244.0>: Vbucket <17> is going dead.
[rebalance:info] [2012-04-14 20:20:01] [ ns_1@10.3.121.118 :<0.7317.0>:ebucketmigrator_srv:terminate:270] Skipping close ack for successfull takover
[ns_server:info] [2012-04-14 20:20:01] [ ns_1@10.3.121.118 :<0.7315.0>:ns_replicas_builder:kill_a_bunch_of_tap_names:207] Killed the following tap names on ' ns_1@10.3.121.118 ': [<<"replication_building_17_' ns_1@10.3.121.119 '">>,
<<"replication_building_17_' ns_1@10.3.121.120 '">>,
<<"replication_building_17_' ns_1@10.3.121.122 '">>]
[ns_server:info] [2012-04-14 20:20:01] [ ns_1@10.3.121.118 :<0.7315.0>:ns_replicas_builder:build_replicas_main:131] Got exit: {'EXIT',<0.7314.0>,shutdown}
We can see how replication_building_17_' ns_1@10.3.121.122 ' complete successfully in logs of .122
[rebalance:info] [2012-04-14 20:20:01] [ ns_1@10.3.121.122 :<0.6172.0>:ebucketmigrator_srv:do_confirm_sent_messages:302] Got close ack!
Then few moments later we're starting replicating vbucket 17 from .122 and find that it still has last closed checkpoint 0.
[error_logger:info] [2012-04-14 20:20:01] [ ns_1@10.3.121.122 :error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,'ns_vbm_sup-default'}
started: [{pid,<0.6180.0>},
{name,{child_id,[17],' ns_1@10.3.121.118 '}},
{mfargs,
{ebucketmigrator_srv,start_link,
[{"10.3.121.122",11209},
{"10.3.121.118",11209},
[{username,"default"},
{password,[]},
{vbuckets,[17]},
{takeover,false},
{suffix," ns_1@10.3.121.118 "}]]}},
{restart_type,permanent},
{shutdown,60000},
{child_type,worker}]
[rebalance:debug] [2012-04-14 20:20:01] [ ns_1@10.3.121.122 :<0.6180.0>:ebucketmigrator_srv:init:183] CheckpointIdsDict:
{dict,10,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[[16|1]],
[[3|1]],
[[6|1]],
[[9|1]],
[[12|1]],
[[15|1]],
[],[],
[[8|1]],
[[11|1]],
[],
[[17|0]],
[],[],
[[10|1]],
[]}}}
[rebalance:info] [2012-04-14 20:20:01] [ ns_1@10.3.121.122 :<0.6180.0>:ebucketmigrator_srv:init:194] Some vbuckets were not yet ready to replicate from:
[17]
So something is wrong and our attempt to reliably build replica as part of rebalance is clearly not working. But I'm pretty sure ns_server saw backfill_completed true for this tap stream and you can also see that in log messages from memcached. So looks like some race and maybe we need to do more to ensure backfill is _really_ done.
Hide
Thuan Nguyen
added a comment -
Integrated in github-ns-server-2-0 #333 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/333/])
do enter gen_server loop when no vbuckets are ready.MB-5052 (Revision 0a51875d97c1e427d974bf6434a06701034ba53a)
Result = SUCCESS
Aliaksey Artamonau :
Files :
* src/ebucketmigrator_srv.erl
do enter gen_server loop when no vbuckets are ready.
Result = SUCCESS
Aliaksey Artamonau :
Files :
* src/ebucketmigrator_srv.erl
Show
Thuan Nguyen
added a comment - Integrated in github-ns-server-2-0 #333 (See [ http://qa.hq.northscale.net/job/github-ns-server-2-0/333/ ])
do enter gen_server loop when no vbuckets are ready. MB-5052 (Revision 0a51875d97c1e427d974bf6434a06701034ba53a)
Result = SUCCESS
Aliaksey Artamonau :
Files :
* src/ebucketmigrator_srv.erl
Hide
Chiyoung Seo
added a comment -
Alk, Aliaksey,
Let me give you some clarifications on the following Alk's comment:
"Then few moments later we're starting replicating vbucket 17 from .122 and find that it still has last closed checkpoint 0."
After we build the replica vbucket on the replica nodes through backfill, the last closed checkpoint id on each new replica vbucket will be still 0 until it receives the open checkpoint id from the master.
QE team,
When you have this issue, please grab the checkpoint stats from each node as well.
Let me give you some clarifications on the following Alk's comment:
"Then few moments later we're starting replicating vbucket 17 from .122 and find that it still has last closed checkpoint 0."
After we build the replica vbucket on the replica nodes through backfill, the last closed checkpoint id on each new replica vbucket will be still 0 until it receives the open checkpoint id from the master.
QE team,
When you have this issue, please grab the checkpoint stats from each node as well.
Show
Chiyoung Seo
added a comment - Alk, Aliaksey,
Let me give you some clarifications on the following Alk's comment:
"Then few moments later we're starting replicating vbucket 17 from .122 and find that it still has last closed checkpoint 0."
After we build the replica vbucket on the replica nodes through backfill, the last closed checkpoint id on each new replica vbucket will be still 0 until it receives the open checkpoint id from the master.
QE team,
When you have this issue, please grab the checkpoint stats from each node as well.
Hide
Chiyoung Seo
added a comment -
Unfortunately, the master doesn't provide the stat that tells us if the open checkpoint id is notified to the replica node after backfill. However, as the master sends the open checkpoint id to the replica node immediately after backfill is completed, we should not expect much delay between these two events. Aliaksey's change could be a workaround for this issue, but we probably need to address this issue in a better way later.
Show
Chiyoung Seo
added a comment - Unfortunately, the master doesn't provide the stat that tells us if the open checkpoint id is notified to the replica node after backfill. However, as the master sends the open checkpoint id to the replica node immediately after backfill is completed, we should not expect much delay between these two events. Aliaksey's change could be a workaround for this issue, but we probably need to address this issue in a better way later.
Hide
Farshid Ghods
added a comment -
can we _please_ have ns_server collect this as part of diags process. or give QE a hook that we can call to enable and disable this feature.
i think it doesn't make sense to file an ep-engine bug without having these stats over time and we have filed many bugs which chiyoung have not been able to diagnose due to this issue
i think it doesn't make sense to file an ep-engine bug without having these stats over time and we have filed many bugs which chiyoung have not been able to diagnose due to this issue
Show
Farshid Ghods
added a comment - can we _please_ have ns_server collect this as part of diags process. or give QE a hook that we can call to enable and disable this feature.
i think it doesn't make sense to file an ep-engine bug without having these stats over time and we have filed many bugs which chiyoung have not been able to diagnose due to this issue
Hide
Chiyoung Seo
added a comment -
After discussion with Alk, the possible workaround for this issue is that the ns-server would poll checkpoint stats from each replica node after backfill completion, to make sure that each replica vbucket receives the open checkpoint id from the master.
This will add a little more overhead because of polling checkpoint stats after backfill, but I don't think there is much delay between backfill_completion and open checkpoint id propagation.
Alk, please assign this back to me if we see the same issue with this workaround.
This will add a little more overhead because of polling checkpoint stats after backfill, but I don't think there is much delay between backfill_completion and open checkpoint id propagation.
Alk, please assign this back to me if we see the same issue with this workaround.
Show
Chiyoung Seo
added a comment - After discussion with Alk, the possible workaround for this issue is that the ns-server would poll checkpoint stats from each replica node after backfill completion, to make sure that each replica vbucket receives the open checkpoint id from the master.
This will add a little more overhead because of polling checkpoint stats after backfill, but I don't think there is much delay between backfill_completion and open checkpoint id propagation.
Alk, please assign this back to me if we see the same issue with this workaround.
Hide
Aleksey Kondratenko
added a comment -
Farshid, in this particular case we have all evidence. But in general I agree.
Show
Aleksey Kondratenko
added a comment - Farshid, in this particular case we have all evidence. But in general I agree.
Hide
Aleksey Kondratenko
added a comment -
Original issue was closed. Duplicate backfills is at MB-5136 to lessen confusion.
Show
Aleksey Kondratenko
added a comment - Original issue was closed. Duplicate backfills is at MB-5136 to lessen confusion.