Details
Description
Rebalancing from a 2 to a 3 node cluster. Started rebalance and allowed it to run for quite a while. Stopped the rebalance process. Restarted the process and allowed it to run until it was very, very close to completion. Stopped it a second time. Now when attempting to restart to finish up, getting multiple errors in logs. Diagnostics attached.
-
- ns-diag-20110103134811.txt
- 03/Jan/11 3:50 PM
- 5.60 MB
- James Phillips
-
Hide
- ns-diag-20110105101559.zip
- 05/Jan/11 12:18 PM
- 4.45 MB
- James Phillips
-
- ns-diag-20110105101559.txt 40.95 MB
-
- Capture.GIF
- 353 kB
- 03/Jan/11 3:57 PM
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Permalink
Thuan Nguyen
added a comment -
Tested on build basestar-191-g79a0550 on ubuntu 10.04 64bit and repeat step by step as in bug report. I can not reproduce this bug. So I think this issue is fixed. I will close this bug.
Show
Thuan Nguyen
added a comment - Tested on build basestar-191-g79a0550 on ubuntu 10.04 64bit and repeat step by step as in bug report. I can not reproduce this bug. So I think this issue is fixed. I will close this bug.
Hide
Sharon Barr
added a comment -
Tony,
Please try and reproduce this on 1.7 and close this bug if it is not reproducible.
Please try and reproduce this on 1.7 and close this bug if it is not reproducible.
Show
Sharon Barr
added a comment - Tony,
Please try and reproduce this on 1.7 and close this bug if it is not reproducible.
Hide
Chiyoung Seo
added a comment -
We reduced the time spent by vbucket deletion from memory hashtable by avoiding the hashtable visit during vbucket deletion (http://review.membase.org/4251).
We observed that the aggregated execution time of all steps performed in a single vbucket deletion on Windows is always less than 2 sec, but the wall clock execution time of a single vbucket deletion sometimes takes more than 5 sec under heavy CPU and memory usage, but much less frequently with the above fix . It seems to me that this execution time delay is related to the Windows scheduler.
If a rebalance fails in this scenario, simply restarting it resolves the failure.
We will investigate the above scheduling-related issue again soon.
We observed that the aggregated execution time of all steps performed in a single vbucket deletion on Windows is always less than 2 sec, but the wall clock execution time of a single vbucket deletion sometimes takes more than 5 sec under heavy CPU and memory usage, but much less frequently with the above fix . It seems to me that this execution time delay is related to the Windows scheduler.
If a rebalance fails in this scenario, simply restarting it resolves the failure.
We will investigate the above scheduling-related issue again soon.
Show
Chiyoung Seo
added a comment - We reduced the time spent by vbucket deletion from memory hashtable by avoiding the hashtable visit during vbucket deletion ( http://review.membase.org/4251 ).
We observed that the aggregated execution time of all steps performed in a single vbucket deletion on Windows is always less than 2 sec, but the wall clock execution time of a single vbucket deletion sometimes takes more than 5 sec under heavy CPU and memory usage, but much less frequently with the above fix . It seems to me that this execution time delay is related to the Windows scheduler.
If a rebalance fails in this scenario, simply restarting it resolves the failure.
We will investigate the above scheduling-related issue again soon.
Hide
Chiyoung Seo
added a comment -
It seems to me that this rebalance failure is caused by 1) quite slow vbucket deletion in memory hashtable and 2) frequent TAP ack timeout.
All of vbucket migrator errors were caused by TAP ack timeout events that consequently killed vbucket migrator processes. I observed that this frequent ack timeout issue also caused the slow replication problem in 1.6.5. Trond is current working on resolving this timeout issue.
Regarding the slow vbucket deletion in memory hashtable, I couldn't reproduce it on my Linux testing machines, but the timing stat from one of James's Windows machines shows that deleting a single vbucket from memory hashtable sometimes takes more than 2 sec:
del_vb_cmd (1415 total)
0 - 1us : ( 46.08%) 652 #########################################
8ms - 16ms : ( 50.32%) 60 ###
16ms - 32ms : ( 51.59%) 18 #
32ms - 65ms : ( 53.99%) 34 ##
65ms - 131ms : ( 56.68%) 38 ##
131ms - 262ms : ( 62.26%) 79 #####
262ms - 524ms : ( 70.60%) 118 #######
524ms - 1s : ( 78.37%) 110 ######
1s - 2s : ( 91.87%) 191 ############
2s - 4s : ( 98.09%) 88 #####
4s - 8s : ( 99.93%) 26 #
8s - 16s : (100.00%) 1
As Sean mentioned, this is the reason why the ns server got many vbucket deletion timeout events from the membase server. Note that vbucket deletion in disk is scheduled and executed asynchronously. I'm currently investigating this issue...
All of vbucket migrator errors were caused by TAP ack timeout events that consequently killed vbucket migrator processes. I observed that this frequent ack timeout issue also caused the slow replication problem in 1.6.5. Trond is current working on resolving this timeout issue.
Regarding the slow vbucket deletion in memory hashtable, I couldn't reproduce it on my Linux testing machines, but the timing stat from one of James's Windows machines shows that deleting a single vbucket from memory hashtable sometimes takes more than 2 sec:
del_vb_cmd (1415 total)
0 - 1us : ( 46.08%) 652 #########################################
8ms - 16ms : ( 50.32%) 60 ###
16ms - 32ms : ( 51.59%) 18 #
32ms - 65ms : ( 53.99%) 34 ##
65ms - 131ms : ( 56.68%) 38 ##
131ms - 262ms : ( 62.26%) 79 #####
262ms - 524ms : ( 70.60%) 118 #######
524ms - 1s : ( 78.37%) 110 ######
1s - 2s : ( 91.87%) 191 ############
2s - 4s : ( 98.09%) 88 #####
4s - 8s : ( 99.93%) 26 #
8s - 16s : (100.00%) 1
As Sean mentioned, this is the reason why the ns server got many vbucket deletion timeout events from the membase server. Note that vbucket deletion in disk is scheduled and executed asynchronously. I'm currently investigating this issue...
Show
Chiyoung Seo
added a comment - It seems to me that this rebalance failure is caused by 1) quite slow vbucket deletion in memory hashtable and 2) frequent TAP ack timeout.
All of vbucket migrator errors were caused by TAP ack timeout events that consequently killed vbucket migrator processes. I observed that this frequent ack timeout issue also caused the slow replication problem in 1.6.5. Trond is current working on resolving this timeout issue.
Regarding the slow vbucket deletion in memory hashtable, I couldn't reproduce it on my Linux testing machines, but the timing stat from one of James's Windows machines shows that deleting a single vbucket from memory hashtable sometimes takes more than 2 sec:
del_vb_cmd (1415 total)
0 - 1us : ( 46.08%) 652 #########################################
8ms - 16ms : ( 50.32%) 60 ###
16ms - 32ms : ( 51.59%) 18 #
32ms - 65ms : ( 53.99%) 34 ##
65ms - 131ms : ( 56.68%) 38 ##
131ms - 262ms : ( 62.26%) 79 #####
262ms - 524ms : ( 70.60%) 118 #######
524ms - 1s : ( 78.37%) 110 ######
1s - 2s : ( 91.87%) 191 ############
2s - 4s : ( 98.09%) 88 #####
4s - 8s : ( 99.93%) 26 #
8s - 16s : (100.00%) 1
As Sean mentioned, this is the reason why the ns server got many vbucket deletion timeout events from the membase server. Note that vbucket deletion in disk is scheduled and executed asynchronously. I'm currently investigating this issue...
Hide
James Phillips (Inactive)
added a comment -
more information...
As noted above, I left the cluster alone for the last couple days in case someone wanted to take a look. Today I came in and looked at it and there were a bunch of vbucket migrator errors being continuously reported in the log. I started a rebalance again. This time it seemed to start, but then there were more errors. See attached diag.
As noted above, I left the cluster alone for the last couple days in case someone wanted to take a look. Today I came in and looked at it and there were a bunch of vbucket migrator errors being continuously reported in the log. I started a rebalance again. This time it seemed to start, but then there were more errors. See attached diag.
Show
James Phillips (Inactive)
added a comment - more information...
As noted above, I left the cluster alone for the last couple days in case someone wanted to take a look. Today I came in and looked at it and there were a bunch of vbucket migrator errors being continuously reported in the log. I started a rebalance again. This time it seemed to start, but then there were more errors. See attached diag.
Hide
Sean Lynch
added a comment -
The delete command timed out talking to memcached, after 5 seconds. We can increase this timeout, but I don't think memcached is supposed to have any delays in responding to commands?
Show
Sean Lynch
added a comment - The delete command timed out talking to memcached, after 5 seconds. We can increase this timeout, but I don't think memcached is supposed to have any delays in responding to commands?
Hide
Chiyoung Seo
added a comment -
Sean,
I saw the bunch of the following erlang messages in the log file. Can you tell me why they happened? If you think they are not directly related to the ns server, please reassign the bug to me.
CRASH REPORT <0.11911.0> 2011-01-03 13:47:59
===============================================================================
Crashing process
initial_call {ns_janitor,cleanup,['Argument__1']}
pid <0.11911.0>
registered_name []
error_info
{exit,{{{badmatch,{error,timeout}},
[{mc_client_binary,cmd_binary_vocal_recv,5},
{mc_client_binary,delete_vbucket,2},
{ns_memcached,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[{'ns_memcached-default','ns_1@10.2.1.101'},
{delete_vbucket,43},
30000]}},
[{gen_server,call,3},
{lists,foreach,2},
{ns_vbm_sup,start_replicas,4},
{lists,foreach,2},
{ns_vbm_sup,'-set_replicas/2-fun-1-',3},
{lists,foreach,2},
{proc_lib,init_p_do_apply,3}]}
ancestors [<0.110.0>,ns_server_sup,ns_server_cluster_sup,<0.60.0>]
messages []
links [<0.110.0>]
dictionary []
trap_exit false
status running
heap_size 17711
stack_size 24
reductions 2384777
INFO REPORT <0.110.0> 2011-01-03 13:47:59
===============================================================================
ns_1@10.2.1.100:ns_orchestrator:174: Janitor run completed for bucket "default" with reason {{{badmatch,
{error,
timeout}},
[{mc_client_binary,
cmd_binary_vocal_recv,
5},
{mc_client_binary,
delete_vbucket,
2},
{ns_memcached,
handle_call,
3},
{gen_server,
handle_msg,
5},
{proc_lib,
init_p_do_apply,
3}]},
{gen_server,
call,
[{'ns_memcached-default',
'ns_1@10.2.1.101'},
{delete_vbucket,
43},
30000]}}
INFO REPORT <0.11992.0> 2011-01-03 13:48:00
===============================================================================
ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 0 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102',
active},
{'ns_1@10.2.1.101',
dead}}
INFO REPORT <0.11992.0> 2011-01-03 13:48:00
===============================================================================
ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 1 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102',
active},
{'ns_1@10.2.1.101',
dead}}
INFO REPORT <0.11992.0> 2011-01-03 13:48:00
===============================================================================
ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 2 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102',
active},
{'ns_1@10.2.1.101',
dead}}
...
I saw the bunch of the following erlang messages in the log file. Can you tell me why they happened? If you think they are not directly related to the ns server, please reassign the bug to me.
CRASH REPORT <0.11911.0> 2011-01-03 13:47:59
===============================================================================
Crashing process
initial_call {ns_janitor,cleanup,['Argument__1']}
pid <0.11911.0>
registered_name []
error_info
{exit,{{{badmatch,{error,timeout}},
[{mc_client_binary,cmd_binary_vocal_recv,5},
{mc_client_binary,delete_vbucket,2},
{ns_memcached,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[{'ns_memcached-default','ns_1@10.2.1.101'},
{delete_vbucket,43},
30000]}},
[{gen_server,call,3},
{lists,foreach,2},
{ns_vbm_sup,start_replicas,4},
{lists,foreach,2},
{ns_vbm_sup,'-set_replicas/2-fun-1-',3},
{lists,foreach,2},
{proc_lib,init_p_do_apply,3}]}
ancestors [<0.110.0>,ns_server_sup,ns_server_cluster_sup,<0.60.0>]
messages []
links [<0.110.0>]
dictionary []
trap_exit false
status running
heap_size 17711
stack_size 24
reductions 2384777
INFO REPORT <0.110.0> 2011-01-03 13:47:59
===============================================================================
ns_1@10.2.1.100:ns_orchestrator:174: Janitor run completed for bucket "default" with reason {{{badmatch,
{error,
timeout}},
[{mc_client_binary,
cmd_binary_vocal_recv,
5},
{mc_client_binary,
delete_vbucket,
2},
{ns_memcached,
handle_call,
3},
{gen_server,
handle_msg,
5},
{proc_lib,
init_p_do_apply,
3}]},
{gen_server,
call,
[{'ns_memcached-default',
'ns_1@10.2.1.101'},
{delete_vbucket,
43},
30000]}}
INFO REPORT <0.11992.0> 2011-01-03 13:48:00
===============================================================================
ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 0 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102',
active},
{'ns_1@10.2.1.101',
dead}}
INFO REPORT <0.11992.0> 2011-01-03 13:48:00
===============================================================================
ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 1 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102',
active},
{'ns_1@10.2.1.101',
dead}}
INFO REPORT <0.11992.0> 2011-01-03 13:48:00
===============================================================================
ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 2 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102',
active},
{'ns_1@10.2.1.101',
dead}}
...
Show
Chiyoung Seo
added a comment - Sean,
I saw the bunch of the following erlang messages in the log file. Can you tell me why they happened? If you think they are not directly related to the ns server, please reassign the bug to me.
CRASH REPORT <0.11911.0> 2011-01-03 13:47:59
===============================================================================
Crashing process
initial_call {ns_janitor,cleanup,['Argument__1']}
pid <0.11911.0>
registered_name []
error_info
{exit,{{{badmatch,{error,timeout}},
[{mc_client_binary,cmd_binary_vocal_recv,5},
{mc_client_binary,delete_vbucket,2},
{ns_memcached,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[{'ns_memcached-default',' ns_1@10.2.1.101 '},
{delete_vbucket,43},
30000]}},
[{gen_server,call,3},
{lists,foreach,2},
{ns_vbm_sup,start_replicas,4},
{lists,foreach,2},
{ns_vbm_sup,'-set_replicas/2-fun-1-',3},
{lists,foreach,2},
{proc_lib,init_p_do_apply,3}]}
ancestors [<0.110.0>,ns_server_sup,ns_server_cluster_sup,<0.60.0>]
messages []
links [<0.110.0>]
dictionary []
trap_exit false
status running
heap_size 17711
stack_size 24
reductions 2384777
INFO REPORT <0.110.0> 2011-01-03 13:47:59
===============================================================================
ns_1@10.2.1.100 :ns_orchestrator:174: Janitor run completed for bucket "default" with reason {{{badmatch,
{error,
timeout}},
[{mc_client_binary,
cmd_binary_vocal_recv,
5},
{mc_client_binary,
delete_vbucket,
2},
{ns_memcached,
handle_call,
3},
{gen_server,
handle_msg,
5},
{proc_lib,
init_p_do_apply,
3}]},
{gen_server,
call,
[{'ns_memcached-default',
' ns_1@10.2.1.101 '},
{delete_vbucket,
43},
30000]}}
INFO REPORT <0.11992.0> 2011-01-03 13:48:00
===============================================================================
ns_1@10.2.1.100 :ns_janitor:193: Killing replicators for vbucket 0 on master ' ns_1@10.2.1.102 ' because of {{' ns_1@10.2.1.102 ',
active},
{' ns_1@10.2.1.101 ',
dead}}
INFO REPORT <0.11992.0> 2011-01-03 13:48:00
===============================================================================
ns_1@10.2.1.100 :ns_janitor:193: Killing replicators for vbucket 1 on master ' ns_1@10.2.1.102 ' because of {{' ns_1@10.2.1.102 ',
active},
{' ns_1@10.2.1.101 ',
dead}}
INFO REPORT <0.11992.0> 2011-01-03 13:48:00
===============================================================================
ns_1@10.2.1.100 :ns_janitor:193: Killing replicators for vbucket 2 on master ' ns_1@10.2.1.102 ' because of {{' ns_1@10.2.1.102 ',
active},
{' ns_1@10.2.1.101 ',
dead}}
...
Hide
James Phillips (Inactive)
added a comment -
In addition to the failure to allow the rebalance to start up again, the cluster is now behaving very badly when trying to use it. See attached screen capture of monitoring screen when turning on a client to load data. This client was getting 4kops/sec before this rebalance failure. I've turned off the client, but the servers are still up 10.2.1.100/101/102. Administrator/j4958ph is the UI login. Administrator/northscale!23 is the windows login.
Show
James Phillips (Inactive)
added a comment - In addition to the failure to allow the rebalance to start up again, the cluster is now behaving very badly when trying to use it. See attached screen capture of monitoring screen when turning on a client to load data. This client was getting 4kops/sec before this rebalance failure. I've turned off the client, but the servers are still up 10.2.1.100/101/102. Administrator/j4958ph is the UI login. Administrator/northscale!23 is the windows login.
Show
Sharon Barr
added a comment - Chiyoung, please have an initial analysis.