[MB-3255] Rebalance fails to Restart, after stopping a rebalance operation near completion Created: 03/Jan/11  Updated: 31/May/11  Resolved: 03/May/11

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.6.5
Fix Version/s: 1.7 beta
Security Level: Public

Type: Bug Priority: Major
Reporter: James Phillips (Inactive) Assignee: Thuan Nguyen
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: 8h
Time Spent: Not Specified
Original Estimate: 8h
Environment: Win 64

Attachments: GIF File Capture.GIF     Text File ns-diag-20110103134811.txt     Zip Archive ns-diag-20110105101559.zip    

Rebalancing from a 2 to a 3 node cluster. Started rebalance and allowed it to run for quite a while. Stopped the rebalance process. Restarted the process and allowed it to run until it was very, very close to completion. Stopped it a second time. Now when attempting to restart to finish up, getting multiple errors in logs. Diagnostics attached.

Comment by Sharon Barr (Inactive) [ 03/Jan/11 ]
Chiyoung, please have an initial analysis.
Comment by James Phillips (Inactive) [ 03/Jan/11 ]
In addition to the failure to allow the rebalance to start up again, the cluster is now behaving very badly when trying to use it. See attached screen capture of monitoring screen when turning on a client to load data. This client was getting 4kops/sec before this rebalance failure. I've turned off the client, but the servers are still up Administrator/j4958ph is the UI login. Administrator/northscale!23 is the windows login.
Comment by Chiyoung Seo [ 03/Jan/11 ]

I saw the bunch of the following erlang messages in the log file. Can you tell me why they happened? If you think they are not directly related to the ns server, please reassign the bug to me.

CRASH REPORT <0.11911.0> 2011-01-03 13:47:59
Crashing process
   initial_call {ns_janitor,cleanup,['Argument__1']}
   pid <0.11911.0>
   registered_name []
   ancestors [<0.110.0>,ns_server_sup,ns_server_cluster_sup,<0.60.0>]
   messages []
   links [<0.110.0>]
   dictionary []
   trap_exit false
   status running
   heap_size 17711
   stack_size 24
   reductions 2384777

INFO REPORT <0.110.0> 2011-01-03 13:47:59

ns_1@ Janitor run completed for bucket "default" with reason {{{badmatch,

INFO REPORT <0.11992.0> 2011-01-03 13:48:00

ns_1@ Killing replicators for vbucket 0 on master 'ns_1@' because of {{'ns_1@',

INFO REPORT <0.11992.0> 2011-01-03 13:48:00

ns_1@ Killing replicators for vbucket 1 on master 'ns_1@' because of {{'ns_1@',

INFO REPORT <0.11992.0> 2011-01-03 13:48:00

ns_1@ Killing replicators for vbucket 2 on master 'ns_1@' because of {{'ns_1@',
Comment by Sean Lynch (Inactive) [ 04/Jan/11 ]
The delete command timed out talking to memcached, after 5 seconds. We can increase this timeout, but I don't think memcached is supposed to have any delays in responding to commands?
Comment by James Phillips (Inactive) [ 05/Jan/11 ]
more information...

As noted above, I left the cluster alone for the last couple days in case someone wanted to take a look. Today I came in and looked at it and there were a bunch of vbucket migrator errors being continuously reported in the log. I started a rebalance again. This time it seemed to start, but then there were more errors. See attached diag.
Comment by Chiyoung Seo [ 07/Jan/11 ]
It seems to me that this rebalance failure is caused by 1) quite slow vbucket deletion in memory hashtable and 2) frequent TAP ack timeout.

All of vbucket migrator errors were caused by TAP ack timeout events that consequently killed vbucket migrator processes. I observed that this frequent ack timeout issue also caused the slow replication problem in 1.6.5. Trond is current working on resolving this timeout issue.

Regarding the slow vbucket deletion in memory hashtable, I couldn't reproduce it on my Linux testing machines, but the timing stat from one of James's Windows machines shows that deleting a single vbucket from memory hashtable sometimes takes more than 2 sec:

 del_vb_cmd (1415 total)
    0 - 1us : ( 46.08%) 652 #########################################
    8ms - 16ms : ( 50.32%) 60 ###
    16ms - 32ms : ( 51.59%) 18 #
    32ms - 65ms : ( 53.99%) 34 ##
    65ms - 131ms : ( 56.68%) 38 ##
    131ms - 262ms : ( 62.26%) 79 #####
    262ms - 524ms : ( 70.60%) 118 #######
    524ms - 1s : ( 78.37%) 110 ######
    1s - 2s : ( 91.87%) 191 ############
    2s - 4s : ( 98.09%) 88 #####
    4s - 8s : ( 99.93%) 26 #
    8s - 16s : (100.00%) 1

As Sean mentioned, this is the reason why the ns server got many vbucket deletion timeout events from the membase server. Note that vbucket deletion in disk is scheduled and executed asynchronously. I'm currently investigating this issue...
Comment by Chiyoung Seo [ 18/Jan/11 ]
We reduced the time spent by vbucket deletion from memory hashtable by avoiding the hashtable visit during vbucket deletion (http://review.membase.org/4251).

We observed that the aggregated execution time of all steps performed in a single vbucket deletion on Windows is always less than 2 sec, but the wall clock execution time of a single vbucket deletion sometimes takes more than 5 sec under heavy CPU and memory usage, but much less frequently with the above fix . It seems to me that this execution time delay is related to the Windows scheduler.

If a rebalance fails in this scenario, simply restarting it resolves the failure.

We will investigate the above scheduling-related issue again soon.
Comment by Sharon Barr (Inactive) [ 31/Mar/11 ]
Please try and reproduce this on 1.7 and close this bug if it is not reproducible.
Comment by Thuan Nguyen [ 03/May/11 ]
Tested on build basestar-191-g79a0550 on ubuntu 10.04 64bit and repeat step by step as in bug report. I can not reproduce this bug. So I think this issue is fixed. I will close this bug.
Generated at Wed Nov 26 18:54:33 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.