Rebalancing Problems / Extremely High I/O
Hi!
We're running a moderately large production couchbase 1.8.1-938 (x86_64) cluster with 108G of ram across (70GB active set) 8 equal servers (m1.xlarge). Size on disk is ~150-170GB. The cluster has been up for 2 weeks and about 3 days ago we attempted to grow it from 6 nodes to 8 nodes. We've been struggling ever since to get the 2 nodes up to date with rebalancing.
I've read over the the documentation on rebalancing and that there's no way to know how long rebalancing will take, but common sense is telling me something is wrong. One one node, it transferred 4GB to disk in 2 days. The filesystem can definitely handle more than that. We've striped/raid-0 4 devices into a single device with xfs.
The other tell tale sign that something is wrong is network traffic is through the roof.
Here is a short snipped from dstat on node3 (it's been up for the full duration)
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 8 5 66 20 0 1|1524k 1350k| 0 0 | 57k 37k|3609 6196 17 16 46 20 0 1|2000k 12k|2534k 52M| 0 0 | 15k 7116 20 12 46 19 0 3| 784k 6430k| 89M 44M| 0 0 | 27k 9277 14 10 53 19 0 4| 812k 0 | 150M 5487k| 0 0 | 31k 9531 28 3 47 22 0 1|1268k 6144B| 30M 3358k| 0 0 | 11k 4207 12 6 65 16 0 1| 956k 0 | 56M 39M| 0 0 | 23k 10k 13 4 60 23 0 1|1416k 44k|2388k 44M| 0 0 | 14k 5903 1 0 75 24 0 0| 340k 7940k|1321k 6704k| 0 0 |6688 5624
Here's dstat from node8 (it's been up for 3 days). We don't see the outrages bursts on this machine, but it's only stored 4G of data.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 1 99 0 0 0|1020B 268k| 0 0 | 0 0 |2933 1805 0 1 99 0 0 0| 0 0 |1647k 1605k| 0 0 |3620 2122 0 0 99 0 0 0| 0 104k|2062k 1958k| 0 0 |4025 2386 0 0 100 0 0 0| 0 196k| 812k 809k| 0 0 |2737 1579 1 0 99 0 0 0| 0 43k|1750k 1545k| 0 0 |3278 2062 0 1 99 0 0 0| 0 253k|1041k 932k| 0 0 |2711 1652 0 0 100 0 0 0| 0 354k| 868k 744k| 0 0 |2391 1510 2 1 96 2 0 0| 0 2429k|1025k 1037k| 0 0 |3185 2405 0 0 100 0 0 0| 0 71k|1510k 1368k| 0 0 |2847 1733 1 0 99 0 0 0| 0 183k|1199k 1202k| 0 0 |2897 1637 0 0 100 0 0 0| 0 91k| 739k 738k| 0 0 |2625 1650
These crazy traffic bursts only started happening after we kicked off a rebalance. After stopping the operation it continues. If we restart one of the couchbase-servers, the traffic immediately subsides and returns to normal levels around 2-4M/s. We hate doing that though, b/c it seems to take forever before the server restarted goes from "Pend" to "OK".
Checking the TAP backfill queues shows it's come down after we stopped rebalancing. It was high (around 400k) and is now around 280k, but barely budged over a day. On our last node (node 8), it has 203K replica items, with zero active items despite it being "OK". We have 10.5M items total in cache.
Any tips on what I should try next or look for?
I'd be happy to share a diag dump.
Regards,
Erik Osterman
Here's some screenshots of our cluster stats:
http://imgur.com/a/SoiG0