Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Couchbase | Couchbase Server 1.8.x

Rebalancing Problems / Extremely High I/O

1 reply [Last post]
  • Login or register to post comments
Thu, 08/16/2012 - 22:54
osterman
Offline
Joined: 08/16/2012
Groups: None

Hi!

We're running a moderately large production couchbase 1.8.1-938 (x86_64) cluster with 108G of ram across (70GB active set) 8 equal servers (m1.xlarge). Size on disk is ~150-170GB. The cluster has been up for 2 weeks and about 3 days ago we attempted to grow it from 6 nodes to 8 nodes. We've been struggling ever since to get the 2 nodes up to date with rebalancing.

I've read over the the documentation on rebalancing and that there's no way to know how long rebalancing will take, but common sense is telling me something is wrong. One one node, it transferred 4GB to disk in 2 days. The filesystem can definitely handle more than that. We've striped/raid-0 4 devices into a single device with xfs.

The other tell tale sign that something is wrong is network traffic is through the roof.

Here is a short snipped from dstat on node3 (it's been up for the full duration)

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  8   5  66  20   0   1|1524k 1350k|   0     0 |  57k   37k|3609  6196 
 17  16  46  20   0   1|2000k   12k|2534k   52M|   0     0 |  15k 7116 
 20  12  46  19   0   3| 784k 6430k|  89M   44M|   0     0 |  27k 9277 
 14  10  53  19   0   4| 812k    0 | 150M 5487k|   0     0 |  31k 9531 
 28   3  47  22   0   1|1268k 6144B|  30M 3358k|   0     0 |  11k 4207 
 12   6  65  16   0   1| 956k    0 |  56M   39M|   0     0 |  23k   10k
 13   4  60  23   0   1|1416k   44k|2388k   44M|   0     0 |  14k 5903 
  1   0  75  24   0   0| 340k 7940k|1321k 6704k|   0     0 |6688  5624 

Here's dstat from node8 (it's been up for 3 days). We don't see the outrages bursts on this machine, but it's only stored 4G of data.

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  1   1  99   0   0   0|1020B  268k|   0     0 |   0     0 |2933  1805 
  0   1  99   0   0   0|   0     0 |1647k 1605k|   0     0 |3620  2122 
  0   0  99   0   0   0|   0   104k|2062k 1958k|   0     0 |4025  2386 
  0   0 100   0   0   0|   0   196k| 812k  809k|   0     0 |2737  1579 
  1   0  99   0   0   0|   0    43k|1750k 1545k|   0     0 |3278  2062 
  0   1  99   0   0   0|   0   253k|1041k  932k|   0     0 |2711  1652 
  0   0 100   0   0   0|   0   354k| 868k  744k|   0     0 |2391  1510 
  2   1  96   2   0   0|   0  2429k|1025k 1037k|   0     0 |3185  2405 
  0   0 100   0   0   0|   0    71k|1510k 1368k|   0     0 |2847  1733 
  1   0  99   0   0   0|   0   183k|1199k 1202k|   0     0 |2897  1637 
  0   0 100   0   0   0|   0    91k| 739k  738k|   0     0 |2625  1650 

These crazy traffic bursts only started happening after we kicked off a rebalance. After stopping the operation it continues. If we restart one of the couchbase-servers, the traffic immediately subsides and returns to normal levels around 2-4M/s. We hate doing that though, b/c it seems to take forever before the server restarted goes from "Pend" to "OK".

Checking the TAP backfill queues shows it's come down after we stopped rebalancing. It was high (around 400k) and is now around 280k, but barely budged over a day. On our last node (node 8), it has 203K replica items, with zero active items despite it being "OK". We have 10.5M items total in cache.

Any tips on what I should try next or look for?

I'd be happy to share a diag dump.

Regards,

Erik Osterman

Top
  • Login or register to post comments
Fri, 08/17/2012 - 00:15
osterman
Offline
Joined: 08/16/2012
Groups: None

Here's some screenshots of our cluster stats:

http://imgur.com/a/SoiG0

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker