Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Couchbase | Couchbase Server 1.8.x

LOOOOONG rebalance - w/errors

3 replies [Last post]
  • Login or register to post comments
Thu, 06/28/2012 - 14:19
ckilborn
Offline
Joined: 03/29/2012
Groups: None

I am going from 5 nodes to 9 - 4GB node size

7 of the nodes have SSD

Version: 1.8.0 community edition (build-55)

77M items

Disk usage 66.9 GB

Ram 18GB/36GB

Centos 5 64bit

The re-balance just hangs - the progress bar for all nodes does not move.

Seeing this error when using cbbrowse_logs (I see it for multiple nodes)

memcached<0.445.0>: Suspend eq_tapq:replication_ns_1@10.10.10.30 for 5.00 secs

Top
  • Login or register to post comments
Thu, 06/28/2012 - 16:44
katana
Offline
Joined: 05/09/2012
Groups: None

From your explanation of the symptoms, it appears
- periodically one (or more of the nodes) which happen to be replication destination, is hitting a 1Million item mark in the Disk Write Queue. when this happens the node sends a backoff to the rest of the cluster, so that it can drain the queue before resuming its duties of participating in a rebalance operation
- your resident ratio (items in memory v/s disk storage) is low. Doing a Rebalance operation in a relatively high (disk greater than memory) scenario will impede the rebalance, as items from disk have to progressively be moved up to memory before being shuffled around for replication / rebalance. Curious to see if you can track the resident ratio (at the cluster level as well as individual nodes) during a rebalance and see if that number rapidly drops down.

At the root of all this could also be the problem with fragmentation at the disk persistence layer(sqlite files) that usually results in slow disk writes / reads and consequently a loooooooooong rebalance.

'm sure you are asking, so what's the way out? it depends-

- If you are able to tolerate a full downtime, then bringing all the nodes down + performing a vacuum of the sqllite files and then bring 'em back up. Then adding the addl nodes and Rebalance should yield the best result
- If you can only manage a partial downtime, then taking a single node or couple of nodes out, then vacuuming the files and then adding them back to the cluster should help. In both this and the above scenario, the affected data will not be available for the clients during the downtime.
- a 3rd option is to wait for the next release of Couchbase which has a capability to do a contained rebalance operation

In the longer run, you may want to look at improving the resident ratio. Hope this helps..

Top
  • Login or register to post comments
Thu, 06/28/2012 - 16:52
ckilborn
Offline
Joined: 03/29/2012
Groups: None

Thanks for the response...

I don't see any "Disk Write Queue" in the GUI

Add more nodes and therefore memory should improve the resident ratio - right?

How do I perform a vacuum of the sqllite files??

Just to be sure.

I should stop this rebalance
Remove the 4 new nodes (there isn't much on them)
Rebalance
Shut down the existing nodes
SQLlite vaccuum
Bring up the nodes (I won't lose any data right :-) )
Add the new nodes
Re-balance

Look good?

Top
  • Login or register to post comments
Thu, 06/28/2012 - 22:34
katana
Offline
Joined: 05/09/2012
Groups: None

Monitor->ServerNodes Cick on the node in question and then Expand the "Summary" category. Disk write queue will be one of the counters towards the lower right hand corner. You may navigate your cursor to the counter and then click on the blue arrow to view the counter by server

Yes, adding more nodes that increases the overall RAM capacity of the cluster will definitely improve your Resident ratio.

The data files are normally stored under the /var/lib/Couchbase/data/-data/
where is the name of the bucket that you are going to vacuum. The files under this directory will be called -0.mb, -1.mb, -2.mb, -3.mb

To vacuum the files, run the following commands as the 'Couchbase' user to defragment each of the sqlite data files:

<Couchbase_install_location>/bin/sqlite3 <couchbase dir>/var/lib/Couchbase/data/<bucket_name>-data/<bucket_name>-0.mb 'VACUUM;'
<Couchbase_install_location>/bin/sqlite3 <couchbase dir>/var/lib/Couchbase/data/<bucket_name>-data/<bucket_name>-1.mb 'VACUUM;'
<Couchbase_install_location>/bin/sqlite3 <couchbase dir>/var/lib/Couchbase/data/<bucket_name>-data/<bucket_name>-2.mb 'VACUUM;'
<Couchbase_install_location>/bin/sqlite3 <couchbase dir>/var/lib/Couchbase/data/<bucket_name>-data/<bucket_name>-3.mb 'VACUUM;'

AS to the sequence of steps you've outlined, that looks more or less right.

In step 2 where you say Remove the 4 new nodes, I assume you are referring to Doing a "Remove Server" from the console. That'd be a preferred method. The next best would be "Fail over". Just shutting down the node or performing a kill -9 may not be a good idea, however small the data they hold.

Just a caution note, the vacuum process does need 2x diskspace to create temp files etc. So you may want to make sure you have enough disk space to perform the vacuum.

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker