Continous Rebalance Failure, Memcached taking Very High CPU

Below are the steps I followed

1)Install Couchbase 2.1 in machine A
2)Create 4 buckets
3)Install Couchbase 2.1 in machine B
4)Add Couchbase on machine B to the cluster

Initially Rebalance is fine, but after some time the memcached CPU in one of the machines is VERY HIGH.
And in between the rebalance fails with the below log

"Rebalance exited with reason {unexpected_exit,

->If I retry the rebalance again, it fails again and again
-> If we do netstat on the memcached process, it has about 600+ connections towards beam.smp process. And most of the connections are in CLOSE_WAIT state

CPU Usage of memcached(top command)
6112 couchbas 20 0 752m 214m 3480 S 405.3 2.8 11712:14 memcached

Hardware/OS Details
OS: Centos 6.2 on both machines
CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz ( 8 cores) on both machine
RAM: 8GB on both machines

What can cause this rebalance failure? Plz let us know the probable cause for this


1 Answer

« Back to question.

Rebalance is a very hard disk i/o heavy. Especially on the node being added in.
Check your Web Admin GUI go to TAB "SERVER NODES" go to server that is being added back in and click on the blue arrow. There is should tell you the current bucket being rebalance. From there go to the TAB "DATA BUCKET"=>Disk Quese=>Average Age Active. That will tell you in what amount of time is Seconds it take to write you items to desk. click the "show by server" there you will see each servers time. The server that you are adding back in is the times the same as the others?
What is your disk Swappiness set at (60% default)?

I could not find the statistic that you asked me in the GUI.

The disk Swappiness is set to 60%. I changed it to 0, that did not help either.
After failing continously the memcached process is in bad state.It takes lot of CPU( >600%)
And it has lot of TCP connections , 800+ connections in one node.

Is there any workaround or fix for this issue. The rebalance of empty buckets is causing this and its taking lot of time as well.
Does this depend on the network setup/ Machine hardware. We are seeing this issue even with 2 identical hardware configuration