[MB-7382] rebalance froze when node failed over and added back (observed mem used > high water mark for bucket) Created: 09/Dec/12  Updated: 29/May/13  Resolved: 10/Dec/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0
Fix Version/s: 2.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Abhinav Dangeti Assignee: Abhinav Dangeti
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1976-rel.deb.manifest.xml
12.04 Ubuntu LTS ec2

Attachments: Text File ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com.txt     Text File ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com.txt    

- 2 node cluster
- 2 buckets
- Bucket 'bkt' had a very high percentage of sets in its front end load.
- Failed over ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com, added back, rebalance.
- Rebalance froze at around 98%.
- Stopped front end loads, disk write queue drained.
- Mem used for both nodes, greater than higher water mark.
- Restarted couchbase server, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.
- Rebooted nodes, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.

Cluster diags:
1 https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

2 https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

Attached the cbstats of all, raw memory for both nodes.

Comment by Chiyoung Seo [ 09/Dec/12 ]
The load rate from clients was too high, which caused the cluster to be highly overloaded during rebalance. There were lots of backlogs in the replication queues, which caused the bucket "bkt" to have memory usage more than 90% of bucket quota. If memory usage is above 90% of bucket quota, the replication or vbucket takeover would stop.

If we don't set up the cluster with the enough capacity, we could have the rebalance issues.

Please set up the cluster with the enough capacity.
Comment by Chiyoung Seo [ 09/Dec/12 ]
Rebalance tests with two nodes wouldn't be good for system tests. All of our customers use three node cluster at least.
Comment by Abhinav Dangeti [ 09/Dec/12 ]
Not part of a system test, this was the cluster where I was checking the deleted items' status, I just tried failing over and adding back one of the nodes.
Comment by Abhinav Dangeti [ 29/May/13 ]
Closing for now, will reopen if need be or seen again.
Generated at Thu Dec 18 03:36:52 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.