[MB-7382] rebalance froze when node failed over and added back (observed mem used > high water mark for bucket) Created: 09/Dec/12 Updated: 10/Dec/12 Resolved: 10/Dec/12 |
|
| Status: | Resolved |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket |
| Affects Version/s: | 2.0 |
| Fix Version/s: | 2.0.1 |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Abhinav Dangeti | Assignee: | Abhinav Dangeti |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1976-rel.deb.manifest.xml
12.04 Ubuntu LTS ec2 |
||
| Attachments: |
|
| Description |
|
- 2 node cluster
- 2 buckets - Bucket 'bkt' had a very high percentage of sets in its front end load. - Failed over ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com, added back, rebalance. - Rebalance froze at around 98%. - Stopped front end loads, disk write queue drained. - Mem used for both nodes, greater than higher water mark. - Restarted couchbase server, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%. - Rebooted nodes, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%. Cluster diags: 1 https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz 2 https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz Attached the cbstats of all, raw memory for both nodes. |
| Comments |
| Comment by Chiyoung Seo [ 09/Dec/12 ] |
|
The load rate from clients was too high, which caused the cluster to be highly overloaded during rebalance. There were lots of backlogs in the replication queues, which caused the bucket "bkt" to have memory usage more than 90% of bucket quota. If memory usage is above 90% of bucket quota, the replication or vbucket takeover would stop. If we don't set up the cluster with the enough capacity, we could have the rebalance issues. Please set up the cluster with the enough capacity. |
| Comment by Chiyoung Seo [ 09/Dec/12 ] |
| Rebalance tests with two nodes wouldn't be good for system tests. All of our customers use three node cluster at least. |
| Comment by Abhinav Dangeti [ 09/Dec/12 ] |
| Not part of a system test, this was the cluster where I was checking the deleted items' status, I just tried failing over and adding back one of the nodes. |