[MB-6649] beam.smp memory usage grows to 2 GB when xdcr feature is enabled and rebalancing is in progress Created: 13/Sep/12 Updated: 26/Sep/12 Resolved: 19/Sep/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | cross-datacenter-replication |
| Affects Version/s: | 2.0-beta-2 |
| Fix Version/s: | 2.0-beta-2 |
| Security Level: | Public |
| Type: | Bug | Priority: | Critical |
| Reporter: | Abhinav Dangeti | Assignee: | Abhinav Dangeti |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | 2.0-beta-release-notes | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
2.0.0-1721-rel
Centos 4G RAM 64-bit machines 1024 vbuckets |
||
| Attachments: |
|
| Description |
|
- Created default buckets on a 2:2 cluster
[10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238] - Set up bidirectional replication for the bucket, ran load on both the buckets. - Swap rebalanced a node on both clusters [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55] - Upon completion of rebalance, stopped load on default buckets. - Created standard buckets on both the clusters. - Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1. - Stopped load after a point. - With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3) [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238] - During rebalance, load is not going on on either cluster, however replication is still going on. - Heavy swap on the orchestrators of both the clusters - Erlang using up a lot of memory ( > 2.5G ) - Rebalance gradually completed on cluster 2. - Rebalance fails on cluster 1: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Rebalance exited with reason {{badmatch,{error,timeout}}, {gen_server,call, [{'ns_memcached-bucket','ns_1@10.1.3.235'}, {get_vbucket,835}, 60000]}} - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - If tried to re-rebalance, rebalance fails again: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Rebalance exited with reason {not_all_nodes_are_ready_yet,['ns_1@10.1.3.235']} - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This is probably because, 10.1.3.235 is in "PEND" state on the UI. - Uploading grabbed diags onto s3. https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.235-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.236-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.54-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.237-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.238-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.55-8091-diag.txt.gz |
| Comments |
| Comment by Junyi Xie [ 13/Sep/12 ] |
|
It is good for QE to aggressively catch bugs, but IHMO it would be nice to investigate and test a little bit before simply dumping bugs to developers :-).
For example, in this case, is the heavy memory and swap usage caused by XDCR or rebalance? It is pretty easy to verify, just need to re-test without the rebalance to see if the issue persists. Also, it is known that XDCR consumes resource at destination (there are already a few bugs filed about it)), so is it the bug a duplicate of previous ones? |
| Comment by Abhinav Dangeti [ 13/Sep/12 ] |
|
This issue isn't seen with XDCR alone, or for that matter rebalance alone.
Verified that with just rebalance: Memory usage is not high. Verified that with just replication: Memory usage is not high. Its a combination of the two that sometimes is causing the heavy memory usage and swap. Also observed: - Rebalance failed on cluster 1. - After rebalance finished on Cluster 2, beam.smp's resident memory usage is still at 1.1G (Replication is still going on) |
| Comment by Junyi Xie [ 18/Sep/12 ] |
| Can you please verify with latest build? |
| Comment by Abhinav Dangeti [ 19/Sep/12 ] |
| Tried reproducing the same scenario on build 1744: Beam.smp had a memory usage of between 300 and 500MB, Rebalance completed successfully on both clusters. |