[MB-6649] beam.smp memory usage grows to 2 GB when xdcr feature is enabled and rebalancing is in progress Created: 13/Sep/12  Updated: 26/Sep/12  Resolved: 19/Sep/12

Status: Closed
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.0-beta-2
Fix Version/s: 2.0-beta-2
Security Level: Public

Type: Bug Priority: Critical
Reporter: Abhinav Dangeti Assignee: Abhinav Dangeti
Resolution: Cannot Reproduce Votes: 0
Labels: 2.0-beta-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.0.0-1721-rel
Centos 4G RAM 64-bit machines
1024 vbuckets

Attachments: PNG File Screen Shot 2012-09-13 at 1.49.42 PM.png     PNG File Screen Shot 2012-09-13 at 2.21.56 PM.png     PNG File Screen Shot 2012-09-13 at 2.22.14 PM.png    

 Description   
- Created default buckets on a 2:2 cluster
        [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
- Set up bidirectional replication for the bucket, ran load on both the buckets.
- Swap rebalanced a node on both clusters
        [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
- Upon completion of rebalance, stopped load on default buckets.
- Created standard buckets on both the clusters.
- Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
- Stopped load after a point.
- With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
        [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
- During rebalance, load is not going on on either cluster, however replication is still going on.
        - Heavy swap on the orchestrators of both the clusters
        - Erlang using up a lot of memory ( > 2.5G )
- Rebalance gradually completed on cluster 2.
- Rebalance fails on cluster 1:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Rebalance exited with reason {{badmatch,{error,timeout}},
{gen_server,call,
[{'ns_memcached-bucket','ns_1@10.1.3.235'},
{get_vbucket,835},
60000]}}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- If tried to re-rebalance, rebalance fails again:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Rebalance exited with reason {not_all_nodes_are_ready_yet,['ns_1@10.1.3.235']}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- This is probably because, 10.1.3.235 is in "PEND" state on the UI.

- Uploading grabbed diags onto s3.

https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.235-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.236-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.54-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.237-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.238-8091-diag.txt.gz
https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.55-8091-diag.txt.gz

 Comments   
Comment by Junyi Xie (Inactive) [ 13/Sep/12 ]
It is good for QE to aggressively catch bugs, but IHMO it would be nice to investigate and test a little bit before simply dumping bugs to developers :-).

For example, in this case, is the heavy memory and swap usage caused by XDCR or rebalance? It is pretty easy to verify, just need to re-test without the rebalance to see if the issue persists.

Also, it is known that XDCR consumes resource at destination (there are already a few bugs filed about it)), so is it the bug a duplicate of previous ones?
Comment by Abhinav Dangeti [ 13/Sep/12 ]
This issue isn't seen with XDCR alone, or for that matter rebalance alone.
Verified that with just rebalance: Memory usage is not high.
Verified that with just replication: Memory usage is not high.

Its a combination of the two that sometimes is causing the heavy memory usage and swap.

Also observed:
- Rebalance failed on cluster 1.
- After rebalance finished on Cluster 2, beam.smp's resident memory usage is still at 1.1G (Replication is still going on)
Comment by Junyi Xie (Inactive) [ 18/Sep/12 ]
Can you please verify with latest build?
Comment by Abhinav Dangeti [ 19/Sep/12 ]
Tried reproducing the same scenario on build 1744: Beam.smp had a memory usage of between 300 and 500MB, Rebalance completed successfully on both clusters.
Generated at Sat Nov 29 03:12:20 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.