[MB-7383] active item resident ratio drop significantly when adding a 2.0 node to 1.8.1 cluster for upgrade ( sasl bucket ) Created: 09/Dec/12  Updated: 02/Jan/13  Resolved: 27/Dec/12

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0
Fix Version/s: 2.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Farshid Ghods (Inactive) Assignee: Chisheng Hong (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: 2.0.0-hotfix
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
scenario:
add 2x2.0 nodes to a 3 node cluster with 2 buckets ( default and sasl where active resident ration is 60% ) few minutes after upgrade process begins active item ration on 10.3.2.43 which is an existing 1.8.1 node , drops from 60% -> 58->48->38-15->5 and then we decided to stop rebalance

i grabbed diags from the node which resident ratio dropped from 63 percent to 5 percent.
https://s3.amazonaws.com/bugdb/jira/systemtest/resident-ratio-drop-2853bb89.zip

please open a bug asap and mention this there. assign the bug to chiyoung and hoping that he or mike can take a look at the cluster.

[stats:debug] [2012-12-08 17:43:52]
active item resident ratio is at 63% and everything looks normal

and then at
[stats:debug] [2012-12-08 17:45:32]
vb_active_perc_mem_resident 58

and at
[stats:debug] [2012-12-08 17:47:12]
vb_active_perc_mem_resident 49

and at
[stats:debug] [2012-12-08 18:03:51]
vb_active_perc_mem_resident 35

what has happened between 17:45 and 17:47 or between 17:56 and 18:03 that pushed the resident ratio this low.
whatever it is there is a combination of 1.8.1 and 2.0 that is causing the issue


collect info from other nodes:



 Comments   
Comment by Farshid Ghods (Inactive) [ 09/Dec/12 ]
existing 1.8.1 node : https://s3.amazonaws.com/bugdb/jira/MB-7383/10.3.2.122-1292012-1553-diag.zip
2.0 node : https://s3.amazonaws.com/bugdb/jira/MB-7383/10.3.2.41-1292012-161-diag.zip
existing 1.8.1 node : https://s3.amazonaws.com/bugdb/jira/MB-7383/10.3.2.43-1292012-1551-diag.zip
existing 1.8.1 node : https://s3.amazonaws.com/bugdb/jira/MB-7383/10.3.2.47-1292012-1558-diag.zip
Comment by Farshid Ghods (Inactive) [ 09/Dec/12 ]
2.0 node : 10.3.2.85 https://s3.amazonaws.com/bugdb/jira/MB-7383/10.3.2.85-1292012-1616-diag.zip
Comment by Chiyoung Seo [ 09/Dec/12 ]
Farshid,

This might be caused by the memory leak bug in 1.8.1. Can you please test it with the latest 1.8.1 patch (build 943?)?
Comment by Farshid Ghods (Inactive) [ 10/Dec/12 ]
from the email


Chisheng,

Lets patch the ep.so file from the one which is available from 1.8.1-943-rel build and apply it to all nodes in this cluster , then add another 2.0 node and rebalance the cluster again

Please keep chiyoung in the loop after the experiment.
Comment by Farshid Ghods (Inactive) [ 10/Dec/12 ]
QE will update the ticket when results are available
Comment by Chisheng Hong (Inactive) [ 27/Dec/12 ]
https://github.com/couchbaselabs/couchbase-qe-docs/blob/master/system-tests/pine-cluster/12-10-2012.txt

Can not repro this on EC2 cluster for Centos. This problem is caused by slow disk speed in previous test: https://github.com/couchbaselabs/couchbase-qe-docs/blob/master/system-tests/pine-cluster/12-08-2012.txt
Generated at Fri Sep 19 12:53:11 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.