<!-- 
RSS generated by JIRA (5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9) at Sat May 25 21:09:03 CDT 2013

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary add field=key&field=summary to the URL of your request.
For example:
http://www.couchbase.com/issues/si/jira.issueviews:issue-xml/MB-6550/MB-6550.xml?field=key&field=summary
-->
<rss version="0.92" >
<channel>
    <title>Couchbase</title>
    <link>http://www.couchbase.com/issues</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>5.2.4</version>
        <build-number>845</build-number>
        <build-date>26-12-2012</build-date>
    </build-info>

<item>
            <title>[MB-6550] [longevity] Rebalance hang after failover and remove node because of the memory leak on a couple of nodes</title>
                <link>http://www.couchbase.com/issues/browse/MB-6550</link>
                <project id="10010" key="MB">Couchbase Server</project>
                        <description>Cluster information:&lt;br/&gt;
- 11 centos 6.2 64bit server with 4 cores CPU&lt;br/&gt;
- Each server has 10 GB RAM and 150 GB disk.&lt;br/&gt;
- 8 GB RAM for couchbase server at each node (80% total system memmories)&lt;br/&gt;
- Disk format ext3 on both data and root&lt;br/&gt;
- Each server has its own drive, no disk sharing with other server.&lt;br/&gt;
- Load 9 million items to both buckets&lt;br/&gt;
- Cluster has 2 buckets, default (3GB) and saslbucket (3GB)&lt;br/&gt;
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)&lt;br/&gt;
- Add one more doc d2 with 2 views to default bucket&lt;br/&gt;
&lt;br/&gt;
* Start cluster with 10 nodes installed couchbase server 2.0.0-1663&lt;br/&gt;
10.3.121.13&lt;br/&gt;
10.3.121.14&lt;br/&gt;
10.3.121.15&lt;br/&gt;
10.3.121.16&lt;br/&gt;
10.3.121.17&lt;br/&gt;
10.3.121.20&lt;br/&gt;
10.3.121.22&lt;br/&gt;
10.3.121.24&lt;br/&gt;
10.3.121.25&lt;br/&gt;
10.3.121.23&lt;br/&gt;
* Data path /data&lt;br/&gt;
* View path /data&lt;br/&gt;
&lt;br/&gt;
* The last run, I do swap rebalance remove node 13 and add node 26.&lt;br/&gt;
* Then node 26 failed due to physical failure. I failover node 26 and rebalance.&lt;br/&gt;
* Rebalance failed with known issue &lt;a href=&quot;http://www.couchbase.com/issues/browse/MB-6497&quot; title=&quot;Not ready to replicate from vbuckets cause rebalance failure due to bad_replicas when replica count &amp;gt; 1&quot;&gt;&lt;strike&gt;MB-6497&lt;/strike&gt;&lt;/a&gt; at the end of rebalance saslbucket &lt;br/&gt;
* Node 22 down due to run out of disk space.  Failover node 22.&lt;br/&gt;
* Remove node 13.  Start rebalance from 19:26:35 - Wed Sep 5, 2012&lt;br/&gt;
&lt;br/&gt;
Bucket &amp;quot;default&amp;quot; rebalance does not seem to be swap rebalance	ns_vbucket_mover000	&lt;a href=&apos;mailto:ns_1@10.3.121.14&apos;&gt;ns_1@10.3.121.14&lt;/a&gt;	19:26:35 - Wed Sep 5, 2012&lt;br/&gt;
&lt;br/&gt;
Rebalance hang until now Thu Sep  6 19:25:29 PDT 2012&lt;br/&gt;
&lt;br/&gt;
CPU and beam stats&lt;br/&gt;
&lt;br/&gt;
10.3.121.15&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 2796m  Rm: 613m  CPU: 13.7  beam.smp&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 6091m  Rm: 4.2g  CPU: 9.8  memcached&lt;br/&gt;
10.3.121.13&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 1845m  Rm: 338m  CPU: 9.9  beam.smp&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 1230m  Rm: 1.0g  CPU: 2.0  memcached&lt;br/&gt;
10.3.121.23&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 2443m  Rm: 652m  CPU: 9.8  beam.smp&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 4969m  Rm: 3.4g  CPU: 7.9  memcached&lt;br/&gt;
10.3.121.24&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 3304m  Rm: 907m  CPU: 19.4  beam.smp&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 5440m  Rm: 4.0g  CPU: 3.9  memcached&lt;br/&gt;
10.3.121.14&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 3462m  Rm: 665m  CPU: 30.7  beam.smp&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 6329m  Rm: 4.1g  CPU: 5.1  memcached&lt;br/&gt;
10.3.121.16&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 2702m  Rm: 642m  CPU: 13.2  beam.smp&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 4845m  Rm: 3.5g  CPU: 5.0  memcached&lt;br/&gt;
10.3.121.17&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 4498m  Rm: 1.4g  CPU: 91.2  beam.smp&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 5359m  Rm: 3.6g  CPU: 1.7  memcached&lt;br/&gt;
10.3.121.20&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 3793m  Rm: 1.0g  CPU: 11.7  beam.smp&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Vm: 5356m  Rm: 3.7g  CPU: 1.7  memcached&lt;br/&gt;
&lt;br/&gt;
Swap stats in MB&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Total      Used      Free&lt;br/&gt;
10.3.121.15&lt;br/&gt;
Swap:         5199       1815       3384&lt;br/&gt;
10.3.121.13&lt;br/&gt;
Swap:         5199         10       5189&lt;br/&gt;
10.3.121.22&lt;br/&gt;
Swap:         5199         15       5184&lt;br/&gt;
10.3.121.14&lt;br/&gt;
Swap:         5199       2503       2696&lt;br/&gt;
10.3.121.23&lt;br/&gt;
Swap:         5199       1037       4162&lt;br/&gt;
10.3.121.24&lt;br/&gt;
Swap:         5199       1543       3656&lt;br/&gt;
10.3.121.17&lt;br/&gt;
Swap:         5199       2156       3043&lt;br/&gt;
10.3.121.16&lt;br/&gt;
Swap:         5199       1156       4043&lt;br/&gt;
10.3.121.20&lt;br/&gt;
Swap:         5199       1949       3250&lt;br/&gt;
&lt;br/&gt;
&lt;br/&gt;
Link to diags of all nodes&lt;br/&gt;
&lt;a href=&quot;https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/9nodes-1663-reb-hang-20120906.tgz&quot;&gt;https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/9nodes-1663-reb-hang-20120906.tgz&lt;/a&gt;&lt;br/&gt;
</description>
                <environment>centos 6.2 64bit</environment>
            <key id="19613">MB-6550</key>
            <summary>[longevity] Rebalance hang after failover and remove node because of the memory leak on a couple of nodes</summary>
                <type id="1" iconUrl="http://www.couchbase.com/issues/images/icons/issuetypes/bug.png">Bug</type>
                                <priority id="3" iconUrl="http://www.couchbase.com/issues/images/icons/priorities/major.png">Major</priority>
                    <status id="6" iconUrl="http://www.couchbase.com/issues/images/icons/statuses/closed.png">Closed</status>
                    <resolution id="1">Fixed</resolution>
                    <security id="10011">Public</security>
                        <assignee username="chiyoung">Chiyoung Seo</assignee>
                                <reporter username="thuan">Thuan Nguyen</reporter>
                        <labels>
                        <label>system-test</label>
                    </labels>
                <created>Thu, 6 Sep 2012 21:28:12 -0500</created>
                <updated>Wed, 9 Jan 2013 22:59:29 -0600</updated>
                    <resolved>Fri, 7 Sep 2012 14:59:17 -0500</resolved>
                            <version>2.0-beta</version>
                                <fixVersion>2.0-beta</fixVersion>
                                <component>couchbase-bucket</component>
                                <votes>0</votes>
                        <watches>0</watches>
                                                    <comments>
                    <comment id="38026" author="chiyoung" created="Fri, 7 Sep 2012 13:14:46 -0500"  >The memory usage on 10.3.121.14 and 10.3.121.15 is above 90% of their bucket quota even after most of active and replica items were ejected. This is the reason why rebalance got stuck:&lt;br/&gt;
&lt;br/&gt;
Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 raw memory&lt;br/&gt;
&amp;nbsp;ep_kv_size:                          2436606624&lt;br/&gt;
&amp;nbsp;ep_max_data_size:                    3145728000&lt;br/&gt;
&amp;nbsp;ep_mem_high_wat:                     2359296000&lt;br/&gt;
&amp;nbsp;ep_mem_low_wat:                      1887436800&lt;br/&gt;
&amp;nbsp;ep_mem_tracker_enabled:              true&lt;br/&gt;
&amp;nbsp;ep_oom_errors:                       0&lt;br/&gt;
&amp;nbsp;ep_overhead:                         221345920&lt;br/&gt;
&amp;nbsp;ep_tmp_oom_errors:                   0&lt;br/&gt;
&amp;nbsp;ep_value_size:                       2214922031&lt;br/&gt;
&amp;nbsp;mem_used:                            2831961568&lt;br/&gt;
&amp;nbsp;tcmalloc_current_thread_cache_bytes: 2281472&lt;br/&gt;
&amp;nbsp;tcmalloc_max_thread_cache_bytes:     4194304&lt;br/&gt;
&amp;nbsp;tcmalloc_unmapped_bytes:             7356416&lt;br/&gt;
&amp;nbsp;total_allocated_bytes:               5440249488&lt;br/&gt;
&amp;nbsp;total_fragmentation_bytes:           919716208&lt;br/&gt;
&amp;nbsp;total_free_bytes:                    2457600&lt;br/&gt;
&amp;nbsp;total_heap_bytes:                    6362423296&lt;br/&gt;
&lt;br/&gt;
Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 all | grep resident&lt;br/&gt;
&amp;nbsp;ep_num_non_resident:                2427780&lt;br/&gt;
&amp;nbsp;vb_active_num_non_resident:         1005950&lt;br/&gt;
&amp;nbsp;vb_active_perc_mem_resident:        0&lt;br/&gt;
&amp;nbsp;vb_pending_num_non_resident:        0&lt;br/&gt;
&amp;nbsp;vb_pending_perc_mem_resident:       0&lt;br/&gt;
&amp;nbsp;vb_replica_num_non_resident:        1421830&lt;br/&gt;
&amp;nbsp;vb_replica_perc_mem_resident:       0&lt;br/&gt;
&lt;br/&gt;
&lt;br/&gt;
It seems to me that there is a serious memory leak on 14 and 15. Especially,  ep_value_size (2214922031) means that most of Blob value instances are freed even after we ejected them. Those blob values are referenced in many places (hash table, flusher, tap replicator, etc.)&lt;br/&gt;
</comment>
                    <comment id="38051" author="chiyoung" created="Fri, 7 Sep 2012 14:59:17 -0500"  >&lt;a href=&quot;http://review.couchbase.org/#/c/20632/&quot;&gt;http://review.couchbase.org/#/c/20632/&lt;/a&gt;</comment>
                    <comment id="38116" author="thuan" created="Sat, 8 Sep 2012 01:22:22 -0500"  >Integrated in github-ep-engine-2-0 #426 (See [&lt;a href=&quot;http://qa.hq.northscale.net/job/github-ep-engine-2-0/426/&quot;&gt;http://qa.hq.northscale.net/job/github-ep-engine-2-0/426/&lt;/a&gt;])&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;a href=&quot;http://www.couchbase.com/issues/browse/MB-6550&quot; title=&quot;[longevity] Rebalance hang after failover and remove node because of the memory leak on a couple of nodes&quot;&gt;&lt;strike&gt;MB-6550&lt;/strike&gt;&lt;/a&gt; Free bg-fetched items if the TAP connection is invalid. (Revision 25f4791191a3c3aca670781357b61559191a7f65)&lt;br/&gt;
&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Result = SUCCESS&lt;br/&gt;
Chiyoung Seo : &lt;br/&gt;
Files : &lt;br/&gt;
* src/tapconnmap.cc&lt;br/&gt;
</comment>
                    <comment id="38637" author="farshid" created="Wed, 12 Sep 2012 15:24:43 -0500"  >is this a system test blocker ? if so please add sblocker label</comment>
                    <comment id="39215" author="kzeller" created="Mon, 17 Sep 2012 18:02:57 -0500"  >Beta RN: 	Fixed rebalance failure. Rebalanced had stalled&lt;br/&gt;
				after performing failover and removing node due to memory leak on &lt;br/&gt;
				cluster nodes.</comment>
                </comments>
                    <attachments>
                    <attachment id="14808" name="9nodes-1663-reb-hang-20120906_checkpoint.txt" size="2263255" author="thuan" created="Thu, 6 Sep 2012 21:28:12 -0500" />
                    <attachment id="14806" name="9nodes-1663-reb-hang-20120906_stats_all.txt" size="96284" author="thuan" created="Thu, 6 Sep 2012 21:28:12 -0500" />
                    <attachment id="14807" name="9nodes-1663-reb-hang-20120906_tap.txt" size="368999" author="thuan" created="Thu, 6 Sep 2012 21:28:12 -0500" />
                </attachments>
            <subtasks>
        </subtasks>
                <customfields>
                                                                        <customfield id="customfield_10180" key="com.atlassian.jira.ext.charting:firstresponsedate">
                <customfieldname>Date of First Response</customfieldname>
                <customfieldvalues>
                    <customfieldvalue>Fri, 7 Sep 2012 13:14:46 -0500</customfieldvalue>

                </customfieldvalues>
            </customfield>
                                                                                                                                                                                                            <customfield id="customfield_10081" key="com.pyxis.greenhopper.jira:gh-global-rank">
                <customfieldname>Rank</customfieldname>
                <customfieldvalues>
                    <customfieldvalue>4063</customfieldvalue>
                </customfieldvalues>
            </customfield>
                                                                                                                                                                                        <customfield id="customfield_10181" key="com.atlassian.jira.ext.charting:timeinstatus">
                <customfieldname>Time In Status</customfieldname>
                <customfieldvalues>
                    
                </customfieldvalues>
            </customfield>
                                                </customfields>
    </item>
</channel>
</rss>