<!-- 
RSS generated by JIRA (5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9) at Thu Jun 20 05:27:32 CDT 2013

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary add field=key&field=summary to the URL of your request.
For example:
http://www.couchbase.com/issues/si/jira.issueviews:issue-xml/MB-4461/MB-4461.xml?field=key&field=summary
-->
<rss version="0.92" >
<channel>
    <title>Couchbase</title>
    <link>http://www.couchbase.com/issues</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>5.2.4</version>
        <build-number>845</build-number>
        <build-date>26-12-2012</build-date>
    </build-info>

<item>
            <title>[MB-4461] replication cursor stuck from slave 1 to slave 2  , hence high number of checkpoint items in slave1</title>
                <link>http://www.couchbase.com/issues/browse/MB-4461</link>
                <project id="10010" key="MB">Couchbase Server</project>
                        <description>Two out of sixteen nodes are ejecting active items because their mem_used is above the high water mark. The other nodes are well below. Customer says that keys are of various sizes, but the larger ones should be spread out randomly across the different nodes. Number of keys on all nodes is roughly equal.&lt;br/&gt;
&lt;br/&gt;
The two problem nodes show ep_value_size much larger than a healthy node. However, looking at the sqlite data files, there&amp;#39;s no significant difference in size of the files on disk (as seen, for example, in */membase.log).&lt;br/&gt;
&lt;br/&gt;
FYI, the rise in data size seems to have started on these two nodes after a different node, 10.254.7.150, stopped responding to REST and membase was restarted (with &amp;#39;service membase-server restart&amp;#39;).&lt;br/&gt;
&lt;br/&gt;
The mbcollect_info data for these servers are in the S3 . The logs are named:&lt;br/&gt;
&lt;br/&gt;
membase 16: a good node, for comparison&lt;br/&gt;
membase 07 and membase 14: the trouble nodes that are ejecting items due to large memory usage&lt;br/&gt;
membase 11: the node that was restarted on Saturday&lt;br/&gt;
&lt;br/&gt;
&lt;br/&gt;
Can someone please take a look at this, and help me understand why the ep_value_size might be bloating up for these two nodes?&lt;br/&gt;
&lt;br/&gt;
Thanks,&lt;br/&gt;
&lt;br/&gt;
Tim&lt;br/&gt;
&lt;br/&gt;
</description>
                <environment>Ubuntu 10.04.3 LTS x86_64, Membase 1.7.2, Amazon m1.large instances, 16-node cluster.</environment>
            <key id="15607">MB-4461</key>
            <summary>replication cursor stuck from slave 1 to slave 2  , hence high number of checkpoint items in slave1</summary>
                <type id="1" iconUrl="http://www.couchbase.com/issues/images/icons/issuetypes/bug.png">Bug</type>
                                <priority id="2" iconUrl="http://www.couchbase.com/issues/images/icons/priorities/critical.png">Critical</priority>
                    <status id="5" iconUrl="http://www.couchbase.com/issues/images/icons/statuses/resolved.png">Resolved</status>
                    <resolution id="1">Fixed</resolution>
                    <security id="10011">Public</security>
                        <assignee username="mikew">Mike Wiederhold</assignee>
                                <reporter username="TimSmith">Tim Smith</reporter>
                        <labels>
                        <label>customer</label>
                    </labels>
                <created>Mon, 21 Nov 2011 20:55:06 -0600</created>
                <updated>Tue, 10 Apr 2012 21:03:20 -0500</updated>
                    <resolved>Thu, 26 Jan 2012 11:59:56 -0600</resolved>
                            <version>1.7.2</version>
                                <fixVersion>1.8.2</fixVersion>
                                <component>couchbase-bucket</component>
                                <votes>0</votes>
                        <watches>4</watches>
                                                    <comments>
                    <comment id="22754" author="farshid" created="Tue, 22 Nov 2011 20:04:17 -0600"  >Tim,&lt;br/&gt;
&lt;br/&gt;
I will have a look at the diags tomorrow morning but it would be helpful to get these info from the customer:&lt;br/&gt;
&lt;br/&gt;
1- number of items per node&lt;br/&gt;
2- mem_used on those nodes that are not ejecting active items&lt;br/&gt;
3- current state of the cluster after it did flish out the active items</comment>
                    <comment id="22767" author="farshid" created="Wed, 23 Nov 2011 14:23:06 -0600"  >farshid-2:Downloads farshid$ egrep -ir &amp;quot;vb_113&amp;quot; membase14-checkpointstats-201111231920.txt vb_113:last_closed_checkpoint_i: 1323&lt;br/&gt;
&amp;nbsp;vb_113:num_checkpoint_items:     373590&lt;br/&gt;
&amp;nbsp;vb_113:num_checkpoints:          264&lt;br/&gt;
&amp;nbsp;vb_113:num_items_for_persistenc: 4&lt;br/&gt;
&amp;nbsp;vb_113:num_tap_cursors:          1&lt;br/&gt;
&amp;nbsp;vb_113:open_checkpoint_id:       1324&lt;br/&gt;
&amp;nbsp;vb_113:persisted_checkpoint_id:  1323&lt;br/&gt;
&amp;nbsp;vb_113:state:                    replica&lt;br/&gt;
&lt;br/&gt;
it shows that this node did not close the checkpoint for this vbucket and hence we have 264 open checkpoints in the memory&lt;br/&gt;
&lt;br/&gt;
replication cursor seems to be stuck in the replica node.</comment>
                    <comment id="22769" author="farshid" created="Wed, 23 Nov 2011 15:42:10 -0600"  >from Chiyoung:&lt;br/&gt;
&lt;br/&gt;
Keeping 2 checkpoints only applies to the master node. The slave node cannot create its new checkpoint for itself, but instead will receive a checkpoint_start (_end) messages from the master.&lt;br/&gt;
&lt;br/&gt;
In this case, cluster has two replicas (master A -&amp;gt; slave B -&amp;gt; slave C), and the replication cursor for C on B got stuck and didn&amp;#39;t move forward.</comment>
                    <comment id="22796" author="TimSmith" created="Mon, 28 Nov 2011 16:27:40 -0600"  >For the record, this has shown up on 1.7.2, not 1.7.1 as stated in the description.&lt;br/&gt;
&lt;br/&gt;
Tim</comment>
                    <comment id="23042" author="dipti" created="Fri, 16 Dec 2011 13:50:05 -0600"  >Post 1.8.0 Hot fix.   </comment>
                    <comment id="23091" author="perry" created="Tue, 20 Dec 2011 13:39:49 -0600"  >One thing that I had a question on regarding this.  Is the cursor expected to be stuck completely, or will it eventually clear itself out?  At the customer, we are seeing everything eventually resolve itself...I just want to make sure we&amp;#39;re looking at the same issue.</comment>
                    <comment id="23786" author="paf" created="Thu, 26 Jan 2012 10:24:56 -0600"  >Colleagues, let me thank you for great product!&lt;br/&gt;
&lt;br/&gt;
We&amp;#39;re seeing something similar, was there any progress towards understanding of this?&lt;br/&gt;
Our statistics show that TAP cursors are really stuck&lt;br/&gt;
$ /opt/membase/bin/mbstats 10.112.119.11:11210 tap GXH teligent|grep -E 10.112.119.12&lt;br/&gt;
..&lt;br/&gt;
&amp;nbsp;eq_tapq:&lt;a href=&apos;mailto:replication_ns_1@10.112.119.12&apos;&gt;replication_ns_1@10.112.119.12&lt;/a&gt;:total_backlog_size:        1879177&lt;br/&gt;
&amp;nbsp;eq_tapq:&lt;a href=&apos;mailto:replication_ns_1@10.112.119.12&apos;&gt;replication_ns_1@10.112.119.12&lt;/a&gt;:total_noops:               58613&lt;br/&gt;
..&lt;br/&gt;
$ ---few minutes passed---&lt;br/&gt;
$ /opt/membase/bin/mbstats 10.112.119.11:11210 tap GXH teligent|grep -E 10.112.119.12&lt;br/&gt;
..&lt;br/&gt;
&amp;nbsp;eq_tapq:&lt;a href=&apos;mailto:replication_ns_1@10.112.119.12&apos;&gt;replication_ns_1@10.112.119.12&lt;/a&gt;:total_backlog_size:        1882715&lt;br/&gt;
&amp;nbsp;eq_tapq:&lt;a href=&apos;mailto:replication_ns_1@10.112.119.12&apos;&gt;replication_ns_1@10.112.119.12&lt;/a&gt;:total_noops:               58644&lt;br/&gt;
..&lt;br/&gt;
&lt;br/&gt;
Only these two statistical rows were changed.&lt;br/&gt;
&lt;br/&gt;
Is there any way to prod replication?&lt;br/&gt;
Or any way to learn it&amp;#39;s state?</comment>
                    <comment id="23789" author="chiyoung" created="Thu, 26 Jan 2012 11:59:56 -0600"  >Fixed in 1.8 release.</comment>
                    <comment id="23831" author="paf" created="Fri, 27 Jan 2012 03:11:59 -0600"  >Thanks for great news, Chiyoung!&lt;br/&gt;
One small Q:&lt;br/&gt;
We understand there is no way to learn the state of replication &#8211; it should &amp;quot;just work&amp;quot;, and that was fixed, right?</comment>
                </comments>
                    <attachments>
                </attachments>
            <subtasks>
        </subtasks>
                <customfields>
                                                                        <customfield id="customfield_10180" key="com.atlassian.jira.ext.charting:firstresponsedate">
                <customfieldname>Date of First Response</customfieldname>
                <customfieldvalues>
                    <customfieldvalue>Tue, 22 Nov 2011 19:59:33 -0600</customfieldvalue>

                </customfieldvalues>
            </customfield>
                                                                                                                                                                                                                                <customfield id="customfield_10081" key="com.pyxis.greenhopper.jira:gh-global-rank">
                <customfieldname>Rank</customfieldname>
                <customfieldvalues>
                    <customfieldvalue>6028</customfieldvalue>
                </customfieldvalues>
            </customfield>
                                                                                                                                                                                        <customfield id="customfield_10181" key="com.atlassian.jira.ext.charting:timeinstatus">
                <customfieldname>Time In Status</customfieldname>
                <customfieldvalues>
                    
                </customfieldvalues>
            </customfield>
                                                                    </customfields>
    </item>
</channel>
</rss>