[MB-11731] Persistence to disk suffers from bucket compaction Created: 15/Jul/14  Updated: 26/Jul/14  Resolved: 26/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File compaction_vs_write_queue.png     PNG File disk_write_queue.png     PNG File drain_rate.png    
Issue Links:
Duplicate
duplicates MB-11799 Bucket compaction causes massive slow... Open
is duplicated by MB-11769 Major regression in write performance... Closed
Relates to
relates to MB-11732 Auto-compaction doesn't honor fragmen... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/357/artifact/
Is this a Regression?: Yes

 Description   
5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec, LAN

This is not a new problem, we could observe it for many months.

From attached charts you can see that drain rate (and disk write queue correspondingly) are antiphased, every 30-40 minutes one of buckets drains faster.

On average size of disk write queue doesn't differ from 2.5.x but peak values are slightly higher.


 Comments   
Comment by Pavel Paulau [ 15/Jul/14 ]
Actually drain rate suffers from slower compaction, see also MB-11732.
Comment by Sundar Sridharan [ 15/Jul/14 ]
Artem, local testing reveals that spawning multiple database compactions in parallel makes it faster, could you please explore if we can somehow trigger multiple compactions in parallel preferably on the same shard?
shardId = vbucketId % 4
thanks
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Pavel Paulau [ 21/Jul/14 ]
It's actually getting worse, see MB-11769 for details.

Also I don't think that parallel compaction is the right solution for this particular.
We do need this feature but we cannot use it as a fix.
Comment by Pavel Paulau [ 21/Jul/14 ]
Promoting to blocker and assigning back to ep-engine team due to MB-11769.
Comment by Sundar Sridharan [ 21/Jul/14 ]
Pavel, could you please clarify on what you meant by its actually getting worse - did you mean a regression where a recent build (Jul18th or later) shows poorer drain rate than a 3.0 build before? thanks
Comment by Pavel Paulau [ 21/Jul/14 ]
Yes, when compaction starts disk write queue increases even higher in recent builds.

MB-11769 has more details about difference.
Comment by Sundar Sridharan [ 21/Jul/14 ]
Pavel, I see that you are using build 988 which contains the writer-limit of 8 threads. As one might expect, limiting writers seems to have an impact on drain rate. To confirm this behavior, if possible, could you please experiment with a larger setting for the max_num_writers using
cbepctl set flush_param max_num_writers 12 (or a different number) to see if this improves the drain rate (note that this may be at an expense of increased cpu usage)?
thanks
Comment by Pavel Paulau [ 21/Jul/14 ]
I will try that.

But number of writers is still higher in 3.0; slower than in 2.5.1 persistence doesn't make a lot of sense.
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/39880 thanks
Comment by Chiyoung Seo [ 25/Jul/14 ]
I made several fixes for this issue and MB-11799 (I think both issues are caused by the same root cause):

http://review.couchbase.org/#/c/39906/
http://review.couchbase.org/#/c/39907/
http://review.couchbase.org/#/c/39910/

We will provide the toy build for Pavel.
Comment by Chiyoung Seo [ 25/Jul/14 ]
The toy build is available in

http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent58-3.0.0-toy-couchstore-x86_64_3.0.0-785-toy.rpm
Comment by Pavel Paulau [ 26/Jul/14 ]
Closing as duplicate of MB-11799 in order to minimize noise.




[MB-11794] Creating 10 buckets causes memcached segmentation fault Created: 23/Jul/14  Updated: 26/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-998

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: Text File gdb.log    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares/396/artifact/
Is this a Regression?: Yes

 Comments   
Comment by Chiyoung Seo [ 23/Jul/14 ]
Sundar,

The backtrace indicates that it is mostly a regression from the vbucket-level lock change for flusher, vb snapshot, compaction, and vbucket deletion task, which we made recently.
Comment by Pavel Paulau [ 23/Jul/14 ]
The same issue happened with single bucket. The problem seems rather common.
Comment by Sundar Sridharan [ 24/Jul/14 ]
Found root cause - cachedVBStates is not preallocated and is modified in a thread unsafe manner. This regression shows up now because we have more parallelism with vbucket-level locking. Working on the fix.
Comment by Sundar Sridharan [ 24/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/#/c/39834/ thanks
Comment by Chiyoung Seo [ 24/Jul/14 ]
The fix was merged.




[MB-8207] moxi does not allow a noop before authentication on binary protocol Created: 07/May/13  Updated: 25/Jul/14  Due: 20/Jun/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Matt Ingenthron Assignee: Steve Yen
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
We had added a feature to spymemcached based on a Couchbase user's request to detect hung processes. This tries to complete a noop before doing auth.

It's fine against memcached/ep-engine in all cases, and it appears to be fine for ascii (where there is no authentication and it fails back to the version command), but moxi does not seem to allow auth after the noop. This may be because it's expecting the first command to wire it up to a downstream for the "gateway" moxi?

We're going to search for a workaround, but I wanted to make sure this issue was known.

See also:
https://code.google.com/p/spymemcached/issues/detail?id=272&thanks=272&ts=1364702110
and
https://github.com/mumoshu/play2-memcached/issues/17

 Comments   
Comment by Maria McDuff (Inactive) [ 08/May/13 ]
per bug triage, assigning to ronnie.
ronnie --- pls take a look. thanks.
Comment by Maria McDuff (Inactive) [ 08/Oct/13 ]
ronnie,

can you please update this bug based on 2.2.0 build 821?
thanks.
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Iryna,

can you verify in 3.0? if not resolved pls assign to Steve Y. Thanks.
Comment by Steve Yen [ 25/Jul/14 ]
Scrubbing through ancient moxi issues.

From the cproxy_protocol_b.c code, I see that moxi should be able to handle only VERSION and QUIT commands before doing an AUTH.

Rather than changing the "stabilized" moxi codebase, I'm marking this Won't Fix on the hope that our lithe, warm-blooded and fast-moving SDK's might be able to maneuver faster than moxi.
Comment by Matt Ingenthron [ 25/Jul/14 ]
This one doesn't affect the SDKs, which talk directly to another service. In this case, the user was using spymemcached against moxi. Understood on the "won't fix", but just to be clear, moving to the Couchbase SDK just works around the "stabilized" moxi, as you call it.




[MB-11809] {UPR}:: Rebalance-in of 2 nodes is stuck when doing Ops Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by MB-11819 XDCR: Rebalance at destination hangs,... Resolved
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
1014, centos 6x

Vms:: 10.6.2.144-150

1. Create 7 node cluster
2. Create default bucket
3. Add 400 K items
4. Do mutations and rebalance-out 2 nodes
5. Do mutations and rebalance-in 2 nodes

Step 5 leads to rebalance being stuck

Test Case:: ./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceinout.RebalanceInOutTests.incremental_rebalance_out_in_with_mutation,init_num_nodes=3,items=400000,skip_cleanup=True,GROUP=IN_OUT;P0


 Comments   
Comment by Parag Agarwal [ 24/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11809/log.tar.gz
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Takeover request appear to be stuck. Thats on node .147.

     {<19779.11046.0>,
      [{registered_name,'replication_manager-default'},
       {status,waiting},
       {initial_call,{proc_lib,init_p,5}},
       {backtrace,[<<"Program counter: 0x00007f1b1d12ffa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f1ad3083860 Return addr 0x00007f1b198ced78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.0.169038>">>,<<"y(1) infinity">>,
                   <<"y(2) {takeover,78}">>,<<"y(3) '$gen_call'">>,
                   <<"y(4) <0.11353.0>">>,<<"y(5) []">>,<<>>,
                   <<"0x00007f1ad3083898 Return addr 0x00007f1acbd79e70 (replication_manager:handle_call/3 + 2840)">>,
                   <<"y(0) infinity">>,<<"y(1) {takeover,78}">>,
                   <<"y(2) 'upr_replicator-default-ns_1@10.6.2.146'">>,
                   <<"y(3) Catch 0x00007f1b198ced78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007f1ad30838c0 Return addr 0x00007f1b198d3570 (gen_server:handle_msg/5 + 272)">>,
                   <<"y(0) [{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FM\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.149',\"789\"}]">>,
                   <<"(1) {state,\"default\",dcp,[{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FMN\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.1">>,
                   <<>>,
                   <<"0x00007f1ad30838d8 Return addr 0x00007f1b1d133ab0 (proc_lib:init_p_do_apply/3 + 56)">>,
                   <<"y(0) replication_manager">>,
                   <<"(1) {state,\"default\",dcp,[{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FMN\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.1">>,
                   <<"y(2) 'replication_manager-default'">>,
                   <<"y(3) <0.11029.0>">>,
                   <<"y(4) {dcp_takeover,'ns_1@10.6.2.146',78}">>,
                   <<"y(5) {<0.11528.0>,#Ref<0.0.0.169027>}">>,
                   <<"y(6) Catch 0x00007f1b198d3570 (gen_server:handle_msg/5 + 272)">>,
                   <<>>,
                   <<"0x00007f1ad3083918 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) Catch 0x00007f1b1d133ad0 (proc_lib:init_p_do_apply/3 + 88)">>,
                   <<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,42}]},
       {heap_size,610},
       {total_heap_size,2208},
       {links,[<19779.11029.0>]},
       {memory,18856},
       {message_queue_len,2},
       {reductions,17287},
       {trap_exit,true}]}
Comment by Mike Wiederhold [ 25/Jul/14 ]
http://review.couchbase.org/#/c/39894




[MB-11819] XDCR: Rebalance at destination hangs, missing replica items Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Aruna Piravi Assignee: Aruna Piravi
Resolution: Duplicate Votes: 0
Labels: rebalance-hang
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 172.23.106.45-7242014-208-diag.zip     Zip Archive 172.23.106.46-7242014-2010-diag.zip     Zip Archive 172.23.106.47-7242014-2011-diag.zip     Zip Archive 172.23.106.48-7242014-2013-diag.zip    
Issue Links:
Duplicate
duplicates MB-11809 {UPR}:: Rebalance-in of 2 nodes is st... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Build
-------
3.0.0-1014

Scenario
------------
1. Uni-xdcr between 2-node clusters, default bucket
2. Load 30K items on source
3. Pause XDCR
4. Start "rebalance-out" of one node each from both clusters simultaneously.
5. Resume xdcr

Rebalance at source proceeds to completion, rebalance on dest hangs at 10%, see -

',default bucket
[2014-07-24 13:27:05,728] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.46'] at cluster 172.23.106.45
[2014-07-24 13:27:05,760] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.48'] at cluster 172.23.106.47
[2014-07-24 13:27:06,806] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:06,816] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:13,183] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:27:17,174] - [rest_client:1216] INFO - rebalance percentage : 24.21875 %
[2014-07-24 13:27:17,181] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:27,201] - [rest_client:1216] INFO - rebalance percentage : 33.59375 %
[2014-07-24 13:27:27,207] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:37,233] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 13:27:37,242] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:47,263] - [rest_client:1216] INFO - rebalance percentage : 53.90625 %
[2014-07-24 13:27:47,272] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:57,294] - [rest_client:1216] INFO - rebalance percentage : 60.8723958333 %
[2014-07-24 13:27:57,304] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:07,325] - [rest_client:1216] INFO - rebalance percentage : 100 %
[2014-07-24 13:28:30,222] - [task:411] INFO - rebalancing was completed with progress: 100% in 83.475001812 sec
[2014-07-24 13:28:30,223] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:28:30,229] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:40,252] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:50,280] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:00,301] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:10,342] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:20,363] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:30,389] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:40,410] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:50,437] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:00,458] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:10,480] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:20,504] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:30,523] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:40,546] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:50,569] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %

Testcase
--------------
./testrunner -i uni-xdcr.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,items=30000,rdirection=unidirection,ctopology=chain,replication_type=xmem,rebalance_out=source-destination,pause=source,GROUP=P1


The rebalance hang to explain the missing replica items?

[2014-07-24 13:31:49,079] - [task:463] INFO - Saw curr_items 30000 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:31:49,103] - [data_helper:289] INFO - creating direct client 172.23.106.47:11210 default
[2014-07-24 13:31:49,343] - [data_helper:289] INFO - creating direct client 172.23.106.48:11210 default
[2014-07-24 13:31:49,536] - [task:463] INFO - Saw vb_active_curr_items 30000 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:31:49,559] - [data_helper:289] INFO - creating direct client 172.23.106.47:11210 default
[2014-07-24 13:31:49,811] - [data_helper:289] INFO - creating direct client 172.23.106.48:11210 default
[2014-07-24 13:31:50,001] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:31:55,045] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:32:00,080] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:32:05,113] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket

Logs
-------------
will attach cbcollect with xdcr trace logging.

 Comments   
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Do you have _any reason at all_ to believe that it's even remotely related to xdcr ? Specifically xdcr does nothing about upr replicas.
Comment by Aruna Piravi [ 24/Jul/14 ]
I, of course _do_ know that replicas have nothing to do with xdcr. But I'm unsure if xdcr, and parallel rebalance contributed to the hang.
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
I cannot diagnose stuck rebalance when logs are capture after cleanup.
Comment by Aruna Piravi [ 24/Jul/14 ]
And more on why I think so ---

Pls note from logs below that there has been no progress in rebalance at the destination _from_ the time we resumed xdcr. Until then it had progressed to 10%.

[2014-07-24 13:26:59,500] - [pauseResumeXDCR:92] INFO - ##### Pausing xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 13:26:59,541] - [rest_client:1757] INFO - Updated pauseRequested=true on bucket'default' on 172.23.106.45
[2014-07-24 13:26:59,968] - [task:517] WARNING - Not Ready: xdc_ops 1734 == 0 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:27:00,145] - [task:521] INFO - Saw replication_docs_rep_queue 0 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091',default bucket
[2014-07-24 13:27:00,339] - [task:517] WARNING - Not Ready: replication_active_vbreps 16 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091', default bucket
[2014-07-24 13:27:05,490] - [task:521] INFO - Saw xdc_ops 0 == 0 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:27:05,697] - [task:521] INFO - Saw replication_active_vbreps 0 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091',default bucket
[2014-07-24 13:27:05,728] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.46'] at source cluster 172.23.106.45
[2014-07-24 13:27:05,760] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.48'] at source cluster 172.23.106.47
[2014-07-24 13:27:05,761] - [xdcrbasetests:372] INFO - sleep for 5 secs. ...
[2014-07-24 13:27:06,733] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.46&user=Administrator&knownNodes=ns_1%40172.23.106.46%2Cns_1%40172.23.106.45
[2014-07-24 13:27:06,746] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 13:27:06,773] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.48&user=Administrator&knownNodes=ns_1%40172.23.106.47%2Cns_1%40172.23.106.48
[2014-07-24 13:27:06,796] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 13:27:06,806] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:06,816] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:10,823] - [pauseResumeXDCR:111] INFO - ##### Resume xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 13:27:10,860] - [rest_client:1757] INFO - Updated pauseRequested=false on bucket'default' on 172.23.106.45
[2014-07-24 13:27:11,101] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 894
[2014-07-24 13:27:11,102] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 13:27:12,043] - [task:521] INFO - Saw replication_active_vbreps 0 >= 0 expected on '172.23.106.45:8091',default bucket
[2014-07-24 13:27:12,260] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 869
[2014-07-24 13:27:12,261] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 13:27:13,142] - [task:521] INFO - Saw xdc_ops 4770 >= 0 expected on '172.23.106.47:8091',default bucket
[2014-07-24 13:27:13,183] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:27:17,174] - [rest_client:1216] INFO - rebalance percentage : 24.21875 %
[2014-07-24 13:27:17,181] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:27,201] - [rest_client:1216] INFO - rebalance percentage : 33.59375 %
[2014-07-24 13:27:27,207] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:37,233] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 13:27:37,242] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:47,263] - [rest_client:1216] INFO - rebalance percentage : 53.90625 %
[2014-07-24 13:27:47,272] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:57,294] - [rest_client:1216] INFO - rebalance percentage : 60.8723958333 %
[2014-07-24 13:27:57,304] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
Comment by Aruna Piravi [ 24/Jul/14 ]
Live cluster

http://172.23.106.45:8091/
http://172.23.106.47:8091/ <-- rebalance stuck
Comment by Aruna Piravi [ 24/Jul/14 ]
New logs attached.
Comment by Aruna Piravi [ 24/Jul/14 ]
Didn't try pausing replication from source cluster. Wanted the leave the cluster in same state.

.47 started receiving data through resumed xdcr from 20:04:01. The last recorded rebalance progress was 8.7890625 % at 20:04:05 on .47. Could have stopped a few secs before that.

[2014-07-24 20:03:55,538] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.46&user=Administrator&knownNodes=ns_1%40172.23.106.46%2Cns_1%40172.23.106.45
[2014-07-24 20:03:55,547] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 20:03:55,569] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.48&user=Administrator&knownNodes=ns_1%40172.23.106.47%2Cns_1%40172.23.106.48
[2014-07-24 20:03:55,578] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 20:03:55,584] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 20:03:55,592] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 20:03:59,629] - [pauseResumeXDCR:111] INFO - ##### Resume xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 20:03:59,665] - [rest_client:1757] INFO - Updated pauseRequested=false on bucket'default' on 172.23.106.45
[2014-07-24 20:03:59,799] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 1010
[2014-07-24 20:03:59,800] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 20:04:00,803] - [task:523] INFO - Saw replication_active_vbreps 0 >= 0 expected on '172.23.106.45:8091',default bucket
[2014-07-24 20:04:01,019] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 1082
[2014-07-24 20:04:01,020] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 20:04:01,877] - [task:523] INFO - Saw xdc_ops 4981 >= 0 expected on '172.23.106.47:8091',default bucket
[2014-07-24 20:04:01,888] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 20:04:05,894] - [rest_client:1216] INFO - rebalance percentage : 10.7421875 %
[2014-07-24 20:04:05,905] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:15,927] - [rest_client:1216] INFO - rebalance percentage : 19.53125 %
[2014-07-24 20:04:15,937] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:25,956] - [rest_client:1216] INFO - rebalance percentage : 26.7578125 %
[2014-07-24 20:04:25,964] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:35,995] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 20:04:36,007] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:46,030] - [rest_client:1216] INFO - rebalance percentage : 50.9114583333 %
[2014-07-24 20:04:46,037] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:56,060] - [rest_client:1216] INFO - rebalance percentage : 59.7005208333 %
[2014-07-24 20:04:56,068] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:05:06,087] - [rest_client:1216] INFO - rebalance percentage : 99.9348958333 %
[2014-07-24 20:05:06,096] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Same symptoms as MB-11809:

     {<0.4446.17>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,3}},
       {backtrace,[<<"Program counter: 0x00007fdb6c22ffa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007fdb1022d3a8 Return addr 0x00007fdb689ced78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.12.179156>">>,<<"y(1) infinity">>,
                   <<"y(2) {dcp_takeover,'ns_1@172.23.106.48',955}">>,
                   <<"y(3) '$gen_call'">>,<<"y(4) <0.147.17>">>,
                   <<"y(5) []">>,<<>>,
                   <<"0x00007fdb1022d3e0 Return addr 0x00007fdb1b1ed020 (janitor_agent:'-spawn_rebalance_subprocess/3-fun-0-'/3 + 200)">>,
                   <<"y(0) infinity">>,
                   <<"y(1) {dcp_takeover,'ns_1@172.23.106.48',955}">>,
                   <<"y(2) 'replication_manager-default'">>,
                   <<"y(3) Catch 0x00007fdb689ced78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007fdb1022d408 Return addr 0x00007fdb6c2338a0 (proc_lib:init_p/3 + 688)">>,
                   <<"y(0) <0.160.17>">>,<<>>,
                   <<"0x00007fdb1022d418 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) []">>,
                   <<"y(1) Catch 0x00007fdb6c2338c0 (proc_lib:init_p/3 + 720)">>,
                   <<"y(2) []">>,<<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,0}]},
       {heap_size,233},
       {total_heap_size,233},
       {links,[<0.160.17>,<0.186.17>]},
       {memory,2816},
       {message_queue_len,0},
       {reductions,29},
       {trap_exit,false}]}
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Aruna, consider pausing xdcr. It is likely unrelated to xdcr given MB- reference above
Comment by Aruna Piravi [ 25/Jul/14 ]
I paused xdcr last night. No progress on rebalance yet. That rules out xdcr completely?
Comment by Aruna Piravi [ 25/Jul/14 ]
Raising as test blocker. ~10 tests failed to this rebalance hang problem. Feel free to close if found to be a duplicate if MB-11809.
Comment by Mike Wiederhold [ 25/Jul/14 ]
Duplicate of MB-11809




[MB-11440] {XDCR SSL UPR}: Possible regression in replication rate compared to 2.5.1 Created: 17/Jun/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sangharsh Agarwal Assignee: Aleksey Kondratenko
Resolution: Done Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-814
XDCR -> UPR

Attachments: Zip Archive 10.3.4.186-6232014-1614-diag.zip     Zip Archive 10.3.4.187-6232014-1615-diag.zip     Zip Archive 10.3.4.188-6232014-1616-diag.zip     Zip Archive 10.3.4.189-6232014-1618-diag.zip     File revIDs.rtf     Text File revID_xmem.txt     Text File xmem2_revIDs.txt    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/049a07cd/10.1.3.93-diag.txt.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/5d80dcb2/10.1.3.93-6162014-1236-diag.zip
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/c864813f/10.1.3.93-6162014-1228-couch.tar.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/6766628f/10.1.3.94-6162014-1237-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8a3cd5c2/10.1.3.94-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/c5818561/10.1.3.94-6162014-1228-couch.tar.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/2e4bb369/10.1.3.95-6162014-1238-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/589f740c/10.1.3.95-diag.txt.gz

10.1.3.95 was failed-over during test.


[Destination]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/66037d6b/10.1.3.96-diag.txt.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/74f90c6a/10.1.3.96-6162014-1228-couch.tar.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/89d810de/10.1.3.96-6162014-1239-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8260baf2/10.1.3.97-6162014-1240-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8d1da3e3/10.1.3.97-6162014-1229-couch.tar.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/b59b07fc/10.1.3.97-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/1bf11bfc/10.1.3.99-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/2e4492f9/10.1.3.99-6162014-1242-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/b9febae4/10.1.3.99-6162014-1229-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/4beb391b/10.1.2.12-6162014-1229-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/630b4a6b/10.1.2.12-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/7642751c/10.1.2.12-6162014-1241-diag.zip
Is this a Regression?: Yes

 Description   
http://qa.hq.northscale.net/job/centos_x64--01_02--XDCR_SSL-P0/10/consoleFull

[Test]
./testrunner -i centos_x64--01_01--uniXDCR_biXDCR-P0.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,demand_encryption=1 -t xdcr.biXDCR.bidirectional.load_with_failover,replicas=1,items=10000,ctopology=chain,rdirection=bidirection,sasl_buckets=2,default_bucket=False,doc-ops=create-update-delete,doc-ops-dest=create-update,failover=source,timeout=180,GROUP=P1


[Test Error]
[2014-06-16 12:25:05,339] - [task:443] INFO - Saw ep_queue_size 0 == 0 expected on '10.1.3.99:8091',sasl_bucket_2 bucket
[2014-06-16 12:25:05,383] - [xdcrbasetests:1335] INFO - Waiting for Outbound mutation to be zero on cluster node: 10.1.3.96
[2014-06-16 12:25:05,555] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 461
[2014-06-16 12:25:05,702] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 519
[2014-06-16 12:25:05,703] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:15,862] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 454
[2014-06-16 12:25:16,003] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 485
[2014-06-16 12:25:16,004] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:26,165] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 433
[2014-06-16 12:25:26,260] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 485
[2014-06-16 12:25:26,260] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:36,408] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 463
[2014-06-16 12:25:36,561] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 491
[2014-06-16 12:25:36,562] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:46,739] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 422
[2014-06-16 12:25:46,884] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 454
[2014-06-16 12:25:46,884] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:57,032] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 415
[2014-06-16 12:25:57,166] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 475
[2014-06-16 12:25:57,167] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:07,316] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 396
[2014-06-16 12:26:07,467] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 458
[2014-06-16 12:26:07,468] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:17,627] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 388
[2014-06-16 12:26:17,765] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 442
[2014-06-16 12:26:17,765] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:27,914] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 354
[2014-06-16 12:26:28,050] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 431
[2014-06-16 12:26:28,050] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:38,198] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 343
[2014-06-16 12:26:38,343] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 432
[2014-06-16 12:26:38,344] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:48,496] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 362
[2014-06-16 12:26:48,643] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 415
[2014-06-16 12:26:48,644] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:58,798] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 320
[2014-06-16 12:26:58,942] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 373
[2014-06-16 12:26:58,942] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:09,110] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 308
[2014-06-16 12:27:09,257] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 354
[2014-06-16 12:27:09,257] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:19,414] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 269
[2014-06-16 12:27:19,568] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 352
[2014-06-16 12:27:19,568] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:29,659] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 255
[2014-06-16 12:27:29,767] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 271
[2014-06-16 12:27:29,768] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:39,936] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 262
[2014-06-16 12:27:40,079] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 271
[2014-06-16 12:27:40,079] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:50,239] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 226
[2014-06-16 12:27:50,386] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 274
[2014-06-16 12:27:50,387] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:28:00,555] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 235
[2014-06-16 12:28:00,740] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 255
[2014-06-16 12:28:00,740] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:28:10,744] - [xdcrbasetests:1354] ERROR - Timeout occurs while waiting for mutations to be replicated
..
..
..
[2014-06-16 12:28:15,721] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo451 =====
[2014-06-16 12:28:15,722] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:1
[2014-06-16 12:28:15,723] - [task:1203] ERROR - cas mismatch: Source cas:16542222942424, Destination cas:16940491943424, Error Count:2
[2014-06-16 12:28:15,724] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542222942424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,725] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16940491943424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,827] - [data_helper:289] INFO - creating direct client 10.1.3.99:11210 sasl_bucket_2
[2014-06-16 12:28:15,828] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2790 =====
[2014-06-16 12:28:15,828] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:3
[2014-06-16 12:28:15,829] - [task:1203] ERROR - cas mismatch: Source cas:16543128831424, Destination cas:16954143372424, Error Count:4
[2014-06-16 12:28:15,829] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543128831424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,830] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16954143372424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,850] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo523 =====
[2014-06-16 12:28:15,851] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:5
[2014-06-16 12:28:15,856] - [task:1203] ERROR - cas mismatch: Source cas:16542229414424, Destination cas:16940869390424, Error Count:6
[2014-06-16 12:28:15,856] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542229414424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,856] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16940869390424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,871] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo1286 =====
[2014-06-16 12:28:15,874] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:7
[2014-06-16 12:28:15,875] - [task:1203] ERROR - cas mismatch: Source cas:16542614464424, Destination cas:16945939925424, Error Count:8
[2014-06-16 12:28:15,875] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542614464424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,876] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16945939925424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,045] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo131 =====
[2014-06-16 12:28:16,045] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:9
[2014-06-16 12:28:16,046] - [task:1203] ERROR - cas mismatch: Source cas:16542224457424, Destination cas:16938972892424, Error Count:10
[2014-06-16 12:28:16,047] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542224457424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,047] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16938972892424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,125] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo933 =====
[2014-06-16 12:28:16,126] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:11
[2014-06-16 12:28:16,126] - [task:1203] ERROR - cas mismatch: Source cas:16542224736424, Destination cas:16943647131424, Error Count:12
[2014-06-16 12:28:16,127] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542224736424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,127] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16943647131424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,132] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2680 =====
[2014-06-16 12:28:16,133] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:13
[2014-06-16 12:28:16,133] - [task:1203] ERROR - cas mismatch: Source cas:16543137791424, Destination cas:16953636997424, Error Count:14
[2014-06-16 12:28:16,134] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543137791424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,135] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16953636997424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,174] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2096 =====
[2014-06-16 12:28:16,175] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:15
[2014-06-16 12:28:16,175] - [task:1203] ERROR - cas mismatch: Source cas:16543128601424, Destination cas:16950500665424, Error Count:16
[2014-06-16 12:28:16,176] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543128601424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,177] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16950500665424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,228] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2022 =====
[2014-06-16 12:28:16,228] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:17
[2014-06-16 12:28:16,229] - [task:1203] ERROR - cas mismatch: Source cas:16543131209424, Destination cas:16949985515424, Error Count:18
[2014-06-16 12:28:16,230] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543131209424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,230] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16949985515424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,240] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2146 =====
[2014-06-16 12:28:16,240] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:19
[2014-06-16 12:28:16,241] - [task:1203] ERROR - cas mismatch: Source cas:16543131719424, Destination cas:16950745984424, Error Count:20
[2014-06-16 12:28:16,241] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543131719424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,242] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16950745984424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,329] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo479 =====
[2014-06-16 12:28:16,330] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:21
[2014-06-16 12:28:16,331] - [task:1203] ERROR - cas mismatch: Source cas:16542264845424, Destination cas:16940633628424, Error Count:22
[2014-06-16 12:28:16,332] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542264845424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,333] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16940633628424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,343] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo700 =====
[2014-06-16 12:28:16,344] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:23
[2014-06-16 12:28:16,344] - [task:1203] ERROR - cas mismatch: Source cas:16542254202424, Destination cas:16941821299424, Error Count:24
[2014-06-16 12:28:16,345] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542254202424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,346] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16941821299424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,352] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo1316 =====
[2014-06-16 12:28:16,352] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:25
[2014-06-16 12:28:16,353] - [task:1203] ERROR - cas mismatch: Source cas:16542621575424, Destination cas:16946095632424, Error Count:26
[2014-06-16 12:28:16,353] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542621575424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,354] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16946095632424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,381] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo603 =====
[2014-06-16 12:28:16,382] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:27
[2014-06-16 12:28:16,382] - [task:1203] ERROR - cas mismatch: Source cas:16542268666424, Destination cas:16941311155424, Error Count:28
[2014-06-16 12:28:16,383] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542268666424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,383] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16941311155424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,501] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2957 =====
[2014-06-16 12:28:16,502] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:29
[2014-06-16 12:28:16,503] - [task:1203] ERROR - cas mismatch: Source cas:16543130010424, Destination cas:16954877533424, Error Count:30
[2014-06-16 12:28:16,504] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543130010424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,505] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16954877533424, 'flags': 0, 'expiration': 0}


[Test Steps]
1. Setup 3 Node Source and 4 Node Destination Cluster.
2. Create buckets sasl_bucket_1, sasl_bucket_2 each side.
3. Setup CAPI mode Bi-directional xdcr for each bucket.
4. Load 10000 items on each side each buckets.
5. Failover and Rebalance one Source node.
6. Perform 30% update and 30% delete on Source Cluster each bucket.
7. Perform 30% update on Destination cluster each bucket.
8. Verify items.
    a) There were some items left of replication queue.
    b) Above caused not all mutations to be replicated.


Created separate bug for 3.0 XDCR UPR than MB-9707.
Outbound mutations not goes to 0 caused mutations not replicated and eventually verified by meta data mismatch between cluster.
After MB-9707, test cases were modified to proceed even replication_changes_left is not 0 for 5 minutes.





 Comments   
Comment by Aruna Piravi [ 17/Jun/14 ]
Hi Sangharsh, Can you also pls indicate the number of items you find on source and destination so we get an idea what % is not replicated? Thanks.
Comment by Sangharsh Agarwal [ 17/Jun/14 ]
Aruna,

Here A <------> B Bi-direction XDCR taken place. Where A (Source cluster) and B(Destination cluster). Initial 10000K items were replicated successfully on either side. Mutations (30% updates as mentioned Step-7) from B -> A were not replicated completely.

Nodes in Cluster A -> 10.1.3.93 (Master), 10.1.3.94, 10.1.3.95 (Failover node)
Nodes in Cluster B -> 10.1.3.96, 10.1.3.97, 10.1.3.99, 10.1.2.12


In below Logs, Read "Source meta data" as Cluster A and "Dest meta data" as Cluster B.

[2014-06-16 12:28:16,501] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2957 =====
[2014-06-16 12:28:16,502] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:29
[2014-06-16 12:28:16,503] - [task:1203] ERROR - cas mismatch: Source cas:16543130010424, Destination cas:16954877533424, Error Count:30
[2014-06-16 12:28:16,504] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543130010424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,505] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16954877533424, 'flags': 0, 'expiration': 0}


So after step-7 seqno number of key loadTwo2957 becomes 2 on Cluster B and remains 1 on Cluster A.

Note: Items count are not printed in the test logs, but can be find out from the data files.



Comment by Aruna Piravi [ 17/Jun/14 ]
We do print active, replica item count in logs -

[2014-06-16 12:28:11,546] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_1
[2014-06-16 12:28:11,641] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_1
[2014-06-16 12:28:11,721] - [task:443] INFO - Saw curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_1 bucket
[2014-06-16 12:28:11,734] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_1
[2014-06-16 12:28:11,821] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_1
[2014-06-16 12:28:11,895] - [task:443] INFO - Saw vb_active_curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_1 bucket
[2014-06-16 12:28:11,909] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_1
[2014-06-16 12:28:11,997] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_1
[2014-06-16 12:28:12,079] - [task:443] INFO - Saw vb_replica_curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_1 bucket
[2014-06-16 12:28:12,096] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_2
[2014-06-16 12:28:12,184] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_2
[2014-06-16 12:28:12,289] - [task:443] INFO - Saw curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_2 bucket
[2014-06-16 12:28:12,304] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_2
[2014-06-16 12:28:12,402] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_2
[2014-06-16 12:28:12,487] - [task:443] INFO - Saw vb_active_curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_2 bucket
[2014-06-16 12:28:12,501] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_2
[2014-06-16 12:28:12,592] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_2
[2014-06-16 12:28:12,672] - [task:443] INFO - Saw vb_replica_curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_2 bucket
[2014-06-16 12:28:12,773] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_1
[2014-06-16 12:28:12,865] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_1
[2014-06-16 12:28:12,974] - [task:1042] INFO - 20000 items will be verified on sasl_bucket_1 bucket
[2014-06-16 12:28:13,041] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_2
[2014-06-16 12:28:13,149] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_2
[2014-06-16 12:28:13,248] - [task:1042] INFO - 20000 items will be verified on sasl_bucket_2 bucket

So this is a case where source and destination item count is equal but some updates have not been propagated to the other side as can be seen from the meta data info.
Comment by Aleksey Kondratenko [ 17/Jun/14 ]
Have you tried same thing but without SSL? Have you tried something simpler ?
Comment by Sangharsh Agarwal [ 17/Jun/14 ]
Alk, Test is passed without SSL. It is a regular test.
Comment by Aruna Piravi [ 19/Jun/14 ]
Sangharsh is right, wasn't able to reproduce the problem with non-encrypted xdcr. However seeing it with encrypted xdcr.

Attaching logs with master trace enabled.
Comment by Aruna Piravi [ 19/Jun/14 ]
.186, .187 = C1
.188, .189 = C2
Comment by Aleksey Kondratenko [ 20/Jun/14 ]
Looking just at /diag I see multiple memcached crashes. Do I understand correctly that you already filed Blocker bugs for every occurrence? If not then please make it your policy because memcached crashes in production are not acceptable.
Comment by Aleksey Kondratenko [ 20/Jun/14 ]
Another thing regarding latest set of logs. Especially if something is stuck (rebalance or replication or views or anything) I need collectinfos capture _during_ test. And not after cleanup. In this case it appears to be after cleanup.
Comment by Aleksey Kondratenko [ 20/Jun/14 ]
See above. I need logs captured during replication that's not catching up.
Comment by Aruna Piravi [ 23/Jun/14 ]
Attaching logs captured when updates are not replicated.
Comment by Aruna Piravi [ 23/Jun/14 ]
Will check the memc crashes.
Comment by Aruna Piravi [ 23/Jun/14 ]
Seen with both capi and xmem protocols.
Comment by Aruna Piravi [ 23/Jun/14 ]
Hopefully this set of logs contain all the info you need. Thanks for your patience.
Comment by Aleksey Kondratenko [ 23/Jun/14 ]
Don't have trace logs in this collectinfos.
Comment by Aruna Piravi [ 23/Jun/14 ]
They are present as 'ns_server.xdcr_trace.log' in all zip files.
Comment by Aleksey Kondratenko [ 23/Jun/14 ]
May I have same test but xmem instead of capi ? Also may I have data files too just in case?
Comment by Aleksey Kondratenko [ 23/Jun/14 ]
I see some evidence in xdcr_trace logs that recent versions of missing items _are_ being sent to other side. Same evidence with xmem will exclude capi layer and then I'll be able to pass it to ep-engine guys.
Comment by Aruna Piravi [ 23/Jun/14 ]
xmem cbcollect and data files - https://s3.amazonaws.com/bugdb/jira/MB-11440/xmem.tar
Comment by Aruna Piravi [ 23/Jun/14 ]
Attaching revIDs_xmem.txt = keys that are missing updates
Comment by Aleksey Kondratenko [ 23/Jun/14 ]
Please do another xmem run with build that has http://review.couchbase.org/38728 (which I merged few moments ago).
Comment by Sangharsh Agarwal [ 24/Jun/14 ]
Alk, Please find the logs after your merge on build 3.0.0-865:

[Test Logs]
https://friendpaste.com/5bHM8T5WTALfLptXwOyMAd

[Source]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/2a3ac6d9/10.1.3.93-6242014-238-couch.tar.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8105eeb6/10.1.3.93-6242014-242-diag.zip
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/daf50fef/10.1.3.93-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/4031c179/10.1.3.94-6242014-238-couch.tar.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/797b1631/10.1.3.94-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/e5b654de/10.1.3.94-6242014-243-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/3be890cf/10.1.3.95-diag.txt.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/9ce3aedb/10.1.3.95-6242014-244-diag.zip

[Destination]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/48bb6a6f/10.1.3.96-diag.txt.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/6382ff46/10.1.3.96-6242014-245-diag.zip
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/f1fec989/10.1.3.96-6242014-239-couch.tar.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/10ca436d/10.1.3.97-6242014-244-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/21438682/10.1.3.97-6242014-239-couch.tar.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/a5e2d504/10.1.3.97-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/01235b22/10.1.3.99-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/57abd92c/10.1.3.99-6242014-246-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/f79db0e4/10.1.3.99-6242014-239-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/2eba987e/10.1.2.12-6242014-247-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/b9dbd837/10.1.2.12-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/f9f8a5e4/10.1.2.12-6242014-239-couch.tar.gz
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
Sangarsh, please be aware, that next time I'll bounce the ticket back to you if you don't start capturing collectinfos _during_ xdcr and _not after_ you've cleaned everything up.
Comment by Aruna Piravi [ 24/Jun/14 ]
Thanks Sangharsh. Alk pls let us know if you need anything else. Thanks.
Comment by Aruna Piravi [ 24/Jun/14 ]
Ok, will get you new set of logs.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
It is possible that we're dealing with two distinct bugs here. Particularly, in logs from Sangharsh I'm not seeing second revision of loadTwo1906 even considered for replication. Which might be possible for example if some upr bit gets broken. But in yesterday's logs from Aruna I saw expected revisions replicated.

So:

1) Aruna, please rerun in your environment and give me new logs please.

2) Sangharsh, please rerun and give me logs that are captured _before_ you clean up everything.
Comment by Aruna Piravi [ 24/Jun/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11440/xmem2.tar
Comment by Aruna Piravi [ 24/Jun/14 ]
xmem2_revIDs.txt attached
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
Seeing evidence that we do send those newer revisions but they are refused by ep-engine's conflict resolution:

I.e:

2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] ===== Verifying rev_ids failed for key: loadTwo1614 =====
2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:97
2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] cas mismatch: Source cas:7329801498721320, Destination cas:7329957630538902, Error Count:98
2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 7329801498721320, 'flags': 0, 'expiration': 0}
2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 7329957630538902, 'flags': 0, 'expiration': 0}


and corresponding traces:

./cbcollect_info_ns_1@10.3.4.186_20140624-190117/ns_server.xdcr_trace.log:399748:{"pid":"<0.32475.4>","type":"missing","ts":1403636156.392065,"startTS":1403636156.392047,"k":[["loadTwo2347",19,2],["loadTwo1784",20,2],["loadTwo1614",21,2],["loadTwo1166",22,2],["loadTwo89",23,2]],"loc":"xdc_vbucket_rep_worker:find_missing:238"}
./cbcollect_info_ns_1@10.3.4.186_20140624-190117/ns_server.xdcr_trace.log:399766:{"pid":"<0.32475.4>","type":"xmemSetMetas","ts":1403636156.39351,"ids":["loadTwo89","loadTwo1166","loadTwo1614","loadTwo1784","loadTwo2347"],"statuses":["key_eexists","key_eexists","key_eexists","key_eexists","key_eexists"],"startTS":1403636156.392241,"loc":"xdc_vbucket_rep_xmem:flush_docs:60"}

I'll wait for Sangharsh logs to see if he's facing same problem or not and then I'll pass this to ep-engine team.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
Actually we can see that earlier replication sends this document just fine.

Also couch_dbdump of corresponding vbuckets reveal that both sides _have_ revision 2.

Please double check your testing code.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
See above
Comment by Aruna Piravi [ 24/Jun/14 ]
Indeed, I see the same from views. This is a test code issue. I will investigate. Thanks for your time.

[root@centos-64-x64 tmp]# diff <(curl http://Administrator:password@10.3.4.186:8092/sasl_bucket_1/_design/dev_doc/_view/sasl1?full_set=true&inclusive_end=false&stale=false&connection_timeout=60000&limit=1000000&skip=0) <(curl http://Administrator:password@10.3.4.188:8092/sasl_bucket_1/_design/dev_doc/_view/sasl1?stale=false&inclusive_end=false&connection_timeout=60000&limit=100000&skip=0 )
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
  0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 1456k 0 1456k 0 0 2169k 0 --:--:-- --:--:-- --:--:-- 2170k
100 1456k 0 1456k 0 0 1070k 0 --:--:-- 0:00:01 --:--:-- 1072k
[root@centos-64-x64 tmp]#
Comment by Anil Kumar [ 17/Jul/14 ]
Test code issue.

Triage - Alk, Aruna, Anil, Wayne .. July 17th
Comment by Sangharsh Agarwal [ 21/Jul/14 ]
Re-running the test again to understand this problem as it never appears with non-ssl XDCR, and continuously appearing with SSL XDCR failover case.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
As this issue, keep on coming on recent builds i.e. 973. I did some more investigation and before making any changes to test I would like to share some information with this bug, hoping it may help in further investigation.

Aruna/Alk,
    Aruna, your previous observation (last comment) was correct. Eventually updates are replicated successfully thats why you are getting right information from views. But test never read data from failed-over/rebalance-out nodes so its difficult to justify the point that test read data from wrong server and get obsolete meta data information:

[Points to Highlight]

Problem is always appearing with SSL-XDCR only. Same test is always passing with non-SSL XDCR test always.

[Test Conditions]
It is found that test is always reproducible if Failover side have 3 nodes and other side have 4 nodes. i.e. After failover+rebalance there should be 2 nodes. e.g. I tried this test with 4 nodes cluster and test passed.
After analysis of the test it found that updates are replicated to other side very slowly that caused this issue.

[Test Steps]
1. Have 3 nodes Source cluster (S) , 4 nodes Destination cluster (D).
2. Create two buckets sasl_bucket_1, sasl_bucket_2.
3. Setup SSL Bi-directional XDCR (CAPI) for both buckets.
4. Load 10000 items on each bucket and Source. keys with prefix "loadOne".
5. Load 10000 items on each bucket and Source. keys with prefix "loadTwo".
6. Failover+Rebalance one node at Source cluster.
7. Perform Updates (3000) and Delete(3000) items on Source. keys with prefix "loadOne".
8. Perform Updates (3000 items) on Destination. keys with prefix "loadTwo".
9. Test will fail with data mismatch error the data on Source (S) and Destination (D). It is the case that key from Destination (D i.e. non-failover side) i.e. "loadTwo" were not replicated when validation took place.


[Additional information]
1. Test with lesser number of items/updates are passed successfully.
2. Test with single bucket is passed with above mentioned items/mutations.

[Workaround]
Add 90 seconds additional sleep before verifying data or increase timeout to 5 minutes (from 3 minutes) to wait for outbound mutations to zero, which will ensure that all data is replicated from either side in bi-directional replication. But XDCR with UPR should be even more faster than previous XDCR. This test always passed with 2.5.1.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
Cluster is live for debugging:

[Source]
10.1.3.96
10.1.3.97
10.1.2.12

10.1.3.97 node were failed-over.

[Destination]
10.1.3.93
10.1.3.94
10.1.3.95
10.1.3.99

[Test Logs]
https://s3.amazonaws.com/bugdb/jira/MB-11440/05bd3d96/test.log
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
Aruna,
    If analysis looks Ok to you, please assign to Dev for further investigation.
Comment by Aruna Piravi [ 22/Jul/14 ]
Hi Sangharsh,

I see some important info in the "workaround" section. You are raising a valid point. So if you waited for 90s more before verification and all items are correct, we are still replicating when we are expected to be done replicating. And if this did not occur in 2.5.1 on same VMs(we could compare mutation replication rates), we probably have a performance regression with encrypted xdcr in 3.0.

cc'ing Pavel for his input.

Thanks,
Aruna
Comment by Pavel Paulau [ 23/Jul/14 ]
My input:

If you want to report this issue as a performance regression then please make sure that all related functional bugs are resolved.
Also you need a reliable (and ideally simple) way to show slowness. E.g., set of results for 2.5.1 and set of results for 3.0.0, using consistent and reasonable environment.

We had many similar reports before 2.5.x releases. It's supercritical to minimize level of noise.
Comment by Sangharsh Agarwal [ 23/Jul/14 ]
Logs are copied:

[Source]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/0ab776fe/10.1.3.96-7232014-345-diag.zip
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/cb30c533/10.1.3.96-8091-diag.txt.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/d0c15a08/10.1.3.96-7232014-332-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/90c3aa7c/10.1.2.12-8091-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/9eca3909/10.1.2.12-7232014-333-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/d4b275f5/10.1.2.12-7232014-351-diag.zip

10.1.3.97 (Failover Node): https://s3.amazonaws.com/bugdb/jira/MB-11440/9b5f28ad/10.1.3.97-7232014-344-diag.zip
10.1.3.97 (Failover Node) : https://s3.amazonaws.com/bugdb/jira/MB-11440/b0833ce4/10.1.3.97-7232014-332-couch.tar.gz
10.1.3.97 (Failover Node) : https://s3.amazonaws.com/bugdb/jira/MB-11440/e2e3f28f/10.1.3.97-8091-diag.txt.gz

[Destination]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/49b4dd7d/10.1.3.93-8091-diag.txt.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/e82f1a33/10.1.3.93-7232014-332-diag.zip
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/ef58cc41/10.1.3.93-7232014-331-couch.tar.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/46f07617/10.1.3.94-8091-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/c4462a8e/10.1.3.94-7232014-340-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/f478617e/10.1.3.94-7232014-332-couch.tar.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/33dfe115/10.1.3.95-7232014-336-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8a616c90/10.1.3.95-8091-diag.txt.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/d91fe96f/10.1.3.95-7232014-331-couch.tar.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/816ae8c0/10.1.3.99-8091-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/ac259ae9/10.1.3.99-7232014-332-couch.tar.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/ef17b4ec/10.1.3.99-7232014-348-diag.zip

Comment by Sangharsh Agarwal [ 23/Jul/14 ]
Aruna, There are some logs found which could be useful to analyze this issue:

[Test Logs]
https://s3.amazonaws.com/bugdb/jira/MB-11440/05bd3d96/test.log

[Faiover Period]

2014-07-22 05:07:38 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 10.1.3.97:8091
2014-07-22 05:07:40 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@10.1.3.97 successful
2014-07-22 05:07:40 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending....
2014-07-22 05:07:40 | INFO | MainProcess | test_thread | [biXDCR.load_with_failover] Failing over Source Non-Master Node 10.1.3.97:8091
2014-07-22 05:07:41 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.97&user=Administrator&knownNodes=ns_1%4010.1.3.97%2Cns_1%4010.1.3.96%2Cns_1%4010.1.2.12
2014-07-22 05:07:41 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance operation started
2014-07-22 05:07:41 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 0 %
2014-07-22 05:07:51 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 18.4769405082 %
2014-07-22 05:08:01 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 46.11833859 %
2014-07-22 05:08:11 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 71.3362916906 %
2014-07-22 05:08:21 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 99.121522694 %
2014-07-22 05:08:32 | INFO | MainProcess | Cluster_Thread | [task.check] rebalancing was completed with progress: 100% in 50.1381921768 sec

Failover was completed at 5:08:26 from logs in 10.1.3.96.


I checked the xdcr finish time on the destination cluster:

Node XDCR Finish Time (Last time of xdcr.log file)
-------- --------------------------
10.1.3.93 -> 5:11:26.898 On time as per data load
10.1.3.94 -> 5:11:26.703 On time as per data load
10.1.3.95: -> 5:13:58.970 -> Delay
10.1.3.99: -> 5:17:04.459 -> Replication finished lastly on this node.


I can see many errors xdcr.log file on 10.1.3.99 which shows checkpoint commit failure, as it were still re-trying to commit on failedover node i.e. 10.1.3.97.

[xdcr:debug,2014-07-22T5:10:03.573,ns_1@10.1.3.99:<0.3265.0>:xdc_vbucket_rep_ckpt:do_send_retriable_http_request:215]Got http error doing POST to https://Administrator:password@10.1.3.97:18092/_commit_for_checkpoint. Will retry. Error: {{tls_alert,
                                                                                                                         "unknown ca"},
                                                                                                                        [{lhttpc_client,
                                                                                                                          send_request,
                                                                                                                          1,
                                                                                                                          [{file,
                                                                                                                            "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                           {line,
                                                                                                                            199}]},
                                                                                                                         {lhttpc_client,
                                                                                                                          execute,
                                                                                                                          9,
                                                                                                                          [{file,
                                                                                                                            "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                           {line,
                                                                                                                            151}]},
                                                                                                                         {lhttpc_client,
                                                                                                                          request,
                                                                                                                          9,
                                                                                                                          [{file,
                                                                                                                            "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                           {line,
                                                                                                                            83}]}]}

..
..

[xdcr:error,2014-07-22T5:16:55.896,ns_1@10.1.3.99:<0.4040.0>:xdc_vbucket_rep_ckpt:send_post:197]Checkpointing related POST to https://Administrator:password@10.1.3.97:18092/_commit_for_checkpoint failed: {{tls_alert,
                                                                                                              "unknown ca"},
                                                                                                             [{lhttpc_client,
                                                                                                               send_request,
                                                                                                               1,
                                                                                                               [{file,
                                                                                                                 "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                {line,
                                                                                                                 199}]},
                                                                                                              {lhttpc_client,
                                                                                                               execute,
                                                                                                               9,
                                                                                                               [{file,
                                                                                                                 "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                {line,
                                                                                                                 151}]},
                                                                                                              {lhttpc_client,
                                                                                                               request,
                                                                                                               9,
                                                                                                               [{file,
                                                                                                                 "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                {line,
                                                                                                                 83}]}]}


10.1.3.99 first received this error 5:10:03 but still it were trying to commit checkpoint on failed over node while node failover were done much before that time.
Comment by Sangharsh Agarwal [ 23/Jul/14 ]
Clusters are still live, if you need for investigation.
Comment by Aruna Piravi [ 23/Jul/14 ]
Sangharsh, 404 errors are ok and are expected. These are seen one per vbucket. So for every vbucket that receives mutation, source will try to reach the dest node which is now failed over. It will get this error and only then retry with new IP. Different vbuckets may receive mutations at different times, so there's no strict time window for 404 errors due to changed dest node.

We dont need logs at this point. What may help is - a comparison with and without ssl against 2.5.1. and 3.0 on the same VMs. We don't have the best of hardware, but this is the little we can do to see if this is indeed a performance problem.

Thanks!
Comment by Sangharsh Agarwal [ 23/Jul/14 ]
If you see the test there were 3000 updation only at the cluster, do you think it will take 7 minutes to get information of mutations. In addition to that, why node is retrying to communicate with fail-over node for next 7 minutes, while Master node i.e. 10.1.3.96 has corrected the CAPI request within 2 minutes to remaining nodes only.

I think along with comparison between 2.5.1 and 3.0. Please compare the logs between non-SSL and SSL XDCR for this same test and same build.
Comment by Aruna Piravi [ 25/Jul/14 ]
Alk,

Can you pls look at the logs @ https://s3.amazonaws.com/bugdb/jira/MB-11440/05bd3d96/test.log

Scenario-

1. 3 * 4 bi-xdcr (ssl) , 3 node cluster is C1, 4 node cluster C2.
2. failover one node from C1
3. C2 takes longer(7 minutes) to send all mutations to C1. However in same test with no-ssl xdcr, C2 takes only 2 mins to complete replication.

So the question here is why are we seeing this difference. Do you see anything notable on 10.1.3.99 that's causing the delay? Also, Sangharsh says the problem is specific to this setup (3*4).

 Node XDCR Finish Time (Last time of xdcr.log file)
-------- --------------------------
10.1.3.93 5:11:26.898 On time as per data load
10.1.3.94 5:11:26.703 On time as per data load
10.1.3.95 5:13:58.970 -> Delay
10.1.3.99 5:17:04.459 -> Replication finished lastly on this node.


Thanks,
Aruna
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
1) test.log doesn't tell me anything

2) _please_ pleeeese do not re-purpose tickets like this. Originally this was data loss bug and now suddenly you change it massively

3) If you have reason to believe that 3.0 with ssl is much slower than 2.5.1 with ssl I need evidence _and_ fresh ticket.
Comment by Aruna Piravi [ 25/Jul/14 ]
Sangharsh,

1. Let's close this bug
2. Open another one with ssl vs non-ssl in 2.5.1 and 3.0 (on same vms) as discussed earlier which would still not be "reliable enough" for performance benchmarking/testing OR simply leave it for Pavel to test on his physical machines. He does testing on ssl vs non-ssl anyhow. Showfast does not show any regression between ssl and non-ssl.
3. Please just increase timeout in test for now to allow replication to continue to completion.

Thanks




[MB-11825] Rebalance may fail if cluster_compat_mode:is_node_compatible times out waiting for ns_doctor:get_node Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: customer, rebalance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Is this a Regression?: No

 Description   
Saw this in CBSE-1301:

 <0.2025.3344> exited with {{function_clause,
                            [{new_ns_replicas_builder,handle_info,
                              [{#Ref<0.0.4447.107509>,
                                [stale,
                                 {last_heard,{1406,315410,842219}},
                                 {now,{1406,315410,210848}},
                                 {active_buckets,
                                  ["user_reg","sentence","telemetry","common",
                                   "notifications","playlists","users"]},
                                 {ready_buckets,

which caused rebalance to fail.

The reason is that new_ns_replicas_builder doesn't have catch-all handle_info that's typical for gen_servers. And this message occurs because of the following call chain:

* new_ns_replicas_builder:init/1

* ns_replicas_builder_utils:spawn_replica_builder/5

* ebucketmigrator_srv:build_args

* cluster_compat_mode:is_node_compatible

* ns_doctor:get_node

ns_doctor:get_node handles timeout and returns empty list. So if this happens actual reply may be delivered later and be handled by handle_info. Which in this case is unable to do it.

3.0 is mostly immune to this particular chain of calls due to optimization:

commit 70badff90b03176b357cac4d03e40acc62f4861b
Author: Aliaksey Kandratsenka <alk@tut.by>
Date: Tue Oct 1 11:44:02 2013 -0700

    MB-9096: optimized is_node_compatible when cluster is compatible
    
    There's no need to check for particular node's compatibility with
    certain feature if entire cluster's mode is new enough.
    
    Change-Id: I9573e6b2049cb00d2adad709ba41ec5285d66a6b
    Reviewed-on: http://review.couchbase.org/29317
    Tested-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
    Reviewed-by: Artem Stemkovski <artem@couchbase.com>


 Comments   
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
http://review.couchbase.org/39908




[MB-11805] KV+ XDCR System test: Missing items in bi-xdcr only Created: 23/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Build
-------
3.0.0-998

Clusters
-----------
C1 : http://172.23.105.44:8091/
C2 : http://172.23.105.54:8091/
Free for investigation. Not attaching data files.

Steps
--------
1a. Load on both clusters till vb_active_resident_items_ratio < 50.
1b. Setup bi-xdcr on "standardbucket", uni-xdcr on "standardbucket1"
2. Access phase with 50% gets, 50%deletes for 3 hrs
3. Rebalance-out 1 node at cluster1
4. Rebalance-in 1 node at cluster1
5. Failover and remove node at cluster1
6. Failover and add-back node at cluster1
7. Rebalance-out 1 node at cluster2
8. Rebalance-in 1 node at cluster2
9. Failover and remove node at cluster2
10. Failover and add-back node at cluster2
11. Soft restart all nodes in cluster1 one by one
Verify item count

Problem
-------------
standardbucket(C1) <---> standardbucket(C2)
On C1 - 57890744 items
On C2 - 57957032 items
standardbucket1(C1) ----> standardbucket1(C2)
On C1 - 14053020 items
On C2 - 14053020 items

Total number of missing items : 66,288

Bucket priority
-----------------------
Both standardbucket and standardbucket1 have high priority.


Attached
-------------
cbcollect and list of keys that are missing on vb0


Missing keys
-------------------
Atleast 50-60 keys missing in every vbucket. Attaching all missing keys from vb0

vb0
-------
{'C1_node:': u'172.23.105.44',
'vb': 0,
'C2_node': u'172.23.105.54',
'C1_key_count': 78831,
 'C2_key_count': 78929,
 'missing_keys': 98}

     id: 06FA8A8B-11_110 deleted, tombstone exists
     id: 06FA8A8B-11_1354 present, report a bug!
     id: 06FA8A8B-11_1426 present, report a bug!
     id: 06FA8A8B-11_2175 present, report a bug!
     id: 06FA8A8B-11_2607 present, report a bug!
     id: 06FA8A8B-11_2797 present, report a bug!
     id: 06FA8A8B-11_3871 deleted, tombstone exists
     id: 06FA8A8B-11_4245 deleted, tombstone exists
     id: 06FA8A8B-11_4537 present, report a bug!
     id: 06FA8A8B-11_662 deleted, tombstone exists
     id: 06FA8A8B-11_6960 present, report a bug!
     id: 06FA8A8B-11_7064 present, report a bug!
     id: 3600C830-80_1298 present, report a bug!
     id: 3600C830-80_1308 present, report a bug!
     id: 3600C830-80_2129 present, report a bug!
     id: 3600C830-80_4219 deleted, tombstone exists
     id: 3600C830-80_4389 deleted, tombstone exists
     id: 3600C830-80_7038 present, report a bug!
     id: 3FEF1B93-91_2890 present, report a bug!
     id: 3FEF1B93-91_2900 present, report a bug!
     id: 3FEF1B93-91_3004 present, report a bug!
     id: 3FEF1B93-91_3194 present, report a bug!
     id: 3FEF1B93-91_3776 deleted, tombstone exists
     id: 3FEF1B93-91_753 present, report a bug!
     id: 52D6D916-120_1837 present, report a bug!
     id: 52D6D916-120_3282 present, report a bug!
     id: 52D6D916-120_3312 present, report a bug!
     id: 52D6D916-120_3460 present, report a bug!
     id: 52D6D916-120_376 deleted, tombstone exists
     id: 52D6D916-120_404 deleted, tombstone exists
     id: 52D6D916-120_4926 present, report a bug!
     id: 52D6D916-120_5022 present, report a bug!
     id: 52D6D916-120_5750 present, report a bug!
     id: 52D6D916-120_594 deleted, tombstone exists
     id: 52D6D916-120_6203 present, report a bug!
     id: 5C12B75A-142_2889 present, report a bug!
     id: 5C12B75A-142_2919 present, report a bug!
     id: 5C12B75A-142_569 deleted, tombstone exists
     id: 73C89FDB-102_1013 present, report a bug!
     id: 73C89FDB-102_1183 present, report a bug!
     id: 73C89FDB-102_1761 present, report a bug!
     id: 73C89FDB-102_2232 present, report a bug!
     id: 73C89FDB-102_2540 present, report a bug!
     id: 73C89FDB-102_4092 deleted, tombstone exists
     id: 73C89FDB-102_4102 deleted, tombstone exists
     id: 73C89FDB-102_668 deleted, tombstone exists
     id: 87B03DB1-62_3369 present, report a bug!
     id: 8DA39D2B-131_1949 present, report a bug!
     id: 8DA39D2B-131_725 deleted, tombstone exists
     id: A2CC835C-00_2926 present, report a bug!
     id: A2CC835C-00_3022 present, report a bug!
     id: A2CC835C-00_3750 present, report a bug!
     id: A2CC835C-00_5282 present, report a bug!
     id: A2CC835C-00_5312 present, report a bug!
     id: A2CC835C-00_5460 present, report a bug!
     id: A2CC835C-00_6133 present, report a bug!
     id: A2CC835C-00_6641 present, report a bug!
     id: A5C9F867-33_1091 present, report a bug!
     id: A5C9F867-33_1101 present, report a bug!
     id: A5C9F867-33_1673 present, report a bug!
     id: A5C9F867-33_2320 present, report a bug!
     id: A5C9F867-33_2452 present, report a bug!
     id: A5C9F867-33_4010 deleted, tombstone exists
     id: A5C9F867-33_4180 deleted, tombstone exists
     id: CD7B0436-153_3638 present, report a bug!
     id: CD7B0436-153_828 present, report a bug!
     id: D94DA3B2-51_829 present, report a bug!
     id: DE161E9D-40_1235 present, report a bug!
     id: DE161E9D-40_1547 present, report a bug!
     id: DE161E9D-40_2014 present, report a bug!
     id: DE161E9D-40_2184 present, report a bug!
     id: DE161E9D-40_2766 present, report a bug!
     id: DE161E9D-40_3880 deleted, tombstone exists
     id: DE161E9D-40_3910 deleted, tombstone exists
     id: DE161E9D-40_4324 deleted, tombstone exists
     id: DE161E9D-40_4456 deleted, tombstone exists
     id: DE161E9D-40_6801 present, report a bug!
     id: DE161E9D-40_6991 present, report a bug!
     id: DE161E9D-40_7095 present, report a bug!
     id: DE161E9D-40_7105 present, report a bug!
     id: DE161E9D-40_940 present, report a bug!
     id: E9F46ECC-22_173 deleted, tombstone exists
     id: E9F46ECC-22_2883 present, report a bug!
     id: E9F46ECC-22_2913 present, report a bug!
     id: E9F46ECC-22_3017 present, report a bug!
     id: E9F46ECC-22_3187 present, report a bug!
     id: E9F46ECC-22_3765 deleted, tombstone exists
     id: E9F46ECC-22_5327 present, report a bug!
     id: E9F46ECC-22_5455 present, report a bug!
     id: E9F46ECC-22_601 deleted, tombstone exists
     id: E9F46ECC-22_6096 present, report a bug!
     id: E9F46ECC-22_6106 present, report a bug!
     id: E9F46ECC-22_6674 present, report a bug!
     id: E9F46ECC-22_791 present, report a bug!
     id: ECD6BE16-113_2961 present, report a bug!
     id: ECD6BE16-113_3065 present, report a bug!
     id: ECD6BE16-113_3687 present, report a bug!
     id: ECD6BE16-113_3717 present, report a bug!

74 undeleted key(s) present on C2(.54) compared to C1(.44)











 Comments   
Comment by Aruna Piravi [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11805/C1.tar
https://s3.amazonaws.com/bugdb/jira/MB-11805/C2.tar
Comment by Aruna Piravi [ 25/Jul/14 ]
[7/23/14 1:40:12 PM] Aruna Piraviperumal: hi Mike, I see some backfill stmts like in MB-11725 but that doesn't lead to any missing items
[7/23/14 1:40:13 PM] Aruna Piraviperumal: 172.23.105.47
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
 
172.23.105.50


172.23.105.59


172.23.105.62


172.23.105.45
/opt/couchbase/var/lib/couchbase/logs/memcached.log.27.txt:Tue Jul 22 16:02:46.470085 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-2ad6ab49733cf45595de9ee568c05798 - (vb 421) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.48


172.23.105.52
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.44
[7/23/14 1:56:12 PM] Michael Wiederhold: Having one of those isn't necessarily bad. Let me take a quick look
[7/23/14 2:02:49 PM] Michael Wiederhold: Ok this is good. I'll debug it a little bit more. Also, I don't necessarily expect that data loss will always occur because it's possible that the items could have already been replicated.
[7/23/14 2:03:38 PM] Aruna Piraviperumal: ok
[7/23/14 2:03:50 PM] Aruna Piraviperumal: I'm noticing data loss on standard bucket though
[7/23/14 2:04:19 PM] Aruna Piraviperumal: but no such disk snapshot logs found for 'standardbucket'
Comment by Mike Wiederhold [ 25/Jul/14 ]
For vbucket 0 in the logs I see that on the source side we have high seqno 102957, but on the destination we only have up to seqno 97705 so it appears that some items were not sent to the remote side. I also see in the logs that xdcr did request those items as shown in the log messages below.

memcached<0.78.0>: Wed Jul 23 12:30:02.506513 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 95291 and end seqno 0
memcached<0.78.0>: Wed Jul 23 13:30:01.683760 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) stream created with start seqno 95291 and end seqno 102957
memcached<0.78.0>: Wed Jul 23 13:30:02.070134 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) Stream closing, 0 items sent from disk, 7666 items sent from memory, 102957 was last seqno sent
[ns_server:info,2014-07-23T13:30:10.753,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Wed Jul 23 13:30:10.552586 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 102957 and end seqno 0
Comment by Mike Wiederhold [ 25/Jul/14 ]
Alk,

See my comments above. Can you verify that all items were sent by the xdcr module correctly?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Let me quickly note that .tar is again in fact .tar.gz.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
missing:

a) data files (so that I can double-check your finding)

b) xdcr traces
Comment by Aruna Piravi [ 25/Jul/14 ]
1. For system tests, data files are huge, I did not attach them, the cluster is available.
2. xdcr traces were not enabled for this run, my apologies but we discard all info we have in hand? Another complete run will take 3 days. I'm not sure if we want to delay investigation for that long.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
There's no way to investigate such delicate issue without having at least traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
If all files are large you can at least attach that vbucket 0 where you found discrepancies.
Comment by Aruna Piravi [ 25/Jul/14 ]
> There's no way to investigate such delicate issue without having at least traces.
If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

>If all files are large you can at least attach that vbucket 0 where you found discrepancies.
 I can, if requested. The cluster was anyway left available.

Fine, let me do another run if there's no way to work around not having traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
>> > There's no way to investigate such delicate issue without having at least traces.

>> If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

Not possible. We log potentially critical information. But _your_ tests are all semi-automated right? So for your automation it makes sense indeed to always enable xdcr tracing.
Comment by Aruna Piravi [ 25/Jul/14 ]
System test is completely automated. Only the post-test verification is not. But enabling tracing is now a part of the framework.




[MB-11786] {UPR}:: Rebalance-out hangs due to indexing stuck Created: 22/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Seeing this issue in 991

1. Create 7 node cluster (10.6.2.144-150)
2. Create default Bucket
3. Add 1K items
4. Create 5 views and query
5. Rebalance out node 10.6.2.150

Step 4 and 5 are run in parallel

We see the rebalance hanging

I am seeing the following issue couchdb log in 10.6.2.150

[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]View merger, revision mismatch for design document `_design/ddoc1', wanted 5-3275804e, got 5-3275804e
[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]Uncaught error in HTTP request: {throw,{error,revision_mismatch}}

[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]View merger, revision mismatch for design document `_design/ddoc1', wanted 5-3275804e, got 5-3275804e
[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]Uncaught error in HTTP request: {throw,{error,revision_mismatch}}

Stacktrace: [{couch_index_merger,query_index,3,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_index_merger/src/couch_index_merger.erl"},
                  {line,75}]},
             {couch_httpd,handle_request,6,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couchdb/couch_httpd.erl"},
                  {line,222}]},
             {mochiweb_http,headers,5,


Will attach logs ASAP

Test Case:: ./testrunner -i ubuntu_x64--109_00--Rebalance-Out.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,dgm=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_with_queries,nodes_out=1,blob_generator=False,value_size=1024,GROUP=OUT;BASIC;P0;FROM_2_0

 Comments   
Comment by Parag Agarwal [ 22/Jul/14 ]
The cluster is live if you want to investigate 10.6.2.144-150.
Comment by Parag Agarwal [ 22/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11786/991_logs.tar.gz
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
We're waiting for index to become updated.

I.e. I see a number of this:

     {<17674.13818.5>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,3}},
       {backtrace,[<<"Program counter: 0x00007f64917effa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f6493d4f070 Return addr 0x00007f648df8ed78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.9.246202>">>,<<"y(1) infinity">>,
                   <<"y(2) {if_rebalance,<0.12798.5>,{wait_index_updated,112}}">>,
                   <<"y(3) '$gen_call'">>,<<"y(4) <0.11899.5>">>,
                   <<"y(5) []">>,<<>>,
                   <<"0x00007f6493d4f0a8 Return addr 0x00007f6444879940 (janitor_agent:wait_index_updated/5 + 432)">>,
                   <<"y(0) infinity">>,
                   <<"y(1) {if_rebalance,<0.12798.5>,{wait_index_updated,112}}">>,
                   <<"y(2) {'janitor_agent-default','ns_1@10.6.2.144'}">>,
                   <<"y(3) Catch 0x00007f648df8ed78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007f6493d4f0d0 Return addr 0x00007f6444a49ea8 (ns_single_vbucket_mover:'-wait_index_updated/5-fun-0-'/5 + 104)">>,
                   <<>>,
                   <<"0x00007f6493d4f0d8 Return addr 0x00007f64917f38a0 (proc_lib:init_p/3 + 688)">>,
                   <<>>,
                   <<"0x00007f6493d4f0e0 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) []">>,
                   <<"y(1) Catch 0x00007f64917f38c0 (proc_lib:init_p/3 + 720)">>,
                   <<"y(2) []">>,<<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,2}]},
       {heap_size,610},
       {total_heap_size,1597},
       {links,[<17674.13242.5>]},
       {memory,13688},
       {message_queue_len,0},
       {reductions,806},
       {trap_exit,false}]}
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
And this:
     {<0.13891.5>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,5}},
       {backtrace,[<<"Program counter: 0x00007f64448ad040 (capi_set_view_manager:'-do_wait_index_updated/4-lc$^0/1-0-'/3 + 64)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f643e3ac948 Return addr 0x00007f64448abb90 (capi_set_view_manager:do_wait_index_updated/4 + 848)">>,
                   <<"y(0) #Ref<0.0.9.246814>">>,
                   <<"y(1) #Ref<0.0.9.246821>">>,
                   <<"y(2) #Ref<0.0.9.246820>">>,<<"y(3) []">>,<<>>,
                   <<"0x00007f643e3ac970 Return addr 0x00007f64917f3ab0 (proc_lib:init_p_do_apply/3 + 56)">>,
                   <<"y(0) {<0.13890.5>,#Ref<0.0.9.246813>}">>,<<>>,
                   <<"0x00007f643e3ac980 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) Catch 0x00007f64917f3ad0 (proc_lib:init_p_do_apply/3 + 88)">>,
                   <<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,5}]},
       {heap_size,987},
       {total_heap_size,1974},
       {links,[]},
       {memory,16808},
       {message_queue_len,0},
       {reductions,1425},
       {trap_exit,false}]}
Comment by Parag Agarwal [ 22/Jul/14 ]
Still seeing the issue in 3.0.0-1000, centos 6x, ubuntu 1204
Comment by Sriram Melkote [ 22/Jul/14 ]
Sarath, can you please take a look?
Comment by Nimish Gupta [ 22/Jul/14 ]
The error in http query will not hang the rebalance. Http query error is happening since ddoc is updated.
I see there is error in getting mutation for partition 127 from ep-engine :

[couchdb:info,2014-07-22T13:37:59.764,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:37:59.866,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:37:59.967,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.070,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.171,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.272,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.373,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.474,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.575,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.676,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.777,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.878,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.979,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...

There are lot of above continuous message till the logs are collected.
Comment by Sarath Lakshman [ 22/Jul/14 ]
Yes, ep-engine kept on returning ETMPFAIL for partition 127's stream request. Hence, indexing never progressed.
EP-Engine team should take a look.
Comment by Sarath Lakshman [ 22/Jul/14 ]
Tue Jul 22 13:52:14.041453 PDT 3: (default) UPR (Producer) eq_uprq:mapreduce_view: default _design/ddoc1 (prod/main) - (vb 127) Stream request failed because this vbucket is in backfill state
Tue Jul 22 13:52:14.143551 PDT 3: (default) UPR (Producer) eq_uprq:mapreduce_view: default _design/ddoc1 (prod/main) - (vb 127) Stream request failed because this vbucket is in backfill state

It seems that vbucket 127 is in backfill state and it never gets completed.
Comment by Mike Wiederhold [ 25/Jul/14 ]
http://review.couchbase.org/39896




[MB-11824] [system test] [kv unix] rebalance hang at 0% when add a node to cluster Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Mike Wiederhold
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.4 64-bit

Attachments: Zip Archive 172.23.107.195-7252014-1342-diag.zip     Zip Archive 172.23.107.196-7252014-1345-diag.zip     Zip Archive 172.23.107.197-7252014-1349-diag.zip     Zip Archive 172.23.107.199-7252014-1352-diag.zip     Zip Archive 172.23.107.200-7252014-1356-diag.zip     Zip Archive 172.23.107.201-7252014-143-diag.zip     Zip Archive 172.23.107.202-7252014-1359-diag.zip     Zip Archive 172.23.107.203-7252014-146-diag.zip    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-1022-rel.rpm.manifest.xml
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.0-1022 on 8 nodes
1:172.23.107.195
2:172.23.107.196
3:172.23.107.197
4:172.23.107.199
5:172.23.107.200
6:172.23.107.202
7:172.23.107.201

8:172.23.107.203

Create a cluster of 7 nodes
Create 2 buckets: default and sasl-2 (no view)
Load 25+ M items to each bucket to bring down active resident ratio down to 80%
Do update, expired and delete on both buckets in 3 hours.
Then add node 203 to cluster. Rebalance hang at 0%

Live cluster is available to debug


 Comments   
Comment by Mike Wiederhold [ 25/Jul/14 ]
Duplicate of MB-11809. We currently have two bug fixes in that fix rebalance stuck issues. (MB-11809 and MB-11786. Please run the tests with these changes merged before filing any other rebalance stuck issues.




[MB-11665] {UPR}: During a 2 node rebalance-in scenario :: Java SDK (1.4.2) usage sees a drop in OPS (get/set) by 50% and high error rate Created: 07/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Parag Agarwal Assignee: Wei-Li Liu
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 172.23.107.174-177

Triage: Untriaged
Operating System: Centos 64-bit
Flagged:
Release Note
Is this a Regression?: Yes

 Description   
We have compared the run of 2.5.1-0194 vs 3.0.0-918, Java SDK used 1.4.2

Common Scenario

1. Create a 2 node cluster
2. Create 1 default bucket
3. Add 15 K items with do get and set
4. Add 2 nodes and then rebalance
5. Run Get and Set again in parallel to rebalance

Issue observed during Step5: Ops drop by 50% , error rate is high for most of the times, when compared to 2.5.1

The comparative report is shared here

General Comparison Summary

https://docs.google.com/document/d/1PjQBdJvLFaK85OrrYzxOaZ54fklTXibj6yKVrrU-AOs/edit

3.0.0-918:: http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-3.0.0-918/Rb2In-HYBRID/07-03-14/068545/22bcef05a4f12ef3f9e7f69edcfc6aa4-MC.html

2.5.1-1094: http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-2.5.1-1094/Rb2In-HYBRID/06-24-14/083822/2f416c3207cf6c435ae631ae37da4861-MC.html
Attaching logs

We are trying to run more tests for different version of SDK like 1.4.3, 2.0

https://s3.amazonaws.com/bugdb/jira/MB-11665/logs_3_0_0_918_SDK_142.tar.gz


 Comments   
Comment by Parag Agarwal [ 07/Jul/14 ]
Pavel: Please add your comments for such a scenario with libcouchbase
Comment by Pavel Paulau [ 07/Jul/14 ]
Not exactly the same scenario but I'm not seeing major drops/errors in my tests (using lcb based workload generator).
Comment by Parag Agarwal [ 08/Jul/14 ]
So Deepti posted results and we are not seeing issues with 1.4.3 for the same run. What is the difference between SDK 1.4.2 Vs 1.4.3
Comment by Aleksey Kondratenko [ 08/Jul/14 ]
Given that problem seems to be sdk version specific and there's no evidence yet that it's something ns_server may cause, I'm bouncing this ticket back.
Comment by Matt Ingenthron [ 08/Jul/14 ]
Check the release notes for 1.4.3. We had an issue where there would be authentication problems, including timeouts and problems with retries. This was introduced in changes in 1.4.0 and fixed in 1.4.3. There's no direct evidence, but that sounds like a likely cause.
Comment by Matt Ingenthron [ 08/Jul/14 ]
Parag: not sure why you assigned this to me. I don't think there's any action for me. Reassigning back. I was just giving you additional information.
Comment by Wei-Li Liu [ 08/Jul/14 ]
Re-run the test with 1.4.2 SDK against 3.0.0 Server with just 4GB RAM per node ( comparing to my initial test with 16GB RAM per node)
The test result is much better. Not seeing the errors and operations rate never drop significantly
http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-3.0.0-918/Rb2In-HYBRID/07-08-14/074980/d5e2508529f1ad565ee38c9b8ab0c75b-MC.html
 
Comment by Parag Agarwal [ 08/Jul/14 ]
Sorry, Matt ! Should we close this as a documentation for release notes?
Comment by Matt Ingenthron [ 08/Jul/14 ]
Given that we believe it's an issue in a different project (JCBC), fixed and release noted there, I think we can just close this. The only other possible action, up to you and your management, is trying to verify this is the actual cause a bit more thoroughly.
Comment by Mike Wiederhold [ 25/Jul/14 ]
I haven't seen any activity on this in weeks and given that the last test results look good. I'm going to mark it as fixed. Please re-open if something still needs to be done for this ticket.




[MB-11822] numWorkers setting of 5 is treated as high priority but should be treated as low priority. Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Venu Uppalapati Assignee: Sundar Sridharan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
https://github.com/couchbase/ep-engine/blob/master/src/workload.h#L44-48
we currently use the priority conversion formula as seen in above code snippet
this assign numWorkers setting of 5 high priority but the expectation is that <=5 is low priority.

 Comments   
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/39891 thanks




[MB-9013] Moxy server restart exiting with code 139 Created: 30/Aug/13  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.0.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Luca Mazzocchi Assignee: Steve Yen
Resolution: Incomplete Votes: 0
Labels: restart
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: VMWare 5.1 Update 2
Centos
2 CPU
8 GB

Cluster with 2 node. We are using Memcached bucket


 Description   
Periodically (every 2 ours) we have this message

Port server moxi on node 'ns_1@couch-ep-1.eprice.lan' exited with status 139. Restarting. Messages: 2013-08-30 15:34:03: (cproxy_config.c.317) env: MOXI_SASL_PLAIN_USR (13)
2013-08-30 15:34:03: (cproxy_config.c.326) env: MOXI_SASL_PLAIN_PWD (12)

alternatively on couch-ep-1 and @couch-ep-2

The hit ratio of the memcached drops and the client (an ecommerce site) logs messagges like "connection refused"


 Comments   
Comment by Maria McDuff (Inactive) [ 30/Aug/13 ]
anil,
pls decide what release this should go.
Comment by Steve Yen [ 25/Jul/14 ]
(scrubbing through ancient moxi bugs on the path to 3.0)

Not sure what the exact cause of the 139 (sigsegv) was back then, but there was at least one crash fix after moxi 2.0.1 -- see MB-8102.

In the hope that that was the cause, marking this report/issue as incomplete.




[MB-8527] Moxi honors http_proxy environment variable Created: 27/Jun/13  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Bill Nash Assignee: Steve Yen
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Ubuntu 64-bit

 Description   

When set, Moxi honors http_proxy when attempting to connect to 127.0.0.1:8091.

In the example below, 10.12.78.99 is the ip address of my http proxy, which I enabled to download and upgrade to CB 2.1.0. As it was still set at cluster start time, Moxi began attempting to honor it, consequently blocking all read and write attempts, even though the cluster was otherwise indicated to be healthy.

[pid 27410] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 45
[pid 27410] setsockopt(45, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 27410] fcntl(45, F_GETFL) = 0x2 (flags O_RDWR)
[pid 27410] fcntl(45, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 27410] connect(45, {sa_family=AF_INET, sin_port=htons(3128), sin_addr=inet_addr("10.12.78.99")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 27410] poll([{fd=45, events=POLLOUT}], 1, 1000) = 1 ([{fd=45, revents=POLLOUT}])
[pid 27410] getsockopt(45, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 27410] getpeername(45, {sa_family=AF_INET, sin_port=htons(3128), sin_addr=inet_addr("10.12.78.99")}, [16]) = 0
[pid 27410] getsockname(45, {sa_family=AF_INET, sin_port=htons(53608), sin_addr=inet_addr("10.12.54.42")}, [16]) = 0
[pid 27410] sendto(45, "GET http://127.0.0.1:8091/pools/"..., 171, MSG_NOSIGNAL, NULL, 0) = 171
[pid 27410] poll([{fd=45, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[pid 27410] poll([{fd=45, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[pid 27410] poll([{fd=45, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=45, revents=POLLIN}])
[pid 27410] poll([{fd=45, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=45, revents=POLLIN}])
[pid 27410] recvfrom(45, "HTTP/1.0 504 Gateway Time-out\r\nS"..., 16384, 0, NULL, NULL) = 1616

Issuing 'unset http_proxy' and restarting the cluster / killing moxi corrects the issue.

The error is mentioned in the babysitter logs:
babysitter.1:[ns_server:info,2013-06-27T11:38:19.083,babysitter_of_ns_1@127.0.0.1:<0.91.0>:ns_port_server:log:168]{moxi,"Atlas"}<0.91.0>: 2013-06-27 11:38:20: (agent_config.c.423) ERROR: parse JSON failed, from REST server: http://127.0.0.1:8091/pools/default/bucketsStreaming/Atlas, <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html><head> <meta http-equiv="Content-Type" CONTENT="text/html; charset=utf-8"> <title>ERROR: The requested URL could not be retrieved</title> <style type="text/css"><!-- %l body :lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; } :lang(he) { direction: rtl; float: right; } --></style> </head><body> <div id="titles"> <h1>ERROR</h1> <h2>The requested URL could not be retrieved</h2> </div> <hr> <div id="content"> <p>The following error was encountered while trying to retrieve the URL: <a href="http://127.0.0.1:8091/pools/default/bucketsStreaming/Atlas">http://127.0.0.1:8091/pools/default/bucketsStreaming/Atlas&lt;/a&gt;&lt;/p> <blockquote id="error"> <p><b>Connection to 127.0.0.1 failed.</b></p> </blockquote>

I would suggest modifying moxi to never honor proxy settings, or to explicitly unset them at server start time.

 Comments   
Comment by Maria McDuff (Inactive) [ 20/May/14 ]
Steve,

is this an easy fix?
Comment by Steve Yen [ 25/Jul/14 ]
Going through ancient moxi issues.

I'm worried about changing this behavior as some users might actually be depending on moxi's current http proxy behavior, especially those perhaps doing standalone moxi as opposed to the moxi packaged in couchbase.




[MB-8601] Log is not self-descriptive when Moxi crashes due to not having vbucket map Created: 12/Jul/13  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Larry Liu Assignee: Steve Yen
Resolution: Fixed Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
http://www.couchbase.com/issues/browse/MB-8431

This bug is closed as won't fix. But the log message seems not self-discriptive and can be mis-interpreted as moxi crashing. It makes to do the following:

1. Moxi should not crash while waiting for vbucket map.
2. If moxi crashes due to vbucket map, the log should be clear. User will not be panic.





 Comments   
Comment by Maria McDuff (Inactive) [ 20/May/14 ]
Steve,

this will be very helpful for 3.0.
Comment by Steve Yen [ 25/Jul/14 ]
http://review.couchbase.org/39895
Comment by Steve Yen [ 25/Jul/14 ]
http://review.couchbase.org/39897




[MB-11816] coucbase-cli failed to collect log in cluster-wide collection Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04 64-bit

Triage: Triaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-1022-rel.deb.manifest.xml
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.0-1022 on one ubuntu 12.04 node
Run cluster-wide collectinfo using couchbase-cli
Failed to collect

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c localhost:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 127.0.0.1:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

 Comments   
Comment by Bin Cui [ 24/Jul/14 ]
http://review.couchbase.org/#/c/39848/
Comment by Thuan Nguyen [ 25/Jul/14 ]
Verified on build 3.0.0-1028. This bug was fixed.




[MB-11818] couchbase cli in cluster-wide collectinfo failed to start to collect selected nodes Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-1022 on 4 nodes
Run couchbase cli to do cluster-wide collectinfo on one node
The collection failed to start

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=192.168.171.148
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.148
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.149


 Comments   
Comment by Bin Cui [ 25/Jul/14 ]
I am confused. Are you sure you want to use collect-logs-stop to start collecting ?
Comment by Thuan Nguyen [ 25/Jul/14 ]
Oop I copy the wrong command
Here is command failed to start collectinfo

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.149
NODES: ERROR: command: collect-logs-start: 192.168.171.148:8091, global name 'nodes' is not defined
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=@192.168.171.149
NODES: ERROR: command: collect-logs-start: 192.168.171.148:8091, global name 'nodes' is not defined
Comment by Bin Cui [ 25/Jul/14 ]
http://review.couchbase.org/#/c/39889/




[MB-10685] XDCR Stats: Negative values seen for mutation replication rate and data replication rate Created: 28/Mar/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-03-28 at 11.51.35 AM.png     PNG File Screen Shot 2014-03-28 at 11.51.54 AM.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Seen in many scenarios where data outflow dips to 0 temporarily.

Negative mutation replication rate measured in number of mutations/sec and data replication rate measured in B/KB do not make sense. Should be 0 or positive.

 Comments   
Comment by Pavel Paulau [ 23/Apr/14 ]
Saw negative "outbound XDCR mutations" as well.

Can't agree with Minor status.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Aruna, Anil, Wayne .. July 17th
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
MB-7432 takes this into account already




[MB-9707] users may see incorrect "Outbound mutations" stat after topology change at source cluster (was: Rebalance in/out operation on Source cluster caused outbound replication mutations != 0 for long time while no write operation on source cluster) Created: 10/Dec/13  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication, test-execution
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sangharsh Agarwal Assignee: Aleksey Kondratenko
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.5.0 build 991

Attachments: File MB-9707-test_log.rtf     PNG File outboundmutations.png     PNG File Screen Shot 2014-01-22 at 11.36.06 AM.png     PNG File Snap-shot-2.png    
Issue Links:
Duplicate
duplicates MB-9745 When XDCR streams encounter any error... Closed
Relates to
relates to MB-9960 2.5 Release Note: users may see incor... Resolved
Triage: Triaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Maria, please check the title and my last comment, let me know if you need anything from me

 Description   
[Test case]
./testrunner -i ./xdcr.1.ini -t xdcr.rebalanceXDCR.Rebalance.swap_rebalance_out_master,items=1000,rdirection=unidirection,ctopology=chain,doc-ops=update-delete,rebalance=source

[Test Exception]
======================================================================
FAIL: swap_rebalance_out_master (xdcr.rebalanceXDCR.Rebalance)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/xdcr/rebalanceXDCR.py", line 328, in swap_rebalance_out_master
    elif self._replication_direction_str in "bidirection":
  File "pytests/xdcr/xdcrbasetests.py", line 714, in verify_results
    else:
  File "pytests/xdcr/xdcrbasetests.py", line 683, in verify_xdcr_stats
    timeout = max(120, end_time - time.time())
  File "pytests/xdcr/xdcrbasetests.py", line 661, in __wait_for_mutation_to_replicate
AssertionError: Timeout occurs while waiting for mutations to be replicated

----------------------------------------------------------------------

[Test Steps]

1. Create 2-2 nodes Source and Destination clusters.
2. Create default bucket on both the clusters.
3. Setup CAPI mode XDCR from source-destination.
4. Load 1000 items on source cluster.
5. Do swap-rebalance master node on source cluster.
6. After rebalance is finished, wait for rebalance_changes_left to 0 on source side. --> Test failed here, getting rebalance_changes_left as 1 always on source cluster.
7. Verify items.

[Bug description]
Outbound replication mutations doesn't goes to 0 after rebalance.


 Comments   
Comment by Sangharsh Agarwal [ 10/Dec/13 ]
[Bug description]
Outbound replication (replication_changes_left stat is non-zero) mutations doesn't goes to 0 after rebalance. Snapshot of oubound stats at source side is attached.
Comment by Sangharsh Agarwal [ 10/Dec/13 ]
[Additional information]
There were meta read operations on destination side also during rebalance in/out operations.
Comment by Sangharsh Agarwal [ 10/Dec/13 ]
There are two issues to be investigated:

1. Outbound xdcr replication left as 1, while it should be zero eventually. -> Major issue.
2. During rebalance in/out operation on Source why there were meta reads from destination side while there is no mutation taken place on Source cluster except intra cluster re-balancing.
Comment by Junyi Xie (Inactive) [ 10/Dec/13 ]
First, what build are your using?

2 is expected. For 1, what is your timeout? Recently we checked in code to make replicator wait for 30 secs before making second try if error happened. That may delay the replication in some testcases.

I would suggest redo the test manually, and see how long the remaining item can be flushed. Also, seems 1K items is a bit too small for general testing., 100K - 500K makes more sense to me.
Comment by Sangharsh Agarwal [ 10/Dec/13 ]
Build 2.5.0 991.

For 2, Please brief about, why there are meta reads on destination side?

For 1 timeout is almost 20 minutes (there are several checks for ep_queue, curr_items, vb_active_times and then replication_changes_left). Here Outbound replication mutation stopped at 1, I waited for 5-10 more minutes but it doesn't come back to zero (as we can see straight line in the graph) and also there was no other operation(s) running in parallel e.g. get, set, rebalance etc. Additional point is that this problem occurring only in case of re-balancing on cluster where rebalance is performed, not with any other operation.

In verification steps we do as follows:

1. First we wait for ep_queue to be 0. (750 seconds)
2. Second we wait for curr_items and vb_active_times to be as expected. (750 - x second spent in step 1)
3. We wait for replication_changes_left to 0, but it stuck to 1. (180 seconds timeout)

Some test with more than 1K items also failed because of this. I will do the test manually also tomorrow.
Comment by Junyi Xie (Inactive) [ 11/Dec/13 ]
For 2, rebalance (topology change) causes vb migration, the vb replicator will start at new home node and see if there is anything to replicate. The scanning process will trigger some traffic to destination side. These are pure getMeta ops. If you have already replicated everything, no data will be replicated after rebalance.
Comment by Sangharsh Agarwal [ 13/Dec/13 ]
But is re-occurred on 999 build. I have added 8 minutes to wait for mutations to be 0. But it is timeout this time also.

Attaching snapshot of another execution, where it stuck for 16 outbound replications.
Comment by Sangharsh Agarwal [ 13/Dec/13 ]
Increasing the severity to blocker as many tests are failing because of this issue.
Comment by Sangharsh Agarwal [ 13/Dec/13 ]
Test is failing with 10K items also.
Comment by Junyi Xie (Inactive) [ 14/Dec/13 ]
I made a toybuild with tentative fixes. Please retest with http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent58-master-toy-junyi-x86_64_0.0.0-MB9663A-toy.rpm

Please update with results and logs

Thanks,
Comment by Sangharsh Agarwal [ 16/Dec/13 ]
Junyi,
   Is this fix also merged?
Comment by Sangharsh Agarwal [ 17/Dec/13 ]
I am still getting this error after installing of this toy build.

[Source Cluster]
10.3.4.176 (Master node) -> https://s3.amazonaws.com/bugdb/jira/MB-9707/c5236067/mb9707_repro_2.zip
10.3.2.161 -> https://s3.amazonaws.com/bugdb/jira/MB-9707/a30d2f16/mb9707_repro_2.zip
172.23.106.21 -> New node -> https://s3.amazonaws.com/bugdb/jira/MB-9707/b1b5e88a/mb9707_repro_2.zip

[Destination Cluster]
10.3.4.175 (Master node) -> https://s3.amazonaws.com/bugdb/jira/MB-9707/186497be/mb9707_repro_2.zip
172.23.106.22 -> https://s3.amazonaws.com/bugdb/jira/MB-9707/f5dd0a9b/mb9707_repro_2.zip


Replication was created at below time

[user:info,2013-12-17T2:31:52.064,ns_1@10.3.4.176:<0.25673.7>:menelaus_web_xdc_replications:handle_create_replication:50]Replication from bucket "default" to bucket "default" on cluster "cluster1" created.

Please analyse the logs are for this time.

Note: Just to add there are two servers has their clock 1 minutes 30 second delay with others.
Comment by Junyi Xie (Inactive) [ 17/Dec/13 ]
Hi Sangharsh,

A few things

1) Inconsistent clock would cause some confusion in reading the logs. Please fix them if you can, otherwise please clearly which node has the delay. In the last test, source 10.3.4.176 and 10.3.2.161 apparently have different clocks, but for 172.23.106.21 it is not clear to which other node it is consistent with.
2) Do you know which source node has remaining mutations not replicated? You can see that from UI, but it is not shown in the uploaded screenshots. From logs, there is no error on 10.3.4.176 and 172.23.106.21 but 10.3.2.161 has some db_not_found errors when replication was created. Usually that is because you create the XDCR right after bucket is created on destination side. To avoid that, please wait for 30 seconds after you create buckets and before start XDCR. These errors triggered vb replicator crashed and restarted, it is known that restart is not uniform (MB-9745),
3) Did you get chance to run the test manually and reproduce?


To speed up the process, I would like to have a meeting with you and run the test, so we can monitor the test together. Please be free to schedule a meeting tomorrow (Wednesday Dec 18th) at convenience, I guess probably morning 10AM EST is good for both of us. Thanks.
Comment by Sangharsh Agarwal [ 17/Dec/13 ]
Junyi,
    Same problem is occurring in fews tests in Jenkin job also. Please see http://qa.sc.couchbase.com/job/centos_x64--01_01--uniXDCR_biXDCR-P0/18/consoleFull.

>2) Do you know which source node has remaining mutations not replicated?

It is on 10.3.2.161, 10.3.4.176 is swap rebalance with 172.23.106.21.

> Usually that is because you create the XDCR right after bucket is created on destination side. To avoid that, please wait for 30 seconds after you create buckets and before start XDCR. These errors triggered vb replicator crashed and restarted, it is known that restart is not uniform (MB-9745).

Problem is occurring in updating the mutations (deletes and updates) while data is being replicating before swap rebalance is started.

I have sent you the invite for the discussion.

Comment by Junyi Xie (Inactive) [ 18/Dec/13 ]
Re-ran the test with Sangharsh using the toybuild but did not reproduce. It does not sound like a code bug but rather a test issue. Two things need to fix in test: 1) after remote bucket is created, wait for 30 seconds before creating XDCR (db_not_found was seen in test even before rebalance) 2) reduce vb replicator restart interval from default 30 seconds to 5 seconds to speed up data sync-up.

BTW, all fixes in toybuild have been merged.
Comment by Sangharsh Agarwal [ 18/Dec/13 ]
Junyi,
  Currently all XDCR tests are running with "xdcrFailureRestartInterval" = 1 second. Is it OK?
Comment by Junyi Xie (Inactive) [ 19/Dec/13 ]
Hi Sangharsh,

It really depends what test you are running. That is the reason we have this parameter. For example, if test with small writes (like Paypal usecase) and no topology change at destination side, it is OK to use small restart interval.
But in other cases like test involving long-time rebalance at destination, it does not make sense to restart every 1 second. So my suggestion is

1) understand the test
2) manually run the test beforehand to figure out the reasonable restart interval
3) modify automated test accordingly
Comment by Junyi Xie (Inactive) [ 19/Dec/13 ]
Sangharsh,

1) please upgrade your build to latest, and
2) send me the ini file you use, I will run the test myself.

Comment by Junyi Xie (Inactive) [ 19/Dec/13 ]
Hi Sangharsh,

I tried the test with my own ini file with 5 VM nodes. The test pass without any problem (see part of log below). I use build 1013. Not sure what happened to you.

Although the test passes, I found a potential problem in test which may possibly fail the test in some cases. After your load deletion of 300 items into source cluster after rebalance, you wait only 30 seconds before merging the buckets. If everything runs perfectly, there is no problem. But if you hit any error, the replicator will restart. In this case the 30 seconds waiting time may be too short for them to restart and catch up replicating all items.

2013-12-19 20:52:56 | INFO | MainProcess | MainThread | [xdcrbasetests.sleep] sleep for 30 secs. ...
2013-12-19 20:53:26 | INFO | MainProcess | MainThread | [xdcrbasetests.merge_buckets] merge buckets 10.3.2.43->10.3.3.101, bidirection:False





13-12-19 20:54:47 | INFO | MainProcess | MainThread | [xdcrbasetests.verify_xdcr_stats] and Verify xdcr replication stats at Source Cluster : 10.3.2.43
2013-12-19 20:54:50 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:54:52 | INFO | MainProcess | Cluster_Thread | [task.check] Saw ep_queue_size 0 == 0 expected on '10.3.2.43:8091',default bucket
2013-12-19 20:54:53 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:54:56 | INFO | MainProcess | Cluster_Thread | [task.check] Saw ep_queue_size 0 == 0 expected on '10.3.3.104:8091',default bucket
2013-12-19 20:54:59 | INFO | MainProcess | MainThread | [xdcrbasetests.verify_xdcr_stats] Verify xdcr replication stats at Destination Cluster : 10.3.3.101
2013-12-19 20:55:00 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:55:04 | INFO | MainProcess | Cluster_Thread | [task.check] Saw ep_queue_size 0 == 0 expected on '10.3.3.101:8091',default bucket
2013-12-19 20:55:04 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:55:08 | INFO | MainProcess | Cluster_Thread | [task.check] Saw ep_queue_size 0 == 0 expected on '10.3.3.103:8091',default bucket
2013-12-19 20:55:12 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:55:15 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:55:18 | INFO | MainProcess | Cluster_Thread | [task.check] Saw curr_items 700 == 700 expected on '10.3.2.43:8091''10.3.3.104:8091',default bucket
2013-12-19 20:55:18 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:55:22 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:55:25 | INFO | MainProcess | Cluster_Thread | [task.check] Saw vb_active_curr_items 700 == 700 expected on '10.3.2.43:8091''10.3.3.104:8091',default bucket
2013-12-19 20:55:29 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:55:32 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:55:35 | INFO | MainProcess | MainThread | [task.__init__] 1000 items will be verified on default bucket
2013-12-19 20:55:35 | INFO | MainProcess | load_gen_task | [task.has_next] 0 items were verified
2013-12-19 20:55:35 | INFO | MainProcess | load_gen_task | [data_helper.getMulti] Can not import concurrent module. Data for each server will be got sequentially
2013-12-19 20:56:19 | INFO | MainProcess | load_gen_task | [task.has_next] 1000 items were verified in 44.8155920506 sec.the average number of ops - 22.3136571342 per second
2013-12-19 20:56:21 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:56:25 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:56:27 | INFO | MainProcess | Cluster_Thread | [task.check] Saw curr_items 700 == 700 expected on '10.3.3.101:8091''10.3.3.103:8091',default bucket
2013-12-19 20:56:28 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:56:31 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:56:33 | INFO | MainProcess | Cluster_Thread | [task.check] Saw vb_active_curr_items 700 == 700 expected on '10.3.3.101:8091''10.3.3.103:8091',default bucket
2013-12-19 20:56:35 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:56:38 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:56:40 | INFO | MainProcess | MainThread | [task.__init__] 1000 items will be verified on default bucket
2013-12-19 20:56:41 | INFO | MainProcess | load_gen_task | [task.has_next] 0 items were verified
2013-12-19 20:56:41 | INFO | MainProcess | load_gen_task | [data_helper.getMulti] Can not import concurrent module. Data for each server will be got sequentially
2013-12-19 20:57:22 | INFO | MainProcess | load_gen_task | [task.has_next] 1000 items were verified in 42.144990921 sec.the average number of ops - 23.7276109742 per second
2013-12-19 20:57:26 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:57:29 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:57:34 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:57:37 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:57:39 | INFO | MainProcess | load_gen_task | [task.has_next] Verification done, 0 items have been verified (updated items: 0)
2013-12-19 20:57:42 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:57:46 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:57:50 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:57:53 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:57:56 | INFO | MainProcess | load_gen_task | [task.has_next] Verification done, 0 items have been verified (deleted items: 0)
2013-12-19 20:57:56 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] ============== XDCRbasetests stats for test #1 swap_rebalance_out_master ==============
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] Type of run: UNIDIRECTIONAL XDCR
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] STATS with source at 10.3.2.43 and destination at 10.3.3.101
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Bucket: default
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Median XDC replication ops for bucket 'default': 0.00301204819277 K ops per second
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Mean XDC replication ops for bucket 'default': 0.0118389172996 K ops per second
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] ============== = = = = = = = = END = = = = = = = = = = ==============
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] ============== rebalanceXDCR cleanup was started for test #1 swap_rebalance_out_master ==============
2013-12-19 20:57:58 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [u'default'] on 10.3.2.43
2013-12-19 20:57:58 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] remove bucket default ...
2013-12-19 20:58:03 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleted bucket : default from 10.3.2.43
2013-12-19 20:58:03 | INFO | MainProcess | MainThread | [bucket_helper.wait_for_bucket_deletion] waiting for bucket deletion to complete....
2013-12-19 20:58:03 | INFO | MainProcess | MainThread | [rest_client.bucket_exists] existing buckets : []
2013-12-19 20:58:05 | INFO | MainProcess | MainThread | [cluster_helper.cleanup_cluster] rebalancing all nodes in order to remove nodes
2013-12-19 20:58:05 | INFO | MainProcess | MainThread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=ns_1%4010.3.3.104&user=Administrator&knownNodes=ns_1%4010.3.3.104%2Cns_1%4010.3.2.43
2013-12-19 20:58:06 | INFO | MainProcess | MainThread | [rest_client.rebalance] rebalance operation started
2013-12-19 20:58:08 | INFO | MainProcess | MainThread | [rest_client.monitorRebalance] rebalance progress took 2.56104779243 seconds
2013-12-19 20:58:08 | INFO | MainProcess | MainThread | [rest_client.monitorRebalance] sleep for 2.56104779243 seconds after rebalance...
2013-12-19 20:58:11 | INFO | MainProcess | MainThread | [cluster_helper.cleanup_cluster] removed all the nodes from cluster associated with ip:10.3.2.43 port:8091 ssh_username:root ? [(u'ns_1@10.3.3.104', 8091)]
2013-12-19 20:58:12 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 10.3.2.43:8091
2013-12-19 20:58:12 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 10.3.2.43:8091 is running
2013-12-19 20:58:13 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [] on 10.3.3.104
2013-12-19 20:58:15 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 10.3.3.104:8091
2013-12-19 20:58:16 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 10.3.3.104:8091 is running
2013-12-19 20:58:17 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [u'default'] on 10.3.3.101
2013-12-19 20:58:17 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] remove bucket default ...
2013-12-19 20:58:21 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleted bucket : default from 10.3.3.101
2013-12-19 20:58:21 | INFO | MainProcess | MainThread | [bucket_helper.wait_for_bucket_deletion] waiting for bucket deletion to complete....
2013-12-19 20:58:21 | INFO | MainProcess | MainThread | [rest_client.bucket_exists] existing buckets : []
2013-12-19 20:58:23 | INFO | MainProcess | MainThread | [cluster_helper.cleanup_cluster] rebalancing all nodes in order to remove nodes
2013-12-19 20:58:24 | INFO | MainProcess | MainThread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=ns_1%4010.3.3.103&user=Administrator&knownNodes=ns_1%4010.3.3.101%2Cns_1%4010.3.3.103
2013-12-19 20:58:24 | INFO | MainProcess | MainThread | [rest_client.rebalance] rebalance operation started
2013-12-19 20:58:27 | INFO | MainProcess | MainThread | [rest_client.monitorRebalance] rebalance progress took 2.58956503868 seconds
2013-12-19 20:58:27 | INFO | MainProcess | MainThread | [rest_client.monitorRebalance] sleep for 2.58956503868 seconds after rebalance...
2013-12-19 20:58:30 | INFO | MainProcess | MainThread | [cluster_helper.cleanup_cluster] removed all the nodes from cluster associated with ip:10.3.3.101 port:8091 ssh_username:root ? [(u'ns_1@10.3.3.103', 8091)]
2013-12-19 20:58:31 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 10.3.3.101:8091
2013-12-19 20:58:31 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 10.3.3.101:8091 is running
2013-12-19 20:58:32 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [] on 10.3.3.103
2013-12-19 20:58:34 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 10.3.3.103:8091
2013-12-19 20:58:34 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 10.3.3.103:8091 is running
2013-12-19 20:58:34 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] ============== rebalanceXDCR cleanup was finished for test #1 swap_rebalance_out_master ==============
ok

----------------------------------------------------------------------
Ran 1 test in 1330.685s

OK
summary so far suite xdcr.rebalanceXDCR.Rebalance , pass 1 , fail 0
testrunner logs, diags and results are available under logs/testrunner-13-Dec-19_20-36-24
Run after suite setup for xdcr.rebalanceXDCR.Rebalance.swap_rebalance_out_master
Junyis-MacBook-Pro:testrunner junyi$
Comment by Junyi Xie (Inactive) [ 20/Dec/13 ]
Based on investigation, it is a test issue instead of a code bug. Not a blocker.
Comment by Anil Kumar [ 20/Dec/13 ]
Sangharsh - Please confirm this is a test issue as mentioned by Junyi if not reopen with details.
Comment by Sangharsh Agarwal [ 22/Dec/13 ]
Anil, Currently almost 16-18 test cases are failed (as per latest jenkin execution on 22nd December) because of this issue. So, I am reviewing each test as per Junyi's comment and will update this issue soon.
Comment by Sangharsh Agarwal [ 26/Dec/13 ]
Junyi,
   After adding your suggestions of waiting after create buckets and destination and after del/update ops, still this bug is occurring on Jenkins and many issue is failing because of this. Recent execution on Jenkins almost 18 jobs are failed because of this issue. Can you please check the code which update the stats for outbound mutations (replication_chages_left)? It might be the case of wrong updation of the stat value also, because number of items on both the side are in sync.
Comment by Sangharsh Agarwal [ 27/Dec/13 ]
[Automated test steps]

1. Create 4 node source cluster and 3 node destination cluster.
2. Setup bidirectional replication for default bucket in CAPI mode.
3. Load 10000 items on both Source side and destination side.
4. Wait for 60 seconds to ensure if replication is completed.
5. Failover one non-master node at destination side.
6. Add back the node.
7. Wait for 60 seconds.
8. Perform 30% update and delete at source and destination side.
9. Wait for 120 seconds
10. Verify results. -> Here it is failed since outbound mutation was non-zero.
Comment by Sangharsh Agarwal [ 27/Dec/13 ]
When I increased the timeout in Step-3 to 180 seconds and 120 seconds in Step-7 here then test is passed. But still this kind of fix are temporary because behaviour is not consistent. I mean timeout depends on various factor e.g. load data, replication mode (capi/xmem), number of updation/deletion etc.

Additionally we also need to know why the replication_changes_left is not zero after the completion and stuck at some non-zero value forever.
Also, I have specifically looked for keyword “changes_left"on the couchbase server logs but couldn't find any clue that would help in understanding the cause of the issue. So please suggest if there's any specific keyword that i can use in the log to analyze further and help find the root cause.

FYI, We didn't have this verification (To check if all mutations are replicated) in place in the test code because we have added this one month back only. That is why we are facing this kind of issue(s) first time.
Comment by Sangharsh Agarwal [ 29/Dec/13 ]
http://qa.sc.couchbase.com/job/ubuntu_x64--36_01--XDCR_upgrade-P1/30/consoleFull

4 Upgrade tests are also failed because of this issue, while number of items were not large.
Comment by Junyi Xie (Inactive) [ 30/Dec/13 ]
See test with Sangharsh even without any rebalance. Several replicators crashed unexpectedly, The Error code captured by XDCR is "NIL" which should be a bug and likely related to recent xdcr-over-ssl change from ns_server team.

Two questions need answers from ns_server team

1) why replicator crashed due to http_request_failed error. By Sangharsh, there is no topology change on both sides and the writes is very small (less than1k/sec bucket-wise), which should not cause any stress issue.
2) why the error code returned from remote is "nil", this does not provide any insights on what happened.
 

error_logger:error,2013-12-30T9:27:12.746,ns_1@10.3.2.109:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.10675.5> terminating
** Last message in was start_replication
** When Server state == [{data,
                          [{"State",
                            {rep_state,
                             {rep,
                              <<"1a57639231dd745992cbe1736c7f1c8c/default/default">>,
                              <<"default">>,
                              <<"/remoteClusters/1a57639231dd745992cbe1736c7f1c8c/buckets/default">>,
                              "xmem",
                              [{optimistic_replication_threshold,256},
                               {worker_batch_size,500},
                               {remote_proxy_port,11214},

                               {cert,
                                <<"-----BEGIN CERTIFICATE-----\r\nMIICmDCCAYKgAwIBAgIIE0STzcUZwvAwCwYJKoZIhvcNAQEFMAwxCjAIBgNVBAMT\r\nASowHhcNMTMwMTAxMDAwMDAwWhcNNDkxMjMxMjM1OTU5WjAMMQowCAYDVQQDEwEq\r\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA4zWT9StoHrRlaQHevX7v\r\ny4l/9RW2PJDpiSjriGOPK9Vpn5lQ5KBqlBbftLIZ+M2hclXQe4fvh1tS12hU5vLB\r\n9zAKsLlF/vyELa9e\
JHuykdMhBuu55VgJCm+m+WzrKSaEmZ837Dbawv7+Bpesyk0N\r\nMX96HrNY83KlzFVl/gwKsXK5TvuoHrfQ8g4odDZEDjnW1VlcAOaISNa8NwCpSrx0\r\n1eXqFnm9cax3FPCS8rZBd8KbFvWXBSFVH/Vpn+03godir1Rn3+nJteWV9S/3Kgap\r\n2TTtFAi4crsVsdcTbezEOI6l0TNL7yjq2yDzNvVKLugX9XA6W3wh4/Nbmu/sgHyo\r\ndwIDAQABowIwADALBgkqhkiG9w0BAQUDggEBACsdjy/32B/osqbgsNvbyjlGlOOY\r\nGZY4HoPgHFZciDqPo9XZ64zHyIAnZ/\
Oy/5rdcajhmFixgIuEj0pNhLRPKbRzeXQ3\r\nG1wtW7YeK1BGrUSmSgZi9BIfLUEPmYiSYmwSnXlwNNFpKoOhcuxgZ97E6RUqdqLq\r\nwF4P7dpw5CXWudpLH9TqEuk7fxzK6ANTC9kgXEqr8+GOqAzG4VAtpEug/EeOI0Wr\r\nB0q6xT7rUvnDnPIr3MPb+aNXU2mHKSpz6nntkaJ+VHyGhlMNgjyICPzrECvC2Pol\r\nKaDxA3I5knrwMQzAspRq4VEafXQYnnjCFMBzzXaQ/P61P7GFpg3InrOqlvs=\r\n-----END CERTIFICATE-----\r\n">>}]},
                             <0.10418.5>,<0.10414.5>,<<"default/156">>,
                             <<"*****@10.3.4.175:18092/default%2f156%3bf52074c035e5ad86641a058dda01caaf">>,
                             undefined,undefined,undefined,undefined,[],
                             {[{<<"session_id">>,
                                <<"d8a212c92cc7c43e6cb7d872a7a8c5bc">>},
                               {<<"source_last_seq">>,44},
                               {<<"start_time">>,
                                <<"Mon, 30 Dec 2013 16:31:47 GMT">>},
                               {<<"end_time">>,
                                <<"Mon, 30 Dec 2013 17:10:34 GMT">>},
                               {<<"docs_checked">>,44},
                               {<<"docs_written">>,44},
                               {<<"data_replicated">>,21108},
                               {<<"history">>,
                                [{[{<<"session_id">>,
                                    <<"d8a212c92cc7c43e6cb7d872a7a8c5bc">>},
                                   {<<"start_time">>,
                                    <<"Mon, 30 Dec 2013 16:31:47 GMT">>},
                                   {<<"end_time">>,
                                    <<"Mon, 30 Dec 2013 17:10:34 GMT">>},
                                   {<<"start_last_seq">>,0},
                                   {<<"end_last_seq">>,44},
                                   {<<"recorded_seq">>,44},
                                   {<<"docs_checked">>,44},
                                   {<<"docs_written">>,44},
                                   {<<"data_replicated">>,21108}]}]}]},
                             0,44,48,0,[],48,
                             {doc,
                              <<"_local/156-1a57639231dd745992cbe1736c7f1c8c/default/default">>,
                              {1,<<27,24,13,111>>},
                              {[]},
                              0,false,[]},
                             {doc,
                              <<"_local/156-1a57639231dd745992cbe1736c7f1c8c/default/default">>,
                              {1,<<27,24,13,111>>},
                              {[]},
                              0,false,[]},
                             "Mon, 30 Dec 2013 16:31:47 GMT",
                             <<"1388400453481967">>,<<"1388400381">>,nil,
                             {1388,424429,531152},
                             {1388,423435,765637},
                             [],<0.9171.9>,
                             <<"d8a212c92cc7c43e6cb7d872a7a8c5bc">>,48,
                             false}}]}]
** Reason for termination ==
** {http_request_failed,"HEAD",
                        "https://Administrator:*****@10.3.4.175:18092/default%2f156%3bf52074c035e5ad86641a058dda01caaf/",
                        {error,{code,nil}}}


Comment by Junyi Xie (Inactive) [ 31/Dec/13 ]
Alk, can you please take a quick look at comments at 30/Dec/13 12:55 PM?
Comment by Junyi Xie (Inactive) [ 31/Dec/13 ]
Please assign back to me after you put your thoughts. Thanks.
Comment by Aleksey Kondratenko [ 31/Dec/13 ]
1) I cannot comment on anything without clear pointer to logs

2) I believe that conflating ssl tests and non-ssl tests is a big mistake. Open new bug if that's related to ssl.
Comment by Junyi Xie (Inactive) [ 31/Dec/13 ]
Sangharsh,

My suggestion:

1) Please provide the logs Alk required (if you did not keep the log, you need to reproduce the test that you see SSL errors but without any topology change).
2) Ask Alk suggested, please file a different bug for SSL issue.


Thanks.



Comment by Sangharsh Agarwal [ 01/Jan/14 ]
Junyi,
   This issue is occurring even before SSL feature merged. I have already provided reproduction logs 3 times. This issue is failing many test cases of XDCR (Around 10-12) continously in every execution. Please use above mentioned logs for analysis. Please let me know if anything is missing these logs. For XDCR tests we keep restart interval as 1 second for each test.
Comment by Maria McDuff (Inactive) [ 08/Jan/14 ]
Junyi,

In build 1028, 2 tests are failing due to this mutations not zero'ing out:

1).xdcr.biXDCR.bidirectional.xdcr.biXDCR.bidirectional.load_with_async_ops_and_joint_sets,doc-ops:create,GROUP:P0;xmem,demand_encryption:1,items:10000,case_number:1,conf_file:py-xdcr-bidirectional.conf,num_nodes:6,cluster_name:6-win-xdcr,ctopology:chain,rdirection:bidirection,ini:/tmp/6-win-xdcr.ini,doc-ops-dest:create,replication_type:xmem,get-cbcollect-info:True,spec:py-xdcr-bidirectional
2). xdcr.biXDCR.bidirectional.xdcr.biXDCR.bidirectional.load_with_async_ops_and_joint_sets_with_warmup,doc-ops:create-update,GROUP:P0;xmem,demand_encryption:1,items:10000,upd:30,case_number:6,conf_file:py-xdcr-bidirectional.conf,num_nodes:6,cluster_name:6-win-xdcr,ctopology:chain,rdirection:bidirection,ini:/tmp/6-win-xdcr.ini,doc-ops-dest:create-update,replication_type:xmem,get-cbcollect-info:True,spec:py-xdcr-bidirectional

Comment by Junyi Xie (Inactive) [ 08/Jan/14 ]
Maria, please upload or point me to the logs of these two failed test. Without logs, I cannot say anything.
Comment by Maria McDuff (Inactive) [ 08/Jan/14 ]
i'm collecting the logs... they are uploading now.
Comment by Maria McDuff (Inactive) [ 08/Jan/14 ]
junyi,

here are the logs: https://s3.amazonaws.com/bugdb/jira/MB-9707/mb9707.tgz
Comment by Junyi Xie (Inactive) [ 09/Jan/14 ]
Duplicate of MB-9745. Let us fix MB-9745 first.
Comment by Maria McDuff (Inactive) [ 10/Jan/14 ]
MB-9745.
Comment by Maria McDuff (Inactive) [ 10/Jan/14 ]
Should be re-tested when MB-9745 is fixed.
Comment by Sangharsh Agarwal [ 17/Jan/14 ]
4 tests are failed because of this issue:

http://qa.sc.couchbase.com/job/centos_x64--01_01--uniXDCR_biXDCR-P0/37/consoleFull

[Logs for tests]
./testrunner -i /tmp/ubuntu-64-2.0-biXDCR-sanity.ini items=10000,get-cbcollect-info=True -t xdcr.biXDCR.bidirectional.load_with_failover,replicas=1,items=10000,ctopology=chain,rdirection=bidirection,doc-ops=create-update-delete,doc-ops-dest=create-update,failover=destination,replication_type=xmem,GROUP=P0;xmem

[Test Steps]
1. FailureRestartInterval is 1 for this test.
2. Create SRC cluster with 3 nodes, and Destination cluster with 4 nodes.
3. Setup Bi-directional xmem non-encryption replication.
4. Load 10000 - 10000 items on both Source and Destination cluster.
5. Perform failover on destination side for non-master node.
6. Sleep for 30 seconds
7. Perform asynchronous updates and deletes (30%) on source side and updates on destination side.
8. Verification steps:
    i) Wait for curr_items, vb_active_curr_items on Source the side to be 17000 and ep_queue_size = 0.
    ii) Wait for replication_changes_left == 0 on destination side. -> Failed here.
    



[Improved logging on the test]
[2014-01-17 00:40:33,434] - [xdcrbasetests:652] INFO - Waiting for Outbound mutation to be zero on cluster node: 10.1.3.96
[2014-01-17 00:40:33,699] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:40:33,701] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:40:43,850] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:40:43,851] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:40:54,143] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:40:54,144] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:04,344] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:04,346] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:14,625] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:14,628] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:24,889] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:24,890] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:35,228] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:35,229] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:45,570] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:45,572] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:55,796] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:55,803] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:06,026] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:06,028] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:16,202] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:16,203] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:26,499] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:26,501] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:36,695] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:36,696] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:46,869] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:46,870] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:57,104] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:57,105] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:43:07,478] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:43:07,479] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:43:17,743] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:43:17,749] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:43:27,977] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:43:27,978] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...


[Logs]

Source ->

10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-9707/c7048a1d/10.1.3.93-1172014-044-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-9707/5788f308/10.1.3.94-1172014-045-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-9707/550388dd/10.1.3.95-1172014-046-diag.zip

Destination ->

10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-9707/75f497c4/10.1.3.96-1172014-047-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-9707/5742d86b/10.1.3.97-1172014-048-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-9707/a280fb85/10.1.3.99-1172014-049-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-9707/1e3c0b00/10.1.2.12-1172014-049-diag.zip


Check the logs after [2014-01-17 00:31:00,028]. It contains the logs for this test case only.

[user:info,2014-01-17T0:31:36.530,ns_1@10.1.3.93:<0.28274.18>:menelaus_web_remote_clusters:do_handle_remote_clusters_post:96]Created remote cluster reference "cluster1" via 10.1.3.96:8091.
[user:info,2014-01-17T0:31:36.632,ns_1@10.1.3.93:<0.28905.18>:menelaus_web_xdc_replications:handle_create_replication:50]Replication from bucket "default" to bucket "default" on cluster "cluster1" created.
[error_logger:info,2014-01-17T0:31:36.638,ns_1@10.1.3.93:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
Comment by Junyi Xie (Inactive) [ 17/Jan/14 ]
I repeat the test using my own VMs (2:3 configuration) and test passes. All items are verified correctly. See part of test output below.

From Sangharsg's recent logs, all items has been replicated and synced up on both sides. The non-zero in "outbound XDCR mutations" is likely a stats issue, which I suspected due to XDCR receives an incorrect vb map containing "dead vb" which however no longer belong to the node. This happens during topology change, that is why non-zero "outbound XDCR mutations is only seen on "Destination cluster" (which is also a source since it is a bi-directional replication).

I will continue investigation.


A couple of side comments

1. we still see db_not_found error when XDCR started up in test, though XDCR is able to recover.
2. the verification stage seems takes very long (> 12 min to verify 20K items) on one cluster.




2014-01-17 12:44:38 | INFO | MainProcess | load_gen_task | [task.has_next] 10000 items were verified
2014-01-17 12:54:55 | INFO | MainProcess | load_gen_task | [task.has_next] 20000 items were verified
2014-01-17 12:54:55 | INFO | MainProcess | load_gen_task | [task.has_next] 20000 items were verified in 656.043297052 sec.the average number of ops - 30.4857927225 per second

2014-01-17 12:55:05 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2014-01-17 12:55:29 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2014-01-17 12:55:37 | INFO | MainProcess | Cluster_Thread | [task.check] Saw curr_items 17000 == 17000 expected on '10.3.3.101:8091''10.3.3.103:8091',default bucket
2014-01-17 12:55:40 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2014-01-17 12:56:06 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2014-01-17 12:56:25 | INFO | MainProcess | Cluster_Thread | [task.check] Saw vb_active_curr_items 17000 == 17000 expected on '10.3.3.101:8091''10.3.3.103:8091',default bucket
2014-01-17 12:56:47 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2014-01-17 12:57:12 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2014-01-17 12:57:30 | INFO | MainProcess | MainThread | [task.__init__] 20000 items will be verified on default bucket
2014-01-17 12:57:30 | INFO | MainProcess | load_gen_task | [task.has_next] 0 items were verified
2014-01-17 12:59:15 | INFO | MainProcess | load_gen_task | [task.has_next] 10000 items were verified
2014-01-17 13:07:57 | INFO | MainProcess | load_gen_task | [task.has_next] 20000 items were verified
2014-01-17 13:07:57 | INFO | MainProcess | load_gen_task | [task.has_next] 20000 items were verified in 627.347528934 sec.the average number of ops - 31.880256233 per second

2014-01-17 13:08:39 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== XDCRbasetests stats for test #1 load_with_failover ==============
2014-01-17 13:08:41 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] Type of run: BIDIRECTIONAL XDCR
2014-01-17 13:08:41 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] STATS with source at 10.3.2.47 and destination at 10.3.3.101
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Bucket: default
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Average local replica creation rate for bucket 'default': 8.00986759593 KB per second
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Median XDC replication ops for bucket 'default': 0.005 K ops per second
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Mean XDC replication ops for bucket 'default': 0.0107190453205 K ops per second
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Average XDCR data replication rate for bucket 'default': 7.94547104654 KB per second
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] STATS with source at 10.3.3.101 and destination at 10.3.2.47
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Bucket: default
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Average local replica creation rate for bucket 'default': 6.83460734451 KB per second
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Median XDC replication ops for bucket 'default': 0.003 K ops per second
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Mean XDC replication ops for bucket 'default': 0.010893736388 K ops per second
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Average XDCR data replication rate for bucket 'default': 6.81691981277 KB per second
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== = = = = = = = = END = = = = = = = = = = ==============
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== XDCRbasetests cleanup was started for test #1 load_with_failover ==============

2014-01-17 13:09:29 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== XDCRbasetests cleanup was finished for test #1 load_with_failover ==============
ok

----------------------------------------------------------------------
Ran 1 test in 3757.717s

OK
summary so far suite xdcr.biXDCR.bidirectional , pass 1 , fail 0
testrunner logs, diags and results are available under logs/testrunner-14-Jan-17_12-06-51
Run after suite setup for xdcr.biXDCR.bidirectional.load_with_failover
-bash: xmem: command not found

Comment by Junyi Xie (Inactive) [ 17/Jan/14 ]
Sangharsh,

This is a stats issue. Here is action plan from scrub meeting.

1. Junyi will create a toybuild with fix, and ask Sangharsh to rerun the test to verify the fix.

2. If debugging takes longer than expected, because the stat is buggy now, Sangharsh can temporarily remove the checking this stat in verification code to allow test continue. Verification code in test will verify all data on both sides are consistent. This is just unblock the test and Sangharsh can add the stats back after the stats is fixed.
Comment by Wayne Siu [ 17/Jan/14 ]
Junyi, please update the ticket when the toybuild is available. Thanks.
Comment by Sangharsh Agarwal [ 19/Jan/14 ]
Junyi,
   I agree with the point on the scrub meeting, Please provide the toy build so that I can verify the fix.

Also, is it possible to add a debug log statement in the server code which can print the value of remaining outbound mutations in ns_server or xdcr logs?
Comment by Junyi Xie (Inactive) [ 19/Jan/14 ]
Sangharsh,

The toybuild is here:

http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent58-2.5.0-toy-junyi-x86_64_2.5.0-MB9707A-toy.rpm

I rerun the test using the build on my own VMs (2:3 configuration). The test pass (it took quite long to finish though)


Junyis-MacBook-Pro:testrunner junyi$ ./testrunner -i ~/memo/vm/xmem2.ini items=10000,get-cbcollect-info=True -t xdcr.biXDCR.bidirectional.load_with_failover,replicas=1,items=10000,ctopology=chain,rdirection=bidirection,doc-ops=create-update-delete,doc-ops-dest=create-update,failover=destination,replication_type=xmem,GROUP=P0;


2014-01-19 23:09:31 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== XDCRbasetests cleanup was finished for test #1 load_with_failover ==============
ok

----------------------------------------------------------------------
Ran 1 test in 4251.595s

OK
summary so far suite xdcr.biXDCR.bidirectional , pass 1 , fail 0
testrunner logs, diags and results are available under logs/testrunner-14-Jan-19_21-58-39
Run after suite setup for xdcr.biXDCR.bidirectional.load_with_failover
Junyis-MacBook-Pro:testrunner junyi$


Comment by Sangharsh Agarwal [ 19/Jan/14 ]
Junyi,
   I will run the whole test suite with this toy build and will share the result with you.
Comment by Sangharsh Agarwal [ 20/Jan/14 ]
Junyi,
I have run the whole test suite with this toy build on my Vms. This issue not occurred. Can you please merge this fix?

Please brief the issue that you found and provided in this toybuild.
Comment by Junyi Xie (Inactive) [ 20/Jan/14 ]

- This issue is highly likely to happen on Jenkens, but both QE and dev cannot reproduce it on their own VMs after trying multiple times. This makes diag of root cause very hard.

- At this time looks like the issue is a stats issue, instead of data replication issue.

- root cause is not 100% clear, looks like some upstream information (updated vb map during topology change) consumed by XDCR stats collections is not fully correct. For example, the new vb map received by XDCR from ns_server is not fully updated and contains dead vbuckets. As a result, XDCR stats code will aggregate the dead vbuckets and issue can happen.

- It is also possible this is caused by race conditions

- Junyi's fix in the toybuild made a defense line to check that stats code only aggregates active vbuckets.


Per discussion with Sangharsh this morning, action items:

Sangharsh:
1) will modify the test to prevent stat checking from crashing the test, instead, the failed stat check should be logged and test should continue with other verification e.g., verify data items on both sides;
2) rerun a set of XDCR tests using the toybuild on his own VM. (By Sangharsh, Jenkins cannot run toybuild at this time). This is because it is quite hard to reproduce by Sangharsh on this own VMs, test the toybuild once is probably not enough to verify that the fix works.


Junyi:
push the fix in toybuild to gerrit and start review.

Comment by Sangharsh Agarwal [ 20/Jan/14 ]
I have modified the test and created review http://review.couchbase.org/#/c/32662/
Comment by Andrei Baranouski [ 20/Jan/14 ]
I can't approve the changes:

1) this is done in a general method.
2) This is tantamount to remove this verification for all tests.
3) if replication_changes_left is buggy we spend a lot of time on meaningless test of this(how often it's reproduced?)

@Sangharsh, why Jenkins cannot run toybuild at this time. if it's so we should fix it
as workaround, you can install toy builds manually and run only tests on jenkis

my opinion is that the test should fall. Another thing that we can failed test with corresponding message after all other checks, this can be done like this:

__wait_for_mutation_to_replicate return boolean (false - Timeout occurs)
get the value in verify_xdcr_stats and based on the results to determine the final status of the test






Comment by Junyi Xie (Inactive) [ 20/Jan/14 ]
The commit in the toybuild is now pending review from ns_server team

http://review.couchbase.org/#/c/32663/
Comment by Junyi Xie (Inactive) [ 20/Jan/14 ]
Hi Andrei,

If the stat is not zero, the test should fail but it should not crash in the middle when this stats checking time out. The correct behavior is that test should finish all verifications, and at the end of test given all verification results it determines if the test should pass or fail.

Comment by Junyi Xie (Inactive) [ 20/Jan/14 ]
Per bug scrub meeting, we will convert it to a doc bug. Here is the description.

Maria,

Let me know if I miss anything.



"Users may see incorrect stat "Outbound mutations" after topology change at source side. If all XDCR activity has settled down and data have been replicated, "Outbound mutations" stat should see 0, meaning no remaining mutations to be replicated. Due to race condition, "Outbound mutations" may contain stats from "dead vbuckets" that were active before rebalance but have been migrated to other nodes during rebalance. If users hit this issue, "Outbound mutations" may show non-zero stat even after all data are replicated. User can verify the data on both sides by checking number of items in source and destination bucket on both sides.

Stop/restart XDCR should refresh all stats and if all data have been replicated, at incoming XDCR stats at destination side, no set and delete operations will be seen, metadata operations will be seen though."



Comment by Maria McDuff (Inactive) [ 20/Jan/14 ]
Cloned doc bug: MB-9960
Comment by Andrei Baranouski [ 21/Jan/14 ]
Hi Junyi,

"If the stat is not zero, the test should fail but it should not crash in the middle when this stats checking time out. The correct behavior is that test should finish all verifications, and at the end of test given all verification results it determines if the test should pass or fail."

completely agree. this is what I meant.

Sangharsh, let's implement this approach for such cases. let me know if I can do anything to help
Comment by Sangharsh Agarwal [ 21/Jan/14 ]
Andrei,
  I have uploaded the updated changes. Please review.

http://review.couchbase.org/#/c/32662/
Comment by Sangharsh Agarwal [ 21/Jan/14 ]
I have started Jenkin jobs on Toy build now with changes in the test code:

http://qa.sc.couchbase.com/job/centos_x64--01_01--uniXDCR_biXDCR-P0/39/console

Comment by Sangharsh Agarwal [ 22/Jan/14 ]
Juyi,
   Result of execution http://qa.sc.couchbase.com/job/centos_x64--01_01--uniXDCR_biXDCR-P0/39/consoleFull with toy build:

Results are unpredictable:

Failed test: 17
Passed :3

All the failed tests failed with wrong number of items on the destination and server and test is time-out.
Comment by Sangharsh Agarwal [ 22/Jan/14 ]
In this attached snapshot, Right hand side, it is showing that replication is configured, but on the left side there is no tab for "Outgoing replication" for the bucket.
Comment by Sangharsh Agarwal [ 22/Jan/14 ]
There are two issues I have observerd:

1. Replication status was "Starting up" on 10.1.3.93 -> 10.1.3.96 for a longer time and no replication was taking place. Please find the below logs.

[SRC]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-9707/332ea7a9/10.1.3.93-1212014-2144-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-9707/55d57001/10.1.3.94-1212014-2146-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-9707/c0f467b5/10.1.3.95-1212014-2145-diag.zip

[DEST]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-9707/c0844e93/10.1.3.96-1212014-2148-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-9707/02eca96e/10.1.3.97-1212014-2147-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-9707/e43d699f/10.1.3.99-1212014-2150-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-9707/eac5f09d/10.1.2.12-1212014-2151-diag.zip


There is one replication created from 10.1.3.93 -> 10.1.3.96 at 21:32:30.432 but it doesn't cause replication:

[user:info,2014-01-21T21:32:30.432,ns_1@10.1.3.93:<0.2181.0>:menelaus_web_xdc_replications:handle_create_replication:50]Replication from bucket "default" to bucket "default" on cluster "cluster1" created.

[xdcr:error,2014-01-21T21:32:31.105,ns_1@10.1.3.93:<0.5811.0>:xdc_vbucket_rep:terminate:377]Shutting xdcr vb replicator ({init_state,
                              {rep,
                               <<"a85361a0f85b08e631930ad342bb1171/default/default">>,
                               <<"default">>,
                               <<"/remoteClusters/a85361a0f85b08e631930ad342bb1171/buckets/default">>,
                               "xmem",
                               [{max_concurrent_reps,32},
                                {checkpoint_interval,1800},
                                {doc_batch_size_kb,2048},
                                {failure_restart_interval,1},
                                {worker_batch_size,500},
                                {connection_timeout,180},
                                {worker_processes,4},
                                {http_connections,20},
                                {retries_per_request,2},
                                {optimistic_replication_threshold,256},
                                {xmem_worker,1},
                                {enable_pipeline_ops,true},
                                {local_conflict_resolution,false},
                                {socket_options,
                                 [{keepalive,true},{nodelay,false}]},
                                {supervisor_max_r,25},
                                {supervisor_max_t,5},
                                {trace_dump_invprob,1000}]},
                              62,"xmem",<0.5411.0>,<0.5412.0>,<0.5408.0>}) down without ever successfully initializing: shutdown

Also, in ns_server.xdcr.log on 10.1.3.93 doesn't have logs between 21:32:31.105 and 21:41:08, It means XDCR was not running at that time.

Then I have deleted and re-create the replication from 10.1.3.93 -> 10.3.1.96 and it started the replication:

[user:info,2014-01-21T21:41:17.165,ns_1@10.1.3.93:<0.2852.2>:menelaus_web_xdc_replications:handle_create_replication:50]Replication from bucket "default" to bucket "default" on cluster "cluster1" created.


Please see if it is not caused by changes in toy build.
Comment by Junyi Xie (Inactive) [ 22/Jan/14 ]
That because a bunch of db_not_found errrors around 21:32:30. The timing in the test does not work on Jenkins.

Both the toybuild and regular build works well in standalone test, you also tried several times. I do not understand why it always failed on Jenkins



[error_logger:error,2014-01-21T21:32:30.708,ns_1@10.1.3.93:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: xdc_vbucket_rep:init/1
    pid: <0.5431.0>
    registered_name: []
    exception exit: {db_not_found,<<"http://Administrator:*****@10.1.2.12:8092/default%2f38%3b2de080ec0a0409811c3560b4779092f1/">>}
      in function gen_server:terminate/6
    ancestors: [<0.5413.0>,<0.5408.0>,xdc_replication_sup,ns_server_sup,
                  ns_server_cluster_sup,<0.59.0>]
    messages: []
    links: [<0.5413.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 4181
    stack_size: 24
    reductions: 16332
  neighbours:
Comment by Dipti Borkar [ 22/Jan/14 ]
Junyi,


The information you added above, does not help a user or support verify that this is not a data problem but a stats problem.
Please add information here about how to verify that the # of documents match or there is no data to replicate.
Comment by Junyi Xie (Inactive) [ 23/Jan/14 ]
Dipti,

I do not fully understand what you really need to verify the data. Users can always look at the count of items at both side to check if they matching. To further determine if there any mutation not replicated, users can

1) read data from both sides and match them, like verification code in test
2) as I said, users can restart XDCR, and no sets will be seen if data already replicated, which is probably the easiest way.

At this time I am not aware of other solutions.
Comment by Aleksey Kondratenko [ 24/Jan/14 ]
Verified item counts stats in logs just in case. Item counts match. So does look like stats problem (unless test is doing some mutations in which case simply comparing item counts is not right way to see if data is indeed replicated).

I believe QE should add verification pass where they do tap/upr or even _all_docs to actually get all keys alive at source and destination cluster. And then they can (and should) actually GET all those keys and compare values. Even better would be to also compare metadata (seqnos, cas, expirations etc).

Otherwise we'll keep having those useless conversations of whether we're really sure there is or there is no data loss.
Comment by Dipti Borkar [ 24/Jan/14 ]
Completely agree. which is why I had asked the Junyi to explain what is the best way to verify that no data loss has occured. what ep_engine stats or other stats can QE look at to validate no data loss.

Alk, so you suggest _all_docs?

Someone from QE, please work with Alk and Junyi, on how to verify this.
Comment by Aleksey Kondratenko [ 24/Jan/14 ]
>> Alk, so you suggest _all_docs?

Unfortunately I cannot really suggest _all_docs. We're supposed to kill it in few weeks. And replacement will not support streaming all bucket's documents/keys.

_all_docs is probably easiest to consume today. But looking forward it looks like we'll have to use tap or upr for that. Or even couch_dbdump.
Comment by Sangharsh Agarwal [ 27/Jan/14 ]
Alk,
Current Verification process:

1. Verify Stats on Source and Destination Cluster:
        -> ep_queue_size == 0
        -> curr_items == Num of items on the cluster.
        -> vb_active_items == Num of items on the cluster.
        -> replication_changes_left == 0
2. Verify Data (Key, Value) on both Source and Destination Cluster.

Is _all_docs defers from the above check?
Comment by Andrei Baranouski [ 27/Jan/14 ]
+ we also _verify_revIds in our tests
Comment by Aleksey Kondratenko [ 27/Jan/14 ]
Thanks, Andrei and Sangharsh.

But how come we're _debating_ whether there's data loss at all if your tests already do all required verification? Perhaps you should in all xdcr related tickets _clearly_ state if test detected actual data loss or not. For extra clarity.
Comment by Andrei Baranouski [ 27/Jan/14 ]
no data loss, the problem is in outbound mutations stats
Comment by Aleksey Kondratenko [ 27/Jan/14 ]
Great. "In Andrei I trust" :)
Comment by Cihan Biyikoglu [ 06/Feb/14 ]
based on the last comment - if there isn't data loss, should we still consider this a blocker?
thanks
-cihan
Comment by Maria McDuff (Inactive) [ 10/Mar/14 ]
Raising as a Blocker as it is failing most of the QE tests.
Comment by Aleksey Kondratenko [ 10/Mar/14 ]
Cannot agree with Blocker. QE tests are _supposed_ to check every doc. Therefore QE tests can handle bad stats.

If issue is not due to bad stats, then it's a _different issue_.
Comment by Aleksey Kondratenko [ 10/Mar/14 ]
If it's blocking some tests it appears to be problem with tests as I've mentioned above.

If that's new bug then it requires new ticket.

In any case it requires more coordination.
Comment by Aruna Piravi [ 12/Mar/14 ]
>Cannot agree with Blocker. QE tests are _supposed_ to check every doc. Therefore QE tests can handle bad stats.

QE tests do check every doc. However _any_ verification starts only when we know replication is complete. And it is this stat "replication_changes_left" we heavily rely on to know if replication has come to a stop. So it is not right to say QE tests can handle bad stats. When this stat doesn't become 0 after replication, our tests timeout.

Even in testing pause and resume, we rely on this stat to check if active_vbreps/incoming xdcr ops on remote cluster will go up after replication is resumed. If stats are buggy, our tests can fail for no good reason.

We are open for coordination but it's not a new bug, tests have been failing for the last few months because of this stat and hence this is a blocker from a QE perspective.
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
Here's suggestion (I was assuming it's obvious): stop testing stats. Only test actual docs.

Do I understand correctly that it'll unblock your month-long blocked testing ?
Comment by Sangharsh Agarwal [ 12/Mar/14 ]
Here, We don't test this stat, while we check this stat to ensure of there is no outbound mutation and we can proceed for further validations of docs key, value, revid etc. Though in 3.0 we are planning to test some of the stats for Pause and resume feature (After Pause and resume)
Comment by Aruna Piravi [ 12/Mar/14 ]
It would be meaningful to start testing docs only after replication is complete, right? Would it make sense otherwise? And the test has been blocked for more than 3-4 months now.
Comment by Andrei Baranouski [ 12/Mar/14 ]
my suggestion is do not fail test for now(and I think Alk meant it) if mutation is still replicated after a long time
https://github.com/couchbase/testrunner/blob/master/pytests/xdcr/xdcrbasetests.py#L733

another option we could implement the ability to identify the bug. when 'outbound mutations' is not changed and non-zero

according to comments I see it happen mostly with failover/warmup scenarios and Pause and resume feature should work as expected in basic cases.
Comment by Andrei Baranouski [ 14/May/14 ]
Sangharsh, could you check in XDCR test logs for some jobs/runs that we still see the issue with timeoutError in Outbound mutations in 3.0
Comment by Sangharsh Agarwal [ 14/May/14 ]
Tests are not stable yet on 3.0, I will update it once stable.
Comment by Sangharsh Agarwal [ 28/May/14 ]
I am still seeing this issue on 3.0.
Comment by Andrei Baranouski [ 28/May/14 ]
I see, could you set build version where you still see it?

@Alk, I think we still ignore checking this stats?(outbound replication mutations != 0)
Comment by Aleksey Kondratenko [ 28/May/14 ]
Andrei, can you elaborate on your question?
Comment by Andrei Baranouski [ 29/May/14 ]
CBQE-2280 - ticket to create separate tests for stats verification

Sangharsh, could you provide complete information on the current problem: build, test, steps, logs, collect_info and then assign it on Alk

Comment by Sangharsh Agarwal [ 29/May/14 ]
Andrei,

>Sangharsh, could you provide complete information on the current problem: build, test, steps, logs, collect_info and then assign it on Alk

I think, original issue is still not fixed. Do you think XDCR UPR will fix this issue?
Comment by Andrei Baranouski [ 29/May/14 ]
Sangharsh, do you expect that devs will study the logs on the old version?
Comment by Sangharsh Agarwal [ 29/May/14 ]
No, I really don't want this. If you can see the history of this bug. It is re-produced 5-6 times and every time logs were posted. If there is any improvement in logs/product which can help in analyzing this issue, then I think, it is advisable to upload the news logs.

Anyways:

Build: 721 (Upgrade tests)

XDCR UPR is used after upgrade.


[Jenkins]
http://qa.hq.northscale.net/job/centos_x64--104_01--XDCR_upgrade-P1/5/consoleFull

[Test]
./testrunner -i centos_x64--104_01--XDCR_upgrade-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,upgrade_version=3.0.0-721-rel -t xdcr.upgradeXDCR.UpgradeTests.online_cluster_upgrade,initial_version=2.0.0-1976-rel,sdata=False,bucket_topology=default:1>2;standard_bucket0:1<2;bucket0:1><2,post-upgrade-actions=src-rebalancein;dest-rebalanceout;dest-create_index


[Number of tests have this issue] (Seach with string "Timeout occurs while waiting for mutations to be replicated" on the above link)
6

[Test Steps]
1. Setup 2.0 Source and Destination nodes with 2-2 nodes
2. XDCR, capi mode:
     bucket0 <-> bucket0 (Load 1000 items on each side)
     default -> default (Load 1000 items on Source)
     standard_bucket0 <-- standard_bucket0 (Load 1000 items on destination)
3. Upgrade nodes to 3.0.0-721-rel.
4. Perform mutations on each nodes. (update and deletes)
5. Rebalance in and Rebalance out one node at Source and Destination nodes respectively.
6. Verify stats on both side.
 

[Test Logs]
2014-05-23 11:09:04,639] - [task:1054] INFO - 3000 items were verified in 2.99595117569 sec.the average number of ops - 1001.3507945 per second
[2014-05-23 11:09:04,639] - [xdcrbasetests:1332] INFO - Waiting for Outbound mutation to be zero on cluster node: 10.3.3.126
[2014-05-23 11:09:04,762] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:04,863] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:04,864] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:14,972] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:15,067] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:15,068] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:25,181] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:25,281] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:25,282] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:35,399] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:35,500] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:35,501] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:45,614] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:45,720] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:45,721] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:55,835] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:55,934] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:55,935] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:06,047] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:06,146] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:06,147] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:16,263] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:16,363] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:16,364] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:26,476] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:26,577] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:26,578] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:36,692] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:36,795] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:36,796] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:46,916] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:47,023] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:47,024] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:57,134] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:57,235] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:57,236] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:07,314] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:07,414] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:07,416] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:17,523] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:17,628] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:17,639] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:27,752] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:27,853] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:27,854] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:37,969] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:38,075] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:38,076] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:48,188] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:48,289] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:48,290] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:58,403] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:58,506] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:58,507] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:12:08,518] - [xdcrbasetests:1351] ERROR - Timeout occurs while waiting for mutations to be replicated


[Logs]

[Source]
10.3.3.126 : https://s3.amazonaws.com/bugdb/jira/MB-9707/3dc2d4a2/10.3.3.126-5232014-2223-diag.zip
10.3.3.126 : https://s3.amazonaws.com/bugdb/jira/MB-9707/ce380435/10.3.3.126-diag.txt.gz
10.3.5.61 : https://s3.amazonaws.com/bugdb/jira/MB-9707/b122888f/10.3.5.61-diag.txt.gz
10.3.5.61 : https://s3.amazonaws.com/bugdb/jira/MB-9707/f2dd5236/10.3.5.61-5232014-2226-diag.zip


[Destination]
10.3.121.199 : https://s3.amazonaws.com/bugdb/jira/MB-9707/f4d0554b/10.3.121.199-5232014-2231-diag.zip
10.3.121.199 : https://s3.amazonaws.com/bugdb/jira/MB-9707/f989626e/10.3.121.199-diag.txt.gz
10.3.5.11 : https://s3.amazonaws.com/bugdb/jira/MB-9707/86246c34/10.3.5.11-diag.txt.gz
10.3.5.11 : https://s3.amazonaws.com/bugdb/jira/MB-9707/fe85afac/10.3.5.11-5232014-2229-diag.zip


[Node added on Source]
10.3.5.60 : https://s3.amazonaws.com/bugdb/jira/MB-9707/19ef5b23/10.3.5.60-diag.txt.gz
10.3.5.60 : https://s3.amazonaws.com/bugdb/jira/MB-9707/b708d7af/10.3.5.60-5232014-2233-diag.zip

Comment by Andrei Baranouski [ 29/May/14 ]
at the moment it is only required to move forward with this bug
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
MB-7432




[MB-11570] XDCR checkpointing: Increment num_failedckpts stat when checkpointing fails with 404 error Created: 26/Jun/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centOS

Triage: Untriaged
Is this a Regression?: No

 Description   
Scenario
------------
Do a failover at destination
The next immediate checkpoint on a failed over vbucket will fail with error 404
However you will not notice any change in the last 10 failed checkpoints per node stat on GUI. This is not the case with error code 400.


Can we increment this stat on 404 errors also? Pls let me know if you need logs.

 Comments   
Comment by Aleksey Kondratenko [ 26/Jun/14 ]
Yes I need logs.
Comment by Aruna Piravi [ 26/Jun/14 ]
Same set of logs as in MB-11571.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Aruna, Anil, Wayne .. July 17th
Comment by Aleksey Kondratenko [ 18/Jul/14 ]
This is actually because this code can only track last 10 checkpoints per node. Not something new and I'm not sure if I'll bother enough to fix this.
Comment by Aruna Piravi [ 25/Jul/14 ]
Yes, it tracks only last 10 checkpoints per node so I'm not going to push for a fix. It might help with some unit testing in future. Also, I filed this MB after consulting you. So I will leave it to you.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
actually I've already fixed it in my work tree as part of stats work that I'll submit under MB-7432




[MB-11797] Rebalance-out hangs during Rebalance + Views operation in DGM run Created: 23/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Meenakshi Goel Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-973-rel

Attachments: Text File logs.txt    
Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Link:
http://qa.sc.couchbase.com/job/ubuntu_x64--65_02--view_query_extended-P1/145/consoleFull

Test to Reproduce:
./testrunner -i /tmp/ubuntu12-view6node.ini get-delays=True,get-cbcollect-info=True -t view.createdeleteview.CreateDeleteViewTests.incremental_rebalance_out_with_ddoc_ops,ddoc_ops=create,test_with_view=True,num_ddocs=2,num_views_per_ddoc=3,items=200000,active_resident_threshold=10,dgm_run=True,eviction_policy=fullEviction

Steps to Reproduce:
1. Setup 5-node cluster
2. Create default bucket
3. Load 200000 items
4. Load bucket to achieve dgm 10%
5. Create Views
6. Start ddoc + Rebalance out operations in parallel

Please refer attached log file "logs.txt".

Uploading Logs:


 Comments   
Comment by Meenakshi Goel [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/8586d8eb/172.23.106.201-7222014-2350-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/ea5d5a3f/172.23.106.199-7222014-2354-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d06d7861/172.23.106.200-7222014-2355-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/65653f65/172.23.106.198-7222014-2353-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/dd05a054/172.23.106.197-7222014-2352-diag.zip
Comment by Sriram Melkote [ 23/Jul/14 ]
Nimish - to my eyes, it looks like views are not involved in this failure. Can you please take a look at the detailed log and assign to Alk if you agree? Thanks
Comment by Nimish Gupta [ 23/Jul/14 ]
From the logs:

[couchdb:info,2014-07-22T14:47:21.345,ns_1@172.23.106.199:<0.17993.2>:couch_log:info:39]Set view `default`, replica (prod) group `_design/dev_ddoc40`, signature `c018b62ae9eab43522a3d0c43ac48b3e`, terminating with reason: {upr_died,
                                                                                                                                       {bad_return_value,
                                                                                                                                        {stop,
                                                                                                                                         sasl_auth_failed}}}

One obvious problem is that we returned the wrong number of parameter for stop when sasl auth failed. That I have fixed, and is under review.(http://review.couchbase.org/#/c/39735/).

I don't know the reason why sasl auth failed or it may be normal for sasl auth to fail during rebalance. Meenakshi, could you please run the test again after this change is merged.
Comment by Nimish Gupta [ 23/Jul/14 ]
Trond has added code to log more information for sasl errors in memcached (http://review.couchbase.org/#/c/39738/). It will be helpful to debug sasl errors.
Comment by Meenakshi Goel [ 24/Jul/14 ]
Issue is reproducible with latest build 3.0.0-1020-rel.
http://qa.sc.couchbase.com/job/ubuntu_x64--65_03--view_dgm_tests-P1/99/consoleFull
Uploading Logs shortly.
Comment by Meenakshi Goel [ 24/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/13f68e9c/172.23.106.186-7242014-1238-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/c0cf8496/172.23.106.187-7242014-1239-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/77b2fb50/172.23.106.188-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d0335545/172.23.106.189-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/7634b520/172.23.106.190-7242014-1241-diag.zip
Comment by Nimish Gupta [ 24/Jul/14 ]
From the ns_server logs, It looks to me memcached has crashed.

[error_logger:error,2014-07-24T12:28:36.305,ns_1@172.23.106.186:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_memcached:init/1
    pid: <0.693.0>
    registered_name: []
    exception exit: {badmatch,{error,closed}}
      in function gen_server:init_it/6 (gen_server.erl, line 328)
    ancestors: ['single_bucket_sup-default',<0.675.0>]
    messages: []
    links: [<0.717.0>,<0.719.0>,<0.720.0>,<0.277.0>,<0.676.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 75113
    stack_size: 27
    reductions: 26397931
  neighbours:

Ep-engine/ns_server team please take a look.
Comment by Nimish Gupta [ 24/Jul/14 ]
From the logs:

** Reason for termination ==
** {unexpected_exit,
       {'EXIT',<0.31044.9>,
           {{{badmatch,{error,closed}},
             {gen_server,call,
                 ['ns_memcached-default',
                  {get_dcp_docs_estimate,321,
                      "replication:ns_1@172.23.106.187->ns_1@172.23.106.188:default"},
                  180000]}},
            {gen_server,call,
                [{'janitor_agent-default','ns_1@172.23.106.187'},
                 {if_rebalance,<0.15733.9>,
                     {wait_dcp_data_move,['ns_1@172.23.106.188'],321}},
                 infinity]}}}}
Comment by Sriram Melkote [ 25/Jul/14 ]
Alk, can you please take a look? Thanks!
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Quick hint for fellow coworkers. When you see connection closed usually first thing to check is if memcached has crashed. And in this case indeed it has (diag's cluster wide logs is perfect place to find this issues):

2014-07-24 12:28:35.861 ns_log:0:info:message(ns_1@172.23.106.186) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:09:47.941525 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 650) stream created with start seqno 5794 and end seqno 18446744073709551615
Thu Jul 24 12:09:49.115570 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 749, cookie 0x606f800
Thu Jul 24 12:09:49.380310 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 648, cookie 0x6070d00
Thu Jul 24 12:09:49.450869 PDT 3: (default) UPR (Consumer) eq_uprq:replication:ns_1@172.23.106.189->ns_1@172.23.106.186:default - (vb 648) Attempting to add takeover stream with start seqno 5463, end seqno 18446744073709551615, vbucket uuid 35529072769610, snap start seqno 5463, and snap end seqno 5463
Thu Jul 24 12:09:49.495674 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 648) stream created with start seqno 5463 and end seqno 18446744073709551615
2014-07-24 12:28:36.302 ns_memcached:0:info:message(ns_1@172.23.106.186) - Control connection to memcached on 'ns_1@172.23.106.186' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_memcached:0:info:message(ns_1@172.23.106.187) - Control connection to memcached on 'ns_1@172.23.106.187' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_log:0:info:message(ns_1@172.23.106.187) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:28:35.860224 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1019) Stream closing, 0 items sent from disk, 0 items sent from memory, 5781 was last seqno sent
Thu Jul 24 12:28:35.860235 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1020) Stream closing, 0 items sent from disk, 0 items sent from memory, 5879 was last seqno sent
Thu Jul 24 12:28:35.860246 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1021) Stream closing, 0 items sent from disk, 0 items sent from memory, 5772 was last seqno sent
Thu Jul 24 12:28:35.860256 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1022) Stream closing, 0 items sent from disk, 0 items sent from memory, 5427 was last seqno sent
Thu Jul 24 12:28:35.860266 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1023) Stream closing, 0 items sent from disk, 0 items sent from memory, 5480 was last seqno sent

Status 137 is 128 (death by signal (set by kernel)) + 9. So signal 9. dmesg (captured in couchbase.log) does not have signs of OOM. This means - humans :) Not the first and sadly not the last time something like this happens. Rogue scripts, bad tests etc.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Also we should stop the practice if reusing tickets for unrelated conditions. This doesn't look anywhere close to rebalance hang isnt?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Not sure what to do about this one. Closing as incomplete will probably not hurt.




[MB-10376] XDCR Pause and Resume : Pausing during rebalance-in does not flush XDCR queue on all nodes Created: 05/Mar/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 10.3.4.186-352014-1846-diag.zip     Zip Archive 10.3.4.187-352014-1849-diag.zip     Zip Archive 10.3.4.188-352014-1851-diag.zip     PNG File Screen Shot 2014-03-05 at 6.37.50 PM.png     PNG File Screen Shot 2014-03-05 at 6.38.12 PM.png    
Triage: Triaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Build
-------
3.0.0- 400

Scenario
--------------
- 2 one-node clusters , create uni-xdcr (checkpoint interval = 60secs)
- Add in one node on source, rebalance
- Pause replication while rebalance-in is in progress
- Observe outbound xdcr stats for the bucket
- rebalance-in is stuck (MB-10385)

While active_vbreps immediately goes down to 0, the xdcr queue is not flushed even after 20 mins(which I think is a bug in itself).
By XDCR queue I mean docs_rep_queue and size_rep_queue.
In the screenshot you will observe that on source nodes-
docs_rep_queue on 10.3.4.186 is 199
docs_rep_queue on 10.3.4.187 is 0.

Note:
10.3.4.186 is the master, .187 is the new node.
active_vbreps anyway drops to zero on both nodes causing replication to stop. I'm not sure if the unflushed queue and stuck rebalance could be related, just sharing my observation.

Attached
--------------
Screenshot and cbcollect from .186, .187 (source) and .188 (target)


 Comments   
Comment by Aleksey Kondratenko [ 06/Mar/14 ]
Is there any reason to believe that xdcr affects this case at all ?
Comment by Aruna Piravi [ 06/Mar/14 ]
Well, when I started writing this bug report, it was basically to draw attention to the fact that xdcr queue was not getting flushed on one of the two nodes. Later I noted that rebalance was stuck on the same node which was a more serious issue. I was not sure if they are related.

To check that,
I resumed replication- rebalance was still stuck.
I deleted replication - rebalance was still stuck
I stopped and started rebalance again - stuck at 0%

If rebalance-in one node on a one node cluster fails, it must either be a regression or something related to xdcr operations I was performing in parallel(creating,deleting, pausing and resuming replications). I'm also not sure if rebalance uses UPR yet. And maybe I should open two separate bugs for xdcr queue not flushed and rebalance getting stuck?

Comment by Aleksey Kondratenko [ 06/Mar/14 ]
rebalance being stuck is likely duplicated somewhere.

Rest of bug description looks like stats bug.
Comment by Aleksey Kondratenko [ 06/Mar/14 ]
Rebalance is not using upr yet. But there are already some regressions AFAIK
Comment by Aruna Piravi [ 06/Mar/14 ]
I see a bunch of rebalance stuck issues which ep-engine says is a result of TAP/UPR refactoring. Some related to memory leaks in TAP and checkpoints waiting to be persisted(already closed). I'm not sure what is causing it here so it would be good if ns_server or ep_engine takes a look at the logs. Filing a separate bug - MB-10385 . Please feel free to close as duplicate if logs give a reason to believe so.

We can use this issue to track the xdcr unflushed queue or stats problem.
Comment by Anil Kumar [ 19/Jun/14 ]
Triage - June 19 2014 Alk, Parag, Anil
Comment by Aruna Piravi [ 25/Jul/14 ]
Finding this fixed.Closing this issue




[MB-10680] XDCR Pause/Resume: Resume(during rebalance-in) causes replication status of existing replications of target cluster(which has failed over node) to go to "starting up" mode Created: 27/Mar/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Aruna Piravi
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 4GB RAM, 4 core VMs

Attachments: Zip Archive 172.23.106.209-3272014-186-diag.zip     Zip Archive 172.23.106.45-3272014-180-diag.zip     Zip Archive 172.23.106.46-3272014-182-diag.zip     Zip Archive 172.23.106.47-3272014-183-diag.zip     Zip Archive 172.23.106.48-3272014-185-diag.zip     PNG File Screen Shot 2014-03-27 at 5.28.24 PM.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Scenario
--------------
1. Create two clusters with 2 nodes each and create bi-xdcr on 2 buckets. Load data, watch replication. Pause all replications at C1. C2 continues to replicate to C1.
2. Rebalance-in one node at cluster1 while failing over one node and rebalancing it out at cluster2. Resume all replications at C1.
3. Notice that on cluster2, all ongoing replications go from "replicating" to "starting up" mode and there's no outbound replication category for any of the cluster buckets.

Setup
--------
[Cluster1]
172.23.106.45
172.23.106.46 <--- 172.23.106.209 [rebalance-in]

[Cluster2]
172.23.106.47
172.23.106.48 ---> failover and rebalance-out

Reproducible?
---------------------
Yes, consistently, tried thrice.

Attached
--------------
cbcollect info and screenshot

Script
--------
./testrunner -i /tmp/bixdcr.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,items=30000,rdirection=bidirection,ctopology=chain,sasl_buckets=1,rebalance_in=source,rebalance_out=destination,failover=destination,pause=source



Will scale down to one replication and also try with xmem.

 Comments   
Comment by Aruna Piravi [ 28/Mar/14 ]
Not seen with XMEM and just 1 bi-xdcr.
Comment by Aruna Piravi [ 28/Mar/14 ]
However seen with XMEM and 2 bi-xdcrs. Also there's no xdcr activity between the clusters after the replication status changes to "starting up" on one cluster.

Mem usage was between 20-30% (unusual for 2 bi-xdcrs but justified due to lack of xdcr activity)

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11013 couchbas 20 0 864m 303m 6408 S 45.4 7.8 68:08.01 memcached
10960 couchbas 20 0 2392m 1.0g 39m S 13.3 27.4 178:20.21 beam.smp

Also not seen with CAPI and just 1 bi-xdcr. 2 replications in a 4GB RAM could be a reason. Let me try on bigger VMs and get back.
Comment by Aruna Piravi [ 28/Mar/14 ]
Reproduced on VMs with15GB RAM. Related to Pause and Resume. Resuming replications at a cluster(which is also rebalancing in node) kills xdcr at remote cluster(which is rebalancing out).
Comment by Aruna Piravi [ 23/Apr/14 ]
Any update on this bug?
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Aruna, Anil, Wayne .. July 17th
Comment by Aruna Piravi [ 25/Jul/14 ]
Finding this fixed in latest builds. Closing this MB. Thanks




[MB-11559] Memcached segfault right after initial cluster setup (master builds) Created: 26/Jun/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: bug-backlog
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Dave Rigby
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: couchbase-server-enterprise_centos6_x86_64_0.0.0-1564-rel.rpm

Attachments: Zip Archive 000-1564.zip     Text File gdb.log    
Issue Links:
Duplicate
is duplicated by MB-11562 memcached crash with segmentation fau... Resolved
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Comments   
Comment by Dave Rigby [ 28/Jun/14 ]
This is caused by some of the changes added (on 3.0.1 branch) by MB-11067. Fix incoming (prob Monday).
Comment by Dave Rigby [ 30/Jun/14 ]
http://review.couchbase.org/#/c/38968/

Note: depends on refactor of stats code: http://review.couchbase.org/#/c/38967




[MB-11811] [Tools] Change UPR to DCP for tools Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Bin Cui [ 24/Jul/14 ]
http://review.couchbase.org/#/c/39814/




[MB-11785] mcd aborted in bucket_engine_release_cookie: "es != ((void *)0)" Created: 22/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Tommie McAfee Assignee: Tommie McAfee
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 64 vb cluster_run -n1

Attachments: Zip Archive collectinfo-2014-07-22T192534-n_0@127.0.0.1.zip    
Triage: Untriaged
Is this a Regression?: Yes

 Description   
Observed while running pyupr unit tests against latest from rel-3.0.0 branch.

 After about 20 tests the crash occurred on test_failover_log_n_producers_n_vbuckets. This test passes stand alone so I think it's a matter of running all the tests in succession and then coming across this issue.

backtrace:

Thread 228 (Thread 0x7fed2e7fc700 (LWP 695)):
#0 0x00007fed8b608f79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fed8b60c388 in __GI_abort () at abort.c:89
#2 0x00007fed8b601e36 in __assert_fail_base (fmt=0x7fed8b753718 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x7fed8949f28c "es != ((void *)0)",
    file=file@entry=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=line@entry=3301,
    function=function@entry=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:92
#3 0x00007fed8b601ee2 in __GI___assert_fail (assertion=0x7fed8949f28c "es != ((void *)0)",
    file=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=3301,
    function=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:101
#4 0x00007fed8949d13d in bucket_engine_release_cookie (cookie=0x5b422e0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:3301
#5 0x00007fed8835343f in EventuallyPersistentEngine::releaseCookie (this=0x7fed4808f5d0, cookie=0x5b422e0)
    at /couchbase/ep-engine/src/ep_engine.cc:1883
#6 0x00007fed8838d730 in ConnHandler::releaseReference (this=0x7fed7c0544e0, force=false)
    at /couchbase/ep-engine/src/tapconnection.cc:306
#7 0x00007fed883a4de6 in UprConnMap::shutdownAllConnections (this=0x7fed4806e4e0)
    at /couchbase/ep-engine/src/tapconnmap.cc:1004
#8 0x00007fed88353e0a in EventuallyPersistentEngine::destroy (this=0x7fed4808f5d0, force=true)
    at /couchbase/ep-engine/src/ep_engine.cc:2034
#9 0x00007fed8834dc05 in EvpDestroy (handle=0x7fed4808f5d0, force=true) at /couchbase/ep-engine/src/ep_engine.cc:142
#10 0x00007fed89498a54 in engine_shutdown_thread (arg=0x7fed48080540)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1564
#11 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480a5b60) at /couchbase/platform/src/cb_pthreads.c:19
#12 0x00007fed8beba182 in start_thread (arg=0x7fed2e7fc700) at pthread_create.c:312
#13 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 226 (Thread 0x7fed71790700 (LWP 693)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093e80, mutex=0x7fed78093e48, ms=720)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78093e40, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78093e40, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78093e40, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78093e40, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480203e0) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71790700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 225 (Thread 0x7fed71f91700 (LWP 692)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093830, mutex=0x7fed780937f8, ms=86390052)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780937f0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780937f0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780937f0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780937f0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801d490) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71f91700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 224 (Thread 0x7fed72792700 (LWP 691)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3894)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801a670) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed72792700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 223 (Thread 0x7fed70f8f700 (LWP 690)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3893)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed48017850) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed70f8f700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 222 (Thread 0x7fed7078e700 (LWP 689)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1672)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b8e90) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed7078e700) at pthread_create.c:312
---Type <return> to continue, or q <return> to quit---
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 221 (Thread 0x7fed0effd700 (LWP 688)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1673)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b6890) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed0effd700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 210 (Thread 0x7fed0f7fe700 (LWP 661)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed740e8910)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed740667e0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 201 (Thread 0x7fed0ffff700 (LWP 644)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed74135070)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed74050ef0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

---Type <return> to continue, or q <return> to quit---
Thread 192 (Thread 0x7fed2cff9700 (LWP 627)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7c90)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c078340) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2cff9700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 183 (Thread 0x7fed2d7fa700 (LWP 610)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009e000)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5009dfe0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2d7fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 174 (Thread 0x7fed2dffb700 (LWP 593)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009dc30)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed50031010) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2dffb700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 165 (Thread 0x7fed2f7fe700 (LWP 576)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed481cef20)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480921c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 147 (Thread 0x7fed2effd700 (LWP 541)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
---Type <return> to continue, or q <return> to quit---
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed540015d0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54057b80) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2effd700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 138 (Thread 0x7fed6df89700 (LWP 523)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed78092aa0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78056ea0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6df89700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 120 (Thread 0x7fed2ffff700 (LWP 489)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7d10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c1b7ac0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 111 (Thread 0x7fed6cf87700 (LWP 472)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5008c030)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500adf50) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6cf87700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 102 (Thread 0x7fed6d788700 (LWP 455)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
---Type <return> to continue, or q <return> to quit---
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080450)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54091560) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6d788700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 93 (Thread 0x7fed6ff8d700 (LWP 438)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080ad0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54068db0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ff8d700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 57 (Thread 0x7fed6e78a700 (LWP 370)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50080230)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5008c360) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6e78a700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 48 (Thread 0x7fed6ef8b700 (LWP 352)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50000c10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500815b0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ef8b700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 39 (Thread 0x7fed6f78c700 (LWP 334)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed4807c290)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
---Type <return> to continue, or q <return> to quit---
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4806e4c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6f78c700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 13 (Thread 0x7fed817fa700 (LWP 292)):
#0 0x00007fed8b693d7d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8b6c5334 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2 0x00007fed88386dd2 in updateStatsThread (arg=0x7fed780343f0) at /couchbase/ep-engine/src/memory_tracker.cc:36
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78034450) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed817fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 10 (Thread 0x7fed8aec4700 (LWP 116)):
#0 0x00007fed8b6be6bd in read () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8b64d4e0 in _IO_new_file_underflow (fp=0x7fed8b992640 <_IO_2_1_stdin_>) at fileops.c:613
#2 0x00007fed8b64e46e in __GI__IO_default_uflow (fp=0x7fed8b992640 <_IO_2_1_stdin_>) at genops.c:435
#3 0x00007fed8b642184 in __GI__IO_getline_info (fp=0x7fed8b992640 <_IO_2_1_stdin_>, buf=0x7fed8aec3e40 "", n=79, delim=10,
    extract_delim=1, eof=0x0) at iogetline.c:69
#4 0x00007fed8b641106 in _IO_fgets (buf=0x7fed8aec3e40 "", n=0, fp=0x7fed8b992640 <_IO_2_1_stdin_>) at iofgets.c:56
#5 0x00007fed8aec5b24 in check_stdin_thread (arg=0x41c0ee <shutdown_server>)
    at /couchbase/memcached/extensions/daemon/stdin_check.c:38
#6 0x00007fed8cf43963 in platform_thread_wrap (arg=0x1a66250) at /couchbase/platform/src/cb_pthreads.c:19
#7 0x00007fed8beba182 in start_thread (arg=0x7fed8aec4700) at pthread_create.c:312
#8 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 9 (Thread 0x7fed89ea3700 (LWP 117)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed8a6c3280 <cond>, mutex=0x7fed8a6c3240 <mutex>, ms=19000)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8a4c0fea in logger_thead_main (arg=0x1a66fe0) at /couchbase/memcached/extensions/loggers/file_logger.c:372
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x1a67050) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed89ea3700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 8 (Thread 0x7fed89494700 (LWP 135)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9cb0) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd0f0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed89494700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
---Type <return> to continue, or q <return> to quit---

Thread 7 (Thread 0x7fed88c93700 (LWP 136)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9da0) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd240) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed88c93700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111






Thread 6 (Thread 0x7fed83fff700 (LWP 137)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9e90) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd390) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed83fff700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7fed837fe700 (LWP 138)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9f80) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd4e0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed837fe700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7fed82ffd700 (LWP 139)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca070) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd630) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed82ffd700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7fed827fc700 (LWP 140)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca160) at /couchbase/memcached/daemon/thread.c:277
---Type <return> to continue, or q <return> to quit---
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd780) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed827fc700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7fed81ffb700 (LWP 141)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca250) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd8d0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed81ffb700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7fed8d764780 (LWP 113)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041d24e in main (argc=3, argv=0x7fff77aaa838) at /couchbase/memcached/daemon/memcached.c:8797

 Comments   
Comment by Chiyoung Seo [ 23/Jul/14 ]
Abhinav,

The backtrace indicates that the abort crash was caused by closing all the UPR connections during shutdown, which we made some fixes recently.
Comment by Abhinav Dangeti [ 24/Jul/14 ]
Tommie, can you tell how to run these tests, so I could try reproducing on my system?
Comment by Tommie McAfee [ 24/Jul/14 ]
*start a cluster run node then:

git clone https://github.com/couchbaselabs/pyupr.git
cd pyupr
./pyupr -h 127.0.0.1:9000 -b dev


noticed all the tests may pass but memcached can silently abort in the background.
Comment by Abhinav Dangeti [ 24/Jul/14 ]
1. ServerSide: If an upr producer or upr consumer already exists for that cookie, engine should return DISCONNECT: http://review.couchbase.org/#/c/39843
2. py-upr: In the test: test_failover_log_n_producers_n_vbuckets, you are essentially opening 1 connection and sending 1024 open connection messages, so many tests will need changes.
Comment by Chiyoung Seo [ 24/Jul/14 ]
Tommie,

The server side fix was merged.

Can you please fix the issue in the test script and retest it?
Comment by Tommie McAfee [ 25/Jul/14 ]
thanks, working now and affected tests pass with patch:

http://review.couchbase.org/#/c/39878/1




[MB-11778] upr replica is unable to detect death of upr producer (was: Some replica items not deleted) Created: 21/Jul/14  Updated: 24/Jul/14  Resolved: 22/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aruna Piravi
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centOS 6.x

Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://172.23.106.47:8091/index.html
http://172.23.106.45:8091/index.html

https://s3.amazonaws.com/bugdb/jira/MB-11573/logs.tar
Is this a Regression?: Unknown

 Description   
I'm seeing a bug similar to MB-11573 on 991. 600 replica items haven't been deleted. However curr_items and vb_active_curr_items are correct.


2014-07-21 18:18:44 | INFO | MainProcess | Cluster_Thread | [task.check] Saw curr_items 2800 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.47:11210 default
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.48:11210 default
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [task.check] Saw vb_active_curr_items 2800 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.47:11210 default
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.48:11210 default
2014-07-21 18:18:45 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 3400 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
2014-07-21 18:18:48 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 3400 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091', sasl_bucket_1 bucket
2014-07-21 18:18:49 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 3400 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091', standard_bucket_1 bucket

testcase:
./testrunner -i sanity.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,reboot=dest_node,items=2000,rdirection=bidirection,replication_type=xmem,standard_buckets=1,sasl_buckets=1,pause=source-destination,doc-ops=update-delete,doc-ops-dest=update-delete

What the test does:

3nodes * 3nodes, bi-dir xdcr on 3 buckets
1. Load 2k items on both clusters. Pause all xdcr(all items got replicated by this time)
2. Reboot one dest node (.48)
3. After warmup, resume replication on all buckets, on both clusters
4. 30% Update, 30% delete items on both sides. No expiration set.
5. Verify item count , value and rev-ids.


The cluster is available for debugging until tomorrow morning. Thanks.

 Comments   
Comment by Chiyoung Seo [ 21/Jul/14 ]
Mike,

Can you please look at this issue? The live cluster is available now.

Seems like the deletions are not replicated.
Comment by Mike Wiederhold [ 22/Jul/14 ]
The cluster looks fine right now so the problem seemed to work itself out. In the future please run one of the scripts we have to figure out which vbuckets are mismatched in the cluster. This will greatly reduce the amount of time needed to look through the cbcollectinfo logs.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
It was different issue, removed my comment.
Comment by Mike Wiederhold [ 22/Jul/14 ]
Alk,

In the memcached logs it looks like at the time that this bug was reported there were missing items. Then about 2 hours I see ns_server create a bunch of replication streams and all of the items that were "missing" are no longer actually missing. Can you take a look at this from the ns_server side and see why it took so long to create the replication streams?

Also, note that as of right now there is only a live cluster and no cbcollectinfo on the ticket.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Which cluster I need to look at Mike ?
Comment by Mike Wiederhold [ 22/Jul/14 ]
http://172.23.106.47:8091/index.html (This is the one that had the problem. Node .47 in particular)
http://172.23.106.45:8091/index.html
Comment by Aruna Piravi [ 22/Jul/14 ]
Pls note cbcollect has already been attached under link to logs section - https://s3.amazonaws.com/bugdb/jira/MB-11573/logs.tar , along with the cluster IPs.
Comment by Aruna Piravi [ 22/Jul/14 ]
And cbcollectinfo was grabbed at the time replica items were incorrect.

Just curious, only replica items were incorrect. Active vb items on both clusters were correct. Does this still have to do with xdcr?
Comment by Aruna Piravi [ 22/Jul/14 ]
ok, I think Mike meant the intra-cluster replication streams.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
I'm not seeing upr replicators spotting this shutdown at all. And I've just verified that I kill -9 memcached, erlang's replicator correctly detects connection closure and reestablishes connections.

Will now test with VMs and reboot.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Confirming manually by doing "hard reset" of VM and observing that other VM does not re-establish upr connections after resetted VM is rebooted.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
It is "it".
Comment by Mike Wiederhold [ 22/Jul/14 ]
http://review.couchbase.org/#/c/39683/
Comment by Aruna Piravi [ 24/Jul/14 ]
Verified on 1014. Closing this issue, thanks.




[MB-11806] rebalance should not be allowed when cbrecovery is stopped by REST API or has not completed Created: 23/Jul/14  Updated: 24/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ashvinder Singh Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: ns_server
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos, ubuntu

Triage: Untriaged
Is this a Regression?: Yes

 Description   
Found in build-3.0.0-973-release

Setup: Two clusters: src and dst with 3 nodes each. Please have 2 spare nodes
- Setup xdcr between src and dst cluster
- Ensure xdcr is setup and complete
- Hard Failover two nodes from dst cluster
- Verify nodes failover
- Add two spare nodes in dst cluster
- Initiate cbrecovery from src to dst
- stop cbrecovery using REST API
http://10.3.121.106:8091//pools/default/buckets/default/controller/stopRecovery?recovery_uuid=3ad71c7b3365593e0979da34306fb2a5

- initiate rebalance operation on dst cluster.

Observations: rebalance operation starts.
Expectation: Since rebalance operation is disallowed from UI when recovery is ongoing (or halted). The rebalance should not be allowed from REST or cli interfaces.


 Comments   
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
First of all you're doing it wrong here. Intended use of cbrecovery is to recover _source_ by using data from destination.
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
stop recovery is stop recovery. We do allow rebalance in this case by design.
Comment by Andrei Baranouski [ 24/Jul/14 ]
Alk, I do not agree regarding "Expectation: Since rebalance operation is disallowed from UI when recovery is ongoing (or halted). The rebalance should not be allowed from REST or cli interface"
I think we shouldn't have possibility to trigger via rest if we can't do it on UI
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Steps don't match that "shouldn't". Feel free to file proper bug for "UI doesn't allow but REST does allow" with all proper details and evidence.




[MB-11813] windows 64-bit buildbot failed to build new 64-bit builds. Failed to throw out error Created: 24/Jul/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Chris Hillery
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Windows 64-bit buildbot failed to build new build
http://builds.hq.northscale.net:8010/builders/server-30-win-x64-300/builds/411
No errors throw out.

 Comments   
Comment by Thuan Nguyen [ 24/Jul/14 ]
This 64-bit builder shows build successful but actually no build built.
Comment by Chris Hillery [ 24/Jul/14 ]
The build isn't performed by buildbot; buildbot only spawns the Jenkins job:

http://factory.couchbase.com/job/cs_300_win6408/

And that job is still ongoing.




[MB-11724] 3.0 cluster is briefly down after remove last 2.0 nodes out of cluster in online upgrade Created: 14/Jul/14  Updated: 24/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Artem Stemkovski
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Attachments: Zip Archive 192.168.171.148-7142014-1411-diag.zip     Zip Archive 192.168.171.149-7142014-1412-diag.zip     Zip Archive 192.168.171.150-7142014-1413-diag.zip     Zip Archive 192.168.171.151-7142014-1414-diag.zip    
Triage: Untriaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Links to manifest file of build 3.0.0-960 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-960-rel.deb.manifest.xml
Is this a Regression?: Unknown

 Description   
Install cb server 2.1.1-764 on 2 nodes (148 and 149)
Create cluster of 2 nodes
Create default bucket and load 5000 items to default bucket.
Install cb server 3.0.0-960 on 2 nodes (150 and 151)
Add node 150 and 151 into 2.1.1 cluster.
Rebalance. Passed
Remove 2 2.1.1 nodes (148 and 149) out of cluster
Rebalance. Passed.
Right after rebalance is finished. 3.0 cluster with 2 nodes (150 and 151) is briefly down about 30 to 45 seconds
Then 3.0 cluster is up again after that.

Output from test run
2014-07-14 14:05:22,473 - root - INFO - rebalance params : password=password&ejectedNodes=ns_1%40192.168.171.148%2Cns_1%40192.168.171.149&user=Administrator&knownNodes=ns_1%40192.168.171.149%2Cns_1%40192.168.171.151%2Cns_1%40192.168.171.148%2Cns_1%40192.168.171.150
2014-07-14 14:05:22,481 - root - INFO - rebalance operation started
2014-07-14 14:05:22,489 - root - INFO - rebalance percentage : 0 %
2014-07-14 14:05:32,499 - root - INFO - rebalance percentage : 100.0 %
2014-07-14 14:05:41,120 - root - ERROR - socket error while connecting to http://192.168.171.148:8091/pools error [Errno 61] Connection refused
2014-07-14 14:05:42,122 - root - ERROR - socket error while connecting to http://192.168.171.148:8091/pools error [Errno 61] Connection refused
2014-07-14 14:05:43,136 - root - INFO - rebalancing was completed with progress: 100.0% in 20.654556036 sec
2014-07-14 14:05:43,137 - root - INFO - sleep for 15 secs. ...
2014-07-14 14:05:58,148 - root - INFO - existing buckets : [u'default']
2014-07-14 14:05:58,148 - root - INFO - found bucket default
2014-07-14 14:05:59,158 - root - INFO - creating direct client 192.168.171.150:11210 default
2014-07-14 14:05:59,187 - root - INFO - Saw ep_queue_size 0 == 0 expected on '192.168.171.150:8091',default bucket
2014-07-14 14:05:59,190 - root - ERROR - http://192.168.171.151:8091/nodes/self error 401 reason: status: 401, content:
http://192.168.171.151:8091/nodes/self:
2014-07-14 14:05:59,193 - root - ERROR - http://192.168.171.151:8091/nodes/self error 401 reason: status: 401, content:
2014-07-14 14:05:59,196 - root - ERROR - http://192.168.171.151:8091/pools/default error 401 reason: status: 401, content:
2014-07-14 14:05:59,196 - root - INFO - sleep 10 seconds and retry
2014-07-14 14:06:09,200 - root - ERROR - http://192.168.171.151:8091/pools/default error 401 reason: status: 401, content:
2014-07-14 14:06:09,200 - root - INFO - sleep 10 seconds and retry
2014-07-14 14:06:19,205 - root - INFO - creating direct client 192.168.171.151:11210 default
2014-07-14 14:06:19,230 - root - INFO - Saw ep_queue_size 0 == 0 expected on '192.168.171.151:8091',default bucket
2014-07-14 14:06:19,277 - root - INFO - creating direct client 192.168.171.151:11210 default


 Comments   
Comment by Aleksey Kondratenko [ 14/Jul/14 ]
Looks auth related. This is probably expected
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
http://review.couchbase.org/39706
Comment by Thuan Nguyen [ 24/Jul/14 ]
Tested on build 3.0.0-1020. I could not reproduce this bug.




[MB-7534] Memcached Ascii client doesn't receive TMPFAIL error when deleting data while bucket flush is in progress Created: 16/Jan/13  Updated: 24/Jul/14  Due: 20/Jun/14  Resolved: 24/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.0.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Deepkaran Salooja Assignee: Iryna Mironava
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: build 2.0.1-127-rel

<manifest>
<remote name="couchbase" fetch="git://github.com/couchbase/"/>
<remote name="membase" fetch="git://github.com/membase/"/>
<remote name="apache" fetch="git://github.com/apache/"/>
<remote name="erlang" fetch="git://github.com/erlang/"/>
<default remote="couchbase" revision="master"/>
<project name="tlm" path="tlm" revision="12abea946eafd7411273d18a10ae1f84390db3d4">
<copyfile src="Makefile.top" dest="Makefile"/>
</project>
<project name="bucket_engine" path="bucket_engine" revision="70b3624abc697b7d18bf3d57f331b7674544e1e7"/>
<project name="ep-engine" path="ep-engine" revision="40544dd94cc758f94e86eb37f0af9135671fc56a"/>
<project name="libconflate" path="libconflate" revision="2cc8eff8e77d497d9f03a30fafaecb85280535d6"/>
<project name="libmemcached" path="libmemcached" revision="ca739a890349ac36dc79447e37da7caa9ae819f5" remote="membase"/>
<project name="libvbucket" path="libvbucket" revision="00d3763593c116e8e5d97aa0b646c42885727398"/>
<project name="membase-cli" path="membase-cli" revision="cb34a9ad94374d407a70d402bb59d07bfa41c873" remote="membase"/>
<project name="memcached" path="memcached" revision="7ea975a93a0231393502af4ca98976eee8a83386" remote="membase"/>
<project name="moxi" path="moxi" revision="52a5fa887bfff0bf719c4ee5f29634dd8707500e"/>
<project name="ns_server" path="ns_server" revision="7b9b2725b017a1ed54a1b7b4a15df68c96b6a3dd"/>
<project name="portsigar" path="portsigar" revision="1bc865e1622fb93a3fe0d1a4cdf18eb97ed9d600"/>
<project name="sigar" path="sigar" revision="63a3cd1b316d2d4aa6dd31ce8fc66101b983e0b0"/>
<project name="couchbase-examples" path="couchbase-examples" revision="cd9c8600589a1996c1ba6dbea9ac171b937d3379"/>
<project name="couchbase-python-client" path="couchbase-python-client" revision="006c1aa8b76f6bce11109af8a309133b57079c4c"/>
<project name="couchdb" path="couchdb" revision="84d25c7cb136c9f66adbb572c99cca81235ef13e"/>
<project name="couchdbx-app" path="couchdbx-app" revision="25ee900fddc9b05babae687eaad71cf102a367a3"/>
<project name="couchstore" path="couchstore" revision="b5937c4479bf05dcc67264efe19abaf52870a127"/>
<project name="geocouch" path="geocouch" revision="8997159c44282cfcd89ea9984dd8c0944a35b2b4"/>
<project name="mccouch" path="mccouch" revision="88701cc326bc3dde4ed072bb8441be83adcfb2a5"/>
<project name="testrunner" path="testrunner" revision="46696a6d0b8dc32af2b8df0a6bcef7c5aae992a8"/>
<project name="otp" path="otp" revision="b6dc1a844eab061d0a7153d46e7e68296f15a504" remote="erlang"/>
<project name="icu4c" path="icu4c" revision="26359393672c378f41f2103a8699c4357c894be7" remote="couchbase"/>
<project name="snappy" path="snappy" revision="5681dde156e9d07adbeeab79666c9a9d7a10ec95" remote="couchbase"/>
<project name="v8" path="v8" revision="447decb75060a106131ab4de934bcc374648e7f2" remote="couchbase"/>
<project name="gperftools" path="gperftools" revision="8f60ba949fb8576c530ef4be148bff97106ddc59" remote="couchbase"/>
<project name="pysqlite" path="pysqlite" revision="0ff6e32ea05037fddef1eb41a648f2a2141009ea" remote="couchbase"/>
</manifest>

Triage: Untriaged

 Description   
Similar to MB-6865 but with ascii client.

Reproducer is here http://review.couchbase.org/#/c/23976/2. Please run the test:
bucketflush_with_data_ops_mc_ascii,items=100000,data_op=delete

Below exception is returned:
  File "pytests/flush/bucketflush.py", line 150, in data_ops_with_mc_ascii
    client.delete(key)
  File "lib/mc_ascii_client.py", line 236, in delete
    raise MemcachedError(-1, response)
MemcachedError: Memcached error #-1: NOT_FOUND

Also for Insert/Update temporary failure exception is a little different than returned for moxi/bin client

Insert
  File "pytests/flush/bucketflush.py", line 148, in data_ops_with_mc_ascii
    client.set(key, 0, 0, value)
  File "lib/mc_ascii_client.py", line 156, in set
    raise MemcachedError(-1, response)
MemcachedError: Memcached error #-1: SERVER_ERROR temporary failure

The exception code is different from the usual 134

Memcached error #134 'Temporary failure': Temporary failure for vbucket :63 to mc 127.0.0.1:12001


 Comments   
Comment by Farshid Ghods (Inactive) [ 17/Jan/13 ]
per bug scrub - moving this to 2.0.2 since this only occurs when command is run against moxi
Comment by Maria McDuff (Inactive) [ 27/Mar/13 ]
per bug scrub: anil to work with ronnie to prioritize.
Comment by Maria McDuff (Inactive) [ 01/Apr/13 ]
Ronnie will be able to fix before he goes on vacation this friday, 4/5.
Comment by Anil Kumar [ 11/Apr/13 ]
Moving this to 2.1.
Comment by Maria McDuff (Inactive) [ 08/Oct/13 ]
shashank,

is this still happening in 2.2.0 build 821?
if not, pls close this bug.
thanks.
Comment by Shashank Gupta [ 21/Oct/13 ]
Same error is encountered when tests for delete and create are run using 'use_ascii=True' parameter. See below:

test delete:
-t flush.bucketflush.BucketFlushTests.bucketflush_with_data_ops_moxi,items=100000,data_op=delete,use_ascii=True

O/P:
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "pytests/flush/bucketflush.py", line 108, in data_ops_with_moxi
    client.delete(key)
  File "lib/mc_ascii_client.py", line 236, in delete
    raise MemcachedError(-1, response)
MemcachedError: Memcached error #-1: NOT_FOUND



test create:
 -t flush.bucketflush.BucketFlushTests.bucketflush_with_data_ops_moxi,items=100000,data_op=create,use_ascii=True

O/P:
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "pytests/flush/bucketflush.py", line 106, in data_ops_with_moxi
    client.set(key, 0, 0, value)
  File "lib/mc_ascii_client.py", line 156, in set
    raise MemcachedError(-1, response)
MemcachedError: Memcached error #-1: SERVER_ERROR temporary failure


Build used : 2.2.0-821
Comment by Maria McDuff (Inactive) [ 20/May/14 ]
Iryna,

is this still happening in 3.0? pls re-assign to Steve Y if that is still the case.
Comment by Steve Yen [ 24/Jul/14 ]
It's probably not surprising that concurrent operations in the midst of a bucket flush might have race-y results.

And, memcached ascii protocol doesn't really have a TMPFAIL error response code, so receiving a NOT_FOUND during the midst of a bucket flush seems defensible.

For those reasons, marking this as won't fix.




[MB-9710] count of connections a little misleading Created: 10/Dec/13  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Perry Krug Assignee: Perry Krug
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged

 Description   
With the latest 2.5 (build 994) I'm looking at the number of connections as shown in the UI.

At the moment, it is showing me 21 connections per server. Yet when I look at the underlying statistics, I see some misleading numbers:
bucket_conns: 21
curr_connections: 22
curr_conns_on_port_11209: 18
curr_conns_on_port_11210: 2
daemon_connections: 4

And via netstat on port 11210, there are 0 shown, and 34 on port 11209.

I'm opening this bug both for the ep-engine/memcached team to help explain which is most accurate, and then for the UI to pick that up.

 Comments   
Comment by Perry Krug [ 10/Dec/13 ]
Netstat for 11210 and 11209:
[root@ip-10-197-21-14 ~]# netstat -anp | grep 11210
tcp 0 0 0.0.0.0:11210 0.0.0.0:* LISTEN 3881/memcached
udp 0 0 0.0.0.0:11210 0.0.0.0:* 3881/memcached
[root@ip-10-197-21-14 ~]# netstat -anp | grep 11209
tcp 0 0 0.0.0.0:11209 0.0.0.0:* LISTEN 3881/memcached
tcp 0 0 127.0.0.1:51910 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:11209 10.197.21.14:42786 ESTABLISHED 3881/memcached
tcp 0 0 127.0.0.1:11209 127.0.0.1:37926 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:57303 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:11209 127.0.0.1:41525 ESTABLISHED 3881/memcached
tcp 0 0 127.0.0.1:11209 127.0.0.1:39789 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.197.21.14:44641 ESTABLISHED 3881/memcached
tcp 0 0 127.0.0.1:34026 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:34556 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:39789 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:54895 10.196.82.15:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:37926 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:53357 10.196.82.15:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:44641 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:41957 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:41525 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:42786 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:11209 10.197.21.14:48985 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:48985 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:11209 127.0.0.1:34026 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.197.21.14:41957 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.197.21.14:34556 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.196.82.15:59784 ESTABLISHED 3881/memcached
tcp 0 0 127.0.0.1:11209 127.0.0.1:51910 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.196.82.15:39580 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.197.21.14:57303 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:38091 10.196.76.2:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:11209 10.198.10.70:48994 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.198.10.70:56929 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:46674 10.198.10.70:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:53288 10.196.76.2:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:51081 10.198.10.70:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:11209 10.196.76.2:35293 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.196.76.2:51339 ESTABLISHED 3881/memcached
udp 0 0 0.0.0.0:11209 0.0.0.0:* 3881/memcached
[root@ip-10-197-21-14 ~]#

UI is reporting 21 on this node. The main question here is to be able to explain to customers the correlation between a) their client connections b) the output of netstat and c) the metrics reported in the UI.
Comment by Cihan Biyikoglu [ 06/Jun/14 ]
downgrading to minor given it isn't a critical stat. keeping on 3.0 for now but should only consider this later in 3.0 cycle.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Trond Norbye [ 22/Jul/14 ]
bucket_conns is refcount - 1 and have the following description:

    /* count of connections + 1 for hashtable reference + number of
     * reserved connections for this bucket + number of temporary
     * references created by find_bucket & frieds.
     *
     * count of connections is count of engine_specific instances
     * having peh equal to this engine_handle. There's only one
     * exception which is connections for which on_disconnect callback
     * was called but which are kept alive by reserved > 0. For those
     * connections we drop refcount in on_disconnect but keep peh
     * field so that bucket_engine_release_cookie can decrement peh
     * refcount.
     *
     * Handle itself can be freed when this drops to zero. This can
     * only happen when bucket is deleted (but can happen later
     * because some connection can hold pointer longer) */

curr_connections is the total number of connection objects in use, and then you have a breakdown of the endpoints for where we have. daemon_conns is the number of connection objects used for "listening" tasks. we have 4 here, which I would guess is ipv4 and ipv6 for the endpoints...

I'm not absolutely sure why there is a mismatch between with the sum of these and the curr_connections. They are being counted differently so there are a number of sane explanations here. We reduce the count for the per port based count immediately when we initiate a disconnect for a connection, but the aggregated number is decremented when connection is completely closed (it may wait for an event in the engine it is connected to etc). Another reason they may differ is if the OS reports an error when closing the socket (!eintr && !eagain). In that case we'll have a zombie connection...

A better question is probably: what is the UI trying to show you ;-)




[MB-11795] Rebalance exited with reason {unexpected_exit, {'EXIT',<0.27836.0>,{bulk_set_vbucket_state_failed...} Created: 23/Jul/14  Updated: 24/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Meenakshi Goel Assignee: Mike Wiederhold
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1005-rel

Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Ref Link:
http://qa.sc.couchbase.com/job/centos_x64--29_01--create_view_all-P1/126/consoleFull

Test to Reproduce:
./testrunner -i myfile.ini get-cbcollect-info=True,get-logs=True, -t view.createdeleteview.CreateDeleteViewTests.rebalance_in_and_out_with_ddoc_ops,ddoc_ops=create,test_with_view=True,num_ddocs=3,num_views_per_ddoc=2,items=200000,sasl_buckets=1

Steps to Reproduce:
1. Setup a 4-node cluster
2. Create 1 default and 1 sasl bucket
3. Rebalance in 2 nodes
4. Start Rebalance

Logs:

[user:info,2014-07-23T2:25:43.220,ns_1@172.23.107.24:<0.1154.0>:ns_orchestrator:handle_info:483]Rebalance exited with reason {unexpected_exit,
                              {'EXIT',<0.27836.0>,
                               {bulk_set_vbucket_state_failed,
                                [{'ns_1@172.23.107.24',
                                  {'EXIT',
                                   {{{{{badmatch,
                                        [{<0.27848.0>,
                                          {done,exit,
                                           {normal,
                                            {gen_server,call,
                                             [<0.14598.0>,
                                              {setup_streams,
                                               [684,692,695,699,704,705,706,
                                                707,708,709,710,711,712,713,
                                                714,715,716,717,718,719,720,
                                                721,722,723,724,725,726,727,
                                                728,729,730,731,732,733,734,
                                                735,736,737,738,739,740,741,
                                                742,743,744,745,746,747,748,
                                                749,750,751,752,753,754,755,
                                                756,757,758,759,760,761,762,
                                                763,764,765,766,767,768,769,
                                                770,771,772,773,774,775,776,
                                                777,778,779,780,781,782,783,
                                                784,785,786,787,788,789,790,
                                                791,792,793,794,795,796,797,
                                                798,799,800,801,802,803,804,
                                                805,806,807,808,809,810,811,
                                                812,813,814,815,816,817,818,
                                                819,820,821,822,823,824,825,
                                                826,827,828,829,830,831,832,
                                                833,834,835,836,837,838,839,
                                                840,841,842,843,844,845,846,
                                                847,848,849,850,851,852,853]},
                                              infinity]}},
                                           [{gen_server,call,3,
                                             [{file,"gen_server.erl"},
                                              {line,188}]},
                                            {upr_replicator,
                                             '-spawn_and_wait/1-fun-0-',1,
                                             [{file,"src/upr_replicator.erl"},
                                              {line,195}]}]}}]},
                                       [{misc,
                                         sync_shutdown_many_i_am_trapping_exits,
                                         1,
                                         [{file,"src/misc.erl"},{line,1429}]},
                                        {upr_replicator,spawn_and_wait,1,
                                         [{file,"src/upr_replicator.erl"},
                                          {line,217}]},
                                        {upr_replicator,handle_call,3,
                                         [{file,"src/upr_replicator.erl"},
                                          {line,112}]},
                                        {gen_server,handle_msg,5,
                                         [{file,"gen_server.erl"},{line,585}]},
                                        {proc_lib,init_p_do_apply,3,
                                         [{file,"proc_lib.erl"},{line,239}]}]},
                                      {gen_server,call,
                                       ['upr_replicator-bucket0-ns_1@172.23.107.26',
                                        {setup_replication,
                                         [684,692,695,699,704,705,706,707,708,
                                          709,710,711,712,713,714,715,716,717,
                                          718,719,720,721,722,723,724,725,726,
                                          727,728,729,730,731,732,733,734,735,
                                          736,737,738,739,740,741,742,743,744,
                                          745,746,747,748,749,750,751,752,753,
                                          754,755,756,757,758,759,760,761,762,
                                          763,764,765,766,767,768,769,770,771,
                                          772,773,774,775,776,777,778,779,780,
                                          781,782,783,784,785,786,787,788,789,
                                          790,791,792,793,794,795,796,797,798,
                                          799,800,801,802,803,804,805,806,807,
                                          808,809,810,811,812,813,814,815,816,
                                          817,818,819,820,821,822,823,824,825,
                                          826,827,828,829,830,831,832,833,834,
                                          835,836,837,838,839,840,841,842,843,
                                          844,845,846,847,848,849,850,851,852,
                                          853]},
                                        infinity]}},
                                     {gen_server,call,
                                      ['replication_manager-bucket0',
                                       {change_vbucket_replication,684,
                                        'ns_1@172.23.107.26'},
                                       infinity]}},
                                    {gen_server,call,
                                     [{'janitor_agent-bucket0',
                                       'ns_1@172.23.107.24'},
                                      {if_rebalance,<0.1353.0>,
                                       {update_vbucket_state,684,replica,
                                        undefined,'ns_1@172.23.107.26'}},
                                      infinity]}}}}]}}}

Uploading Logs


 Comments   
Comment by Meenakshi Goel [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11795/f9ad56ee/172.23.107.24-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11795/07e24114/172.23.107.25-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11795/a9c9a36d/172.23.107.26-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11795/2517f70b/172.23.107.27-diag.zip
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
Seeing downstream (upr replicator to upr consumer) connection being closed.

Possibly due to this message

Wed Jul 23 02:25:43.080600 PDT 3: (bucket0) UPR (Consumer) eq_uprq:replication:ns_1@172.23.107.26->ns_1@172.23.107.24:bucket0 - (vb 684) Attempting to add stream with start seqno 0, end seqno 18446744073709551615, vbucket uuid 139895607874175, snap start seqno 0, and snap end seqno 0
Wed Jul 23 02:25:43.080642 PDT 3: (bucket0) UPR (Consumer) eq_uprq:replication:ns_1@172.23.107.26->ns_1@172.23.107.24:bucket0 - Disconnecting because noop message has no been received for 40 seconds
Wed Jul 23 02:25:43.082958 PDT 3: (bucket0) UPR (Producer) eq_uprq:replication:ns_1@172.23.107.24->ns_1@172.23.107.25:bucket0 - (vb 359) Stream closing, 0 items sent from disk, 0 items sent from memory, 0 was last seqno sent

This is on .24.

Appears related to yesterday's fix to detect disconnects on consumer side.
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
CC-ed Chiyoung and optimistically passed this to Mike due to apparent relation to fix made (AFAIK) by Mike.
Comment by Mike Wiederhold [ 23/Jul/14 ]
Duplicate of MB-18003
Comment by Ketaki Gangal [ 24/Jul/14 ]
MB-11803* ?
Comment by Chiyoung Seo [ 24/Jul/14 ]
Ketaki,

Yes, it's MB-11803.




[MB-11746] Rebalance exited with reason {unexpected_exit,{define_view,true}...['capi_set_view_manager-bucket0'] Created: 16/Jul/14  Updated: 24/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Meenakshi Goel Assignee: Meenakshi Goel
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-966-rel

Attachments: Text File logs.txt    
Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Test to Reproduce:
view.createdeleteview.CreateDeleteViewTests.rebalance_in_and_out_with_ddoc_ops,ddoc_ops=create,test_with_view=True,num_ddocs=3,num_views_per_ddoc=2,items=2000000,sasl_buckets=1

Steps to Reproduce:
1. Setup 4-node cluster
2. Create 1 default bucket and 1 sasl bucket
3. Rebalance in 2 nodes
4. Load 2000000 documents
5. Rebalance in out nodes operation with ddoc ops in parallel

Please refer attached logs,txt for logs

Uploading logs.

 Comments   
Comment by Meenakshi Goel [ 16/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11746/b2519932/10.3.5.90-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11746/15ab0f0c/10.3.5.91-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11746/f8d8f371/10.3.5.92-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11746/dd809f9b/10.3.5.93-diag.zip
Comment by Nimish Gupta [ 17/Jul/14 ]
From the logs, it seems that define_group process is exiting. I tried reproducing on my set up many times, but not able to reproduce it.
Meenakshi, could you please give me your setup to debug this issue .
Comment by Meenakshi Goel [ 17/Jul/14 ]
Cluster shared above has the setup and in the same state.
You can use the cluster if of any help.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - July 17
Comment by Nimish Gupta [ 22/Jul/14 ]
Review in progress. (http://review.couchbase.org/39663)
Comment by Meenakshi Goel [ 24/Jul/14 ]
Verified with 3.0.0-1018-rel, Issue is no longer reproducible.




[MB-11326] [memcached] Function call argument is an uninitialised value in upr_stream_req_executor Created: 05/Jun/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Dave Rigby Assignee: Trond Norbye
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File report-3a6911.html    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Bug reported by the clang static analyzer.

Description: Function call argument is an uninitialized value
File: /Users/dave/repos/couchbase/server/source/memcached/daemon/memcached.c upr_stream_req_executor()
Line: 4242

See attached report.

From speaking to Trond offline he believes that it shouldn't be possible to enter upr_stream_req_executor() with c->aiostat == ENGINE_ROLLBACK (which is what triggers this error) - in which case we should just add a suitable assert() to squash the warning.

 Comments   
Comment by Dave Rigby [ 20/Jun/14 ]
http://review.couchbase.org/#/c/38560/
Comment by Wayne Siu [ 08/Jul/14 ]
Hi Trond,
The patchset is ready for review.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th




[MB-11768] movement of 27 empty replica vbuckets gets stuck in seqnoWaiting Created: 20/Jul/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Sriram Ganesan
Resolution: Cannot Reproduce Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-988

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Issue Links:
Duplicate
is duplicated by MB-11796 Rebalance after manual failover hangs... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-11768/172.23.96.11.zip
https://s3.amazonaws.com/bugdb/jira/MB-11768/172.23.96.12.zip
https://s3.amazonaws.com/bugdb/jira/MB-11768/172.23.96.13.zip
https://s3.amazonaws.com/bugdb/jira/MB-11768/172.23.96.14.zip
Is this a Regression?: Yes

 Description   
Rebalance of 10 empty buckets never finishes.

Rebalance of first bucket ("bucket-10") completed successfully.

Movement of 27 vbuckets in bucket "bucket-9" started but obviously makes no progress.

451, 455, 459, 463, 467, 471, 475, 479, 483, 487, 491, 495, 499, 503, 507, 511, 811, 815, 819, 823, 827, 831, 835, 839, 843, 847, 851

According to master events the last step in all cases is seqnoWaitingStarted:

{u'node': u'ns_1@172.23.96.13', u'bucket': u'bucket-9', u'pid': u'<0.19962.2>', u'ts': 1405841984.068374, u'chainBefore': [u'172.23.96.13:11209', u'172.23.96.11:11209'], u'chainAfter': [u'172.23.96.14:11209', u'172.23.96.11:11209'], u'vbucket': 851, u'type': u'vbucketMoveStart'}
{u'bucket': u'bucket-9', u'state': u'replica', u'ts': 1405841984.069871, u'host': u'172.23.96.11:11209', u'vbucket': 851, u'type': u'vbucketStateChange'}
{u'bucket': u'bucket-9', u'state': u'replica', u'ts': 1405841984.070029, u'host': u'172.23.96.14:11209', u'vbucket': 851, u'type': u'vbucketStateChange'}
{u'node': u'172.23.96.14:11209', u'bucket': u'bucket-9', u'vbucket': 851, u'type': u'indexingInitiated', u'ts': 1405841984.073599}
{u'bucket': u'bucket-9', u'vbucket': 851, u'type': u'backfillPhaseEnded', u'ts': 1405841984.074577}
{u'node': u'172.23.96.14:11209', u'seqno': 0, u'bucket': u'bucket-9', u'ts': 1405841984.075081, u'vbucket': 851, u'type': u'seqnoWaitingStarted'}
{u'node': u'172.23.96.11:11209', u'seqno': 0, u'bucket': u'bucket-9', u'ts': 1405841984.075081, u'vbucket': 851, u'type': u'seqnoWaitingStarted'}

Not surprisingly 1051 replica vubckets are reported.

 Comments   
Comment by Sriram Ganesan [ 21/Jul/14 ]
Pavel

I tried running the rebalance-in with cluster_run with 10 empt buckets and haven't been able to reproduce this in the latest repo. Can you please try with the latest build and update the ticket to see if this is still a problem?
Comment by Pavel Paulau [ 21/Jul/14 ]
Sriram,

The issue is very occasional, probably only one out of 50 tests fails.

Also build 3.0.0-988 is not that old.

If you cannot find anything from logs then please close as incomplete.
Comment by Pavel Paulau [ 23/Jul/14 ]
See also MB-11796. The same issue, it may have more details for you.

Will live cluster help?
Comment by Sriram Ganesan [ 23/Jul/14 ]
Sure. I shall take a look at the live cluster.
Comment by Pavel Paulau [ 23/Jul/14 ]
Live cluster with MB-11796 (which I hope a duplicate): 172.23.96.11:8091 Administrator:password
Comment by Pavel Paulau [ 24/Jul/14 ]
I'm closing the ticket for now.

I will provide a live cluster if it happens again.




[MB-11796] Rebalance after manual failover hangs (delta recovery) Created: 23/Jul/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1005

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: Text File gdb11.log     Text File gdb12.log     Text File gdb13.log     Text File gdb14.log     Text File master_events.log    
Issue Links:
Duplicate
duplicates MB-11768 movement of 27 empty replica vbuckets... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.11.zip
https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.12.zip
https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.13.zip
https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.14.zip
Is this a Regression?: Yes

 Description   
1 of 4 nodes is being re-added after failover.
100M x 2KB items, 10K mixed ops/sec.

Steps:
1. Failover one of nodes.
2. Add it back.
3. Enabled delta recovery.
4. Sleep 20 minutes.
5. Rebalance cluster.

Warmup is completed but rebalance hangs afterwards.

 Comments   
Comment by Sriram Ganesan [ 23/Jul/14 ]
I see the following log messages

Tue Jul 22 23:16:44.367356 PDT 3: (bucket-1) UPR (Consumer) eq_uprq:replication:ns_1@172.23.96.11->ns_1@172.23.96.12:bucket-1 - Disconnecting because noop message has no been received for 40 seconds
Tue Jul 22 23:16:44.367363 PDT 3: (bucket-1) UPR (Consumer) eq_uprq:replication:ns_1@172.23.96.14->ns_1@172.23.96.12:bucket-1 - Disconnecting because noop message has no been received for 40 seconds
Tue Jul 22 23:16:44.367376 PDT 3: (bucket-1) UPR (Consumer) eq_uprq:replication:ns_1@172.23.96.13->ns_1@172.23.96.12:bucket-1 - Disconnecting because noop message has no been received for 40 seconds

I also see messages like this

Wed Jul 23 02:30:49.306705 PDT 3: 155 Closing connection due to read error: Connection reset by peer
Wed Jul 23 02:30:49.310060 PDT 3: 144 Closing connection due to read error: Connection reset by peer
Wed Jul 23 02:30:49.310273 PDT 3: 152 Closing connection due to read error: Connection reset by peer

The first set of the messages could be a bug in UPR that could be causing the disconnections and the second set of the messages could be because we are trying to read from a disconnected socket. Interestingly, a fix was merged for bug MB-11803 (http://review.couchbase.org/#/c/39760/) in the UPR noop area recently. It might be a good idea to run this test with that fix to see if that could address the problem.

I don't see any of the above error messages in the logs of MB-11768. So, the seqnoWaitingStarted in this case could be different from the one in MB-11768 assuming that the fix for MB-11803 solves this problem.

Comment by Pavel Paulau [ 24/Jul/14 ]
Indeed, that fix helped.




[MB-11701] UI graceful option should be greyed out when there are no replicas. Created: 11/Jul/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Patrick Varley Assignee: Pavel Blagodov
Resolution: Fixed Votes: 1
Labels: failover
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2 buckets
beer-sample bucket 1 replica
XDCR bucket 0 replica

Attachments: PNG File Failover.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: No

 Description   
When a bucket has no replicas it cannot be gracefully failed over.

In the UI we hide the graceful button which I believe is bad UI design instead we should grey it out and explain that graceful failover is not available without the correct replicas vBuckets.

 Comments   
Comment by Anil Kumar [ 18/Jul/14 ]
Pavel - Instead of 'hiding' lets grey out the Graceful Fail Over option.

( ) Graceful Fail Over (default) [Grey out]
(*) Hard Fail Over ...............

Attention – Graceful fail over option is not available either because node is unreachable or replica vbucket cannot be activated gracefully.

Warning --
Comment by Pavel Blagodov [ 24/Jul/14 ]
http://review.couchbase.org/39604




[MB-11629] Memcached crashed during rebalance Created: 03/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Sangharsh Agarwal Assignee: Sangharsh Agarwal
Resolution: Fixed Votes: 0
Labels: releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-918
ubuntu 12.04, 64 bit

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
http://qa.hq.northscale.net/job/ubuntu_x64--01_02--rebalanceXDCR-P0/19/consoleFull


[Core Logs]
Basic crash dump analysis of /tmp//core.memcached.21508.

Please send the file to support@couchbase.com

--------------------------------------------------------------------------------
File information:
-rwxr-xr-x 1 couchbase couchbase 4958595 2014-07-01 19:00 /opt/couchbase/bin/memcached
6cd323a6609b29186b45436c840e7580 /opt/couchbase/bin/memcached
-rw------- 1 couchbase couchbase 332505088 2014-07-02 17:38 /tmp//core.memcached.21508
5b2434f7bc783b86c7d165249c783519 /tmp//core.memcached.21508
--------------------------------------------------------------------------------
Core file callstacks:
GNU gdb (GDB) 7.1-ubuntu
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/couchbase/bin/memcached...done.
[New Thread 21511]
[New Thread 21512]
[New Thread 23321]
[New Thread 25659]
[New Thread 29947]
[New Thread 21508]
[New Thread 29948]
[New Thread 21509]
[New Thread 21510]
[New Thread 29934]
[New Thread 29950]
[New Thread 27817]
[New Thread 21514]
[New Thread 29949]
[New Thread 21515]
[New Thread 21513]

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /opt/couchbase/bin/../lib/memcached/libmcd_util.so.1.0.0...done.
Loaded symbols for /opt/couchbase/bin/../lib/memcached/libmcd_util.so.1.0.0
Reading symbols from /opt/couchbase/bin/../lib/libcbsasl.so.1.1.1...done.
Loaded symbols for /opt/couchbase/bin/../lib/libcbsasl.so.1.1.1
Reading symbols from /opt/couchbase/bin/../lib/libplatform.so.0.1.0...done.
Loaded symbols for /opt/couchbase/bin/../lib/libplatform.so.0.1.0
Reading symbols from /opt/couchbase/bin/../lib/libcJSON.so.1.0.0...done.
Loaded symbols for /opt/couchbase/bin/../lib/libcJSON.so.1.0.0
Reading symbols from /opt/couchbase/bin/../lib/libJSON_checker.so...done.
Loaded symbols for /opt/couchbase/bin/../lib/libJSON_checker.so
Reading symbols from /opt/couchbase/bin/../lib/libsnappy.so.1...done.
Loaded symbols for /opt/couchbase/bin/../lib/libsnappy.so.1
Reading symbols from /opt/couchbase/bin/../lib/libtcmalloc_minimal.so.4...done.
Loaded symbols for /opt/couchbase/bin/../lib/libtcmalloc_minimal.so.4
Reading symbols from /opt/couchbase/bin/../lib/libevent_core-2.0.so.5...done.
Loaded symbols for /opt/couchbase/bin/../lib/libevent_core-2.0.so.5
Reading symbols from /lib/libssl.so.0.9.8...(no debugging symbols found)...done.
Loaded symbols for /lib/libssl.so.0.9.8
Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypto.so.0.9.8
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libz.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /opt/couchbase/lib/memcached/stdin_term_handler.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/stdin_term_handler.so
Reading symbols from /opt/couchbase/lib/memcached/file_logger.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/file_logger.so
Reading symbols from /opt/couchbase/lib/memcached/bucket_engine.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/bucket_engine.so
Reading symbols from /opt/couchbase/lib/memcached/ep.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/ep.so
Reading symbols from /opt/couchbase/lib/libcouchstore.so...done.
Loaded symbols for /opt/couchbase/lib/libcouchstore.so
Reading symbols from /opt/couchbase/lib/libdirutils.so.0.1.0...done.
Loaded symbols for /opt/couchbase/lib/libdirutils.so.0.1.0
Reading symbols from /opt/couchbase/lib/libv8.so...done.
Loaded symbols for /opt/couchbase/lib/libv8.so
Reading symbols from /opt/couchbase/lib/libicui18n.so.44...done.
Loaded symbols for /opt/couchbase/lib/libicui18n.so.44
Reading symbols from /opt/couchbase/lib/libicuuc.so.44...done.
Loaded symbols for /opt/couchbase/lib/libicuuc.so.44
Reading symbols from /opt/couchbase/lib/libicudata.so.44...(no debugging symbols found)...done.
Loaded symbols for /opt/couchbase/lib/libicudata.so.44
Core was generated by `/opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcach'.
Program terminated with signal 6, Aborted.
#0 0x00007f4c6455ca75 in raise () from /lib/libc.so.6

Thread 16 (Thread 21513):
#0 0x00007f4c646102d3 in epoll_wait () from /lib/libc.so.6
#1 0x00007f4c65c875a6 in epoll_dispatch (base=0x6312780,
    tv=<value optimized out>) at epoll.c:404
#2 0x00007f4c65c72a04 in event_base_loop (base=0x6312780,
    flags=<value optimized out>) at event.c:1558
#3 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5e110)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#4 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#5 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 15 (Thread 21515):
#0 0x00007f4c646102d3 in epoll_wait () from /lib/libc.so.6
#1 0x00007f4c65c875a6 in epoll_dispatch (base=0x6312c80,
    tv=<value optimized out>) at epoll.c:404
#2 0x00007f4c65c72a04 in event_base_loop (base=0x6312c80,
    flags=<value optimized out>) at event.c:1558
#3 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5e130)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#4 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#5 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 14 (Thread 29949):
#0 0x00007f4c65471bc9 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1 0x00007f4c666f21cb in cb_cond_timedwait (cond=0x64c8058, mutex=0x64c8020,
    ms=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:156
#2 0x00007f4c5fa2c5ff in SyncObject::wait (this=0x64c8018,
    tv=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/syncobject.h:74
#3 0x00007f4c5fa27cfc in ExecutorPool::trySleep (this=0x64c8000, t=...,
    now=...)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorpool.cc:190
#4 0x00007f4c5fa28066 in ExecutorPool::_nextTask (this=0x64c8000, t=...,
    tick=8 '\b')
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorpool.cc:155
#5 0x00007f4c5fa280bf in ExecutorPool::nextTask (this=0x64c8000, t=...,
    tick=98 'b')
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorpool.cc:165
#6 0x00007f4c5fa39362 in ExecutorThread::run (this=0xb338760)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorthread.cc:77
#7 0x00007f4c5fa397ad in launch_executor_thread (arg=0x64c805c)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorthread.cc:33
#8 0x00007f4c666f20df in platform_thread_wrap (arg=0x9045ef0)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#9 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#10 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 13 (Thread 21514):
#0 0x00007f4c646102d3 in epoll_wait () from /lib/libc.so.6
#1 0x00007f4c65c875a6 in epoll_dispatch (base=0x6312a00,
    tv=<value optimized out>) at epoll.c:404
#2 0x00007f4c65c72a04 in event_base_loop (base=0x6312a00,
    flags=<value optimized out>) at event.c:1558
#3 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5e120)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#4 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#5 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 12 (Thread 27817):
#0 0x00007f4c6547185c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1 0x00007f4c62d00034 in engine_shutdown_thread (arg=0x6325180)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/engines/bucket_engine/bucket_engine.c:1610
#2 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5f700)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#3 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#4 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 11 (Thread 29950):
#0 0x00007f4c65471bc9 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1 0x00007f4c666f21cb in cb_cond_timedwait (cond=0x64c8058, mutex=0x64c8020,
    ms=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:156
#2 0x00007f4c5fa2c5ff in SyncObject::wait (this=0x64c8018,
    tv=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/syncobject.h:74
#3 0x00007f4c5fa27cfc in ExecutorPool::trySleep (this=0x64c8000, t=...,
    now=...)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorpool.cc:190
#4 0x00007f4c5fa28066 in ExecutorPool::_nextTask (this=0x64c8000, t=...,
    tick=16 '\020')
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorpool.cc:155
#5 0x00007f4c5fa280bf in ExecutorPool::nextTask (this=0x64c8000, t=...,
    tick=97 'a')
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorpool.cc:165
#6 0x00007f4c5fa39362 in ExecutorThread::run (this=0xb338260)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorthread.cc:77
#7 0x00007f4c5fa397ad in launch_executor_thread (arg=0x64c805c)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorthread.cc:33
#8 0x00007f4c666f20df in platform_thread_wrap (arg=0x9045920)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#9 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#10 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 10 (Thread 29934):
#0 0x00007f4c6547185c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1 0x00007f4c62d00034 in engine_shutdown_thread (arg=0xa0a3e20)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/engines/bucket_engine/bucket_engine.c:1610
#2 0x00007f4c666f20df in platform_thread_wrap (arg=0xafe2e90)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#3 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#4 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 9 (Thread 21510):
#0 0x00007f4c65471bc9 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1 0x00007f4c666f21cb in cb_cond_timedwait (cond=0x7f4c6390e240,
    mutex=0x7f4c6390e200, ms=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:156
#2 0x00007f4c6370d1e8 in logger_thead_main (arg=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/extensions/loggers/file_logger.c:372
#3 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5e070)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#4 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#5 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 8 (Thread 21509):
#0 0x00007f4c64601a6d in read () from /lib/libc.so.6
#1 0x00007f4c6459c598 in _IO_file_underflow () from /lib/libc.so.6
#2 0x00007f4c6459e13e in _IO_default_uflow () from /lib/libc.so.6
#3 0x00007f4c6459268e in _IO_getline_info () from /lib/libc.so.6
#4 0x00007f4c64591579 in fgets () from /lib/libc.so.6
#5 0x00007f4c64110a91 in fgets (arg=<value optimized out>)
    at /usr/include/bits/stdio2.h:255
#6 check_stdin_thread (arg=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/extensions/daemon/stdin_check.c:38
#7 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5e060)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#8 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#9 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 7 (Thread 29948):
#0 0x00007f4c645af645 in mempcpy () from /lib/libc.so.6
#1 0x00007f4c6459ef8e in _IO_default_xsputn () from /lib/libc.so.6
#2 0x00007f4c6456ea10 in vfprintf () from /lib/libc.so.6
#3 0x00007f4c64626d30 in __vsnprintf_chk () from /lib/libc.so.6
#4 0x00007f4c64626c6a in __snprintf_chk () from /lib/libc.so.6
#5 0x00007f4c5f9e6bc5 in snprintf (this=0x6ec5698,
    add_stat=0x4099e0 <append_stats>, cookie=0x62a5b00)
    at /usr/include/bits/stdio2.h:66
#6 CheckpointManager::addStats (this=0x6ec5698,
    add_stat=0x4099e0 <append_stats>, cookie=0x62a5b00)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/checkpoint.cc:1560
#7 0x00007f4c5fa24de3 in StatCheckpointVisitor::addCheckpointStat(void const*, void (*)(char const*, unsigned short, char const*, unsigned int, void const*), EventuallyPersistentStore*, RCPtr<VBucket>&) ()
   from /opt/couchbase/lib/memcached/ep.so
#8 0x00007f4c5fa24eb8 in StatCheckpointVisitor::visitBucket(RCPtr<VBucket>&)
    () from /opt/couchbase/lib/memcached/ep.so
#9 0x00007f4c5f9f0ffc in EventuallyPersistentStore::visit (
    this=<value optimized out>, visitor=...)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/ep.cc:2784
#10 0x00007f4c5fa26f68 in StatCheckpointTask::run() ()
   from /opt/couchbase/lib/memcached/ep.so
#11 0x00007f4c5fa39401 in ExecutorThread::run (this=0xb338440)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorthread.cc:95
#12 0x00007f4c5fa397ad in launch_executor_thread (arg=0x7f4c5c476998)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorthread.cc:33
#13 0x00007f4c666f20df in platform_thread_wrap (arg=0x9045e20)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#14 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#15 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#16 0x0000000000000000 in ?? ()

Thread 6 (Thread 21508):
#0 0x00007f4c646102d3 in epoll_wait () from /lib/libc.so.6
#1 0x00007f4c65c875a6 in epoll_dispatch (base=0x6312000,
    tv=<value optimized out>) at epoll.c:404
#2 0x00007f4c65c72a04 in event_base_loop (base=0x6312000,
    flags=<value optimized out>) at event.c:1558
#3 0x00000000004108d4 in main (argc=<value optimized out>,
    argv=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/daemon/memcached.c:8768

Thread 5 (Thread 29947):
#0 0x00007f4c65471bc9 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1 0x00007f4c666f21cb in cb_cond_timedwait (cond=0x64c8058, mutex=0x64c8020,
    ms=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:156
#2 0x00007f4c5fa2c5ff in SyncObject::wait (this=0x64c8018,
    tv=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/syncobject.h:74
#3 0x00007f4c5fa27cfc in ExecutorPool::trySleep (this=0x64c8000, t=...,
    now=...)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorpool.cc:190
#4 0x00007f4c5fa28066 in ExecutorPool::_nextTask (this=0x64c8000, t=...,
    tick=0 '\000')
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorpool.cc:155
#5 0x00007f4c5fa280bf in ExecutorPool::nextTask (this=0x64c8000, t=...,
    tick=99 'c')
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorpool.cc:165
#6 0x00007f4c5fa39362 in ExecutorThread::run (this=0xb337d60)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorthread.cc:77
#7 0x00007f4c5fa397ad in launch_executor_thread (arg=0x64c805c)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/executorthread.cc:33
#8 0x00007f4c666f20df in platform_thread_wrap (arg=0x9045c10)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#9 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#10 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 4 (Thread 25659):
#0 0x00007f4c6547185c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1 0x00007f4c62d00034 in engine_shutdown_thread (arg=0x63242a0)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/engines/bucket_engine/bucket_engine.c:1610
#2 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5ee40)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#3 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#4 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 3 (Thread 23321):
#0 0x00007f4c645d369d in nanosleep () from /lib/libc.so.6
#1 0x00007f4c64608df4 in usleep () from /lib/libc.so.6
#2 0x00007f4c5fa37905 in updateStatsThread (arg=<value optimized out>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/memory_tracker.cc:36
#3 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5e1f0)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#4 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#5 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 2 (Thread 21512):
#0 0x00007f4c646102d3 in epoll_wait () from /lib/libc.so.6
#1 0x00007f4c65c875a6 in epoll_dispatch (base=0x6312500,
    tv=<value optimized out>) at epoll.c:404
#2 0x00007f4c65c72a04 in event_base_loop (base=0x6312500,
    flags=<value optimized out>) at event.c:1558
#3 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5e100)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#4 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#5 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 1 (Thread 21511):
#0 0x00007f4c6455ca75 in raise () from /lib/libc.so.6
#1 0x00007f4c645605c0 in abort () from /lib/libc.so.6
#2 0x00007f4c64555941 in __assert_fail () from /lib/libc.so.6
#3 0x00000000004083ec in decrement_session_ctr ()
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/daemon/memcached.c:7731
#4 0x00007f4c5fa1f0be in EventuallyPersistentEngine::decrementSessionCtr (
    this=0x6360800)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/ep_engine.h:510
#5 0x00007f4c5fa1df0b in processUnknownCommand (h=0x6360800,
    cookie=0x617d800, request=0x6258000,
    response=0x40cb20 <binary_response_handler>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/ep_engine.cc:1193
#6 0x00007f4c5fa1ee0c in EvpUnknownCommand (handle=0x6360800,
    cookie=0x617d800, request=0x6258000,
    response=0x40cb20 <binary_response_handler>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/ep-engine/src/ep_engine.cc:1260
#7 0x00007f4c62d02bb3 in bucket_unknown_command (handle=0x7f4c62f09220,
    cookie=0x617d800, request=0x6258000,
    response=0x40cb20 <binary_response_handler>)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/engines/bucket_engine/bucket_engine.c:3215
#8 0x00000000004191ca in process_bin_unknown_packet (c=0x617d800)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/daemon/memcached.c:2663
#9 process_bin_packet (c=0x617d800)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/daemon/memcached.c:5392
#10 complete_nread (c=0x617d800)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/daemon/memcached.c:5796
#11 conn_nread (c=0x617d800)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/daemon/memcached.c:7003
#12 0x000000000040c73d in event_handler (fd=<value optimized out>,
    which=<value optimized out>, arg=0x617d800)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/memcached/daemon/memcached.c:7276
#13 0x00007f4c65c72afc in event_process_active_single_queue (base=0x6312280,
    flags=<value optimized out>) at event.c:1308
#14 event_process_active (base=0x6312280, flags=<value optimized out>)
    at event.c:1375
#15 event_base_loop (base=0x6312280, flags=<value optimized out>)
    at event.c:1572
#16 0x00007f4c666f20df in platform_thread_wrap (arg=0x1a5e0f0)
    at /home/buildbot/ubuntu-1004-x64-300-builder/build/build/platform/src/cb_pthreads.c:19
#17 0x00007f4c6546c9ca in start_thread () from /lib/libpthread.so.0
#18 0x00007f4c6460fcdd in clone () from /lib/libc.so.6
#19 0x0000000000000000 in ?? ()
--------------------------------------------------------------------------------
Module information:
/opt/couchbase/bin/../lib/memcached/libmcd_util.so.1.0.0:
-rw-r--r-- 1 couchbase couchbase 415163 2014-07-01 19:01 /opt/couchbase/bin/../lib/memcached/libmcd_util.so.1.0.0
b5f73dc290899e88fded946fde6a5244 /opt/couchbase/bin/../lib/memcached/libmcd_util.so.1.0.0
/opt/couchbase/bin/../lib/libcbsasl.so.1.1.1:
-rw-r--r-- 1 couchbase couchbase 412706 2014-07-01 19:01 /opt/couchbase/bin/../lib/libcbsasl.so.1.1.1
bf540b44de8c56f1bcab31c97edf825d /opt/couchbase/bin/../lib/libcbsasl.so.1.1.1
/opt/couchbase/bin/../lib/libplatform.so.0.1.0:
-rw-r--r-- 1 couchbase couchbase 556937 2014-07-01 19:01 /opt/couchbase/bin/../lib/libplatform.so.0.1.0
852ed58c53d2c9e69cef21d27e08370b /opt/couchbase/bin/../lib/libplatform.so.0.1.0
/opt/couchbase/bin/../lib/libcJSON.so.1.0.0:
-rw-r--r-- 1 couchbase couchbase 95332 2014-07-01 19:01 /opt/couchbase/bin/../lib/libcJSON.so.1.0.0
6e1bac6b7025efd048042f0da34841f5 /opt/couchbase/bin/../lib/libcJSON.so.1.0.0
/opt/couchbase/bin/../lib/libJSON_checker.so:
-rw-r--r-- 1 couchbase couchbase 35648 2014-07-01 19:01 /opt/couchbase/bin/../lib/libJSON_checker.so
1fd90948b5c1feb47b290e8d1714f4ac /opt/couchbase/bin/../lib/libJSON_checker.so
/opt/couchbase/bin/../lib/libsnappy.so.1:
lrwxrwxrwx 1 couchbase couchbase 18 2014-07-01 22:35 /opt/couchbase/bin/../lib/libsnappy.so.1 -> libsnappy.so.1.1.2
7c50c2a147ab8247b5a2b61a38604ccc /opt/couchbase/bin/../lib/libsnappy.so.1
/opt/couchbase/bin/../lib/libtcmalloc_minimal.so.4:
lrwxrwxrwx 1 couchbase couchbase 28 2014-07-01 22:35 /opt/couchbase/bin/../lib/libtcmalloc_minimal.so.4 -> libtcmalloc_minimal.so.4.2.1
a0669ae75ee5ae5352dd7c2704adf766 /opt/couchbase/bin/../lib/libtcmalloc_minimal.so.4
/opt/couchbase/bin/../lib/libevent_core-2.0.so.5:
lrwxrwxrwx 1 couchbase couchbase 26 2014-07-01 22:35 /opt/couchbase/bin/../lib/libevent_core-2.0.so.5 -> libevent_core-2.0.so.5.1.0
7c23d736254e9dc999fcad4782e986e5 /opt/couchbase/bin/../lib/libevent_core-2.0.so.5
/lib/libssl.so.0.9.8:
-rw-r--r-- 1 root root 333904 2012-04-24 08:29 /lib/libssl.so.0.9.8
d32b70676a28d4b191a859f5621d489d /lib/libssl.so.0.9.8
/lib/libcrypto.so.0.9.8:
-rw-r--r-- 1 root root 1622304 2012-04-24 08:29 /lib/libcrypto.so.0.9.8
34885593167472209511acfacef5a962 /lib/libcrypto.so.0.9.8
/lib/libpthread.so.0:
lrwxrwxrwx 1 root root 20 2012-05-24 11:03 /lib/libpthread.so.0 -> libpthread-2.11.1.so
f636a70138b6fbae3d067d560725a238 /lib/libpthread.so.0
/lib/libdl.so.2:
lrwxrwxrwx 1 root root 15 2012-05-24 11:03 /lib/libdl.so.2 -> libdl-2.11.1.so
b000e5ace83cb232171a8c973c443299 /lib/libdl.so.2
/lib/librt.so.1:
lrwxrwxrwx 1 root root 15 2012-05-24 11:03 /lib/librt.so.1 -> librt-2.11.1.so
28d8634d49bbe02e93af23a7adc632e8 /lib/librt.so.1
/usr/lib/libstdc++.so.6:
lrwxrwxrwx 1 root root 19 2012-05-24 11:03 /usr/lib/libstdc++.so.6 -> libstdc++.so.6.0.13
778aaa89a71cdc1f55b5986a9a9e3499 /usr/lib/libstdc++.so.6
/lib/libm.so.6:
lrwxrwxrwx 1 root root 14 2012-05-24 11:03 /lib/libm.so.6 -> libm-2.11.1.so
77385c4ce7ee521e4c38fb9cc13f534e /lib/libm.so.6
/lib/libgcc_s.so.1:
-rw-r--r-- 1 root root 92552 2012-03-08 20:17 /lib/libgcc_s.so.1
64bedbbb11f8452c9811c7d60c56b49a /lib/libgcc_s.so.1
/lib/libc.so.6:
lrwxrwxrwx 1 root root 14 2012-05-24 11:03 /lib/libc.so.6 -> libc-2.11.1.so
940dc8606a7975374e0be61e44c12fa8 /lib/libc.so.6
/lib/libz.so.1:
lrwxrwxrwx 1 root root 15 2012-05-24 10:41 /lib/libz.so.1 -> libz.so.1.2.3.3
6abd7af4f2752f371b0ecb7cc601c3ac /lib/libz.so.1
/lib64/ld-linux-x86-64.so.2:
lrwxrwxrwx 1 root root 12 2012-05-24 11:03 /lib64/ld-linux-x86-64.so.2 -> ld-2.11.1.so
c024251e1af3963fbb3fcef25d58410e /lib64/ld-linux-x86-64.so.2
/opt/couchbase/lib/memcached/stdin_term_handler.so:
-rw-r--r-- 1 couchbase couchbase 119503 2014-07-01 19:01 /opt/couchbase/lib/memcached/stdin_term_handler.so
2ed71f63866e3ab453c975ecdd722b3b /opt/couchbase/lib/memcached/stdin_term_handler.so
/opt/couchbase/lib/memcached/file_logger.so:
-rw-r--r-- 1 couchbase couchbase 143927 2014-07-01 19:01 /opt/couchbase/lib/memcached/file_logger.so
77dea259b81310f73ea9798dd8f60e5d /opt/couchbase/lib/memcached/file_logger.so
/opt/couchbase/lib/memcached/bucket_engine.so:
-rw-r--r-- 1 couchbase couchbase 421096 2014-07-01 19:01 /opt/couchbase/lib/memcached/bucket_engine.so
f7d424c7f5a853fd5f84b6cffcfb1b38 /opt/couchbase/lib/memcached/bucket_engine.so
/opt/couchbase/lib/memcached/ep.so:
-rw-r--r-- 1 couchbase couchbase 18673956 2014-07-01 19:01 /opt/couchbase/lib/memcached/ep.so
c30a26c281056fd381a09425633277fd /opt/couchbase/lib/memcached/ep.so
/opt/couchbase/lib/libcouchstore.so:
-rw-r--r-- 1 couchbase couchbase 3933395 2014-07-01 19:01 /opt/couchbase/lib/libcouchstore.so
66dd2e0d770234cddd60446cfd95f867 /opt/couchbase/lib/libcouchstore.so
/opt/couchbase/lib/libdirutils.so.0.1.0:
-rw-r--r-- 1 couchbase couchbase 181445 2014-07-01 19:01 /opt/couchbase/lib/libdirutils.so.0.1.0
de8c58cc7f1bc7a989e66becfefb2db2 /opt/couchbase/lib/libdirutils.so.0.1.0
/opt/couchbase/lib/libv8.so:
-rw-r--r-- 1 couchbase couchbase 116909487 2014-07-01 19:01 /opt/couchbase/lib/libv8.so
02be0eea832a546d2493cf530b4535ad /opt/couchbase/lib/libv8.so
/opt/couchbase/lib/libicui18n.so.44:
lrwxrwxrwx 1 couchbase couchbase 18 2014-07-01 22:35 /opt/couchbase/lib/libicui18n.so.44 -> libicui18n.so.44.0
c7b26e773e42456459912c214e35c571 /opt/couchbase/lib/libicui18n.so.44
/opt/couchbase/lib/libicuuc.so.44:
lrwxrwxrwx 1 couchbase couchbase 16 2014-07-01 22:35 /opt/couchbase/lib/libicuuc.so.44 -> libicuuc.so.44.0
5183a8c242bef1c814ad4eb30308f779 /opt/couchbase/lib/libicuuc.so.44
/opt/couchbase/lib/libicudata.so.44:
lrwxrwxrwx 1 couchbase couchbase 18 2014-07-01 22:35 /opt/couchbase/lib/libicudata.so.44 -> libicudata.so.44.0
93c3bc2bf90a238425388d76543168a0 /opt/couchbase/lib/libicudata.so.44

 Comments   
Comment by Sangharsh Agarwal [ 03/Jul/14 ]
I think MB-11599 is wrongly marked as duplicate of MB-11572. Sorry If I missed something, but as per my observation traces attached in MB-11572 were different from MB-11599.
Comment by Sangharsh Agarwal [ 03/Jul/14 ]
[Server Log]
10.3.3.142 : https://s3.amazonaws.com/bugdb/jira/MB-11629/349add7c/10.3.3.142-diag.txt.gz
10.3.3.142 : https://s3.amazonaws.com/bugdb/jira/MB-11629/5d2c12b5/10.3.3.142-722014-1814-diag.zip
10.3.3.143 : https://s3.amazonaws.com/bugdb/jira/MB-11629/b1f77fa5/10.3.3.143-722014-188-diag.zip
10.3.3.143 : https://s3.amazonaws.com/bugdb/jira/MB-11629/c89011f9/10.3.3.143-diag.txt.gz
10.3.3.144 : https://s3.amazonaws.com/bugdb/jira/MB-11629/8b1a9770/10.3.3.144-722014-1820-diag.zip
10.3.3.144 : https://s3.amazonaws.com/bugdb/jira/MB-11629/bee710d9/10.3.3.144-diag.txt.gz
10.3.3.145 : https://s3.amazonaws.com/bugdb/jira/MB-11629/3aa0c1d1/10.3.3.145-722014-1811-diag.zip
10.3.3.145 : https://s3.amazonaws.com/bugdb/jira/MB-11629/8877a22f/10.3.3.145-diag.txt.gz
10.3.3.146 : https://s3.amazonaws.com/bugdb/jira/MB-11629/e751ec21/10.3.3.146-diag.txt.gz
10.3.3.146 : https://s3.amazonaws.com/bugdb/jira/MB-11629/f7009b5f/10.3.3.146-722014-1817-diag.zip
10.3.3.147 : https://s3.amazonaws.com/bugdb/jira/MB-11629/862ceaf1/10.3.3.147-722014-1825-diag.zip
10.3.3.147 : https://s3.amazonaws.com/bugdb/jira/MB-11629/e029a84d/10.3.3.147-diag.txt.gz
10.3.3.148 : https://s3.amazonaws.com/bugdb/jira/MB-11629/8c91f598/10.3.3.148-722014-1822-diag.zip
10.3.3.148 : https://s3.amazonaws.com/bugdb/jira/MB-11629/d20eacee/10.3.3.148-diag.txt.gz
10.3.3.149 : https://s3.amazonaws.com/bugdb/jira/MB-11629/8d75f3f5/10.3.3.149-722014-1827-diag.zip
10.3.3.149 : https://s3.amazonaws.com/bugdb/jira/MB-11629/9bf52b14/10.3.3.149-diag.txt.gz

[Core Logs]
core : https://s3.amazonaws.com/bugdb/jira/MB-11629/ea944e30/core-10.3.3.146-0.log
Comment by Ketaki Gangal [ 03/Jul/14 ]
Seeing (likely)similar crash on rebalance on System Views test, I dont have cores with current run. Will update once I do.
Comment by Abhinav Dangeti [ 07/Jul/14 ]
Sangarsh, can you try your scenario on centos machines, and let me know if you're able to reproduce this issue.
If that be the case, I have some toy builds that contain some additional logging and some alternate logic, and we can give those a try.
Comment by Sangharsh Agarwal [ 08/Jul/14 ]
This bug occurred on ubuntu VMs. Test was passed on CentOS.
Comment by Abhinav Dangeti [ 08/Jul/14 ]
Please re-test, once we've a build with this change:
http://review.couchbase.org/#/c/39212/
Comment by Abhinav Dangeti [ 08/Jul/14 ]
Re-open if issue still persists.
Comment by Sangharsh Agarwal [ 10/Jul/14 ]
Issue re-occurred on build 3.0.0-941 execution

[Job]
http://qa.hq.northscale.net/job/centos_x64--107_01--rebalanceXDCR-P1/11/consoleFull

[Test]
./testrunner -i centos_x64--107_01--rebalanceXDCR-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True -t xdcr.rebalanceXDCR.Rebalance.swap_rebalance_out_master,items=100000,rdirection=bidirection,ctopology=chain,doc-ops=update-delete,doc-ops-dest=update-delete,rebalance=source-destination,GROUP=P1


[Test Error]
[2014-07-09 10:31:10,870] - [rest_client:1216] INFO - rebalance percentage : 97.8005865103 %
[2014-07-09 10:31:11,313] - [rest_client:1200] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
[2014-07-09 10:31:11,408] - [rest_client:2007] INFO - Latest logs from UI on 10.5.2.228:
[2014-07-09 10:31:11,409] - [rest_client:2008] ERROR - {u'node': u'ns_1@10.5.2.228', u'code': 0, u'text': u'Bucket "default" loaded on node \'ns_1@10.5.2.228\' in 0 seconds.', u'shortText': u'message', u'serverTime': u'2014-07-09T10:31:09.308Z', u'module': u'ns_memcached', u'tstamp': 1404927069308, u'type': u'info'}
[2014-07-09 10:31:11,409] - [rest_client:2008] ERROR - {u'node': u'ns_1@10.5.2.228', u'code': 2, u'text': u'Rebalance exited with reason {unexpected_exit,\n {\'EXIT\',<0.17115.99>,\n {bulk_set_vbucket_state_failed,\n [{\'ns_1@10.5.2.229\',\n {\'EXIT\',\n {{{{case_clause,\n {error,\n {{{badmatch,\n {error,\n {{badmatch,\n {memcached_error,key_enoent,\n <<"Engine not found">>}},\n [{mc_replication,connect,1,\n [{file,\n "src/mc_replication.erl"},\n {line,50}]},\n {upr_proxy,connect,4,\n [{file,"src/upr_proxy.erl"},\n {line,177}]},\n {upr_proxy,maybe_connect,1,\n [{file,"src/upr_proxy.erl"},\n {line,164}]},\n {upr_producer_conn,init,2,\n [{file,\n "src/upr_producer_conn.erl"},\n {line,30}]},\n {upr_proxy,init,1,\n [{file,"src/upr_proxy.erl"},\n {line,48}]},\n {gen_server,init_it,6,\n [{file,"gen_server.erl"},\n {line,304}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},\n {line,239}]}]}}},\n [{upr_replicator,init,1,\n [{file,"src/upr_replicator.erl"},\n {line,48}]},\n {gen_server,init_it,6,\n [{file,"gen_server.erl"},\n {line,304}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},\n {line,239}]}]},\n {child,undefined,\'ns_1@10.5.2.228\',\n {upr_replicator,start_link,\n [\'ns_1@10.5.2.228\',"default"]},\n temporary,60000,worker,\n [upr_replicator]}}}},\n [{upr_sup,start_replicator,2,\n [{file,"src/upr_sup.erl"},{line,78}]},\n {upr_sup,\n \'-set_desired_replications/2-lc$^2/1-2-\',\n 2,\n [{file,"src/upr_sup.erl"},{line,55}]},\n {upr_sup,set_desired_replications,2,\n [{file,"src/upr_sup.erl"},{line,55}]},\n {replication_manager,handle_call,3,\n [{file,"src/replication_manager.erl"},\n {line,130}]},\n {gen_server,handle_msg,5,\n [{file,"gen_server.erl"},{line,585}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},{line,239}]}]},\n {gen_server,call,\n [\'replication_manager-default\',\n {change_vbucket_replication,118,\n \'ns_1@10.5.2.228\'},\n infinity]}},\n {gen_server,call,\n [{\'janitor_agent-default\',\n \'ns_1@10.5.2.229\'},\n {if_rebalance,<0.27856.98>,\n {update_vbucket_state,119,replica,\n undefined,\'ns_1@10.5.2.234\'}},\n infinity]}}}}]}}}\n', u'shortText': u'message', u'serverTime': u'2014-07-09T10:31:06.026Z', u'module': u'ns_orchestrator', u'tstamp': 1404927066026, u'type': u'info'}
[2014-07-09 10:31:11,410] - [rest_client:2008] ERROR - {u'node': u'ns_1@10.5.2.228', u'code': 0, u'text': u'<0.17318.99> exited with {unexpected_exit,\n {\'EXIT\',<0.17115.99>,\n {bulk_set_vbucket_state_failed,\n [{\'ns_1@10.5.2.229\',\n {\'EXIT\',\n {{{{case_clause,\n {error,\n {{{badmatch,\n {error,\n {{badmatch,\n {memcached_error,key_enoent,\n <<"Engine not found">>}},\n [{mc_replication,connect,1,\n [{file,"src/mc_replication.erl"},\n {line,50}]},\n {upr_proxy,connect,4,\n [{file,"src/upr_proxy.erl"},\n {line,177}]},\n {upr_proxy,maybe_connect,1,\n [{file,"src/upr_proxy.erl"},\n {line,164}]},\n {upr_producer_conn,init,2,\n [{file,"src/upr_producer_conn.erl"},\n {line,30}]},\n {upr_proxy,init,1,\n [{file,"src/upr_proxy.erl"},\n {line,48}]},\n {gen_server,init_it,6,\n [{file,"gen_server.erl"},\n {line,304}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},\n {line,239}]}]}}},\n [{upr_replicator,init,1,\n [{file,"src/upr_replicator.erl"},\n {line,48}]},\n {gen_server,init_it,6,\n [{file,"gen_server.erl"},{line,304}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},{line,239}]}]},\n {child,undefined,\'ns_1@10.5.2.228\',\n {upr_replicator,start_link,\n [\'ns_1@10.5.2.228\',"default"]},\n temporary,60000,worker,\n [upr_replicator]}}}},\n [{upr_sup,start_replicator,2,\n [{file,"src/upr_sup.erl"},{line,78}]},\n {upr_sup,\n \'-set_desired_replications/2-lc$^2/1-2-\',\n 2,\n [{file,"src/upr_sup.erl"},{line,55}]},\n {upr_sup,set_desired_replications,2,\n [{file,"src/upr_sup.erl"},{line,55}]},\n {replication_manager,handle_call,3,\n [{file,"src/replication_manager.erl"},\n {line,130}]},\n {gen_server,handle_msg,5,\n [{file,"gen_server.erl"},{line,585}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},{line,239}]}]},\n {gen_server,call,\n [\'replication_manager-default\',\n {change_vbucket_replication,118,\n \'ns_1@10.5.2.228\'},\n infinity]}},\n {gen_server,call,\n [{\'janitor_agent-default\',\'ns_1@10.5.2.229\'},\n {if_rebalance,<0.27856.98>,\n {update_vbucket_state,119,replica,\n undefined,\'ns_1@10.5.2.234\'}},\n infinity]}}}}]}}}', u'shortText': u'message', u'serverTime': u'2014-07-09T10:31:06.005Z', u'module': u'ns_vbucket_mover', u'tstamp': 1404927066005, u'type': u'critical'}
[2014-07-09 10:31:11,411] - [rest_client:2008] ERROR - {u'node': u'ns_1@10.5.2.228', u'code': 0, u'text': u"Control connection to memcached on 'ns_1@10.5.2.228' disconnected: {badmatch,\n {error,\n einval}}", u'shortText': u'message', u'serverTime': u'2014-07-09T10:31:05.111Z', u'module': u'ns_memcached', u'tstamp': 1404927065111, u'type': u'info'}
[2014-07-09 10:31:11,411] - [rest_client:2008] ERROR - {u'node': u'ns_1@10.5.2.228', u'code': 0, u'text': u"Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 134. Restarting. Messages: Wed Jul 9 10:31:03.731560 PDT 3: (default) UPR (Producer) eq_uprq:xdcr:default-ffc3287d8a27b709905167cd9eff75c6 - (vb 94) Stream closing, 0 items sent from disk, 179 items sent from memory, 199 was last seqno sent\nWed Jul 9 10:31:04.610874 PDT 3: (default) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@10.5.2.228:default - (vb 62) stream created with start seqno 184 and end seqno 0\nWed Jul 9 10:31:04.611285 PDT 3: (default) UPR (Producer) eq_uprq:xdcr:default-ffc3287d8a27b709905167cd9eff75c6 - (vb 64) stream created with start seqno 1 and end seqno 175\nWed Jul 9 10:31:04.613383 PDT 3: (default) UPR (Producer) eq_uprq:xdcr:default-ffc3287d8a27b709905167cd9eff75c6 - (vb 64) Stream closing, 0 items sent from disk, 174 items sent from memory, 175 was last seqno sent\nmemcached: /buildbot/build_slave/centos-5-x64-300-builder/build/build/memcached/daemon/memcached.c:7731: decrement_session_ctr: Assertion `session_cas.ctr != 0' failed.", u'shortText': u'message', u'serverTime': u'2014-07-09T10:31:05.055Z', u'module': u'ns_log', u'tstamp': 1404927065055, u'type': u'info'}
[2014-07-09 10:31:11,412] - [rest_client:2008] ERROR - {u'node': u'ns_1@10.5.2.228', u'code': 0, u'text': u'Bucket "default" rebalance appears to be swap rebalance', u'shortText': u'message', u'serverTime': u'2014-07-09T10:27:43.285Z', u'module': u'ns_vbucket_mover', u'tstamp': 1404926863285, u'type': u'info'}
[2014-07-09 10:31:11,412] - [rest_client:2008] ERROR - {u'node': u'ns_1@10.5.2.234', u'code': 0, u'text': u'Bucket "default" loaded on node \'ns_1@10.5.2.234\' in 0 seconds.', u'shortText': u'message', u'serverTime': u'2014-07-09T10:27:42.851Z', u'module': u'ns_memcached', u'tstamp': 1404926862851, u'type': u'info'}
[2014-07-09 10:31:11,413] - [rest_client:2008] ERROR - {u'node': u'ns_1@10.5.2.228', u'code': 0, u'text': u'Started rebalancing bucket default', u'shortText': u'message', u'serverTime': u'2014-07-09T10:27:41.973Z', u'module': u'ns_rebalancer', u'tstamp': 1404926861973, u'type': u'info'}
[2014-07-09 10:31:11,413] - [rest_client:2008] ERROR - {u'node': u'ns_1@10.5.2.228', u'code': 4, u'text': u"Starting rebalance, KeepNodes = ['ns_1@10.5.2.230','ns_1@10.5.2.229',\n 'ns_1@10.5.2.234'], EjectNodes = ['ns_1@10.5.2.228'], Failed over and being ejected nodes = []; no delta recovery nodes\n", u'shortText': u'message', u'serverTime': u'2014-07-09T10:27:41.890Z', u'module': u'ns_orchestrator', u'tstamp': 1404926861890, u'type': u'info'}


[Core File]
core : https://s3.amazonaws.com/bugdb/jira/MB-11629/79759a4c/core-10.5.2.228-0.log


[Logs]
10.5.2.228 : https://s3.amazonaws.com/bugdb/jira/MB-11629/1814cebe/10.5.2.228-diag.txt.gz
10.5.2.228 : https://s3.amazonaws.com/bugdb/jira/MB-11629/2c964a9b/10.5.2.228-792014-1053-diag.zip
10.5.2.229 : https://s3.amazonaws.com/bugdb/jira/MB-11629/78d3f41e/10.5.2.229-792014-1057-diag.zip
10.5.2.229 : https://s3.amazonaws.com/bugdb/jira/MB-11629/afee0bc6/10.5.2.229-diag.txt.gz
10.5.2.230 : https://s3.amazonaws.com/bugdb/jira/MB-11629/ac873377/10.5.2.230-diag.txt.gz
10.5.2.230 : https://s3.amazonaws.com/bugdb/jira/MB-11629/f7b9a08e/10.5.2.230-792014-1055-diag.zip
10.5.2.234 : https://s3.amazonaws.com/bugdb/jira/MB-11629/28b59b04/10.5.2.234-diag.txt.gz
10.5.2.234 : https://s3.amazonaws.com/bugdb/jira/MB-11629/e9dba800/10.5.2.234-792014-114-diag.zip


10.5.2.231 : https://s3.amazonaws.com/bugdb/jira/MB-11629/96c9d310/10.5.2.231-792014-112-diag.zip
10.5.2.231 : https://s3.amazonaws.com/bugdb/jira/MB-11629/fc3d9d94/10.5.2.231-diag.txt.gz
10.5.2.232 : https://s3.amazonaws.com/bugdb/jira/MB-11629/3c75c843/10.5.2.232-792014-1059-diag.zip
10.5.2.232 : https://s3.amazonaws.com/bugdb/jira/MB-11629/94fe0c9a/10.5.2.232-diag.txt.gz
10.5.2.233 : https://s3.amazonaws.com/bugdb/jira/MB-11629/577ed73f/10.5.2.233-diag.txt.gz
10.5.2.233 : https://s3.amazonaws.com/bugdb/jira/MB-11629/de5f568f/10.5.2.233-792014-115-diag.zip
10.3.5.68 : https://s3.amazonaws.com/bugdb/jira/MB-11629/12754a72/10.3.5.68-792014-117-diag.zip
10.3.5.68 : https://s3.amazonaws.com/bugdb/jira/MB-11629/a5609191/10.3.5.68-diag.txt.gz
Comment by Abhinav Dangeti [ 10/Jul/14 ]
http://review.couchbase.org/#/c/39297/, http://review.couchbase.org/#/c/39296/

Re-test once we've a build when these get merged.
Comment by Sangharsh Agarwal [ 11/Jul/14 ]
http://qa.hq.northscale.net/job/centos_x64--107_01--rebalanceXDCR-P1/15/ started job with build 953.
Comment by Abhinav Dangeti [ 11/Jul/14 ]
The changes haven't been merged yet.
Comment by Ketaki Gangal [ 11/Jul/14 ]
fyi: seeing this with an older build(943) on Centos 6.4, during rebalance out, the outgoing node has a memcached crash - looks similar (got no core yet again)
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th

Trond will be back next week. We need his help.
Comment by Sangharsh Agarwal [ 18/Jul/14 ]
Two new crashes found MB-11763, MB-11764 though during rebalance. Backtrace for MB-11763 looks similar to this backtrace. Please confirm.
Comment by Abhinav Dangeti [ 18/Jul/14 ]
MB-11763, MB-11764: both are duplicates of this bug.
Comment by Sangharsh Agarwal [ 20/Jul/14 ]
Abhinav,
     Sorry, I don't understand. How MB-11764 is duplicate of this issue as to me backtraces are very much different between MB-11629 and MB-11764.
Comment by Sangharsh Agarwal [ 20/Jul/14 ]
Even point of abort in MB-11763 differs from MB-11629. Let me know if missed anything here.
Comment by Abhinav Dangeti [ 20/Jul/14 ]
The logs you attached in both MB-11763 & MB-11764, show the assertion fail, caused by the underflow in session ctr.

Thread 1 (Thread 0x7f44700e5700 (LWP 570)):
#0 0x00007f4472ef38a5 in raise () from /lib64/libc.so.6
#1 0x00007f4472ef5085 in abort () from /lib64/libc.so.6
#2 0x00007f4472eeca1e in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007f4472eecae0 in __assert_fail () from /lib64/libc.so.6
#4 0x0000000000407a1c in decrement_session_ctr () at /buildbot/build_slave/centos-5-x64-300-builder/build/build/memcached/daemon/memcached.c:7731
Comment by Sangharsh Agarwal [ 21/Jul/14 ]
Got it. Thanks.
Comment by Abhinav Dangeti [ 21/Jul/14 ]
Sangarsh, Can you please run: http://qa.hq.northscale.net/job/centos_x64--107_01--rebalanceXDCR-P1
with this toy build: http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent58-3.0.0-toy-couchstore-x86_64_3.0.0-778-toy.rpm, as soon as you can.
This toy build just has some extra logging (without the fix), that'll help confirm something.
Note that you should still see the memcached crash, and when you do, can you get me the complete back trace and the cbcollect_info from just the node where the crash occurs.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
Abhinav, Jobs are started with toy build:

http://qa.hq.northscale.net/job/centos_x64--107_01--rebalanceXDCR-P1/28/consoleFull

DownStream Job:

http://qa.hq.northscale.net/job/centos_x64--107_02--rebalanceXDCR_SSL-P1/22/consoleFull

I will share the results with you.
Comment by Abhinav Dangeti [ 22/Jul/14 ]
Thank you sangarsh.
Comment by Abhinav Dangeti [ 23/Jul/14 ]
Sangarsh/Aruna, can you schedule another job here:
http://qa.hq.northscale.net/job/centos_x64--107_01--rebalanceXDCR-P1/
with: http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent58-3.0.0-toy-couchstore-x86_64_3.0.0-784-toy.rpm
Comment by Abhinav Dangeti [ 23/Jul/14 ]
Merged a fix: http://review.couchbase.org/#/c/39768/




[MB-11803] {UPR}:: Rebalance-out failing due to bad replicators Created: 23/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 10.5.2.13
10.5.2.14
10.5.2.15
10.3.121.63
10.3.121.64
10.3.121.66
10.3.121.69

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Tested on 1011, 1005, both ubuntu and centos are seeing this issue

1. Create a 7 node cluster
2. Create a default bucket
3. Add 100 K items
4. Rebalance-out 1 Node (10.3.121.69)
5. Do Ops for Gets

Step 4 and Step 5 act in parallel.

Rabalance exits with the following error::

Bad replicators after rebalance:
Missing = [{'ns_1@10.3.121.63','ns_1@10.3.121.64',0},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',1},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',2},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',3},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',56},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',4},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',5},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',6},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',57},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',58},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',19},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',20},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',21},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',22},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',59},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',23},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',24},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',25},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',60},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',26},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',29},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',30},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',31},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',61},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',38},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',39},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',40},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',62},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',41},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',42},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',43},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',63},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',44},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',47},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',48},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',49},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',64},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',65},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',74},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',75},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',76},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',66},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',77},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',78},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',79},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',67},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',80},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',81},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',82},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',83},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',68},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',92},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',93},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',94},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',69},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',95},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',96},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',97},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',71},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',110},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',111},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',112},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',72},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',113},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',114},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',115},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',73},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',116},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',117},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',118}]
Extras = []

Test Case:: ./testrunner -i centos.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,dgm=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_get_random_key,nodes_out=1,items=100000,value_size=256,skip_cleanup=True,GROUP=OUT;BASIC;P0;FROM_2_0

Will attach logs asap

 Comments   
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
This is relatively easily reproducible on cluster_run. I'm seeing upr disconnects which explain bad_replicas.

Might be duplicate of another upr disconnects bug.
Comment by Parag Agarwal [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11803/logs.tar.gz
Comment by Mike Wiederhold [ 23/Jul/14 ]
http://review.couchbase.org/#/c/39760/
Comment by Parag Agarwal [ 23/Jul/14 ]
Does not repro in 1014




[MB-11775] Rebalance-stop is slow -- takes multiple attempts to stop rebalance Created: 21/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Ketaki Gangal Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-973-rel
Centos 6.4

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Setup:

1. Cluster 7 nodes, 2 buckets, 1 design doc X 2 views
2. Load 120M, 99M items on both the buckets, dgm state of 70% active resident.
3. Do a graceful failover on 1 node
4. Choose delta recovery, add back node and rebalance

5. I tried to stop the rebalance a couple of times ( about 10 times) --- Unsuccessful on a number of attempts.
Rebalance eventually failed with reason " Rebalance exited with reason stop" --- Rebalance stop is not working as expected.

- Attaching logs



 Comments   
Comment by Ketaki Gangal [ 21/Jul/14 ]
Logs https://s3.amazonaws.com/bugdb/MB-11775/11775.tar
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
I'll need more diagnostics to figure out this case. I've added some in this commit (still pending review and merge): http://review.couchbase.org/39625

Please retest after this commit is merged so that I can see what makes rebalance stop slow.
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
referenced commit is now in. So test as soon as you get next build
Comment by Ketaki Gangal [ 22/Jul/14 ]
Tested with build which contains the above commit - build 3.0.0-999-rel.

Seeing same behaviour, where it takes a couple of attempts to stop rebalance.

Logs at https://s3.amazonaws.com/bugdb/MB-11775/11775-2.tar
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Uploaded probable fix:

http://review.couchbase.org/39694

With this fix (assuming I am right about cause of slowness) we'll be able to stop even if some node is stuck somewhere in janitor_agent which could in turn be due to view engine. That would mean that original slowness would be (maybe) visible elsewhere. Possibly in harder to debug way.

So in order to diagnose _that_ I need you to capture diag or collectinfo from just one node _immediately_ after you're sending stop and it is slow. If this is done correctly I'll be able to see what is causing that slowness in first place. Note that it needs to be done on build prior to rebalance stop fix that I've referred to above.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
merged. So rebalance stop should not be slow anymore. But see above for some additional investigation that we should do.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
reverted for now
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
Merged hopefully more correct fix: http://review.couchbase.org/39756




[MB-11225] investigate spikes in "rate" stats for UPR and TAP Created: 27/May/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Artem Stemkovski Assignee: Artem Stemkovski
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
happen after some data is replicated and then the server is restarted

 Comments   
Comment by Anil Kumar [ 19/Jun/14 ]
Triage - June 19 2014 Alk, Parag, Anil
Comment by Artem Stemkovski [ 23/Jul/14 ]
looks like the problem was fixed in memcached




[MB-11802] [BUG BASH] Sample Bug Created: 23/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Don Pinto Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Sample test bug for bug bash - Ignore




[MB-11772] Provide the facility to release free memory back to the OS from running mcd process Created: 21/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1, 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Major
Reporter: Dave Rigby Assignee: Dave Rigby
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
On many occasions we have seen tcmalloc being very "greedy" with free memory and not releasing it back to the OS very quickly. There have even been occasions where this has triggered the Linux OOM-killer due to the memcached process's having too much "free" tcmalloc memory still resident.

tcmalloc by design will /slowly/ return memory back to the OS - via madvise(DONT_NEED) - but this rate is very conservative, and it can only be changed currently by modifying an environment variable, which obviously cannot be done on a running process.

To help mitigate these problems in future, it would be very helpful to allow the user to request that free memory is released back to the OS.


 Comments   
Comment by Dave Rigby [ 21/Jul/14 ]
http://review.couchbase.org/#/c/39608/




[MB-11793] Build breakage in upr-consumer.cc Created: 22/Jul/14  Updated: 23/Jul/14  Due: 23/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: .master
Fix Version/s: .master
Security Level: Public

Type: Task Priority: Test Blocker
Reporter: Chris Hillery Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: ep-engine
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: