[MB-12798] projector/router not configured to work by default in kv node Created: 27/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: all

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We do not have the projector turned on by default. Creating a bug to track this.

Alk's response.

"Optional" projector is temporary and unfortunate side effect of situation in build domain. Specifically branch-master.xml manifest (which is still our main manifest) does not install projector so ns_server cannot rely on projector always being available.

Having said that I'll implement a better way of dealing with this situation soon. We can detect if projector is available and spawn it not based on environment variable, but based on existence of executable file. Expect the code for that in next few hours.

And then hopefully build side will get well reasonably soon and make everyone happy.







[MB-12797] Improve logging in Couchstore to provide more context for various corruption cases Created: 27/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, storage-engine
Affects Version/s: techdebt-backlog
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We observed some Couchstore file corruption cases, such as header checksum error, snappy decompression failure, from some customer deployments. Couchstore is currently lack of loggings on those issues and needs to be improved to provide more context.




[MB-9874] [Windows] Couchstore drop and reopen of file handle fails Created: 09/Jan/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, storage-engine
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Critical
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: windows, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows

Is this a Regression?: Yes

 Description   
The unit test doing couchstore_drop_file and couchstore_repoen_file fails due to COUCHSTORE_READ_ERROR when it tries to reopen the file.

The commit http://review.couchbase.org/#/c/31767/ disabled the test to allow the rest of the unit tests to be executed.

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Anil Kumar [ 18/Nov/14 ]
[Chiyoung] - Moving this out of 3.0.2 since its not a regression and its a unit test which needs to be fixed.




[MB-10496] Investigate other possible memory allocators that provide the better fragmentation management Created: 18/Mar/14  Updated: 27/Nov/14

Status: In Progress
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.0, 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Task Priority: Critical
Reporter: Chiyoung Seo Assignee: Dave Rigby
Resolution: Unresolved Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Issue Links:
Dependency
depends on MB-11756 segfault in ep.dylib`Checkpoint::queu... Closed
Duplicate
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-12067 Explicit defragmentation of ep_engine... Technical task In Progress Dave Rigby  
MB-12575 Make jemalloc cbdeps build pass on fa... Technical task Open Chris Hillery  
MB-12604 Add factory Ubuntu 12.04 build for je... Technical task Open Chris Hillery  
MB-12605 Add factory CentOS 6 build for jemall... Technical task In Progress Dave Rigby  
MB-12608 Enable TCMalloc's aggressive decommit... Technical task Resolved Dave Rigby  

 Description   
As tcmalloc incurs a significant memory fragmentation for particular load patterns (e.g., append / prepend operations), we need to investigate the other options that have much less fragmentation overhead for those load patterns.

 Comments   
Comment by Matt Ingenthron [ 19/Mar/14 ]
I'm not an expert in this area any more, but I would say that my history with allocators is that there is often a tradeoff between performance aspects and space efficiency. My own personal opinion is that it may be better to not be tied to any one memory allocator, but rather have the right abstractions so we can use one or more.

I can certainly say that the initial tc_malloc integration was perhaps a bit hasty, driven by Sharon. The problem we were trying to solve at the time was actually a glibstdc++ bug on CentOS 5.2. It could have been fixed by upgrading to CentOS 5.3, but for a variety of reasons we were trying to find another workaround or solution. tc_malloc was integrated for that.

It was then that I introduced the lib/ directory and changed the compilation to set the RPATH. The reason I did this is I was trying to avoid our shipping tc_malloc, as at the time Ubuntu didn't include it since there were bugs. That gave me enough pause to think we may not want to be the first people to use tc_malloc in this particular way.

In particular, there's no reason to believe tc_malloc is best for windows. It may also not be best for platforms like mac OS and solaris/smartOS (in case we ever get there).
Comment by Matt Ingenthron [ 19/Mar/14 ]
By the way, those comments are just all history in case it's useful. Please discount or ignore it as appropriate. ;)
Comment by Chiyoung Seo [ 19/Mar/14 ]
Thanks Matt for good comments. As you mentioned, we plan to support more than one memory allocator, so that users can choose the allocator based on their OS and workload patterns. I know that there several open source projects and think we can start with investigating them first, and then need to develop our own allocator if necessary.
Comment by Matt Ingenthron [ 19/Mar/14 ]
No worries. You'll need a benchmark or two to evaluate things. Even then, some people will probably prefer something space efficient versus time efficient, but we won't be able to support everything, etc. If it were me, I'd look to support the OS shipped advanced allocator and maybe one other, as long as they met my test criteria of course.
Comment by Dave Rigby [ 24/Mar/14 ]
Adding some possible candidates:

* jemalloc (https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919) - Note used personally but know some of the guys at FB who use it. Reportedly has good fragmentation properties.
Comment by Chiyoung Seo [ 06/May/14 ]
Trond will build a quick prototype that is based on a slabber on top of the slabber to see if that shows a better fragmentation management. He will share his ideas and initial results later.
Comment by Steve Yen [ 28/May/14 ]
Hi Trond,
Any latest thoughts / news on this?
-- steve
Comment by Aleksey Kondratenko [ 09/Jul/14 ]
There's also some recently opensources work by aerospike folks to track jemalloc allocations apparently similarly to how we're doing it with tcmalloc.
Comment by Dave Rigby [ 10/Jul/14 ]
@Alk: You got a link? I scanned back in aerospike's github fork (https://github.com/aerospike/jemalloc) for the last ~2 years but didn't see anything likely in there...
Comment by Aleksey Kondratenko [ 10/Jul/14 ]
It is sibling project (mentioned on their server's readme): https://github.com/aerospike/asmalloc
Comment by Dave Rigby [ 21/Jul/14 ]
I've taken a look at asmalloc from Aerospike. Some notes for the record:

asmalloc isn't actually used for the "product-level" memory quota management - it's more along the lines of a debug tool which can be configured to report (via a callback interface) when allocations over certain sizes occur and/or when memory reaches certain levels.

The git repo documentation eludes to the fact that it can be LD_PRELOADed by default (in an inactive mode) and enabled on-demand, but the pre-built Vagrant VM I download from their website didn't have it loaded, so I suspect it is more of a developer tool than a production feature.

In terms of implementation, asmalloc is just defines it's own malloc / free symbols and relies on LD_PRELOAD / dlsym with RTLD_NEXT to interpose it's symbols in front of the real malloc library. I'd note however that this isn't directly supported on Windows (which isn't a problem for Aerospike as they only support Linux).


Their actual tracking of memory for "namespaces" (aka buckets) is done by simple manual counting - incrementing and decrementing atomic counters when documents are added / removed / resized.
Comment by Dave Rigby [ 07/Aug/14 ]
Update on progress:

I've constructed a pathologically bad workload for the memory allocator (PathoGen) and run this on TCMalloc, JEMalloc and TCMalloc with aggressive decommit. Details at: https://docs.google.com/document/d/1sgE9LFfT5ZD4FbSZqCuUtzOoLu5BFi1Kf63R9kYuAyY/edit#

TL;DR:

* I can demonstrate TCMalloc having significantly higher RSS than the actual dataset, *and* holding onto this memory after the workload has decreased. JEMalloc doesn't have these problems.
Interestingly, enabling TCMALLOC_AGGRESSIVE_DECOMMIT makes TCMalloc behave very close to Jemalloc (i.e. RSS tracks workload, *and* there is minimal general overhead).

Further investigation is needed to see the implications of "aggressive decommit", particularly any negative performance implications, or if there is a middle-ground setting.
Comment by Aleksey Kondratenko [ 07/Aug/14 ]
Interesting. But if I understand correctly this is actually not worst possible workload. If you increase items size and overwrite all documents, then smaller size classes are more or less completely freed (logically, but internally at least tcmalloc will hold some in thread caches). I think you can make it worse by leaving significant count (say 1-2%) per size class allocated (i.e. by not overwriting some docs). This is where I expect both jemalloc and tcmalloc to be worse than just plain glibc malloc (which is worth testing too; but be sure to check if it has any major revisions I think it had so RHEL5 vs. RHEL6 vs. RHEL7 might give you different behavior).
Comment by Dave Rigby [ 07/Aug/14 ]
@Alk: Agreed - and in fact I was planning on adding a separate workload which did essentially that. I say as a separate benchmark as there are a few, arguably orthogonal "troublesome" aspects for allocators.

The pathoGen workload previously mentioned was trying to investigate the RSS overhead that occurs when allocators "hang on" to memory after the application has finished with it - and also look at the overhead associated with this - my intent was to model what customers V and F have seen when the OOM-killer takes out memcached when large amount s of memory are in the various TCMalloc free lists.

For the size-class problem which you describe, I believe ultimately this /will/ require some form of active defragmentation inside memcached/ep_engine - regardless of the allocation policy at object-creation time, you cannot predict what objects will/won't still be in memory later on.

I hope to hack together a "pyramid scheme" workload like you describe tomorrow, and we can see how the various allocators behave.

Comment by Dave Rigby [ 07/Aug/14 ]
@Alk: Additionally, I have some other numbers with revAB_sim (heavy-append workload, building a reverse address book) here: https://docs.google.com/spreadsheets/d/1JhsmpvRXGS9hmks2sY-Pz8obCllpOHPoVBsQ6J4ZuDg/edit#gid=0

They are less conclusive, but do show that TCMalloc gains a speedup (compared to jemalloc), at the cost of RSS size, particularly when looking at jemalloc [narenas:1] which is probably the more sensible configuration given our alloc/free of documents from different threads.

Arguably the most interesting thing is the terrible performance of glibc (Ubuntu 14.04) in terms of RSS usage...
Comment by Dave Rigby [ 12/Aug/14 ]
Toy build with "aggressive decommit" enabled: http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-daver-x86_64_3.0.0-700-toy.rpm
Comment by Dave Rigby [ 15/Aug/14 ]
Note aforementioned toy build suffered from the TCMALLOC -DSMALL_BUT_SLOW bug (MB-11961), so the results are kinda moot. Having said that we can do exactly the same test without a modified build, by simply setting the env var TCMALLOC_AGGRESSIVE_DECOMMIT=t.
Comment by Dave Rigby [ 15/Aug/14 ]
I've conducted a further set of tests using PathoGen, specifically I've expanded it to address some of the feedback from people on this thread by adding a "frozen" mode where a small percentage of documents are frozen at a given size after each iteration. Results are here: https://docs.google.com/document/d/1sgE9LFfT5ZD4FbSZqCuUtzOoLu5BFi1Kf63R9kYuAyY/edit#heading=h.55etcgxxrj3a

Most interesting is probably the graph on page seven (direct link: https://docs.google.com/spreadsheets/d/1JhsmpvRXGS9hmks2sY-Pz8obCllpOHPoVBsQ6J4ZuDg/edit#gid=319463280)

I won't repeat the full analysis from the doc, but suffice to say that either TCMalloc with the "aggressive decommit" or jemalloc show significantly reduced memory overhead compared to the current default we use.
Comment by Aleksey Kondratenko [ 15/Aug/14 ]
See also http://www.couchbase.com/issues/browse/MB-11974
Comment by Dave Rigby [ 19/Sep/14 ]
memcached:

http://review.couchbase.org/41485 - jemalloc: Implement release_free_memory
http://review.couchbase.org/41486 - jemalloc: Report free {un,}mapped size
http://review.couchbase.org/41487 - Add 'enable_thread_cache' call to hooks API
http://review.couchbase.org/41488 - Add alloc_hooks API to mock server
http://review.couchbase.org/41489 - Add get_mapped_bytes() and release_free_memory() to testHarness
http://review.couchbase.org/41490 - jemalloc: Implement mem_used tracking using experimental hooks
http://review.couchbase.org/41491 - Add run_defragmenter_task to ENGINE API.

ep_engine:

http://review.couchbase.org/41494 - MB-10496 [1/6]: Initial version of HashTable defragmenter
http://review.couchbase.org/41495 - MB-10496 [2/6]: Implement run_defragmenter_task for ep_engine
http://review.couchbase.org/41496 - MB-10496 [3/6]: Unit test for degragmenter task
http://review.couchbase.org/41497 - MB-10496 [4/6]: Add epoch field to Blob; use as part of defragmenter policy
http://review.couchbase.org/41498 - MB-10496 [5/6]: pause/resume visitor support for epStore & HashTable
http://review.couchbase.org/41499 - MB-10496 [6/6]: Use pause/resume visitor for defragmenter task
Comment by Cihan Biyikoglu [ 19/Sep/14 ]
discussed this with David H, David R and Chiyoung. lets shoot for sherlock




[MB-7761] Add stats for all operations to memcached Created: 15/Feb/13  Updated: 27/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 1.8.1, 2.0, 2.5.1, 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: supportability
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Attachments: File stats-improvements.md    
Issue Links:
Duplicate
is duplicated by MB-11986 Stats for every operations. (prepend ... Resolved
Relates to
relates to MB-8793 Prepare spec on stats updates Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-5011 gat (get and touch) operation not rep... Technical task In Progress Mike Wiederhold  
MB-6121 More operation stats please Technical task Resolved Mike Wiederhold  
MB-7711 UI: Getandlock doesn't show up in any... Technical task Closed Mike Wiederhold  
MB-7807 aggregate all kinds of ops in ops/sec... Technical task Open Mike Wiederhold  
MB-8183 getAndTouch (and touch) operations ar... Technical task Resolved Aleksey Kondratenko  
MB-10377 getl and cas not reported in the GUI ... Technical task Resolved Aleksey Kondratenko  
MB-11655 Stats: Getandlock doesn't show up in ... Technical task In Progress Mike Wiederhold  

 Description   
Stats have increasingly been an issue to deal with since they are half done in memcached and half done in ep-engine. Memcached should simply handle connections and not really care or track anything operation related. This stuff should happen in the engines and memcached should just ask for it when it needs the info.

 Comments   
Comment by Tug Grall (Inactive) [ 01/May/13 ]
Just to be sure they are linked. I let the engr team chose how to deal with this JIRA
Comment by Perry Krug [ 07/Jul/14 ]
Raising awareness to this broad supportability issue which sometimes makes it hard for the field and customers to accurately understand their Couchbase traffic
Comment by Mike Wiederhold [ 03/Sep/14 ]
Trond,

I've attached the design document for this issue. Last we discussed you mentioned that you would take on the task of implementing this into memcached. Once your finished I will coordinate the rest of the ns_server/ep-engine changes.
Comment by Perry Krug [ 20/Sep/14 ]
Mike, just came across another situation I hope we can resolve with this.

It seems that in the current implementation, CAS operations (even on set) are not included within the cmd_set statistic. So while looking at the UI, even with very high load of set+CAS, there is nothing recorded in the graphs.

Given that with the binary protocol, CAS can be implemented on any operation, can we augment the way that we track all statistics to count an operation whether it has CAS or not as the underlying operation? And then also have the break-out statistics of "CAS operations/hits/misses" count for all operations that have a CAS supplied?

Thanks
Comment by Anil Kumar [ 22/Oct/14 ]
Mike/ Trond - I'm guessing this is not yet implemented. Mike you have put together design-doc for this issue. Can you please confirm.

Comment by Mike Wiederhold [ 22/Oct/14 ]
Yes, as I mentioned in my comment above the design doc is attached to this ticket.
Comment by Mike Wiederhold [ 22/Oct/14 ]
Also, this is partially finished. Please refer to the resolved tickets for the things that have been done already.
Comment by Perry Krug [ 22/Oct/14 ]
Should we not then leave this ticket as open? Can you enumerate what hasn't yet been implemented?
Comment by Trond Norbye [ 23/Oct/14 ]
Everything in memcached is implemented and may be viewed by running mctimings and mcstat. I believe that the only pending issues is to add logic to ns_server..




[MB-12520] Investigate kernel tunings for SSD/FusionIO performance Created: 30/Oct/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Task Priority: Major
Reporter: Perry Krug Assignee: Thomas Anderson
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Recently at a customer we managed to gain a significant disk write performance boost by setting these two kernel parameters (on RHEL):
kernel.sched_min_granularity_ns = 25000000
kernel.sched_migration_cost = 5000000

Keep in mind that you need to be already saturating the CPU but not the disk subsystem so this will need to be run on Enterprise SSD's and/or FusionIO and pushing >150k writes/sec to disk per node.

Coming from: https://access.redhat.com/sites/default/files/attachments/2012_perf_brief-low_latency_tuning_for_rhel6_0.pdf




[MB-7965] implement fast flushing of buckets (was: bucket-flush takes over 8 seconds to complete on an empty bucket) Created: 25/Mar/13  Updated: 27/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 2.0.1, 2.1.0, 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Critical
Reporter: Perry Krug Assignee: Dave Finlay
Resolution: Unresolved Votes: 0
Labels: PM-PRIORITIZED, devX, ns_server-story, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
Simply running bucket-flush from the CLI takes over 8 seconds to return when run against a bucket with no items in it.

 Comments   
Comment by Aleksey Kondratenko [ 25/Mar/13 ]
This is duplicated somewhere. Mike currently owns this issue.
Comment by Maria McDuff (Inactive) [ 22/Apr/13 ]
hi mike, do u have the orig bug number? if you can locate it, can u update this bug so I can close this one? thanks.
Comment by Mike Wiederhold [ 22/Apr/13 ]
I didn't know we had another one. Also moving to 2.1.
Comment by Dipti Borkar [ 07/May/13 ]
Can't find any duplicate bug for this issue.
Comment by Dipti Borkar [ 07/May/13 ]
Duplicate of MB-6232
Comment by Aleksey Kondratenko [ 07/May/13 ]
Last customer (and restricted) comment actually looks like some unrelated bug. Please proceed with CBSE and feel free to assign to me.
Comment by Matt Ingenthron [ 11/Jul/13 ]
I think you mean MB-6232, but this is not currently believed to be a duplicate of that issue. That issue is specifically about the ep-engine level and this issue is more about the overall bucket flush feature.

Note that tests indicate even with the storage as ramdisk, it still takes multiple seconds.
Comment by Aleksey Kondratenko [ 11/Jul/13 ]
Currently believed to be caused by "erlang bits". Requires investigation by our team.
Comment by Maria McDuff (Inactive) [ 08/Oct/13 ]
Alk,

any update on this bug?
Comment by Aleksey Kondratenko [ 08/Oct/13 ]
No updates. Simply requires some time on investigation. Given 3.0 scope might or might not happen.
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Meenakshi,

are we seeing this latency on 3.0?
pls update. if so -- pls assign to Alk, not to me. Thanks.
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - 06/04/2014 Alk, Wayne, Parag, Anil
Comment by Anil Kumar [ 04/Jun/14 ]
Meenakshi - Can you provide update?
Comment by Meenakshi Goel [ 05/Jun/14 ]
Below are the steps tried to reproduce the issue with 3.0 latest build, please let me know if something else needs to done or missed:
OS: CentOS 6.4

#/opt/couchbase/bin/couchbase-cli bucket-create --bucket="default" --bucket-type=couchbase --bucket-port=11211 --bucket-ramsize=200 --bucket-replica=1 --enable-flush=1 --cluster=172.23.107.20:8091 -u Administrator -p password
SUCCESS: bucket-create
#cat /opt/couchbase/VERSION.txt
3.0.0-779-rel
# time /opt/couchbase/bin/couchbase-cli bucket-flush --bucket="default" --cluster=172.23.107.20:8091 -u Administrator -p password
Running this command will totally PURGE database data from disk.Do you really want to do it? (Yes/No)Yes
Database data will be purged from disk ...
SUCCESS: bucket-flush

real 0m9.299s
user 0m0.074s
sys 0m0.034s
Comment by Matt Ingenthron [ 23/Jun/14 ]
Note that flush has always been very long and asynchronous, which makes it impossible to reliably integrate into test cycles. We get significant user feedback on the impact on development.

Delete/create is not a workaround, as it too is long and asynchronous.

The best workaround known to date is to reduce the number of vbuckets to something like 4.

I'm not sure if the issue reported here is common, but that would be a good is/is-not item to check on this issue.
Comment by Aleksey Kondratenko [ 29/Sep/14 ]
Lets finally get this resolved.

My gut feeling is that 8 seconds has nothing to do with ns_server's actions and is likely due to multiple fsyncs as part of flushing everything.

In order to investigate this I'll need VM that has that 9 seconds to flush. Because on my box:

*) with barriers on it takes far longer (and we know it's due to fsyncs) (and I believe there should be ticket for that for ep-engine/storage)

*) without barriers it takes much less than 8 seconds.
Comment by Anil Kumar [ 03/Nov/14 ]
Wayne - Any update on this ?
Comment by Aleksey Kondratenko [ 14/Nov/14 ]
Folks, this should be simple. No need to bounce this around.

Just _give me vm_ that (reportedly) has this issue. That's why this bug was given to Wayne.
Comment by Thuan Nguyen [ 14/Nov/14 ]
I will try to reproduce it today and give Alk vm if it hit this bug.
Comment by Thuan Nguyen [ 14/Nov/14 ]
I did repro the issue with barriers off on build 3.0.2-1528 on centos 6.5. Flush empty default bucket took 8 seconds.
Gave the vm to Alk to investigate it more.
Comment by Aleksey Kondratenko [ 14/Nov/14 ]
don't need vm anymore
Comment by Aleksey Kondratenko [ 14/Nov/14 ]
So here is my findings. On that xen vm, setting nobarrier makes flush take about 8 seconds (and a lot more without). But that's still mostly fsync-related because after running couchbase under eatmydata (which interposes fsync and disables it) lowered this even further. On than xen vm it was down to about 2.5 seconds. That's still not fast however.

So I retried on my machine and on my machine but inside VM (kvm, however). On my box eatmydata lowers flush time down to 0.7 seconds (otherwise it's 2 and something seconds without barriers, but I'm on xfs). Inside centos6 vm on my box it is somewhat slower. About 1.1-1.2 seconds.

Which appears to suggest that there might be context switches (or maybe plan syscalls) that are comparatively more expensive on vms.

Perf appears to confirm it (from my box):

perf stat -a /opt/couchbase/bin/couchbase-cli bucket-flush -c 127.0.0.1:8091 --force -u Administrator -p asdasd --bucket=default
Database data will be purged from disk ...
SUCCESS: bucket-flush

 Performance counter stats for 'system wide':

       5702.241103 task-clock (msec) # 8.028 CPUs utilized [100.00%]
            32,157 context-switches # 0.006 M/sec [100.00%]
               561 cpu-migrations # 0.098 K/sec [100.00%]
             5,070 page-faults # 0.889 K/sec
     3,419,669,465 cycles # 0.600 GHz [100.00%]
     4,790,349,052 stalled-cycles-frontend # 140.08% frontend cycles idle [100.00%]
   <not supported> stalled-cycles-backend
     2,490,689,165 instructions # 0.73 insns per cycle
                                                  # 1.92 stalled cycles per insn [100.00%]
       541,952,083 branches # 95.042 M/sec [100.00%]
        14,838,698 branch-misses # 2.74% of all branches

       0.710297094 seconds time elapsed

You can see 32k context switches in time of 0.7 seconds. Thats on average one context switch per 20k cycles (per core it's about 160k cycles between switch):

irb(main):001:0> 32157/0.7
=> 45938.571428571435
irb(main):002:0> 1E9/_
=> 21768.19976987903
irb(main):003:0>

So indeed a bit on the often side.

top under continuous flushing appears to confirm high kernel space cpu utilization:

PRC | sys 1.22s | user 1.64s | | | #proc 272 | | #trun 3 | #tslpi 771 | #tslpu 0 | | #zombie 1 | clones 12/s | | | #exit 10/s |
CPU | sys 113% | user 158% | | irq 4% | | | idle 524% | wait 1% | | | steal 0% | guest 0% | | avgf 1.80GHz | avgscal 69% |
cpu | sys 16% | user 28% | | irq 2% | | | idle 54% | cpu006 w 0% | | | steal 0% | guest 0% | | avgf 1.89GHz | avgscal 72% |
cpu | sys 16% | user 19% | | irq 1% | | | idle 65% | cpu000 w 0% | | | steal 0% | guest 0% | | avgf 1.87GHz | avgscal 71% |
cpu | sys 16% | user 20% | | irq 0% | | | idle 64% | cpu004 w 0% | | | steal 0% | guest 0% | | avgf 1.81GHz | avgscal 69% |
cpu | sys 10% | user 25% | | irq 1% | | | idle 64% | cpu001 w 0% | | | steal 0% | guest 0% | | avgf 1.73GHz | avgscal 66% |
cpu | sys 12% | user 19% | | irq 2% | | | idle 67% | cpu003 w 0% | | | steal 0% | guest 0% | | avgf 1.71GHz | avgscal 65% |
cpu | sys 18% | user 14% | | irq 0% | | | idle 68% | cpu007 w 0% | | | steal 0% | guest 0% | | avgf 1.86GHz | avgscal 71% |
cpu | sys 14% | user 14% | | irq 0% | | | idle 72% | cpu005 w 0% | | | steal 0% | guest 0% | | avgf 1.70GHz | avgscal 65% |
cpu | sys 12% | user 16% | | irq 0% | | | idle 72% | cpu002 w 0% | | | steal 0% | guest 0% | | avgf 1.73GHz | avgscal 66% |
CPL | avg1 2.35 | | avg5 0.97 | | avg15 1.02 | | | | csw 44562/s | | intr 2003/s | | | | numcpu 8 |
MEM | tot 7.7G | | free 1.8G | cache 3.1G | | dirty 23.1M | buff 0.4M | | slab 396.6M | | | | | | |
SWP | tot 15.1G | | free 15.1G | | | | | | | | | | vmcom 7.4G | | vmlim 18.9G |
DSK | sda | | busy 13% | read 0/s | | write 20/s | KiB/r 0 | | KiB/w 30 | MBr/s 0.00 | | MBw/s 0.59 | avq 8.00 | | avio 6.40 ms |
NET | transport | tcpi 7426/s | | tcpo 7426/s | udpi 203/s | udpo 0/s | tcpao 2/s | | tcppo 2/s | tcprs 0/s | tcpie 0/s | tcpor 0/s | | udpnp 0/s | udpip 0/s |
NET | network | | ipi 7472/s | ipo 7426/s | | ipfrw 0/s | deliv 7471/s | | | | | | icmpi 0/s | | icmpo 0/s |
NET | eth0 0% | pcki 46/s | | pcko 0/s | si 205 Kbps | | so 0 Kbps | coll 0/s | mlti 0/s | | erri 0/s | erro 0/s | | drpi 0/s | drpo 0/s |
NET | lo ---- | pcki 7426/s | | pcko 7426/s | si 5516 Kbps | | so 5516 Kbps | coll 0/s | mlti 0/s | | erri 0/s | erro 0/s | | drpi 0/s | drpo 0/s |

  PID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
25168 couchbas couchbas 21 0.98s 0.73s 0K 0K 0K 19488K -- - S 6 173% memcached
25088 couchbas couchbas 30 0.13s 0.57s 2572K 2748K 0K 1204K -- - S 3 71% beam.smp
25054 couchbas couchbas 29 0.03s 0.08s 0K 48K 0K 40K -- - S 6 11% beam.smp
12526 root root 4 0.00s 0.05s 0K 0K 0K 0K -- - S 6 5% virt-manager
26031 root - 0 0.00s 0.04s 0K 0K - - NE 0 E - 4% <python>
 1120 root root 1 0.01s 0.02s 20K 0K 0K 0K -- - S 3 3% Xorg

Comment by Aleksey Kondratenko [ 17/Nov/14 ]
Played with this a bit more. Use of flush memcached command enables much faster flushes. And arguably nice but somewhat debatable thing is that flush doesn't wait for disk anymore (and due to death of mccouch we don't have to). So flush itself is quick even if memcached is very busy afterwards with persisting it.

On the other hand, not waiting for disk makes flush slightly unreliable in face of memcached crashes. Which might be a big deal for non-test use-cases of flush.

Older approach which uses sync vbucket deletions is actually supposed to be completely robust (modulo bugs of course but I'm aware of none). Currently, ns_server marks flush as done _in persistent fashion_ after all vbuckets are deleted. And if memcached crashes before vbuckets are deleted, ns_server will actually retry deletion of vbuckets until it succeeds.

So with new implementations, if vbucket deletions are not persisted before flush is marked as done, ns_server might happily activate vbuckets which are supposed to be flushed.

I've uploaded a couple of commits that implement additional ways to flush local vbuckets here:

* http://review.couchbase.org/43331
* http://review.couchbase.org/43332

First implements flush via flush command. But it needs slight ep-engine modifications to work (i.e. to enable flush and to make it work when traffic is disabled). Second implements flush via quicker vbucket deletions. Which works out of the box, but is slower than flush due to having to create vbuckets back.

Based on that I suggest:

* introduce new privileged command to flush bucket. It should require _admin user and be enabled all the time. It should have "sync" option which makes it return only after flush is persisted.

* change ns_server to use that command.

* add flush REST request option allowing user to change between sync and async flushing. We can also add ns_server setting for default behavior of flush (i.e. sync or async).

Comment by Aleksey Kondratenko [ 17/Nov/14 ]
Passing to Dave given that my team cannot implement this faster flushing alone.
Comment by Aleksey Kondratenko [ 17/Nov/14 ]
CC-ed TAG head as well. This is good candidate improvement for sherlock. But given deadlines we have, we might want to move quite fast on that.




[MB-12796] ep-engine segaults when there is a Unsupported key. Created: 27/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0.1
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Major
Reporter: uris200 Assignee: David Haikney
Resolution: Unresolved Votes: 0
Labels: I/O_Priority, server, warmup
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04.4 LTS

Attachments: Zip Archive all_4_logs_private_part1.1.zip     Zip Archive all_4_logs_private_part1.2.zip     Zip Archive all_4_logs_private_part1.zip     Zip Archive all_4_logs_private_part2.zip    
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
After changing one bucket's I/O priority from Low (default) to High, all nodes and all buckets are in constant loop of starting up for 48 hours.

The issue was with having workload_optimization key set, when IO priority changed it caused the bucket to be restart and to pickup the workload_optimization key.

The steps to reproduce:
Create a one node cluster
Set "workload_optimization" on the default bucket
wget -O- --user=administrator --password=password --post-data='ns_bucket:update_bucket_props("default", [{extra_config_string, "workload_optimization=write"}]).' http://localhost:8091/diag/eval
Change I/O priority which causes the bucket to restart and for memcached to segfault.

Expectation:
ep-engine should ignore the unsupported key (throw a warning in the logs) and continue with warmup instead of segfaulting.




[MB-12793] Intermittent lockup in ep-engine unti test test_flush_multiv_restart Created: 27/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: .master
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Major
Reporter: Dave Rigby Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04

Issue Links:
Relates to
relates to MB-12792 Intermittent lockup in ep-engine unit... Open
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
During the ep-engine commit-validation, I occasionally see a lockup in test_flush_multiv_restart.

There appears to be some kind of race between the Flusher thread (Thread 6 below) completing Flusher::completeFlush, and shutting down a vbucket.

    (gdb) thread apply all bt

    Thread 12 (Thread 0x2ad1c412d700 (LWP 23173)):
    #0 0x00002ad1bf693dbd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
    #1 0x00002ad1bf6c1dd4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33
    #2 0x00002ad1c1a8b294 in updateStatsThread (arg=0x2ad1c0c32f00) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/memory_tracker.cc:36
    #3 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c24020) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #4 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c412d700) at pthread_create.c:308
    #5 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #6 0x0000000000000000 in ?? ()

    Thread 11 (Thread 0x2ad1c5537700 (LWP 23184)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002ad1bef6bf95 in cb_cond_timedwait (cond=0x2ad1c0c5f340, mutex=0x2ad1c0c5f308, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002ad1c1a7baa1 in SyncObject::wait (this=0x2ad1c0c5f300, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002ad1c1ab5fda in TaskQueue::_doSleep (this=0x2ad1c0c5f300, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002ad1c1ab60cb in TaskQueue::_fetchNextTask (this=0x2ad1c0c5f300, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002ad1c1ab647d in TaskQueue::fetchNextTask (this=0x2ad1c0c5f300, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002ad1c1a78947 in ExecutorPool::_nextTask (this=0x2ad1c0e05600, t=..., tick=157 '\235') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002ad1c1a789d5 in ExecutorPool::nextTask (this=0x2ad1c0e05600, t=..., tick=157 '\235') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002ad1c1a8e630 in ExecutorThread::run (this=0x2ad1c0c38460) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c38460) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c24940) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c5537700) at pthread_create.c:308
    #12 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 10 (Thread 0x2ad1c5336700 (LWP 23185)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002ad1bef6bf95 in cb_cond_timedwait (cond=0x2ad1c0c5f340, mutex=0x2ad1c0c5f308, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002ad1c1a7baa1 in SyncObject::wait (this=0x2ad1c0c5f300, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002ad1c1ab5fda in TaskQueue::_doSleep (this=0x2ad1c0c5f300, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002ad1c1ab60cb in TaskQueue::_fetchNextTask (this=0x2ad1c0c5f300, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002ad1c1ab647d in TaskQueue::fetchNextTask (this=0x2ad1c0c5f300, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002ad1c1a78947 in ExecutorPool::_nextTask (this=0x2ad1c0e05600, t=..., tick=160 '\240') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002ad1c1a789d5 in ExecutorPool::nextTask (this=0x2ad1c0e05600, t=..., tick=160 '\240') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002ad1c1a8e630 in ExecutorThread::run (this=0x2ad1c0c38540) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c38540) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c24950) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c5336700) at pthread_create.c:308
    #12 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 9 (Thread 0x2ad1c5135700 (LWP 23186)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002ad1bef6bf95 in cb_cond_timedwait (cond=0x2ad1c0c5f340, mutex=0x2ad1c0c5f308, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002ad1c1a7baa1 in SyncObject::wait (this=0x2ad1c0c5f300, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002ad1c1ab5fda in TaskQueue::_doSleep (this=0x2ad1c0c5f300, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002ad1c1ab60cb in TaskQueue::_fetchNextTask (this=0x2ad1c0c5f300, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002ad1c1ab647d in TaskQueue::fetchNextTask (this=0x2ad1c0c5f300, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002ad1c1a78947 in ExecutorPool::_nextTask (this=0x2ad1c0e05600, t=..., tick=157 '\235') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002ad1c1a789d5 in ExecutorPool::nextTask (this=0x2ad1c0e05600, t=..., tick=157 '\235') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002ad1c1a8e630 in ExecutorThread::run (this=0x2ad1c0c38620) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c38620) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c24960) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c5135700) at pthread_create.c:308
    #12 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 8 (Thread 0x2ad1c4f34700 (LWP 23187)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002ad1bef6bf95 in cb_cond_timedwait (cond=0x2ad1c0c5f340, mutex=0x2ad1c0c5f308, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002ad1c1a7baa1 in SyncObject::wait (this=0x2ad1c0c5f300, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002ad1c1ab5fda in TaskQueue::_doSleep (this=0x2ad1c0c5f300, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002ad1c1ab60cb in TaskQueue::_fetchNextTask (this=0x2ad1c0c5f300, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002ad1c1ab647d in TaskQueue::fetchNextTask (this=0x2ad1c0c5f300, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002ad1c1a78947 in ExecutorPool::_nextTask (this=0x2ad1c0e05600, t=..., tick=176 '\260') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002ad1c1a789d5 in ExecutorPool::nextTask (this=0x2ad1c0e05600, t=..., tick=176 '\260') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002ad1c1a8e630 in ExecutorThread::run (this=0x2ad1c0c38700) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c38700) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c24970) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c4f34700) at pthread_create.c:308
    #12 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 7 (Thread 0x2ad1c4d33700 (LWP 23188)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002ad1bef6bf95 in cb_cond_timedwait (cond=0x2ad1c0c5f7c0, mutex=0x2ad1c0c5f788, ms=773) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002ad1c1a7baa1 in SyncObject::wait (this=0x2ad1c0c5f780, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002ad1c1ab5ff3 in TaskQueue::_doSleep (this=0x2ad1c0c5f780, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:94
    #4 0x00002ad1c1ab60cb in TaskQueue::_fetchNextTask (this=0x2ad1c0c5f780, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002ad1c1ab647d in TaskQueue::fetchNextTask (this=0x2ad1c0c5f780, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002ad1c1a78947 in ExecutorPool::_nextTask (this=0x2ad1c0e05600, t=..., tick=97 'a') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002ad1c1a789d5 in ExecutorPool::nextTask (this=0x2ad1c0e05600, t=..., tick=97 'a') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002ad1c1a8e630 in ExecutorThread::run (this=0x2ad1c0c387e0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c387e0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c24980) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c4d33700) at pthread_create.c:308
    #12 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 6 (Thread 0x2ad1c4b32700 (LWP 23189)):
    #0 VBucketMap::isBucketCreation (this=0x2ad1c0c4c028, id=0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/vbucketmap.cc:130
    #1 0x00002ad1c1a390f6 in EventuallyPersistentStore::flushVBucket (this=0x2ad1c0c4c000, vbid=0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep.cc:2846
    #2 0x00002ad1c1a84d4a in Flusher::flushVB (this=0x2ad1c0c63780) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/flusher.cc:283
    #3 0x00002ad1c1a84828 in Flusher::completeFlush (this=0x2ad1c0c63780) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/flusher.cc:219
    #4 0x00002ad1c1a845eb in Flusher::step (this=0x2ad1c0c63780, task=0x2ad1c0c332d0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/flusher.cc:190
    #5 0x00002ad1c1ab45d9 in FlusherTask::run (this=0x2ad1c0c332d0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/tasks.cc:44
    #6 0x00002ad1c1a8e90d in ExecutorThread::run (this=0x2ad1c0c388c0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:110
    #7 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c388c0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #8 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c24990) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #9 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c4b32700) at pthread_create.c:308
    #10 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #11 0x0000000000000000 in ?? ()

    Thread 5 (Thread 0x2ad1c4931700 (LWP 23190)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002ad1bef6bf95 in cb_cond_timedwait (cond=0x2ad1c0c5f7c0, mutex=0x2ad1c0c5f788, ms=773) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002ad1c1a7baa1 in SyncObject::wait (this=0x2ad1c0c5f780, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002ad1c1ab5ff3 in TaskQueue::_doSleep (this=0x2ad1c0c5f780, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:94
    #4 0x00002ad1c1ab60cb in TaskQueue::_fetchNextTask (this=0x2ad1c0c5f780, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002ad1c1ab647d in TaskQueue::fetchNextTask (this=0x2ad1c0c5f780, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002ad1c1a78947 in ExecutorPool::_nextTask (this=0x2ad1c0e05600, t=..., tick=144 '\220') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002ad1c1a789d5 in ExecutorPool::nextTask (this=0x2ad1c0e05600, t=..., tick=144 '\220') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002ad1c1a8e630 in ExecutorThread::run (this=0x2ad1c0c389a0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c389a0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c249a0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c4931700) at pthread_create.c:308
    #12 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 4 (Thread 0x2ad1c4730700 (LWP 23191)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002ad1bef6bf95 in cb_cond_timedwait (cond=0x2ad1c0c5f7c0, mutex=0x2ad1c0c5f788, ms=773) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002ad1c1a7baa1 in SyncObject::wait (this=0x2ad1c0c5f780, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002ad1c1ab5ff3 in TaskQueue::_doSleep (this=0x2ad1c0c5f780, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:94
    #4 0x00002ad1c1ab60cb in TaskQueue::_fetchNextTask (this=0x2ad1c0c5f780, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002ad1c1ab647d in TaskQueue::fetchNextTask (this=0x2ad1c0c5f780, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002ad1c1a78947 in ExecutorPool::_nextTask (this=0x2ad1c0e05600, t=..., tick=195 '\303') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002ad1c1a789d5 in ExecutorPool::nextTask (this=0x2ad1c0e05600, t=..., tick=195 '\303') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002ad1c1a8e630 in ExecutorThread::run (this=0x2ad1c0c38a80) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c38a80) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c249b0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c4730700) at pthread_create.c:308
    #12 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 3 (Thread 0x2ad1c452f700 (LWP 23192)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002ad1bef6bf95 in cb_cond_timedwait (cond=0x2ad1c0c5f4c0, mutex=0x2ad1c0c5f488, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002ad1c1a7baa1 in SyncObject::wait (this=0x2ad1c0c5f480, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002ad1c1ab5fda in TaskQueue::_doSleep (this=0x2ad1c0c5f480, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002ad1c1ab60cb in TaskQueue::_fetchNextTask (this=0x2ad1c0c5f480, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002ad1c1ab647d in TaskQueue::fetchNextTask (this=0x2ad1c0c5f480, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002ad1c1a78947 in ExecutorPool::_nextTask (this=0x2ad1c0e05600, t=..., tick=155 '\233') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002ad1c1a789d5 in ExecutorPool::nextTask (this=0x2ad1c0e05600, t=..., tick=155 '\233') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002ad1c1a8e630 in ExecutorThread::run (this=0x2ad1c0c38b60) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c38b60) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c249c0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c452f700) at pthread_create.c:308
    #12 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 2 (Thread 0x2ad1c432e700 (LWP 23193)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002ad1bef6bf95 in cb_cond_timedwait (cond=0x2ad1c0c5f640, mutex=0x2ad1c0c5f608, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002ad1c1a7baa1 in SyncObject::wait (this=0x2ad1c0c5f600, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002ad1c1ab5fda in TaskQueue::_doSleep (this=0x2ad1c0c5f600, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002ad1c1ab60cb in TaskQueue::_fetchNextTask (this=0x2ad1c0c5f600, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002ad1c1ab647d in TaskQueue::fetchNextTask (this=0x2ad1c0c5f600, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002ad1c1a78947 in ExecutorPool::_nextTask (this=0x2ad1c0e05600, t=..., tick=167 '\247') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002ad1c1a789d5 in ExecutorPool::nextTask (this=0x2ad1c0e05600, t=..., tick=167 '\247') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002ad1c1a8e630 in ExecutorThread::run (this=0x2ad1c0c38c40) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002ad1c1a8e1cd in launch_executor_thread (arg=0x2ad1c0c38c40) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002ad1bef6bb4a in platform_thread_wrap (arg=0x2ad1c0c249d0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002ad1bf3bde9a in start_thread (arg=0x2ad1c432e700) at pthread_create.c:308
    #12 0x00002ad1bf6c831d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 1 (Thread 0x2ad1c05b4780 (LWP 23172)):
    #0 0x00002ad1bf693dbd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
    #1 0x00002ad1bf6c1dd4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33
    #2 0x00002ad1c1a83cfd in Flusher::wait (this=0x2ad1c0c63780) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/flusher.cc:47
    #3 0x00002ad1c1a2f408 in EventuallyPersistentStore::stopFlusher (this=0x2ad1c0c4c000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep.cc:418
    #4 0x00002ad1c1a2eda4 in EventuallyPersistentStore::~EventuallyPersistentStore (this=0x2ad1c0c4c000, __in_chrg=<optimized out>) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep.cc:379
    #5 0x00002ad1c1a658ff in EventuallyPersistentEngine::~EventuallyPersistentEngine (this=0x2ad1c0c72000, __in_chrg=<optimized out>) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep_engine.cc:5736
    #6 0x00002ad1c1a50294 in EvpDestroy (handle=0x2ad1c0c72000, force=false) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep_engine.cc:143
    #7 0x0000000000401c6c in mock_destroy (handle=0x609480, force=false) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:63
    #8 0x0000000000403d8a in destroy_engine (force=false) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:989
    #9 0x0000000000403dd1 in reload_engine (h=0x7fff45122a18, h1=0x7fff45122a10, engine=0x7fff451232ad "ep.so", cfg=0x2ad1c0c58840 "flushall_enabled=true;", init=true, force=false)
        at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:997
    #10 0x00002ad1c103091e in test_flush_multiv_restart (h=0x609480, h1=0x609480) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/tests/ep_testsuite.cc:1405
    #11 0x0000000000403fbe in execute_test (test=..., engine=0x7fff451232ad "ep.so", default_cfg=0x7fff451232c9 "flushall_enabled=true;ht_size=13;ht_locks=7")
        at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:1055
    #12 0x0000000000404704 in main (argc=9, argv=0x7fff45122d28) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:1313

You can see that Thread 6 is just looping forever trying to flush the VBucket:

(gdb) next
    218 while(!canSnooze()) {
    (gdb)
    219 flushVB();
    (gdb)
    218 while(!canSnooze()) {
    (gdb)
    219 flushVB();
    (gdb)
    218 while(!canSnooze()) {
    (gdb)
    219 flushVB();

Stepping into this, we eventually end up with flushVBucket() returning with RETRY_FLUSH_VBUCKET, as it appears that the vbucket is still marked as being created:

(gdb) step
    EventuallyPersistentStore::flushVBucket (this=0x2ad1c0c4c000, vbid=0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep.cc:2836
    2836 KVShard *shard = vbMap.getShard(vbid);
    (gdb) next
    2837 if (diskFlushAll) {
    (gdb)
    2846 if (vbMap.isBucketCreation(vbid)) {
    (gdb)
    2847 return RETRY_FLUSH_VBUCKET;
    (gdb)
    2976 }




 Comments   
Comment by Dave Rigby [ 27/Nov/14 ]
Note: Command-line for failing test (as of 1a9ae4ae0b33c83f91e74337a0e2b9d567a74875):

    build/memcached/engine_testapp -E ep.so -T ep_testsuite.so -e "flushall_enabled=true;ht_size=13;ht_locks=7" -C 138




[MB-12792] Intermittent lockup in ep-engine unit tests during unregisterBucket Created: 27/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: .master
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Major
Reporter: Dave Rigby Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Jenkins ep-engine CV build - Ubuntu 12.04 x64 - see http://factory.couchbase.com/job/ep-engine-gerrit-master/128/

Issue Links:
Relates to
relates to MB-12793 Intermittent lockup in ep-engine unti... Open
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
The Jenkins commit-validation build for ep-engine is still randomly timing out. Investigating it I found that ep-engine appears to have deadlocked during shutdown - we have somehow managed to end up with a ExecutorThread which after calling stop() has a state of EXECUTOR_SLEEPING.

GDB session of the issue - note my comments [DJR] inline:

(gdb) bt
    #0 0x00002b59d6f89148 in pthread_join (threadid=47664944314112, thread_return=0x0) at pthread_join.c:89
    #1 0x00002b59d6b35c78 in cb_join_thread (id=47664944314112) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:60
    #2 0x00002b59d968e537 in ExecutorThread::stop (this=0x2b59d8838540, wait=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:63
    #3 0x00002b59d967a68b in ExecutorPool::_unregisterBucket (this=0x2b59d8a05600, engine=0x2b59d8872000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:574
    #4 0x00002b59d967a89e in ExecutorPool::unregisterBucket (this=0x2b59d8a05600, engine=0x2b59d8872000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:602
    #5 0x00002b59d962edbc in EventuallyPersistentStore::~EventuallyPersistentStore (this=0x2b59d884c000, __in_chrg=<optimized out>) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep.cc:380
    #6 0x00002b59d96658ff in EventuallyPersistentEngine::~EventuallyPersistentEngine (this=0x2b59d8872000, __in_chrg=<optimized out>) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep_engine.cc:5736
    #7 0x00002b59d9650294 in EvpDestroy (handle=0x2b59d8872000, force=false) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep_engine.cc:143
    #8 0x0000000000401c6c in mock_destroy (handle=0x609480, force=false) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:63
    #9 0x0000000000403d8a in destroy_engine (force=false) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:989
    #10 0x0000000000404015 in execute_test (test=..., engine=0x7fff265372af "ep.so", default_cfg=0x7fff265372cb "flushall_enabled=true;ht_size=13;ht_locks=7")
        at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:1061
    #11 0x0000000000404704 in main (argc=9, argv=0x7fff26535ef8) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:1313
    (gdb) f 2
    #2 0x00002b59d968e537 in ExecutorThread::stop (this=0x2b59d8838540, wait=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:63
    63 cb_join_thread(thread);
    (gdb) p this->state
    $1 = {value = EXECUTOR_SLEEPING}
    [DJR] ^^^^^^^^^^^^^^^^^
    [DJR] Note that the value is EXECUTOR_SLEEPING, even though we have just (atomically?) set it to SHUTDOWN at line 58:
    (gdb) list -
    53
    54 void ExecutorThread::stop(bool wait) {
    55 if (!wait && (state == EXECUTOR_SHUTDOWN || state == EXECUTOR_DEAD)) {
    56 return;
    57 }
    58 state = EXECUTOR_SHUTDOWN;
    59 if (!wait) {
    60 LOG(EXTENSION_LOG_INFO, "%s: Stopping", name.c_str());
    61 return;
    62 }
    (gdb) thread apply all bt

    Thread 6 (Thread 0x2b59dbd2d700 (LWP 12679)):
    #0 0x00002b59d725ddbd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
    #1 0x00002b59d728bdd4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33
    #2 0x00002b59d968b294 in updateStatsThread (arg=0x2b59d8832f00) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/memory_tracker.cc:36
    #3 0x00002b59d6b35b4a in platform_thread_wrap (arg=0x2b59d8824020) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #4 0x00002b59d6f87e9a in start_thread (arg=0x2b59dbd2d700) at pthread_create.c:308
    #5 0x00002b59d729231d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #6 0x0000000000000000 in ?? ()

    Thread 5 (Thread 0x2b59dc12f700 (LWP 12681)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002b59d6b35f95 in cb_cond_timedwait (cond=0x2b59d885f4c0, mutex=0x2b59d885f488, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002b59d967baa1 in SyncObject::wait (this=0x2b59d885f480, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002b59d96b5fda in TaskQueue::_doSleep (this=0x2b59d885f480, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002b59d96b60cb in TaskQueue::_fetchNextTask (this=0x2b59d885f480, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002b59d96b647d in TaskQueue::fetchNextTask (this=0x2b59d885f480, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002b59d9678947 in ExecutorPool::_nextTask (this=0x2b59d8a05600, t=..., tick=121 'y') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002b59d96789d5 in ExecutorPool::nextTask (this=0x2b59d8a05600, t=..., tick=121 'y') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002b59d968e630 in ExecutorThread::run (this=0x2b59d8838540) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002b59d968e1cd in launch_executor_thread (arg=0x2b59d8838540) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002b59d6b35b4a in platform_thread_wrap (arg=0x2b59d8824a20) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002b59d6f87e9a in start_thread (arg=0x2b59dc12f700) at pthread_create.c:308
    #12 0x00002b59d729231d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 4 (Thread 0x2b59dc531700 (LWP 12683)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002b59d6b35f95 in cb_cond_timedwait (cond=0x2b59d885f4c0, mutex=0x2b59d885f488, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002b59d967baa1 in SyncObject::wait (this=0x2b59d885f480, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002b59d96b5fda in TaskQueue::_doSleep (this=0x2b59d885f480, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002b59d96b60cb in TaskQueue::_fetchNextTask (this=0x2b59d885f480, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002b59d96b647d in TaskQueue::fetchNextTask (this=0x2b59d885f480, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002b59d9678947 in ExecutorPool::_nextTask (this=0x2b59d8a05600, t=..., tick=121 'y') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002b59d96789d5 in ExecutorPool::nextTask (this=0x2b59d8a05600, t=..., tick=121 'y') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002b59d968e630 in ExecutorThread::run (this=0x2b59d8838700) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002b59d968e1cd in launch_executor_thread (arg=0x2b59d8838700) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002b59d6b35b4a in platform_thread_wrap (arg=0x2b59d8824a30) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002b59d6f87e9a in start_thread (arg=0x2b59dc531700) at pthread_create.c:308
    #12 0x00002b59d729231d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 3 (Thread 0x2b59dc933700 (LWP 12685)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002b59d6b35f95 in cb_cond_timedwait (cond=0x2b59d885f340, mutex=0x2b59d885f308, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002b59d967baa1 in SyncObject::wait (this=0x2b59d885f300, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002b59d96b5fda in TaskQueue::_doSleep (this=0x2b59d885f300, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002b59d96b60cb in TaskQueue::_fetchNextTask (this=0x2b59d885f300, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002b59d96b647d in TaskQueue::fetchNextTask (this=0x2b59d885f300, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002b59d9678947 in ExecutorPool::_nextTask (this=0x2b59d8a05600, t=..., tick=121 'y') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002b59d96789d5 in ExecutorPool::nextTask (this=0x2b59d8a05600, t=..., tick=121 'y') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002b59d968e630 in ExecutorThread::run (this=0x2b59d88388c0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002b59d968e1cd in launch_executor_thread (arg=0x2b59d88388c0) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002b59d6b35b4a in platform_thread_wrap (arg=0x2b59d8824a50) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002b59d6f87e9a in start_thread (arg=0x2b59dc933700) at pthread_create.c:308
    #12 0x00002b59d729231d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 2 (Thread 0x2b59dcf36700 (LWP 12688)):
    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
    #1 0x00002b59d6b35f95 in cb_cond_timedwait (cond=0x2b59d885f640, mutex=0x2b59d885f608, ms=2000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:156
    #2 0x00002b59d967baa1 in SyncObject::wait (this=0x2b59d885f600, tv=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/syncobject.h:74
    #3 0x00002b59d96b5fda in TaskQueue::_doSleep (this=0x2b59d885f600, t=...) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:92
    #4 0x00002b59d96b60cb in TaskQueue::_fetchNextTask (this=0x2b59d885f600, t=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:117
    #5 0x00002b59d96b647d in TaskQueue::fetchNextTask (this=0x2b59d885f600, thread=..., toSleep=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/taskqueue.cc:161
    #6 0x00002b59d9678947 in ExecutorPool::_nextTask (this=0x2b59d8a05600, t=..., tick=121 'y') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:214
    #7 0x00002b59d96789d5 in ExecutorPool::nextTask (this=0x2b59d8a05600, t=..., tick=121 'y') at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:229
    #8 0x00002b59d968e630 in ExecutorThread::run (this=0x2b59d8838b60) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:78
    #9 0x00002b59d968e1cd in launch_executor_thread (arg=0x2b59d8838b60) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:34
    #10 0x00002b59d6b35b4a in platform_thread_wrap (arg=0x2b59d8824a80) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:19
    #11 0x00002b59d6f87e9a in start_thread (arg=0x2b59dcf36700) at pthread_create.c:308
    #12 0x00002b59d729231d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    #13 0x0000000000000000 in ?? ()

    Thread 1 (Thread 0x2b59d817e780 (LWP 12678)):
    #0 0x00002b59d6f89148 in pthread_join (threadid=47664944314112, thread_return=0x0) at pthread_join.c:89
    #1 0x00002b59d6b35c78 in cb_join_thread (id=47664944314112) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/platform/src/cb_pthreads.c:60
    #2 0x00002b59d968e537 in ExecutorThread::stop (this=0x2b59d8838540, wait=true) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorthread.cc:63
    #3 0x00002b59d967a68b in ExecutorPool::_unregisterBucket (this=0x2b59d8a05600, engine=0x2b59d8872000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:574
    #4 0x00002b59d967a89e in ExecutorPool::unregisterBucket (this=0x2b59d8a05600, engine=0x2b59d8872000) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/executorpool.cc:602
    #5 0x00002b59d962edbc in EventuallyPersistentStore::~EventuallyPersistentStore (this=0x2b59d884c000, __in_chrg=<optimized out>) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep.cc:380
    #6 0x00002b59d96658ff in EventuallyPersistentEngine::~EventuallyPersistentEngine (this=0x2b59d8872000, __in_chrg=<optimized out>) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep_engine.cc:5736
    #7 0x00002b59d9650294 in EvpDestroy (handle=0x2b59d8872000, force=false) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/ep-engine/src/ep_engine.cc:143
    #8 0x0000000000401c6c in mock_destroy (handle=0x609480, force=false) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:63
    #9 0x0000000000403d8a in destroy_engine (force=false) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:989
    #10 0x0000000000404015 in execute_test (test=..., engine=0x7fff265372af "ep.so", default_cfg=0x7fff265372cb "flushall_enabled=true;ht_size=13;ht_locks=7")
        at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:1061
    #11 0x0000000000404704 in main (argc=9, argv=0x7fff26535ef8) at /home/jenkins/jenkins/workspace/ep-engine-gerrit-master/memcached/programs/engine_testapp/engine_testapp.c:1313

    [DJR] Note that we *do* appear to perform an atomic compare-exchange when sleeping:
    (gdb) list TaskQueue::_doSleep
    72 numToWake -= sleepers;
    73 }
    74 }
    75 }
    76
    77 bool TaskQueue::_doSleep(ExecutorThread &t) {
    78 gettimeofday(&t.now, NULL);
    79 if (less_tv(t.now, t.waketime) && manager->trySleep(queueType)) {
    80 // Atomically switch from running to sleeping; iff we were previously
    81 // running.
    (gdb)
    82 executor_state_t expected_state = EXECUTOR_RUNNING;
    83 if (!t.state.compare_exchange_strong(expected_state,
    84 EXECUTOR_SLEEPING)) {
    85 return false;
    86 }
    ... <snip>


 Comments   
Comment by Dave Rigby [ 27/Nov/14 ]
Inspecting the ep-engine code we *appear* to be correct - transitioning from RUNNING -> SLEEPING and back in the thread itself uses compare_exchange_strong(), but clearly something is going wrong.

I note that Ubuntu 12.04 (platform here) uses gcc 4.6, which while it does support <atomic>[1] doesn't fully support C++11 and hence we configure the build to *not* use std::atomic - see [2]. As a consequence we use our own CouchbaseAtomic class. Any chance there's some bug in our implementation? How do people feel about using std::atomic as long as it's present - even if that means slightly pre-C++11 compliance?

[1]: https://gcc.gnu.org/projects/cxx0x.html
[2]: http://src.couchbase.org/source/xref/trunk/ep-engine/src/config_static.h#120
Comment by Dave Rigby [ 27/Nov/14 ]
I've dug into this a bit further - I thought I'd run Clang's ThreadSanitizer on this test, with CouchbaseAtomic enabled. When I do this ThreadSanitizer reports CouchbaseAtomic.load() and .store() as being non-atomic(!).

It's possible that with x86's strong memory ordering CouchbaseAtomic *is* valid on x86, but assuming ThreadSanitizer is correct is isn't valid in the general case.

The more I dig into this the more I think we should just use std::atomic when present (GCC 4.4+, MSVC 2012+, Clang 3.1) - which is true for Ubuntu 10.04+, RHEL 6+ and OS X. The only notable omission is RHEL5.




[MB-12795] MacOSX build is unavailable after 3.0.2-1582-rel Created: 27/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0.2
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Meenakshi Goel Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
MacOSX build seems to be unavailable after 3.0.2-1582-rel

http://latestbuilds.hq.couchbase.com/index_3.0.2.html




[MB-12729] Replica index is not triggered even when partition sequence > replicaUpdateMinChanges set Created: 20/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0.2
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Major
Reporter: Meenakshi Goel Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.2-1542-rel

Triage: Triaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Ref Link:
http://qa.sc.couchbase.com/job/centos_x64-02-01-viewquery2-P0/153/consoleFull

Test To Reproduce:
python testrunner.py -i myfile.ini -t view.viewquerytests.ViewQueryTests.test_employee_dataset_min_changes_check -p max-dupe-result-count=10,get-cbcollect-info=True,num-tries=60,attempt-num=60,get-delays=True

Steps to Reproduce:
1. Create views with updateMinChanges and replicaUpdateMinChanges option
2. Load data less that changes
3. Check index is not started
4. Load data more than changes
5.Check index is triggered

Uploading Logs


 Comments   
Comment by Meenakshi Goel [ 20/Nov/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-12729/8586d8eb/172.23.106.61-11192014-2327-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12729/d9aca249/172.23.106.63-11192014-2328-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12729/439baa5e/172.23.106.62-11192014-2329-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12729/d42ed0d3/172.23.106.64-11192014-2329-diag.zip
Comment by Nimish Gupta [ 21/Nov/14 ]
This issue is not reproducible on my machine. I tried on aws machine also. I got different error on my dev setup and aws machine. Currently debugging on Meenakshi's setup using toy build.
Comment by Nimish Gupta [ 24/Nov/14 ]
This issue is intermittently reproducible on Meenakshi's setup. The number of condition for triggering the updater is not met for replica partition.

After running the toy build with debug message :
[couchdb:info,2014-11-24T4:31:12.688,ns_1@172.23.107.20:<0.21691.0>:couch_log:info:41]condition check is false 1974 2000

Currently I don't know the reason why the count is less and I am looking at it.
Comment by Nimish Gupta [ 25/Nov/14 ]
After running the test with addition debug logs, It looks that views didn't get the enough mutations from ep-engine itself to trigger the update condition. Views are already updated till ep-engine vbucket stats. Due to slow replication, may be ep-engine don't have mutations.
 
Following message is output by the test while running the test:
Partition sequence is: 2225
This is confusing message.This partition sequence is sum of sequence number for active vbuckets only not for replica partitions.
Replica partitions are still behind and so indexing is not triggered for replica partitions.

Ep-engine team look into this issue to figure out why ep-engine don't have enough mutations in replica partitions. I think it is due to slow replication.



Comment by Sriram Melkote [ 25/Nov/14 ]
Chiyoung, can you please help? Based on Nimish's comments, it appears the mutations these testings were expecting did not reach replica. Given this test used to pass in the past, would appreciate your help in understanding the root cause of this problem.
Comment by Chiyoung Seo [ 25/Nov/14 ]
Sriram,

Can you please look at DCP stats in the producer sides to see if there are any backfill tasks that are not completed yet or any items in the replication queue?
Comment by Sriram Ganesan [ 26/Nov/14 ]
The ep_dcp_items_remaining and ep_dcp_queue_backfillremaining stats are both 0 in the logs.
Comment by Sriram Ganesan [ 26/Nov/14 ]
The stats seems to suggest that there are no items remaining to be sent. Having said that, there is a possibility that a backfill task is continuously snoozing. The logs don't reflect that because it is logged only at INFO level for memcached.

Meenakshi

Would it be possible to reproduce this problem with INFO level logging in memcached? Please upload the logs once you are able to do so.

Thanks
Sriram
Comment by Meenakshi Goel [ 27/Nov/14 ]
Sriram, Can you please suggest me the way to enable INFO level logging in memcached so that i can provide you the desired logs ? Thanks.




[MB-11917] One node slow probably due to the Erlang scheduler Created: 09/Aug/14  Updated: 27/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Volker Mische Assignee: Harsha Havanur
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File crash_toy_701.rtf     PNG File leto_ssd_300-1105_561_build_init_indexleto_ssd_300-1105_561172.23.100.31beam.smp_cpu.png    
Issue Links:
Duplicate
duplicates MB-12200 Seg fault during indexing on view-toy... Resolved
duplicates MB-12579 View Index DGM 20% Regression (Initia... Resolved
duplicates MB-9822 One of nodes is too slow during indexing Closed
is duplicated by MB-12183 View Query Thruput regression compare... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
One node is slow, that's probably due to the "scheduler collapse" bug in the Erlang VM R16.

I will try to find a way to verify that it is really the scheduler and no other problem. This is basically a duplicate of MB-9822. Though that bug has a long history, hence I dare to create a new one.

 Comments   
Comment by Volker Mische [ 09/Aug/14 ]
I forgot to add that our issue sounds exactly like that one: http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
Comment by Sriram Melkote [ 11/Aug/14 ]
Upgrading to blocker as this is doubling initial index time in recent runs on showfast.
Comment by Volker Mische [ 12/Aug/14 ]
I verified that it's the "scheduler collapse". Have a look at the chart I've attached (It's from [1] [172.23.100.31] beam.smp_cpu). It starts with a utilization of around 400% at around 120 I reduced the online schedulers to 1 (with running erlang:system_flag(schedulers_online, 1) via a remote shell). I then increased the schedulers_online again at around 150 to the original value of 24. You can see that it got back to normal.

[1]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1105_561_build_init_index
Comment by Volker Mische [ 12/Aug/14 ]
I would try to run on R16 and see how often it happens with COUCHBASE_NS_SERVER_VM_EXTRA_ARGS=["+swt", "low", "+sfwi", "100"] set (as suggested in MB-9822 [1]).

[1]: https://www.couchbase.com/issues/browse/MB-9822?focusedCommentId=89219&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-89219
Comment by Pavel Paulau [ 12/Aug/14 ]
We agreed to try:

+sfwi 100/500 and +sbwt long

Will run test 5 times with these options.
Comment by Pavel Paulau [ 13/Aug/14 ]
5 runs of tests/index_50M_dgm.test with -sfwi 100 -sbwt long:

http://ci.sc.couchbase.com/job/leto-dev/19/
http://ci.sc.couchbase.com/job/leto-dev/20/
http://ci.sc.couchbase.com/job/leto-dev/21/
http://ci.sc.couchbase.com/job/leto-dev/22/
http://ci.sc.couchbase.com/job/leto-dev/23/

3 normal runs, 2 with slowness.
Comment by Volker Mische [ 13/Aug/14 ]
I see only one slow run (22): http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1137_6a0_build_init_index

But still :-/
Comment by Pavel Paulau [ 13/Aug/14 ]
See (20), incremental indexing: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1137_ed9_build_incr_index
Comment by Volker Mische [ 13/Aug/14 ]
Oh, I was only looking at the initial building.
Comment by Volker Mische [ 13/Aug/14 ]
I got a hint in the #erlang IRC channel. I'll try to use the erlang:bump_reductions(2000) and see if that helps.
Comment by Volker Mische [ 13/Aug/14 ]
Let's see if bumping the reductions make it work: http://review.couchbase.org/40591
Comment by Aleksey Kondratenko [ 13/Aug/14 ]
merged that commit.
Comment by Pavel Paulau [ 13/Aug/14 ]
Just tested build 3.0.0-1150, rebalance test but with initial indexing phase.

2 nodes are super slow and utilize only single core.
Comment by Volker Mische [ 18/Aug/14 ]
I can't reproduce it locally. I tend towards closing this issue as "won't fix". We should really not have long running NIFS.

I also think that it won't happen much under real work loads. And even if, the workaround would be to reduce the number of online schedulers to 1 and immediately increasing it again back to the original number.
Comment by Volker Mische [ 18/Aug/14 ]
Assigning to Siri to make the call on whether we close it or not.
Comment by Anil Kumar [ 18/Aug/14 ]
Triage - Not blocking 3.0 RC1
Comment by Raju Suravarjjala [ 19/Aug/14 ]
Triage: Siri will put additional information and this bug is being retargeted to 3.0.1
Comment by Sriram Melkote [ 19/Aug/14 ]
Folks, for too long we've had trouble that get pinned to our NIFs. In 3.5, let's solve them whatever is the correct Erlang approach to running heavy high performance code. Port, or reporting reductions, or moving to R17 with dirty schedulers, or some other option I missed - whatever is the best solution, let us implement in 3.5 and be done.
Comment by Volker Mische [ 09/Sep/14 ]
I think we should close this issue and rather create a new one for whatever we come up with (e.g. the async mapreduce NIF).
Comment by Harsha Havanur [ 10/Sep/14 ]
Toy Build for this change at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_ubunt12-3.0.0-toy-hhs-x86_64_3.0.0-702-toy.deb

Review in progress at
http://review.couchbase.org/#/c/41221/4
Comment by Harsha Havanur [ 12/Sep/14 ]
Please find udpated toy build for this
http://latestbuilds.hq.couchbase.com/couchbase-server-community_ubunt12-3.0.0-toy-hhs-x86_64_3.0.0-704-toy.deb
Comment by Sriram Melkote [ 12/Sep/14 ]
Another occurrence of this, MB-12183.

I'm making this a blocker.
Comment by Harsha Havanur [ 13/Sep/14 ]
Centos build at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-hhs-x86_64_3.0.0-700-toy.rpm
Comment by Ketaki Gangal [ 16/Sep/14 ]
Filed bug MB-12200 for this toy-build
Comment by Ketaki Gangal [ 17/Sep/14 ]
Attaching stack from toy-build 701
File

crash_toy_701.rtf

Access to machine is as mentioned previously on MB-12200.
Comment by Harsha Havanur [ 19/Sep/14 ]
We are facing 2 issues with async nif implementation.
1) Loss of signals leading to deadlock in enqueue and dequeue in queues
I am suspecting enif mutex and condition variables. I could reproduce deadlock scenario on Centos which potentially point to both producer and consumer (enqueue and dequeue) in our case going to sleep due to not handling condition variable signals correctly.
To address this issue, I have replaced enif mutex and condition variables with that of C++ stl counterparts. This seem to fix the dead lock situation.

2) Memory getting freed by terminator task when the context is alive during mapDoc.
This is still work in progress and will update once I have a solution for this.
Comment by Harsha Havanur [ 21/Sep/14 ]
Segmentation fault is probably due to termination of erlang process calling map_doc. This triggers destructor which cleans up v8 context when the task is still in the queue. Will attempt a fix for this.
Comment by Harsha Havanur [ 22/Sep/14 ]
I have fixed both issues in this build
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-hhs-x86_64_3.0.0-709-toy.rpm
I am running systests as ketki suggested on VMs 10.6.2.164, 165, 168, 171, 172, 194, 195. Currently rebalance is in progress.

For the deadlock situation resolution was to broadcast condition signal to wake up all waiting threads instead of waking up only one of the threads.
For Segmentation fault resolution was to complete map task for the context before it is cleaned up by destructor when erlang process calling map task terminates or crashes.

Please use this build for further functional and performance verification. Thanks,
Comment by Venu Uppalapati [ 01/Oct/14 ]
For performance runs, do the sfwi and sbwt options need to be set? pls provide some guidance on how to set these.
Comment by Volker Mische [ 02/Oct/14 ]
I would try a run without additional options.

In case you want to run with additional options see the comment above. You only need to set the COUCHBASE_NS_SERVER_VM_EXTRA_ARGS environment variable.
Comment by Venu Uppalapati [ 02/Oct/14 ]
2 runs of tests/index_50M_dgm.test with the above toy build. both are slow.
http://ci.sc.couchbase.com/job/leto-dev/29/
http://ci.sc.couchbase.com/job/leto-dev/28/
Comment by Harsha Havanur [ 08/Oct/14 ]
If we can confirm slowness of indexing is not because of erlang scheduler collapse, Can we merge these changes and investigate further?
Comment by Volker Mische [ 08/Oct/14 ]
But then we need to confirm tat it's not a scheduler issue :)

One way I did it in the past:

1. Monitor the live system, see if one node has low CPU usage while the others perform normal
2. Open an remote erlang shell to that node (couchbase-cli can do that with the undocumented `server-eshell` command:

    ./couchbase-cli server-eshell -c 127.0.0.1:8091 -u Administrator -p asdasd

3. Run the following erlang (without the comments of course:

    %% Get current number of online schedulers
    Schedulers = erlang:system_info(schedulers_online).

    %% Reduce number online to 1
    erlang:system_flag(schedulers_online, 1).

    %% Restore to original number of online schedulers
    erlang:system_flag(schedulers_online, Schedulers)

4. Monitor this node again. If it gets back to normal I'd say it's the scheduler collapse (or at least some scheduler issue).
Comment by Volker Mische [ 09/Oct/14 ]
Look what I've stumbled upon: https://github.com/huiqing/percept2

It's a profiling tool with a useful feature: scheduler activity: the number of active schedulers at any time during the profiling;

There's even a screenshot: http://refactoringtools.github.io/percept2/percept2_scheduler.png

I haven't looked at it closely or tried it, but it sounds promising.

Harsha, I think we should give this tool a try.
Comment by Sriram Melkote [ 29/Oct/14 ]
Status so far is that removing map and reduce NIF, which were long suspected to be misusing Erlang threads to run heavy operations without reporting proper reductions has not helped.

The plan of action going forward is:

(a) To reproduce this on EC2 so we are not delayed on availability of Leto
(b) To run the GDB script to detect more details of scheduler thread behavior
(c) To run with R14 locally
Comment by Volker Mische [ 29/Oct/14 ]
Siri, I'm not sure on (c). We've seen a similar issue in the past on R14 on the past, but less frequently. So it would need more runs to reproduce it. I don't think it's a bug in the Erlang VM that was introduced in R16. Anyway, if it's easy to do let's do it, but let's spend more time on moving forward, rather than backwards :)
Comment by Harsha Havanur [ 05/Nov/14 ]
Making couch_view_parser NIF async seems to address erlang scheduler collapse in this case.
Review in progress at
http://review.couchbase.org/#/c/42821/
Comment by Harsha Havanur [ 05/Nov/14 ]
A toy build for the same is at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-hhs-x86_64_3.0.0-726-toy.rpm
Request QE to run basic functional tests and Query throughput tests on this build.
Comment by Sriram Melkote [ 10/Nov/14 ]
ETA to merge this is 5pm IST on Nov 11
Comment by Harsha Havanur [ 12/Nov/14 ]
Change has been successfully cherry-picked as 3bf0b23892a11299ff5cc25e3d1ebf83e3beec9f
Comment by Volker Mische [ 13/Nov/14 ]
I'm re-opening this issue as the problem is still there even with the async couch_view_parser NIF. It can be seen on the low CPU utilisation on one node [1] (search for "[172.23.100.30] beam.smp_cpu") compared to the others.

[1]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_302-1518_130_build_incr_index
Comment by Sriram Melkote [ 25/Nov/14 ]
Moving to Sherlock per 3.0.2 release meeting today.
Comment by Volker Mische [ 27/Nov/14 ]
Is there a way to create a toy build with a patched Erlang?

I'd like to see a run with a toy build that was built with an Erlang from this branch [1]. This is Erlang R16B03-1 with some backported patches.

The idea comes from an email on the Erlang users mailing list [2].

[1]: https://github.com/rickard-green/otp/commits/rickard/R16B03-1/load_balance/OTP-11385
[2]: http://erlang.org/pipermail/erlang-questions/2014-November/081683.html
Comment by Volker Mische [ 27/Nov/14 ]
Assigning to Ceej to answer the question: Is it possible to have a toy build with a patched Erlang?




[MB-9174] Smart Client version information available from cluster Created: 25/Sep/13  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: clients, ns_server
Affects Version/s: 2.2.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Improvement Priority: Minor
Reporter: David Haikney Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: greenstack, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
As part of the support process we typically capture logs using cbcollect_info. For any support issue involving the clients we have to ask which version of the SDK is in use which adds a delay to the process. It would be useful if the SDK could supply this information to the cluster as some form of signature as part of its initial connection. Then we would need a method for extracting this information from the cluster as part of the cbcollect_info process.

 Comments   
Comment by Michael Nitschinger [ 25/Sep/13 ]
The clients could do that by supplying a x-header in the streaming connection request.

BUT we want to move away from that, so I'm not sure it's straightforward (since there can be no state pushed from the client aside from connecting to something)..
Comment by Matt Ingenthron [ 27/Nov/13 ]
Trond: do you think we can add something to authentication so client auth can be logged, including version?
Comment by Trond Norbye [ 21/Dec/13 ]
It would be better to use an explicit HELLO command that is coming in 3.0.
Comment by Matt Ingenthron [ 11/Sep/14 ]
We'd like to support this, with the move to carrier publication and questions on hello, I'll pass this to Anil at the moment. I'd be glad to get into a discussion about how we'd do this.




[MB-12791] Parallel client for view engine Created: 27/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: sherlock
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Nimish Gupta Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown




[MB-12775] memcached.exe stays at 60-80% CPU, erl.exe stays at ~10% while no activity on database Created: 25/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: memcached, ns_server
Affects Version/s: 3.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: neatcode Assignee: Trond Norbye
Resolution: Unresolved Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 8.1 64bit

Attachments: File savetodb_v1_downloads_last_5_email_messages_to_database.js     File savetodb_v2_downloads_1000_message_headers_to_database.js    
Triage: Untriaged
Operating System: Windows 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: I uploaded the zip file containing the logs to filedropper: http://www.filedropper.com/collectinfo-2014-11-26t021229-ns1127001-copy
Is this a Regression?: Unknown

 Description   
Server remains at extremely high CPU usage indefinitely; no activity is being performed on database -- it's just sitting.

I tried to attach a log but the following error appears in JIRA: "Cannot attach file collectinfo.zip: Unable to communicate with JIRA" (The file is under 50mb)

 Comments   
Comment by neatcode [ 25/Nov/14 ]
Looks like I can't edit the description. I wanted to add that I tried restarting the service, and the problem resurfaced. I tried restarting the computer, and the problem resurfaced as well. It's been going on for hours now.
Comment by neatcode [ 25/Nov/14 ]
Attempting to attach a screenshot in JIRA just redirected me to the page: "http://www.oracle.com/technetwork/java/index.html"
Comment by neatcode [ 25/Nov/14 ]
I uploaded the zip file containing the logs to filedropper: http://www.filedropper.com/collectinfo-2014-11-26t021229-ns1127001-copy
Comment by Dave Rigby [ 26/Nov/14 ]
Hi neatcode,

Thanks for the logs. Could you give us some more information on what events lead to this issue? What SDK(s) were you using, with what kinds of operations? SSL for data enabled/disabled?
Comment by neatcode [ 26/Nov/14 ]
savetodb_v2 is the script i was most recently working with. it saves 1000 email message headers to the database at a time. savetodb_v1 was an earlier version of the script that saved full email messages along with their message headers instead of just message headers.
Comment by neatcode [ 26/Nov/14 ]
I was using the Node SDK. I've just uploaded the most recent version of the node.js script I was using, as well as an earlier version i was using. I think the most recent version (savetodb_v2..) is the one most likely to have triggered issues, since I didn't notice the CPU spinning until recently. I'm also using N1QL as you'll see in the source. The code is a mess but it's not very long. It's just me playing around with a few libraries (pipedrive/inbox, andris9/mailparser, and couchnode).

To successfully run the code, if you should wish to do that, you will need to create or use a @yahoo.com email address, fill it with at least a few emails or up to around 2000 emails as in my case, and hardcode that email address and password into the script and run it. The script may work with a different email @domain, but I haven't used pipedrive/inbox with other email domains yet so your mileage may vary.
Comment by neatcode [ 26/Nov/14 ]
How do I determine if SSL for data is enabled or disabled? If I had to guess, I'd say it wasn't enabled since I don't remember ever doing anything that would have explicitly enabled SSL.
Comment by neatcode [ 27/Nov/14 ]
I was thinking... I'd be willing to give you guys all of the data related to couchbase on my system including the buckets. If you could give me an ftp server to upload to and tell me what files to send, I'll send them and perhaps you can replicate the problem on your end. One guess is that the data in one of my email message headers (which are stored in a "messages" bucket) wasn't processed correctly by your Node SDK or by Couchbase Server itself, given that email data can vary widely and have different charsets, etc. Of course it could be something else. Anyways, let know if you'd like me to send more, what to send, and where to send it.

Because I'm working on developing some software, I'm going to remove this couchbase installation and reinstall so I can get back to that. But I'll hold out for a bit longer to see if you get back to me with a request for files (I don't know what to backup but I'll probably backup the couchbase directory if I *do* reinstall, first).
Comment by Dave Rigby [ 27/Nov/14 ]
neatcode: If you are still seeing this behaviour (60-80% CPU on memcached.exe while idle) could you try to collect a Minidump of memcached.exe? See http://support.microsoft.com/kb/931673 for details on how to do this.

Additionally, could you restart Couchbase (after taking the minidump) and see if that results in reduced levels of CPU for memcached & erl?




[MB-12673] [system-tests]items count mismatch uni-directional XDCR Created: 16/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 3.0.2
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.2-1520

Attachments: PNG File uni.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
bidirection replication for 4 buckets on source 172.23.105.156 and destination 172.23.105.160

AbRegNums
MsgsCalls
RevAB
UserInfo

data load more then 2 days
all this time has been done a large number of steps with different scenarios.
More detailed steps can be found here:
https://github.com/couchbaselabs/couchbase-qe-docs/blob/master/system-tests/viber/build_3.0.2-1520/report.txt

the problem is that I can not say at what stage of discrepancy of data occurred because I check data match only when I stopped data load( last step)

result
source:
AbRegNums 1607045
MsgsCalls 33301
RevAB 35716338
UserInfo 292190

destination:
AbRegNums 1607045
MsgsCalls 33300
RevAB 35716351
UserInfo 292190

diff <(curl http://172.23.105.156:8092/MsgsCalls/_design/docs/_view/docs?inclusive_end=true&stale=false&connection_timeout=60000&skip=0) <(curl http://172.23.105.160:8092/MsgsCalls/_design/docs/_view/docs?inclusive_end=true&stale=update_after&connection_timeout=60000&skip=0)
   % %To taTl o t % aRecleiv ed % X fer d %Ave ragRe Sepeecd e Tiimev e Tidme % T imeX Cfurreentr
 d A ve r a g e S Dploaed eUpldoad Tot al T Sipenmt e L eft S pee d
i 0m e 0 0 T0 i 0m e 0 C 0 u r 0r --e:--n:--t -
-:- -:- - - -:- -:- - 0 Dload Upload Total Spent Left Speed
100 2664k 0 2664k 0 0 680k 0 --:--:-- 0:00:03 --:--:-- 680k
100 2664k 0 2664k 0 0 321k 0 --:--:-- 0:00:08 --:--:-- 588k
1c1
< {"total_rows":33301,"rows":[
---
> {"total_rows":33300,"rows":[
33244d33243
< {"id":"MSG_owmiixgxuqiptrwjjhzgorkfsvrxcgwsrlmrtxkp_myiunhwwynfqjobdtwffjwoic","key":null,"value":null},

so, "MSG_owmiixgxuqiptrwjjhzgorkfsvrxcgwsrlmrtxkp_myiunhwwynfqjobdtwffjwoic" exists on src, doesn't on dest

just in case, leave the cluster alive for investigation for a few days


 Comments   
Comment by Andrei Baranouski [ 16/Nov/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.156-11162014-214-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.157-11162014-235-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.158-11162014-224-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.160-11162014-37-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.206-11162014-33-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.207-11162014-310-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.22-11162014-254-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.159-11162014-245-diag.zip

Comment by Mike Wiederhold [ 17/Nov/14 ]
The expiry pager is running on the destination cluster. This needs to be turned off.

Mikes-MacBook-Pro:ep-engine mikewied$ management/cbstats 172.23.105.207:11210 -b MsgsCalls all | grep exp
 ep_exp_pager_stime: 3600
 ep_expired_access: 0
 ep_expired_pager: 19655
 ep_item_flush_expired: 0
 ep_num_expiry_pager_runs: 53
 vb_active_expired: 19547
 vb_pending_expired: 0
 vb_replica_expired: 108
Comment by Andrei Baranouski [ 18/Nov/14 ]
Sorry Mike, it's not clear to me

I didn't run any expiry pagers in the tests. why I need to turn off something? I used default settings for the clusters
when you say "The expiry pager is running on the destination cluster", does it mean that it should be completed and then items should be matched? but it does not occur

[root@centos-64-x64 bin]# ./cbstats 172.23.105.207:11210 -b MsgsCalls all | grep exp
 ep_exp_pager_stime: 3600
 ep_expired_access: 0
 ep_expired_pager: 19655
 ep_item_flush_expired: 0
 ep_num_expiry_pager_runs: 105
 vb_active_expired: 19547
 vb_pending_expired: 0
 vb_replica_expired: 108
[root@centos-64-x64 bin]# ./cbstats 172.23.105.156:11210 -b MsgsCalls all | grep exp
 ep_exp_pager_stime: 3600
 ep_expired_access: 0
 ep_expired_pager: 8
 ep_item_flush_expired: 0
 ep_num_expiry_pager_runs: 135
 vb_active_expired: 8
 vb_pending_expired: 0
 vb_replica_expired: 0
Comment by Andrei Baranouski [ 18/Nov/14 ]
diff <(curl http://172.23.105.156:8092/MsgsCalls/_design/docs/_view/docs?inclusive_end=true&stale=false&connection_timeout=60000&skip=0) <(curl http://172.23.105.160:8092/MsgsCalls/_design/docs/_view/docs?inclusive_end=true&stale=update_after&connection_timeout=60000&skip=0)
   % %To taTl o t % aRecleiv ed % X fer d %Ave ragRe Sepeecd e Tiimev e Tidme % T imeX Cfurreentr
 d A ve r a g e S Dploaed eUpldoad Tot al T Sipenmt e L eft S pee d
i 0m e 0 0 T0 i 0m e 0 C 0 u r 0r --e:--n:--t -
-:- -:- - - -:- -:- - 0 Dload Upload Total Spent Left Speed
100 2664k 0 2664k 0 0 680k 0 --:--:-- 0:00:03 --:--:-- 680k
100 2664k 0 2664k 0 0 321k 0 --:--:-- 0:00:08 --:--:-- 588k
1c1
< {"total_rows":33301,"rows":[
---
> {"total_rows":33300,"rows":[
33244d33243
< {"id":"MSG_owmiixgxuqiptrwjjhzgorkfsvrxcgwsrlmrtxkp_myiunhwwynfqjobdtwffjwoic","key":null,"value":null},

key "MSG_owmiixgxuqiptrwjjhzgorkfsvrxcgwsrlmrtxkp_myiunhwwynfqjobdtwffjwoic" doesn't exist on dest, exists on source
Comment by Mike Wiederhold [ 18/Nov/14 ]
Andre,

The stat ep_expired_pager_runs shows that the expiry pager is running. You should not have this running on the destination cluster otherwise items might be deleted. This will cause the rev sequence number to be increased and can result in items not being replicated to the destination. This is a known issue so you need to re-run the test and make sure that the expiry pager is not running. You can turn off the expiry pager by running the command below on each node.

cbepctl host:port set flush_param exp_pager_stime 0
Comment by Andrei Baranouski [ 18/Nov/14 ]
Thanks Mike,

could you point the ticket for "a known issue"
so, it should be run only on all nodes in destination cluster?

BTW, how we proceed to test bi-XDCR replication, I believe that there may also be a problem?
Comment by Mike Wiederhold [ 18/Nov/14 ]
Yes, for unidirectional you need to disable the expiry pager on the destination nodes. You can leave it on in the source cluster. Also, I don't know of a ticket that specifically relates to this issue, but I discussed it with support and it is known. If I can find something I'll post it here.

The problem is that if the destination cluster has any traffic (in this case expiry counts as traffic) then the rev sequence number will be increased. This can cause the destination node to win conflict resolution and as a result would mean an item from the source would not end up getting to the destination node. At some point this issue would work itself out, but only after the item expired on both sides.

For bi-directional this wouldn't be an issue because the destination will replicate back the source. In the case of this ticket the destination rev id is 74 and the source is 73. So when the destination replicates this item back it will win the conflict resolution.
Comment by Andrei Baranouski [ 18/Nov/14 ]
Thanks for the update!
Comment by Andrei Baranouski [ 24/Nov/14 ]
Hi Mike,

 with the above scenarios it's expected that destination cluster has doc but source doesn't have it?

source: http://172.23.105.156:8091/index.html#sec=buckets
destination: http://172.23.105.160:8091/index.html#sec=buckets

the following items don't exist on src
RAB_222565502766
RAB_222740635920
RAB_222750550473

and one more question:
do we still support vbuckettool in 3.0.0? http://www.couchbase.com/issues/browse/MB-7253?focusedCommentId=98776&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-98776

I see this tool after installation 3.0.2 but seems like it doesn't work

curl http://172.23.105.160:8091/pools/default/buckets/RevAB | ./vbuckettool RAB_222565502766
vbuckettool mapfile key0 [key1 ... [keyN]]

  The vbuckettool expects a vBucketServerMap JSON mapfile, and
  will print the vBucketId and servers each key should live on.
  You may use '-' instead for the filename to specify stdin.

  Examples:
    ./vbuckettool file.json some_key another_key

    curl http://HOST:8091/pools/default/buckets/default | \
       ./vbuckettool - some_key another_key
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 12205 100 12205 0 0 434k 0 --:--:-- --:--:-- --:--:-- 458k
curl: (23) Failed writing body (0 != 12205)

Comment by Mike Wiederhold [ 24/Nov/14 ]
Apparently the last I was told we don't support vbuckettool. You should file a separate bug about this and assign it to the PM team.
Comment by Mike Wiederhold [ 24/Nov/14 ]
There appears to be an issue with persistence on the source node. I don't think this could be a DCP problem since there aren't any deletes or expirations in the cluster.
Comment by Mike Wiederhold [ 24/Nov/14 ]
First off, the missing keys are as follows:

Comparing active VBuckets across clusters

 Error found:
139 active 44219 172.23.105.157
139 active 44220 172.23.105.160

 Error found:
653 active 44440 172.23.105.158
653 active 44441 172.23.105.207

 Error found:
788 active 43780 172.23.105.158
788 active 43781 172.23.105.207

If you look at VBucket 653:

Mikes-MacBook-Pro:ep-engine mikewied$ management/cbstats 172.23.105.158:11210 -b RevAB vbucket-details | grep vb_653
 vb_653: active
 vb_653:db_data_size: 4408989
 vb_653:db_file_size: 4878427
 vb_653:high_seqno: 46365
 vb_653:ht_cache_size: 3891306
 vb_653:ht_item_memory: 3891306
 vb_653:ht_memory: 393720
 vb_653:num_ejects: 0
 vb_653:num_items: 44440
 vb_653:num_non_resident: 0
 vb_653:num_temp_items: 0
 vb_653:ops_create: 44440
 vb_653:ops_delete: 0
 vb_653:ops_reject: 0
 vb_653:ops_update: 589
 vb_653:pending_writes: 0
 vb_653:purge_seqno: 0
 vb_653:queue_age: 0
 vb_653:queue_drain: 45029
 vb_653:queue_fill: 45029
 vb_653:queue_memory: 0
 vb_653:queue_size: 0
 vb_653:uuid: 229288576785427
Mikes-MacBook-Pro:ep-engine mikewied$ management/cbstats 172.23.105.207:11210 -b RevAB vbucket-details | grep vb_653
 vb_653: active
 vb_653:db_data_size: 4417903
 vb_653:db_file_size: 5001307
 vb_653:high_seqno: 45467
 vb_653:ht_cache_size: 3917611
 vb_653:ht_item_memory: 3917611
 vb_653:ht_memory: 197032
 vb_653:num_ejects: 0
 vb_653:num_items: 44441
 vb_653:num_non_resident: 0
 vb_653:num_temp_items: 0
 vb_653:ops_create: 44441
 vb_653:ops_delete: 0
 vb_653:ops_reject: 0
 vb_653:ops_update: 1012
 vb_653:pending_writes: 0
 vb_653:purge_seqno: 0
 vb_653:queue_age: 0
 vb_653:queue_drain: 45453
 vb_653:queue_fill: 45453
 vb_653:queue_memory: 0
 vb_653:queue_size: 0
 vb_653:uuid: 228614974178837

We can see above that the number of creates is different between the clusters, but what is strange is that the destination node has an extra item and there are no deletes or expirations. On top of this the couch files show that the item on the destination cluster never existed on the source.

Src:

[root@centos-64-x64 ~]# /opt/couchbase/bin/couch_dbdump /opt/couchbase/var/lib/couchbase/data/RevAB/653.couch.14 | grep RAB_222740635920 -B 1 -A 6
Dumping "/opt/couchbase/var/lib/couchbase/data/RevAB/653.couch.14":

Total docs: 44440

Dest:

[root@centos-64-x64 ~]# /opt/couchbase/bin/couch_dbdump /opt/couchbase/var/lib/couchbase/data/RevAB/653.couch.24 | grep RAB_222740635920 -B 1 -A 6
Dumping "/opt/couchbase/var/lib/couchbase/data/RevAB/653.couch.24":
Doc seq: 31333
     id: RAB_222740635920
     rev: 1
     content_meta: 131
     size (on disk): 23
     cas: 1416742771886067118, expiry: 0, flags: 0, datatype: 0
     size: 13
     data: (snappy) ,111005375249

Total docs: 44441
Comment by Abhinav Dangeti [ 25/Nov/14 ]
Andrei, is this a regression? Was this seen consistently in 3.0.2?
Was this issue seen in 3.0.1, there is no history here: https://github.com/couchbaselabs/couchbase-qe-docs/tree/master/system-tests/viber ?

I can't find any trace of the missing items on the source side in the logs, as there seem to be no deletes, no expirations and the items aren't in couchstore as well. Are we certain that there was absolutely so load on the destination?

I read through the test spec, at the end of each phase do you wait for replication to catch up and do you grab stats by any chance, this is to check if any rebalance caused this data loss?
If you can confirm that this is indeed a regression from 3.0.1, we can take a look at all the changes that were made for 3.0.2, that could cause this data loss.
Comment by Andrei Baranouski [ 25/Nov/14 ]
Hi Abhinav,

I've never seen this issue before. and I didn't run the tests against any 3.0.1 version https://github.com/couchbaselabs/couchbase-qe-docs/commits/master/system-tests/viber

Are we certain that there was absolutely so load on the destination?
yes, destination didn't have any loaders

do you wait for replication to catch up and do you grab stats by any chance, this is to check if any rebalance caused this data loss?
no, I do not wait anything between phases, loader runs continuously. here I'm not able to verify that the docs/stats are identical on clusters.

I can't confirm that is regression from 3.0.1 because I didn't run against any 3.0.1 build

btw, before that I never disable the expiry pager on dest cluster and never seen data lost on sorce or destination( expect known and already fixed bug)


Comment by Abhinav Dangeti [ 25/Nov/14 ]
Enabling/Disabling the expiry pager on the destination cluster shouldn't have any effect on the data on the source cluster in a unidirectional XDCR scenario.
Comment by Abhinav Dangeti [ 26/Nov/14 ]
Andrei, the hard failover operation seems to be the main suspect here (as there could be data loss if there were backed up items in the replication queue).

I will need you to run this job again, and please grab the logs after every rebalance operation or phase in your test.
I would also need you to wait for replication to catch up before you trigger the failover operation to make sure that there aren't items backed up in the replication queue (to confirm if this caused the data loss), and check for the item mismatches after each of the rebalance operations. Please let me know if this is possible.
Comment by Andrei Baranouski [ 26/Nov/14 ]
"I will need you to run this job again, and please grab the logs after every rebalance operation or phase in your test.|
okay, will get logs after each operation

"I would also need you to wait for replication to catch up before you trigger the failover operation to make sure that there aren't items backed up in the replication queue (to confirm if this caused the data loss), and check for the item mismatches after each of the rebalance operations"
there are a couple questions here:
"wait for replication to catch up before you trigger the failove" so, it means that there is no any dataload on server before any hard/gracefull failover. I guess in this case the scenario wil be very simple and such cases should be covered in many func tests
" and check for the item mismatches after each of the rebalance operations" the same, when I should stop loader? I think it make sence only after rebalance?

I'm going to split each step on parts:
1) load n hours
2) stop loader, check items
3) start loader, wait a little and start any rebalance or failover operations
4) wait rebalance/failover completed
5) stop loader, check items ( do you need a logs if all is well after the iteration?)

let me know if this works

Comment by Abhinav Dangeti [ 26/Nov/14 ]
This would work Andrei, but you'll need to make sure there are no items in the replication queue when you do hard failover.
Running a load during rebalance and graceful failover operations is fine.
Get logs at the end of each phase only if you do see any item mismatches.
Thanks for the help Andrei.




[MB-12789] Per-node disk paths not intuitive to users Created: 16/Jun/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: techdebt-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Perry Krug Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
As per some training feedback, it is not immediately intuitive to users that their non-default disk paths for indexes and data are not propagated to nodes as they are added to the cluster. This is somewhat by design so that the paths can be set per-node, but users are not familiar with this.

In order of ease, I would suggest a few possible improvements:
1 - Update the documentation simply to state the best practices here that you need to set the disk paths for a node before adding it to the cluster. This is minimal effort, but not necessarily solving the problem if users don't read every line of the docs.
2 - Add a message on the "add node" screen to say that any non-default disk paths will not be propagated to the node and that they must be set prior to it being added. This helps the most common case of adding a node through the UI, but does not help the CLI or REST API paths.
3 - Expand the "add node" UI interface to allow for applying a non-default disk path when adding the node. This also only covers the UI but not the CLI or REST API.
4 - Change the behavior of ns_server to propagate these disk paths as part of the global configuration. I'm less a fan of this since it changes the existing behavior and may make it harder to have per-node settings here which have been useful in the past.

For #2 and #3, I would argue that improving the UI interface should be the primary goal and that we can cover the difference in the CLI and REST API through documentation on "how to programmatically manage couchbase" (which we do not yet have: http://www.couchbase.com/issues/browse/MB-8105)

Assigning to Anil for triage and prioritization.

 Comments   
Comment by Amy Kurtzman [ 26/Nov/14 ]
This one got moved to DOC in the mass migration. Moving it back to MB because it is not primarily a doc issue, that was a suggestion for solving the stated problem.




[MB-12788] JSON versions and encodings supported by Couchbase Server need to be defined Created: 16/Jun/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: techdebt-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Critical
Reporter: Matt Ingenthron Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
While JSON is a standard, there are multiple unicode encodings and the definition of how to interact with this encoding has changed over the course of time. Also, our dependencies (mochiweb, view engine's JSON) may not actually conform to these standards.

Couchbase Server needs to define and document what it supports with respect to JSON.

See:
http://tools.ietf.org/html/draft-ietf-json-rfc4627bis-10 and
http://tools.ietf.org/html/rfc4627


 Comments   
Comment by Cihan Biyikoglu [ 16/Jun/14 ]
making this a documentation item - we should make this public.
Comment by Chiyoung Seo [ 24/Jun/14 ]
Moving this to post 3.0 as the datatype support is not supported in 3.0
Comment by Matt Ingenthron [ 11/Sep/14 ]
This isn't really datatype related, though it's not couchbase-bucket any more either. View engine and other parts of the server use JSON, what do they expect as input? It's also sort of documentation, but not strictly documentation since it should either be defined and validated, or determined based on what our dependencies actually do and verified. In either case, there's probably research and writing of unit tests I think.
Comment by Chiyoung Seo [ 12/Sep/14 ]
Assigning to the PM team to figure out the appropriate steps to be taken.
Comment by Amy Kurtzman [ 26/Nov/14 ]
This issue got moved to the DOC project during the mass migration of tickets. I'm moving it back to MB because there isn't anything to document yet. After the work is done to define what Couchbase Server supports, create a new DOC ticket that links to this one, supply us with the information, and then we can document it.




[MB-10292] [windows] assertion failure in test_file_sort Created: 24/Feb/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, storage-engine
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Major
Reporter: Trond Norbye Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
assertion on line 263 fails: assert(ret == FILE_SORTER_SUCCESS);

ret == FILE_SORTER_ERROR_DELETE_FILE

 Comments   
Comment by Trond Norbye [ 27/Feb/14 ]
I've disabled the test for win32 with http://review.couchbase.org/#/c/33985/ to allow us to find other regressions..
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Don Pinto [ 23/Sep/14 ]
Trond, Chiyoung, any update here
Quick question here - Is this the case that the unit test needs to be updated or bug in the code?

Comment by Chiyoung Seo [ 23/Sep/14 ]
We didn't look at this windows issue yet, but will update it soon.
Comment by Sundar Sridharan [ 03/Nov/14 ]
Blocked by CBD-1444
Comment by Anil Kumar [ 18/Nov/14 ]
Sundar - Can you provide an update on this ticket?
Comment by Sundar Sridharan [ 18/Nov/14 ]
Not tested after CBD-1444 was resolved. Will try to triage soon. Either way I think this should not be critical as it only affects test code. thanks
Comment by Anil Kumar [ 21/Nov/14 ]
Okay. Once you triage can you please update the ticket with plan.
Comment by Sundar Sridharan [ 26/Nov/14 ]
This is a tests -only issue, that does not reproduce on my local windows vm. Will work on it once I have a windows machine. thanks
Comment by Sundar Sridharan [ 26/Nov/14 ]
Adding more triaging details...
Error appears at
in src/file_sort.cc
if (!ctx->skip_writeback && remove(ctx->source_file) != 0) {
        ret = FILE_SORTER_ERROR_DELETE_FILE;
        goto failure;
}
Issue maybe related to the semantics of the of the remove() call on Windows 2012 onwards esp when it hits NO_SUCH_FILE.




[MB-12769] run millis_to_str(str_to_millis(some_date)) do not give right result Created: 25/Nov/14  Updated: 26/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: query
Affects Version/s: sherlock
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
query is:
 select join_yr, join_mo, join_day, millis_to_str(str_to_millis(tostr(join_yr) || '-0' || tostr(join_mo) || '-0' || tostr(join_day))) as date from default where join_mo < 10 and join_day < 10 ORDER BY date asc

but results for example for millis_to_str(str_to_millis('2010-01-01')) is 2009-12-31

query result starts as:
[{'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}, {'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}, {'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}, {'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}, {'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}, {'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}, {'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}, {'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}, {'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}, {'date': u'2009-12-31', 'join_yr': 2010, 'join_mo': 1, 'join_day': 1}

 Comments   
Comment by Gerald Sangudi [ 25/Nov/14 ]
Default to local time zone.
Comment by Iryna Mironava [ 25/Nov/14 ]
how does n1ql know then the zone?
if i do str_to_millis('2010-01-01') -> i get some <value> according to local zone of the machine
but if i convert this value millis_to_str(<value>) i get one day less. But it is same machine and local zone is not changed. Could you please explain what i am doing wrong?
Comment by Gerald Sangudi [ 25/Nov/14 ]
can you post the queries and results?
Comment by Ketaki Gangal [ 26/Nov/14 ]
Hi Gerald,

The above looks like a bug behaviour. Could you explain why this is dependent on any timezone not classified as a bug?
Comment by Gerald Sangudi [ 26/Nov/14 ]
Hi Ketaki,

Jan 1 at 1:30am in New York is Dec 31 at 10:30pm in Los Angeles. I need to see the exact query and results to determine if it's a bug.

Thanks.
Comment by Iryna Mironava [ 26/Nov/14 ]
query is select millis_to_str(str_to_millis('2010-01-01')) as date
result is
{u'status': u'success', u'metrics': {u'elapsedTime': u'1.685069ms', u'executionTime': u'1.54437ms', u'resultSize': 59, u'resultCount': 1}, u'signature': {u'date': u'string'}, u'results': [{u'date': u'2009-12-31T16:00:00-08:00'}], u'request_id': u'7b22fbdc-1f4b-4f13-8ef5-e9aab67f19cb'}

as i do both str_to_millis and millis_to_str at the same time, i expect my date to be same as initial - '2010-01-01' but it is 2009-12-31T16:00:00-08:00

this was a jenkins run http://qa.sc.couchbase.com/job/centos_x64--52_00--n1ql_tests-P0/21/consoleFull




[MB-8693] [Doc] distribute couchbase-server through yum and ubuntu package repositories Created: 24/Jul/13  Updated: 26/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: build, documentation
Affects Version/s: 2.1.0, 2.2.0, 2.5.0
Fix Version/s: 3.0.2
Security Level: Public

Type: Task Priority: Critical
Reporter: Anil Kumar Assignee: marija jovanovic
Resolution: Unresolved Votes: 0
Labels: install, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
depends on MB-6972 distribute couchbase-server through y... Resolved
Relates to
Flagged:
Release Note

 Description   
this helps us in handling dependencies that are needed for couchbase server
sdk team has already implemented this for various sdk packages.

we might have to make some changes to our packaging metadata to work with this schema

 Comments   
Comment by Wayne Siu [ 10/Jul/14 ]
Steps are documented in MB-6972.
Please let us know if you have any questions.
Comment by Anil Kumar [ 11/Jul/14 ]
Ruth - Have we documented this for 3.0 features - we should.
Comment by Ruth Harris [ 17/Jul/14 ]
not yet.
Comment by Wayne Siu [ 30/Jul/14 ]
Any ETA on this?
Comment by Ruth Harris [ 05/Sep/14 ]
I added MB-6972 to the list of fixed bugs in the Release Notes

The following instructions in mb-6972 are specific to Centos 5/6 CE & EE and to yum. Also, Phil's last update said he failed to install Centos6-x86

Basically:

Centos 5/6 community/enterprise
Log on to Centos machine
wget http://packages.couchbase.com/releases/couchbase-server/keys/couchbase-server-public-key

gpg --import couchbase-server-public-key
sudo wget http://packages.couchbase.com/releases/couchbase-server/yum.repos.d/&lt;5|6>/enterprise/couchbase-server.repo --output-document=/etc/yum.repos.d/couchbase-server.repo

vi /etc/yum.repos.d/couchbase-server.repo to verify information
yum install couchbase-server

Comment by Wayne Siu [ 25/Nov/14 ]
Will the documentation be ready for 3.0.2 release?

QE also needs to proof read/check the instructions.
Comment by Ruth Harris [ 26/Nov/14 ]
Wayne,

Could you clarify what need to be done for this Jira ticket? It appears that additional information needs to be included in the installation section, however, it's not clear.
1. Does this yum/repository information apply to 3.0.0, 3.0.1, and 3.0.2?
2. Should it be in the regular installation section?
3. Or should it be a bug in the release notes for 3.0.2?
4. Has anyone tested the information (in previous comment)? Or can someone provide the information that is needed?

Note: Marija is the contact person for this Jira

Thanks, Ruth




[MB-12671] Support for CLI Tools: Node Services Created: 15/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Parag Agarwal Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: all


 Description   
With Sherlock, we will have node services as an option with add node and joinCluster. This should reflect in cli tools as well.

Reference

https://github.com/couchbase/ns_server/blob/master/doc/api.txt


// adds node with given hostname to given server group with specified
// set of services
//
// services field is optional and defaults to kv,moxi
POST /pools/default/serverGroups/<group-uuid>/addNode
hostname=<hostname>&user=Administrator&password=asdasd&services=kv,n1ql,moxi
// same as serverGroups addNode endpoint, but for default server group
POST /controller/addNode
hostname=<hostname>&user=Administrator&password=asdasd&services=kv,n1ql,moxi
// joins _this_ node to cluster which member is given in hostname parameter
POST /node/controller/doJoinCluster
hostname=<hostname>&user=Administrator&password=asdasd&services=kv,n1ql,moxi


 Comments   
Comment by Gerald Sangudi [ 15/Nov/14 ]
+Cihan.

Is it possible to use the more generic terms data, index, and query for these services? In particular, the term "n1ql" is a brand, not a feature. It can be changed by marketing.
Comment by Dave Finlay [ 26/Nov/14 ]
Bin planning to get to this next week (week beginning 12/1)




[MB-12494] [Windows] memory fragmentation ratio (mem_used / total_heap_bytes) increased from 10% (2.5.1) to 14% (3.0.2) with append based workload Created: 28/Oct/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: performance, test-execution
Affects Version/s: 3.0.2
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Major
Reporter: Venu Uppalapati Assignee: Venu Uppalapati
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Test case:kv_fragmentation_revAB_100K.test
Fragmentation % went up fro 10.5(2.5.1) to 13.5(3.0.1)

Number of items:100K
Bucket RAM quota:20480MB
resident ratio:100%
hardware: 12 vCPU, 32 GB,HDD


links:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arjuna_301-1437_cbc_load
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arjuna_251-1083_0a7_load

logs:
3.0.1:
http://ci.sc.couchbase.com/job/arjuna/5/artifact/172.23.100.44.zip
http://ci.sc.couchbase.com/job/arjuna/5/artifact/172.23.100.45.zip
http://ci.sc.couchbase.com/job/arjuna/5/artifact/172.23.100.55.zip
http://ci.sc.couchbase.com/job/arjuna/5/artifact/172.23.100.56.zip
2.5.1:
http://ci.sc.couchbase.com/job/arjuna/6/artifact/172.23.100.44.zip
http://ci.sc.couchbase.com/job/arjuna/6/artifact/172.23.100.45.zip
http://ci.sc.couchbase.com/job/arjuna/6/artifact/172.23.100.55.zip
http://ci.sc.couchbase.com/job/arjuna/6/artifact/172.23.100.56.zip

 Comments   
Comment by Dave Rigby [ 29/Oct/14 ]
Comparison link: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arjuna_251-1083_0a7_load&snapshot=arjuna_301-1437_cbc_load

Fragmentation looks to be the least of the issues - from those two runs 3.0.1 took maybe 25% longer to run, along with big disk differences.
Comment by Chiyoung Seo [ 29/Oct/14 ]
Venu,

Per our discussion, please run the test again with the 3.0.1 (or 3.0.2) build that includes Sundar's recent change in shared thread pool.
Comment by Venu Uppalapati [ 29/Oct/14 ]
Will test this as soon as a 3.0.2 build is available.
Comment by Venu Uppalapati [ 05/Nov/14 ]
Initial testing with build 3.0.2-1503 does not show any improvements. Further testing in progress.
Comment by Chiyoung Seo [ 05/Nov/14 ]
Venu,

Please also check the following two things:

1) Any tcmalloc config changes between 2.5.1 and 3.0.2 windows builds.

2) Reduce the number of memcached worker threads in 3.0.2 to be the same as 2.5.1 (4 memcached worker threads)
Comment by Venu Uppalapati [ 06/Nov/14 ]
reducing the number of front end memcached workers has improved the ops per sec,
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arjuna_251-1083_57a_load&snapshot=arjuna_302-1503_48a_load

Thanks Dave for sharing the new way of tuning memcached params.
Comment by Venu Uppalapati [ 06/Nov/14 ]
Summary of memory fragmentation%s and workload run times, threads spawned:
Run# CB Version workload runtime(seconds) Frag% #frontend threads #backend threads
1 2.5.1(1083) ~28750 10.5 default(4) default(4)
2 2.5.1(1083) ~29750 10.1 default(4) default(4)
3. 3.0.1(1437) ~38000 13.5 default-18 default-18
4. 3.0.2(1053) ~35500 17.4 default-18 incremental(12 total, 1 writer-single bucket case)
5. 3.0.2(1054) ~28000 14.2 tuned-4 incremental(12 total, 1 writer-single bucket case)
Comment by Venu Uppalapati [ 06/Nov/14 ]
From the above it is seen that the workload runtime has a dependancy on number of threads spawned by the server. However, Fragmentation%s do not seem to be influenced by the thread count.
Latest 2.5.1vs3.0.2 with tuned front end threads(4) http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arjuna_251-1083_57a_load&snapshot=arjuna_302-1503_48a_load
Comment by Venu Uppalapati [ 10/Nov/14 ]
Per discussion with Chiyoung on Friday, the increased fragmentation could be caused due to bigger write queue sizes triggered by MB-12576. The current plan is to re-test this when the fix for MB-12576 is merged in.
Comment by Venu Uppalapati [ 14/Nov/14 ]
After two additional runs with reduced number of front end threads(4) I do not see an improvement in the frag%. Assigning to Anil to make the decision on moving the ticket out of this release.
Comment by Anil Kumar [ 14/Nov/14 ]
Venu can you please clarify there are two issues here -

#1. Regression of 28% in Memory fragmentation in 3.0.2 Windows
#2. Drop in throughput Ops/Sec with Append based workload

And for both of them we're saying we don't see any improvement with changes in MB-12576 correct?
Comment by Chiyoung Seo [ 14/Nov/14 ]
From the graph, I don't see any drop in ops/sec anymore.
Comment by Venu Uppalapati [ 14/Nov/14 ]
further additional information regarding the ops/sec drop. In the runs in which memcached# of front end threads was tuned to 4, no drop in ops/sec is observed between 3.0.2 and 2.5.1. With default # of front end threads in 3.0.2 a drop in ops/sec(~15%) is observed.
Comment by Venu Uppalapati [ 14/Nov/14 ]
I have written up a summary of the investigation done for this issue below,
https://docs.google.com/document/d/19l-iDEfM1EetLCoYCjkvNCloVAUfpmIX3PCBOpCx9DU/edit?usp=sharing
Comment by Anil Kumar [ 14/Nov/14 ]
Chiyoung - Please take a look at the summary report from Venu and let us know.
Comment by Chiyoung Seo [ 15/Nov/14 ]
Venu,

Can you upload the collect info logs?

"stats.log" files in the old collect info that you uploaded are all empty.
Comment by Venu Uppalapati [ 15/Nov/14 ]
Hi Chiyoung,

Although this test cluster has four nodes, this particular test utilizes only one node(172.23.100.55). Here is the cbcollectinfo from the latest run which has the populated stats.log,
http://ci.sc.couchbase.com/job/arjuna/82/artifact/172.23.100.55.zip

Stats.log from 2.5.1 run, http://ci.sc.couchbase.com/job/arjuna/62/artifact/172.23.100.55.zip
Comment by Chiyoung Seo [ 15/Nov/14 ]
Venu,

I need "stats.log" files from both 2.5.1 and 3.0.2 runs.
Comment by Chiyoung Seo [ 15/Nov/14 ]
I looked at stats.log and found that the fragmentation overhead seems very small compared with the overall mem_used:

bucket-1

 mem_used: 7112867784
 tcmalloc_current_thread_cache_bytes: 28655832
 tcmalloc_max_thread_cache_bytes: 33554432
 total_allocated_bytes: 7187339704
 total_fragmentation_bytes: 60663368
 total_free_mapped_bytes: 732962816
 total_free_unmapped_bytes: 323887104
 total_heap_bytes: 8304852992

As you can see, the fragmentation overhead is less than 1% of mem_used. Is this test heavy-append only case?

Comment by Chiyoung Seo [ 15/Nov/14 ]
This the memory stats from 2.5.1 run:

 mem_used: 7118222472
 tcmalloc_current_thread_cache_bytes: 39059512
 tcmalloc_max_thread_cache_bytes: 33554432
 tcmalloc_unmapped_bytes: 0
 total_allocated_bytes: 7176237640
 total_fragmentation_bytes: 80195000
 total_free_bytes: 662740992
 total_heap_bytes: 7919173632


As seen, the fragmentation overhead in 2.5.1 run is 80MB while 3.0.2 has 60MB fragmentation overhead. I don't think this reflects the regression in the fragmentation overhead given that the overhead from both runs is very small (less than 1% of the mem_used).

If we want to compare the fragmentation overhead in heavy-append use case between 2.5.1 and 3.0.2, I suggest you to run the Viber use case that we implemented.
Comment by Venu Uppalapati [ 15/Nov/14 ]
We calculate the fragmentation ratio using mem_used and total_heap_bytes as per the following formula,
https://github.com/couchbaselabs/perfrunner/blob/master/perfrunner/tests/kv.py#L367
which in the above case computes to a value of 14.4%.

We do not use the total_fragmentation_bytes statistic in this metric calculation.
Comment by Venu Uppalapati [ 15/Nov/14 ]
the difference between total_heap_bytes and mem_used is ~1136MB in 302 and ~763MB in 251.
Comment by Chiyoung Seo [ 15/Nov/14 ]
I don't think this issue should be a blocker for 3.0.2. I actually want to close this ticket, but leave it opened at this time.

Basically, this test case doesn't reflect the heavy fragmentation and both runs in 2.5.1 and 3.0.2 didn't show any major regression in the fragmentation. The reason why the heap size stat in 3.0.2 is bigger than 2.5.1 by 14.5% is that tcmalloc in 3.0.2 returned some free memory (323MB) back to the OS:

total_free_unmapped_bytes: 323887104

This amount of difference in the heap size won't cause any significant issues. As I mentioned, the actual fragmentation overhead isn't significant in both runs. I recommend to run the Viber use case if we want to compare the fragmentation overhead in append use case between 2.5.1 and 3.0.2.


Comment by Anil Kumar [ 21/Nov/14 ]
Marking this as test enhancement. As Chiyoung mentioned can we run with Viber workload.




[MB-12778] [Single HDD] 24% regression in BgFetch latency and 65% regression in write queue size Created: 26/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0.2
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Venu Uppalapati Assignee: Venu Uppalapati
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
bgfetch latency comparison:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_251-1083_6f2_access&snapshot=zeus_302-1521_c55_access#zeus_251-1083_6f2_accesszeus_302-1521_c55_accesszeus_251-1083_6f2zeus_302-1521_c55bucket-1avg_bg_wait_time_histo

Disk write queue comparison:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_251-1083_9b3_access&snapshot=zeus_302-1521_bd8_access#zeus_251-1083_9b3_accesszeus_302-1521_bd8_accesszeus_251-1083_9b3zeus_302-1521_bd8bucket-1disk_write_queue

This issue is not seen on RAID 10 HDD and SSD based tests.




[MB-12787] select * shows an error right after delete statement run Created: 26/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
cbq> delete from b0 where b0.default.VMs[0].RAM = 12 limit 1;
{
    "request_id": "11092959-2992-438f-800b-fb7f757b474b",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "19.081ms",
        "executionTime": "18.566ms",
        "resultCount": 0,
        "resultSize": 0
    }
}

cbq> select * from b0;
{
    "request_id": "a578af07-e35c-4ecb-af3d-cdf85ab1f6fa",
    "signature": {
        "*": "*"
    },
    "results": [
    ]
    "errors": [
        {
            "code": 5000,
            "msg": "Error doing bulk get - cause: {1 errors, starting with MCResponse status=KEY_ENOENT, opcode=GET, opaque=0, msg: Not found}"
        }
    ],
    "status": "errors",
    "metrics": {
        "elapsedTime": "15.104ms",
        "executionTime": "14.758ms",
        "resultCount": 0,
        "resultSize": 0,
        "errorCount": 1
    }
}

cbq>





[MB-12786] how json non-doc documents can be updated? Created: 26/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
how json non-doc documents can be updated? set syntax is only for json docs




[MB-12784] dml update: if document is not updated success status still is shown Created: 26/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
doc is like:
{
  "default": {
    "VMs": [
      {
        "RAM": 51,
        "memory": 12,
        "name": "vm_12",
        "os": "ubuntu"
      },
      {
        "RAM": 12,
        "memory": 12,
        "name": "vm_13",
        "os": "windows"
      }
    ],}}

cbq> update b0 use keys 'tjson' set default.VMs.RAM=44;
{
    "request_id": "f25e346d-caf1-4a83-8aa5-0c16fab2271a",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "3.545ms",
        "executionTime": "3.034ms",
        "resultCount": 0,
        "resultSize": 0
    }
}

cbq>

doc is not updated because VMs is an array, but success status is displayed




[MB-12782] if query has both args and $<identifier> error Error evaluating projection appears Created: 26/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-alpha
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Colm Mchugh
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: 4h
Time Spent: Not Specified
Original Estimate: 4h

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
http://172.27.33.17:8093/query?statement=select%20count%28$1%29,%20$2,%20$sel%20al%20from%20default%20&$sel=%22VMs%22&args=[%22name%22,%20%22email%22]


{
    "request_id": "1f000fc1-965d-4dbe-a376-c181b2fb709c",
    "signature": {
        "$1": "number",
        "$2": "json",
        "al": "json"
    },
    "results": [
    ]
    "errors": [
        {
            "code": 5000,
            "msg": "Error evaluating projection. - cause: No value for named parameter $sel."
        }
    ],
    "status": "errors",
    "metrics": {
        "elapsedTime": "1.684612s",
        "executionTime": "1.684494s",
        "resultCount": 0,
        "resultSize": 0,
        "errorCount": 1
    }
}





[MB-12783] need a dml doc update Created: 26/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Iryna Mironava Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
https://github.com/couchbaselabs/query/blob/460336460384bfe7ef5a34c9cacb625738f58c0e/docs/n1ql-dml.md

keys clause now is USE KEYS, page contains old syntax




[MB-12780] if arg for rest api is an array error that there is no $ARG appears Created: 26/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-alpha
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Colm Mchugh
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: 4h
Time Spent: Not Specified
Original Estimate: 4h

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
http://172.27.33.17:8093/query?statement=select%20count%28name%29%20cn%20from%20default%20d%20use%20keys%20$keys1&$keys1=[%27query-test0014744-0%27,%20%27query-test0014744-1%27]

{
    "request_id": "5859c427-0f70-406d-bca2-fe618299ed93",
    "signature": {
        "cn": "number"
    },
    "results": [
        {
            "cn": 0
        }
    ]
    "errors": [
        {
            "code": 5000,
            "msg": "Error evaluating KEYS. - cause: No value for named parameter $keys1."
        }
    ],
    "status": "errors",
    "metrics": {
        "elapsedTime": "3.602ms",
        "executionTime": "3.44ms",
        "resultCount": 1,
        "resultSize": 31,
        "errorCount": 1
    }
}





[MB-12779] error 'starting with MCResponse status=KEY_ENOENT' when query containing key using rest api with args Created: 26/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-alpha
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Colm Mchugh
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: 4h
Time Spent: Not Specified
Original Estimate: 4h

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
http://172.27.33.17:8093/query?statement=select%20count%28name%29%20cn%20from%20default%20d%20use%20keys%20[%27$keys1%27]&$keys1=%22query-test0014744-0%22

{
    "request_id": "39f794aa-26e8-4a56-8ea3-b1b547f0ffbd",
    "signature": {
        "cn": "number"
    },
    "results": [
        {
            "cn": 0
        }
    ]
    "errors": [
        {
            "code": 5000,
            "msg": "Error doing bulk get - cause: {1 errors, starting with MCResponse status=KEY_ENOENT, opcode=GET, opaque=0, msg: Not found}"
        }
    ],
    "status": "errors",
    "metrics": {
        "elapsedTime": "3.224ms",
        "executionTime": "2.803ms",
        "resultCount": 1,
        "resultSize": 31,
        "errorCount": 1
    }
}


query without arg runs ok:
cbq> select count(name) cn from default d use keys ['query-test0014744-0'];
{
    "request_id": "21c0e7b8-4cfa-461a-87f8-776c850f04a8",
    "signature": {
        "cn": "number"
    },
    "results": [
        {
            "cn": 1
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "2.809ms",
        "executionTime": "2.684ms",
        "resultCount": 1,
        "resultSize": 31
    }
}

cbq>





[MB-12777] Race condition in http_request/response.go Created: 26/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-alpha
Security Level: Public

Type: Bug Priority: Major
Reporter: Manik Taneja Assignee: Colm Mchugh
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: 8h
Time Spent: Not Specified
Original Estimate: 8h

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Run cbq-engine with the -race flag

1. go run -race main.go -datastore=http://localhost:9000
2. run a simple query select 1 + 1;

go race detector outputs the following :

==================
WARNING: DATA RACE
Write by goroutine 527:
  github.com/couchbaselabs/query/server.(*BaseRequest).Stop()
      /Users/manik/tuqtng/src/github.com/couchbaselabs/query/server/request.go:323 +0x131
  github.com/couchbaselabs/query/server/http.func·001()
      /Users/manik/tuqtng/src/github.com/couchbaselabs/query/server/http/http_request.go:174 +0x9b

Previous write by goroutine 533:
  github.com/couchbaselabs/query/server/http.(*httpRequest).writeResults()
      /Users/manik/tuqtng/src/github.com/couchbaselabs/query/server/http/http_response.go:127 +0x2b0
  github.com/couchbaselabs/query/server/http.(*httpRequest).Execute()
      /Users/manik/tuqtng/src/github.com/couchbaselabs/query/server/http/http_response.go:69 +0x142

Goroutine 527 (running) created at:
  github.com/couchbaselabs/query/server/http.newHttpRequest()
      /Users/manik/tuqtng/src/github.com/couchbaselabs/query/server/http/http_request.go:175 +0x12f6
  github.com/couchbaselabs/query/server/http.(*HttpEndpoint).ServeHTTP()
      /Users/manik/tuqtng/src/github.com/couchbaselabs/query/server/http/http_endpoint.go:51 +0x5c
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/go/src/pkg/net/http/server.go:1511 +0x21c
  net/http.serverHandler.ServeHTTP()
      /usr/local/go/src/pkg/net/http/server.go:1673 +0x1fc
  net/http.(*conn).serve()
      /usr/local/go/src/pkg/net/http/server.go:1174 +0xf9e

Goroutine 533 (finished) created at:
  github.com/couchbaselabs/query/server.(*Server).serviceRequest()
      /Users/manik/tuqtng/src/github.com/couchbaselabs/query/server/server.go:163 +0x618
  github.com/couchbaselabs/query/server.(*Server).doServe()
      /Users/manik/tuqtng/src/github.com/couchbaselabs/query/server/server.go:108 +0xb4
==================




[MB-12696] Couchbase Version: 3.0.0 Enterprise Edition (build-1209) Cluster State ID: 03B-020-218 Node Going Down after evry second day Created: 18/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ashwini Ahire Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: Down, Node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ram - 60 GB , Core 8 Core on each node .
CLuster with 3 Nodes. Harddisk - ssd 1TB

Attachments: File XDCR_Seeting.odt    
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Couchbase 3.0 node going down at every weekend.
Version: 3.0.0 Enterprise Edition (build-1209)
Cluster State ID: 03B-020-217

Please see below logs.
Request you to pls let me know , to avoid this Failove.

Event Module Code Server Node Time
Remote cluster reference "Virginia_to_OregonS" updated. New name is "VirginiaM_to_OregonS". menelaus_web_remote_clusters000 ns_1ec2-####104.compute-1.amazonaws.com 12:46:38 - Mon Nov 17, 2014
Client-side error-report for user undefined on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com':
User-Agent:Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0
Got unhandled error:
Script error.
At:
http://ph.couchbase.net/v2?callback=jQuery162012552191850461902_1416204362614&launchID=8eba0b18a4e965daf1c3a0baecec994c-1416208180553-3638&version=3.0.0-1209-rel-enterprise&_=1416208180556:0:0
Backtrace:
<generated>
generateStacktrace@http://ec2-####108 -.compute-1.amazonaws.com:8091/js/bugsnag.js:411:7
bugsnag@http://ec2-####108 -.compute-1.amazonaws.com:8091/js/bugsnag.js:555:13

    menelaus_web102 ns_1@ec2-####108 -.compute-1.amazonaws.com 12:45:56 - Mon Nov 17, 2014
Replication from bucket "apro" to bucket "apro" on cluster "Virginia_to_OregonS" created. menelaus_web_xdc_replications000 ns_1@ec2-####108 -.compute-1.amazonaws.com 12:38:49 - Mon Nov 17, 2014
Replication from bucket "apro" to bucket "apro" on cluster "Virginia_to_OregonS" removed. xdc_rdoc_replication_srv000 ns_1@ec2-####108 -.compute-1.amazonaws.com 12:38:40 - Mon Nov 17, 2014
Rebalance completed successfully.
    ns_orchestrator001 ns_1@ec2-####107.compute-1.amazonaws.com 11:53:17 - Mon Nov 17, 2014
Bucket "ifa" rebalance does not seem to be swap rebalance ns_vbucket_mover000 ns_1@ec2-####107.compute-1.amazonaws.com 11:53:04 - Mon Nov 17, 2014
Started rebalancing bucket ifa ns_rebalancer000 ns_1@ec2-####107.compute-1.amazonaws.com 11:53:02 - Mon Nov 17, 2014
Could not automatically fail over node ('ns_1@ec2-####108 -.compute-1.amazonaws.com'). Rebalance is running. auto_failover001 ns_1@ec2-####107.compute-1.amazonaws.com 11:49:58 - Mon Nov 17, 2014
Bucket "apro" rebalance does not seem to be swap rebalance ns_vbucket_mover000 ns_1@ec2-####107.compute-1.amazonaws.com 11:48:02 - Mon Nov 17, 2014
Started rebalancing bucket apro ns_rebalancer000 ns_1@ec2-####107.compute-1.amazonaws.com 11:47:59 - Mon Nov 17, 2014
Bucket "apro" loaded on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com' in 366 seconds. ns_memcached000 ns_1@ec2-####108 -.compute-1.amazonaws.com 11:47:58 - Mon Nov 17, 2014
Bucket "ifa" loaded on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com' in 96 seconds. ns_memcached000 ns_1@ec2-####108 -.compute-1.amazonaws.com 11:43:29 - Mon Nov 17, 2014
Starting rebalance, KeepNodes = ['ns_1ec2-####104.compute-1.amazonaws.com',
'ns_1@ec2-####107.compute-1.amazonaws.com',
'ns_1@ec2-####108 -.compute-1.amazonaws.com'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@ec2-####108 -.compute-1.amazonaws.com'], Delta recovery buckets = all ns_orchestrator004 ns_1@ec2-####107.compute-1.amazonaws.com 11:41:52 - Mon Nov 17, 2014
Control connection to memcached on 'ns_1@ec2-####108 -.compute-1.amazonaws.com' disconnected: {badmatch,
{error,
closed}} ns_memcached000 ns_1@ec2-####108 -.compute-1.amazonaws.com 21:19:54 - Sun Nov 16, 2014
Node ('ns_1@ec2-####108 -.compute-1.amazonaws.com') was automatically failovered.
[stale,
{last_heard,{1416,152978,82869}},
{stale_slow_status,{1416,152863,60088}},
{now,{1416,152968,80503}},
{active_buckets,["apro","ifa"]},
{ready_buckets,["ifa"]},
{status_latency,5743},
{outgoing_replications_safeness_level,[{"apro",green},{"ifa",green}]},
{incoming_replications_conf_hashes,
[{"apro",
[{'ns_1ec2-####104.compute-1.amazonaws.com',126796989},
{'ns_1@ec2-####107.compute-1.amazonaws.com',41498822}]},
{"ifa",
[{'ns_1ec2-####104.compute-1.amazonaws.com',126796989},
{'ns_1@ec2-####107.compute-1.amazonaws.com',41498822}]}]},
{local_tasks,
[[{type,xdcr},
{id,<<"949dcce68db4b6d1add4c033ec4e32a9/apro/apro">>},
{errors,
[<<"2014-11-16 19:35:03 [Vb Rep] Error replicating vbucket 201. Please see logs for details.">>]},
{changes_left,220},
{docs_checked,51951817},
{docs_written,51951817},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,210},
{time_working,1040792.401734},
{time_committing,0.0},
{time_working_rate,0.9101340661254117},
{num_checkpoints,53490},
{num_failedckpts,1},
{wakeups_rate,11.007892659036528},
{worker_batches_rate,20.514709046386255},
{rate_replication,22.015785318073057},
{bandwidth_usage,880.6314127229223},
{rate_doc_checks,22.015785318073057},
{rate_doc_opt_repd,22.015785318073057},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1271.0828664152195},
{docs_latency_wt,20.514709046386255}],
[{type,xdcr},
{id,<<"fc72b1b0e571e9c57671d6621cac6058/apro/apro">>},
{errors,[]},
{changes_left,278},
{docs_checked,51217335},
{docs_written,51217335},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,269},
{time_working,1124595.930738},
{time_committing,0.0},
{time_working_rate,1.019751359238166},
{num_checkpoints,54571},
{num_failedckpts,3},
{wakeups_rate,6.50472893793788},
{worker_batches_rate,16.51200422707308},
{rate_replication,23.01673316501096},
{bandwidth_usage,936.6809670630547},
{rate_doc_checks,23.01673316501096},
{rate_doc_opt_repd,23.01673316501096},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1500.9621995190503},
{docs_latency_wt,16.51200422707308}],
[{type,xdcr},
{id,<<"16b1afb33dbcbde3d075e2ff634d9cc0/apro/apro">>},
{errors,
[<<"2014-11-16 19:21:55 [Vb Rep] Error replicating vbucket 258. Please see logs for details.">>,
<<"2014-11-16 19:22:41 [Vb Rep] Error replicating vbucket 219. Please see logs for details.">>,
<<"2014-11-16 19:23:04 [Vb Rep] Error replicating vbucket 315. Please see logs for details.">>,
<<"2014-11-16 20:06:40 [Vb Rep] Error replicating vbucket 643. Please see logs for details.">>,
<<"2014-11-16 20:38:20 [Vb Rep] Error replicating vbucket 651. Please see logs for details.">>]},
{changes_left,0},
{docs_checked,56060297},
{docs_written,56060297},
{active_vbreps,0},
{max_vbreps,4},
{waiting_vbreps,0},
{time_working,140073.119377},
{time_committing,0.0},
{time_working_rate,0.04649055712180432},
{num_checkpoints,103504},
{num_failedckpts,237},
{wakeups_rate,21.524796565643623},
{worker_batches_rate,22.52594989427821},
{rate_replication,22.52594989427821},
{bandwidth_usage,913.0518357147434},
{rate_doc_checks,22.52594989427821},
{rate_doc_opt_repd,22.52594989427821},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,13.732319632216313},
{docs_latency_wt,22.52594989427821}],
[{type,xdcr},
{id,<<"b734095ad63ea9832f9da1b1ef3449ac/apro/apro">>},
{errors,
[<<"2014-11-16 19:36:22 [Vb Rep] Error replicating vbucket 260. Please see logs for details.">>,
<<"2014-11-16 19:36:38 [Vb Rep] Error replicating vbucket 299. Please see logs for details.">>,
<<"2014-11-16 19:36:43 [Vb Rep] Error replicating vbucket 205. Please see logs for details.">>,
<<"2014-11-16 19:36:48 [Vb Rep] Error replicating vbucket 227. Please see logs for details.">>,
<<"2014-11-16 20:26:19 [Vb Rep] Error replicating vbucket 175. Please see logs for details.">>,
<<"2014-11-16 20:26:25 [Vb Rep] Error replicating vbucket 221. Please see logs for details.">>,
<<"2014-11-16 21:16:40 [Vb Rep] Error replicating vbucket 293. Please see logs for details.">>,
<<"2014-11-16 21:16:40 [Vb Rep] Error replicating vbucket 251. Please see logs for details.">>,
<<"2014-11-16 21:17:06 [Vb Rep] Error replicating vbucket 270. Please see logs for details.">>]},
{changes_left,270},
{docs_checked,50418639},
{docs_written,50418639},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,261},
{time_working,1860159.788732},
{time_committing,0.0},
{time_working_rate,1.008940755729142},
{num_checkpoints,103426},
{num_failedckpts,87},
{wakeups_rate,6.50782891818858},
{worker_batches_rate,16.01927118323343},
{rate_replication,23.027702325898055},
{bandwidth_usage,933.1225464233472},
{rate_doc_checks,23.027702325898055},
{rate_doc_opt_repd,23.027702325898055},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1367.9901922012182},
{docs_latency_wt,16.01927118323343}],
[{type,xdcr},
{id,<<"e213600feb7ec1dfa0537173ad7f2e02/apro/apro">>},
{errors,
[<<"2014-11-16 20:16:39 [Vb Rep] Error replicating vbucket 647. Please see logs for details.">>,
<<"2014-11-16 20:17:31 [Vb Rep] Error replicating vbucket 619. Please see logs for details.">>]},
{changes_left,854},
{docs_checked,33371659},
{docs_written,33371659},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,318},
{time_working,2421539.8537169998},
{time_committing,0.0},
{time_working_rate,1.7382361098734072},
{num_checkpoints,102421},
{num_failedckpts,85},
{wakeups_rate,3.0038659755104824},
{worker_batches_rate,7.009020609524459},
{rate_replication,30.539304084356573},
{bandwidth_usage,1261.6237097144026},
{rate_doc_checks,30.539304084356573},
{rate_doc_opt_repd,30.539304084356573},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1997.2249284829577},
{docs_latency_wt,7.009020609524459}]]},
{memory,
[{total,752400928},
{processes,375623512},
{processes_used,371957960},
{system,376777416},
{atom,594537},
{atom_used,591741},
{binary,94783616},
{code,15355960},
{ets,175831736}]},
{system_memory_data,
[{system_total_memory,64552329216},
{free_swap,0},
{total_swap,0},
{cached_memory,27011342336},
{buffered_memory,4885585920},
{free_memory,12694065152},
{total_memory,64552329216}]},
{node_storage_conf,
[{db_path,"/data/couchbase"},{index_path,"/data/couchbase"}]},
{statistics,
[{wall_clock,{552959103,4997}},
{context_switches,{8592101014,0}},
{garbage_collection,{2034857586,5985868018204,0}},
{io,{{input,270347194989},{output,799175854069}}},
{reductions,{833510054494,7038093}},
{run_queue,0},
{runtime,{553128340,5090}},
{run_queues,{0,0,0,0,0,0,0,0}}]},
{system_stats,
[{cpu_utilization_rate,2.5316455696202533},
{swap_total,0},
{swap_used,0},
{mem_total,64552329216},
{mem_free,44590993408}]},
{interesting_stats,
[{cmd_get,0.0},
{couch_docs_actual_disk_size,21729991305},
{couch_docs_data_size,11673379153},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,30268090},
{curr_items_tot,60625521},
{ep_bg_fetched,0.0},
{get_hits,0.0},
{mem_used,11032659776},
{ops,116.0},
{vb_replica_curr_items,30357431}]},
{per_bucket_interesting_stats,
[{"ifa",
[{cmd_get,0.0},
{couch_docs_actual_disk_size,611617800},
{couch_docs_data_size,349385716},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,1020349},
{curr_items_tot,2039753},
{ep_bg_fetched,0.0},
{get_hits,0.0},
{mem_used,307268040},
{ops,0.0},
{vb_replica_curr_items,1019404}]},
{"apro",
[{cmd_get,0.0},
{couch_docs_actual_disk_size,21118373505},
{couch_docs_data_size,11323993437},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,29247741},
{curr_items_tot,58585768},
{ep_bg_fetched,0.0},
{get_hits,0.0},
{mem_used,10725391736},
{ops,116.0},
{vb_replica_curr_items,29338027}]}]},
{processes_stats,
[{<<"proc/(main)beam.smp/cpu_utilization">>,0},
{<<"proc/(main)beam.smp/major_faults">>,0},
{<<"proc/(main)beam.smp/major_faults_raw">>,0},
{<<"proc/(main)beam.smp/mem_resident">>,943411200},
{<<"proc/(main)beam.smp/mem_share">>,6901760},
{<<"proc/(main)beam.smp/mem_size">>,2951794688},
{<<"proc/(main)beam.smp/minor_faults">>,0},
{<<"proc/(main)beam.smp/minor_faults_raw">>,456714435},
{<<"proc/(main)beam.smp/page_faults">>,0},
{<<"proc/(main)beam.smp/page_faults_raw">>,456714435},
{<<"proc/beam.smp/cpu_utilization">>,0},
{<<"proc/beam.smp/major_faults">>,0},
{<<"proc/beam.smp/major_faults_raw">>,0},
{<<"proc/beam.smp/mem_resident">>,108077056},
{<<"proc/beam.smp/mem_share">>,2973696},
{<<"proc/beam.smp/mem_size">>,1113272320},
{<<"proc/beam.smp/minor_faults">>,0},
{<<"proc/beam.smp/minor_faults_raw">>,6583},
{<<"proc/beam.smp/page_faults">>,0},
{<<"proc/beam.smp/page_faults_raw">>,6583},
{<<"proc/memcached/cpu_utilization">>,0},
{<<"proc/memcached/major_faults">>,0},
{<<"proc/memcached/major_faults_raw">>,0},
{<<"proc/memcached/mem_resident">>,17016668160},
{<<"proc/memcached/mem_share">>,6885376},
{<<"proc/memcached/mem_size">>,17812746240},
{<<"proc/memcached/minor_faults">>,0},
{<<"proc/memcached/minor_faults_raw">>,4385001},
{<<"proc/memcached/page_faults">>,0},
{<<"proc/memcached/page_faults_raw">>,4385001}]},
{cluster_compatibility_version,196608},
{version,
[{lhttpc,"1.3.0"},
{os_mon,"2.2.14"},
{public_key,"0.21"},
{asn1,"2.0.4"},
{couch,"2.1.1r-432-gc2af28d"},
{kernel,"2.16.4"},
{syntax_tools,"1.6.13"},
{xmerl,"1.3.6"},
{ale,"3.0.0-1209-rel-enterprise"},
{couch_set_view,"2.1.1r-432-gc2af28d"},
{compiler,"4.9.4"},
{inets,"5.9.8"},
{mapreduce,"1.0.0"},
{couch_index_merger,"2.1.1r-432-gc2af28d"},
{ns_server,"3.0.0-1209-rel-enterprise"},
{oauth,"7d85d3ef"},
{crypto,"3.2"},
{ssl,"5.3.3"},
{sasl,"2.3.4"},
{couch_view_parser,"1.0.0"},
{mochiweb,"2.4.2"},
{stdlib,"1.19.4"}]},
{supported_compat_version,[3,0]},
{advertised_version,[3,0,0]},
{system_arch,"x86_64-unknown-linux-gnu"},
{wall_clock,552959},
{memory_data,{64552329216,51966836736,{<13661.389.0>,147853368}}},
{disk_data,
[{"/",10309828,38},
{"/dev/shm",31519692,0},
{"/mnt",154817516,1},
{"/data",1056894132,3}]},
{meminfo,
<<"MemTotal: 63039384 kB\nMemFree: 12396548 kB\nBuffers: 4771080 kB\nCached: 26378264 kB\nSwapCached: 0 kB\nActive: 31481704 kB\nInactive: 17446048 kB\nActive(anon): 17750620 kB\nInactive(anon): 2732 kB\nActive(file): 13731084 kB\nInactive(file): 17443316 kB\nUnevictable: 0 kB\nMlocked: 0 kB\nSwapTotal: 0 kB\nSwapFree: 0 kB\nDirty: 13312 kB\nWriteback: 0 kB\nAnonPages: 17753376 kB\nMapped: 14516 kB\nShmem: 148 kB\nSlab: 1297976 kB\nSReclaimable: 1219296 kB\nSUnreclaim: 78680 kB\nKernelStack: 2464 kB\nPageTables: 39308 kB\nNFS_Unstable: 0 kB\nBounce: 0 kB\nWritebackTmp: 0 kB\nCommitLimit: 31519692 kB\nCommitted_AS: 19222984 kB\nVmallocTotal: 34359738367 kB\nVmallocUsed: 114220 kB\nVmallocChunk: 34359618888 kB\nHardwareCorrupted: 0 kB\nAnonHugePages: 17432576 kB\nHugePages_Total: 0\nHugePages_Free: 0\nHugePages_Rsvd: 0\nHugePages_Surp: 0\nHugepagesize: 2048 kB\nDirectMap4k: 6144 kB\nDirectMap2M: 63993856 kB\n">>}] auto_failover001 ns_1@ec2-####107.compute-1.amazonaws.com 21:19:53 - Sun Nov 16, 2014
Failed over 'ns_1@ec2-####108 -.compute-1.amazonaws.com': ok ns_rebalancer000 ns_1@ec2-####107.compute-1.amazonaws.com 21:19:53 - Sun Nov 16, 2014
Skipped vbucket activations and replication topology changes because not all remaining node were found to have healthy bucket "ifa": ['ns_1@ec2-####107.compute-1.amazonaws.com'] ns_rebalancer000 ns_1@ec2-####107.compute-1.amazonaws.com 21:19:53 - Sun Nov 16, 2014
Shutting down bucket "ifa" on 'ns_1@ec2-####108 -.compute-1.amazonaws.com' for deletion ns_memcached000 ns_1@ec2-####108 -.compute-1.amazonaws.com 21:19:49 - Sun Nov 16, 2014
Starting failing over 'ns_1@ec2-####108 -.compute-1.amazonaws.com' ns_rebalancer000 ns_1@ec2-####107.compute-1.amazonaws.com 21:19:48 - Sun Nov 16, 2014
Bucket "apro" loaded on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com' in 0 seconds. ns_memcached000 ns_1@ec2-####108 -.compute-1.amazonaws.com 21:19:44 - Sun Nov 16, 2014
Control connection to memcached on 'ns_1@ec2-####108 -.compute-1.amazonaws.com' disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
cmd_vocal_recv,
5,
[{file,
"src/mc_client_binary.erl"},
{line,
151}]},
{mc_client_binary,
select_bucket,
2,
[{file,
"src/mc_client_binary.erl"},
{line,
346}]},
{ns_memcached,
ensure_bucket,
2,
[{file,
"src/ns_memcached.erl"},
{line,
1269}]},
{ns_memcached,
handle_info,
2,
[{file,
"src/ns_memcached.erl"},
{line,
744}]},
{gen_server,
handle_msg,
5,
[{file,
"gen_server.erl"},
{line,
604}]},
{ns_memcached,
init,
1,
[{file,
"src/ns_memcached.erl"},
{line,
171}]},
{gen_server,
init_it,
6,
[{file,
"gen_server.erl"},
{line,
304}]},
{proc_lib,
init_p_do_apply,
3,
[{file,
"proc_lib.erl"},
{line,
239}]}]} ns_memcached000 ns_1@ec2-####108 -.compute-1.amazonaws.com 21:19:44 - Sun Nov 16, 2014

 Comments   
Comment by Aleksey Kondratenko [ 18/Nov/14 ]
I cannot help without logs
Comment by Anil Kumar [ 18/Nov/14 ]
Ashwini - Thanks for reporting the issue. Please run the cbcollect_info (http://docs.couchbase.com/admin/admin/CLI/cbcollect_info_tool.html) to gather the logs and attach it to ticket so we can investigate the issue. Thanks
Comment by Anil Kumar [ 21/Nov/14 ]
Ashwini - We would need complete log file to investigate. Can you please collect log and attach to the ticket. If you're concerned about the sensitive data you can contact support@couchbase.com so we can provide you secured storage path to upload file.
Comment by Abhishek Singh [ 24/Nov/14 ]
Ashwini - You could use the "auto-cbcollect_info" utility under the Logs tabs in Admin UI to upload logs. you should leave the "Upload to host:" option as the default "s3.amazonaws.com/cb-customers ". More info available here- http://www.couchbase.com/wiki/display/couchbase/Working+with+the+Couchbase+Technical+Support+Team
Comment by Ashwini Ahire [ 24/Nov/14 ]
logs uploaded at location :

https://s3.amazonaws.com/customers.couchbase.com/vserv/collectinfo-2014-11-21T065754-ns_1@ec2-54-163-248-107.compute-1.amazonaws.com.zip
Comment by Aleksey Kondratenko [ 24/Nov/14 ]
Inspected recent autofailover of node .107. It is convenient that same node was also master node.

Autofailover appears to be caused by timeouts speaking to memcached. So erlang on node .107 indeed considered memcached as "down".

It's not 100% clear why those timeouts occurred. One very likely possibility is that your box seems to have transparent huge pages feature still enabled (it's on by default on RHEL family distros).

At least I'm not seeing anything more specific in logs. We have seen both in our testing and in our customers that enabled transparent huge pages is frequent source of false-positive autofailovers.

I'm seeing quite high system load caused by xdcr, which might be what you want. Consider looking at anticipatory xdcr delay setting to lower system load.

If my advise on disabling transparent huge pages ends up not enough (which is unlikely), next things to do are:

* get us fresh logs

* apply anticipatory xdcr delay
Comment by Ashwini Ahire [ 25/Nov/14 ]
again node went down today ... please see fresh logs.

https://s3.amazonaws.com/cb-customers/Vserv/770/collectinfo-2014-11-25T053703-ns_1%40ec2-54-163-248-107.compute-1.amazonaws.com.zip

Comment by Ashwini Ahire [ 25/Nov/14 ]
 transparent huge is Disable in System.

uname -a
Linux ip-****-196 2.6.32-431.11.2.el6.centos.plus.x86_64 #1 SMP Tue Mar 25 21:36:54 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

and This cluster is Master cluster ..Having 5 other slave cluster , connected to same Master.

We are using 60 GB ram on master Cluster having 1TB SSD and 8 Core Cpu. (3 Nodes on Master Cluster)


Comment by Ashwini Ahire [ 25/Nov/14 ]
Please find current XDCR Setting
Comment by Aleksey Kondratenko [ 25/Nov/14 ]
my statement above on THP is based on the following evidence from your collectinfos.

1. the following /proc/vmstat counters:
nr_anon_transparent_hugepages 5787
thp_fault_alloc 5323739
thp_fault_fallback 811681
thp_collapse_alloc 92414
thp_collapse_alloc_failed 7172
thp_split 1601704

2. sysfs settings:

Transparent Huge Pages data
cat /sys/kernel/mm/transparent_hugepage/enabled
==============================================================================
[always] madvise never

3. A number of VMAs with actual huge pages. E.g. (tiny subset):
  18096:AnonHugePages: 0 kB
  18110:AnonHugePages: 8192 kB
  18124:AnonHugePages: 6144 kB
  18138:AnonHugePages: 0 kB

Will inspect second collectinfo right now.
Comment by Aleksey Kondratenko [ 25/Nov/14 ]
Seeing same symptoms. Memcached requests start timing out. It doesn't looks like dead memcached. It's just unhealthy. Possibly due to THP.

I'm also seeing "ocean" of unexpected messages like this:

[ns_server:debug,2014-11-25T10:55:26.885,ns_1@ec2-54-163-248-107.compute-1.amazonaws.com:<0.29002.3029>:xdcr_dcp_streamer:do_start:189]Got stream end without snapshot marker
[ns_server:debug,2014-11-25T10:55:26.886,ns_1@ec2-54-163-248-107.compute-1.amazonaws.com:<0.5187.3034>:xdcr_dcp_streamer:do_start:189]Got stream end without snapshot marker

That might point our DCP folks to some other possible reason.
Comment by Aleksey Kondratenko [ 25/Nov/14 ]
Passing to Mike in case he wants to understand those numerous "got stream end without snapshot marker" situations.

In first log I also saw a bunch of them shortly prior to autofailover.
Comment by Aleksey Kondratenko [ 25/Nov/14 ]
But regardless of true cause of this issues it is _strongly_ advised to disable THP on all your couchbase systems.
Comment by Ashwini Ahire [ 26/Nov/14 ]
Thanks Aleksey for your reply , THP is enabled on all our couchbase servers.

I will disable on Master Server first and do testing , if its working fine then will do same changes on slave server also.
Request you to check my XDCR Setting , Please suggest if it creating any problem. (PFA document)

FYI - Monitoring is required

Thanks again !!

Comment by Ashwini Ahire [ 26/Nov/14 ]
we have disble the THP

cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]




[MB-12761] The server doesn't support ASCII protocol unless moxi is running Created: 25/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: sherlock
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Trond Norbye Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Couchbase_Console__3_0_0-1156_.png    

 Description   
Running Couchbase server as of 24 November moxi isn't enabled by default, but if you try to modify a bucket you're displayed the attached screenshot. The following things is wrong:

1) ASCII protocol is _only_ supported through moxi
2) The preferred port for our system is 11210, which is the data port only supporting the binary protocol. Using 11211 will cause an extra network hop between moxi and memcached.
3) A dedicated port is only supported through moxi




 Comments   
Comment by Matt Ingenthron [ 25/Nov/14 ]
I believe I filed a bug on this previously. In that bug, I mentioned the UI indicates moxi and SASL auth are mutually exclusive, but that's not our implementation IIRC. Perhaps going into Sherlock we should consider moxi like all of the other "services" at the UI/configuration level?
Comment by Trond Norbye [ 26/Nov/14 ]
Matt: In sherlock moxi is a service you can enable during the configuration phase and it's by default off (YAY!!!!)




[MB-12430] Prepared statements Created: 23/Oct/14  Updated: 25/Nov/14  Due: 08/Dec/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-alpha, cbq-beta
Fix Version/s: cbq-alpha
Security Level: Public

Type: Epic Priority: Major
Reporter: Gerald Sangudi Assignee: Colm Mchugh
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: 32h
Time Spent: Not Specified
Original Estimate: 32h

Epic Name: Prepared statements
Epic Status: To Do

 Description   
Prepared statements enable queries to be executed repeatedly without incurring the cost of parsing and planning.

Design: http://goo.gl/T8l7nd





[MB-10214] Mac version update check is incorrectly identifying newest version Created: 14/Feb/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0.1, 2.2.0, 2.1.1, 3.0.1, 3.0
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: David Haikney Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified
Environment: Mac OS X

Attachments: PNG File upgrade_check.png    
Issue Links:
Duplicate
is duplicated by MB-12345 Version 3.0.0-1209-rel prompts for up... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-12051 Update the Release_Server job on Jenk... Technical task Open Chris Hillery  
Is this a Regression?: Yes

 Description   
Running 2.1.1 version of couchbase on a Mac, "check for latest version" reports the latest version is already running (e.g. see attached screenshot)


 Comments   
Comment by Aleksey Kondratenko [ 14/Feb/14 ]
Definitely not ui bug. It's using phone home to find out about upgrades. And I have no idea who owns that now.
Comment by Steve Yen [ 12/Jun/14 ]
got an email from ravi to look into this
Comment by Steve Yen [ 12/Jun/14 ]
Not sure if this is correct analysis, but I did a quick scan of what I think is the mac installer, which I think is...

  https://github.com/couchbase/couchdbx-app

It gets its version string by running a "git describe", in the Makefile here...

  https://github.com/couchbase/couchdbx-app/blob/master/Makefile#L1

Currently, a "git describe" on master branch returns...

  $ git describe
  2.1.1r-35-gf6646fa

...which is *kinda* close to the reported version string in the screenshot ("2.1.1-764-rel").

So, I'm thinking one fix needed would be a tagging (e.g., "git tag -a FOO -m FOO") of the couchdbx-app repository.

So, reassigning to Phil to do that appropriately.

Also, it looks like the our mac installer is using an open-source packaging / installer / runtime library called "sparkle" (which might be a little under-maintained -- not sure).

  https://github.com/andymatuschak/Sparkle/wiki

The sparkle library seems to check for version updates by looking at the URL here...

  https://github.com/couchbase/couchdbx-app/blob/master/cb.plist.tmpl#L42

Which seems to either be...

  http://appcast.couchbase.com/membasex.xml

Or, perhaps...

  http://appcast.couchbase.com/couchbasex.xml

The appcast.couchbase.com appears to be actually an S3 bucket, off of our production couchbase AWS account. So those *.xml files need to be updated, as they currently have content that has older versions. For example, http://appcast.couchbase.com/couchbase.xml looks currently like...

    <rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle" version="2.0">
    <channel>
    <title>Updates for Couchbase Server</title>
    <link>http://appcast.couchbase.com/couchbase.xml&lt;/link>
    <description>Recent changes to Couchbase Server.</description>
    <language>en</language>
    <item>
    <title>Version 1.8.0</title>
    <sparkle:releaseNotesLink>
    http://www.couchbase.org/wiki/display/membase/Couchbase+Server+1.8.0
    </sparkle:releaseNotesLink>
    <!-- date -u +"%a, %d %b %Y %H:%M:%S GMT" -->
    <pubDate>Fri, 06 Jan 2012 16:11:17 GMT</pubDate>
    <enclosure url="http://packages.couchbase.com/1.8.0/Couchbase-Server-Community.dmg" sparkle:version="1.8.0" sparkle:dsaSignature="MCwCFAK8uknVT3WOjPw/3LkQpLBadi2EAhQxivxe2yj6EU6hBlg9YK/5WfPa5Q==" length="33085691" type="application/octet-stream"/>
    </item>
    </channel>
    </rss>

Not updating the xml files, though, probably causes no harm. Just that our osx users won't be pushed news on updates.
Comment by Phil Labee (Inactive) [ 12/Jun/14 ]
This has nothing to do with "git describe". There should be no place in the product that "git describe" should be used to determine version info. See:

    http://hub.internal.couchbase.com/confluence/display/CR/Branching+and+Tagging

so there's definitely a bug in the Makefile.

The version update check seems to be out of date. The phone-home file is generated during:

    http://factory.hq.couchbase.com:8080/job/Product_Staging_Server/

but the process of uploading it is not automated.
Comment by Steve Yen [ 12/Jun/14 ]
Thanks for the links.

> This has nothing to do with "git describe".

My read of the Makefile makes me think, instead, that "git describe" is the default behavior unless it's overridden by the invoker of the make.

> There should be no place in the product that "git describe" should be used to determine version info. See:
> http://hub.internal.couchbase.com/confluence/display/CR/Branching+and+Tagging

It appears all this couchdbx-app / sparkle stuff predates that wiki page by a few years, so I guess it's inherited legacy.

Perhaps voltron / buildbot are not setting the PRODUCT_VERSION correctly before invoking the the couchdbx-app make, which makes the Makefile default to 'git describe'?

    commit 85710d16b1c52497d9f12e424a22f3efaeed61e4
    Date: Mon Jun 4 14:38:58 2012 -0700

    Apply correct product version number
    
    Get version number from $PRODUCT_VERSION if it's set.
    (Buildbot and/or voltron will set this.)
    If not set, default to `git describe` as before.
    
> The version update check seems to be out of date.

Yes, that's right. The appcast files are out of date.

> The phone-home file is generated during:
> http://factory.hq.couchbase.com:8080/job/Product_Staging_Server/

I think appcast files for OSX / sparkle are a _different_ mechanism than the phone-home file, and an appcast XML file does not appear to be generated/updated by the Product_Staging_Server job.

But, I'm not an expert or really qualified on the details here -- this is just my opinions from a quick code scan, not from actually doing/knowing.

Comment by Wayne Siu [ 01/Aug/14 ]
Per PM (Anil), we should get this fixed by 3.0 RC1.
Raising the priority to Critical.
Comment by Wayne Siu [ 07/Aug/14 ]
Phil,
Please provide update.
Comment by Anil Kumar [ 12/Aug/14 ]
Triage - Upgrading to 3.0 Blocker

Comment by Wayne Siu [ 20/Aug/14 ]
Looks like we may have a short term "fix" for this ticket which Ceej and I have tested.
@Ceej, can you put in the details here?
Comment by Chris Hillery [ 20/Aug/14 ]
The file is hosted in S3, and we proved tonight that overwriting that file (membasex.xml) with a version containing updated version information and download URLs works as expected. We updated it to point to 2.2 for now, since that is the latest version with a freely-available download URL.

We can update the Release_Server job on Jenkins to create an updated version of this XML file from a template, and upload it to S3.

Assigning back to Wayne for a quick question: Do we support Enterprise edition for MacOS? If we do, then this solution won't be sufficient without more effort, because the two editions will need different Sparkle configurations for updates. Also, Enterprise edition won't be able to directly download the newer release, unless we provide a "hidden" URL for that (the download link on the website goes to a form).
Comment by Chris Hillery [ 14/Oct/14 ]
We manually uploaded a new version of membasex.xml when 3.0.0 was released, but as MB-12345 shows, it doesn't work correctly (it still thinks there's a new download even if you're running the released 3.0.0).

I do not anticipate being able to put more time into this issue in the near future.




[MB-12149] [Windows] Cleanup unnecessary files that are part of the windows builder Created: 08/Sep/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0.1
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7
Build 3.0.1-1261

Attachments: PNG File Screen Shot 2014-09-09 at 2.22.28 PM.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Install windows build 3.0.1-1261
As part of the installation you will see the following directories:

1. cmake -- Does this need to be there?
2. erts-5.10.4 under server directory and also it is under lib directory but some files are duplicated please remove the duplicated files
3. licenses.tgz file -- This can be removed (I do not find this in Linux anymore)



 Comments   
Comment by Raju Suravarjjala [ 09/Sep/14 ]
I did a search on erts_mt and found 4 of them, looks like there are duplicate files 2 for each eras_MT.lib and eras_MTD.lib in two different folders
Comment by Sriram Melkote [ 09/Sep/14 ]
I can help with erts stuff (if removing one of them breaks anything, that is)
Comment by Chris Hillery [ 10/Nov/14 ]
I will delete the cmake directory and licenses.tgz file.

The erts stuff I don't think I can delete without knowing for sure which one is "right", so I will assign this to Siri once I'm done with (1) and (3).




Investigate other possible memory allocators that provide the better fragmentation management (MB-10496)

[MB-12604] Add factory Ubuntu 12.04 build for jemalloc cbdep. Created: 10/Nov/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Technical task Priority: Major
Reporter: Dave Rigby Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We need a build on factory.couchbase.com to build jemalloc for Ubuntu 12.04, and add it to the S3 bucket.

 Comments   
Comment by Dave Rigby [ 10/Nov/14 ]
Added a build to factory (http://factory.couchbase.com/view/build/view/third_party_deps/job/cb_dep_ubuntu_1204_x64/) based off the Windows build. This successfully compiles and packages up the library, but needs updating to upload the result to S3 as the Windows one seemed to SFTP it to latestbuidls which doesn't look like the correct location.
Comment by Dave Rigby [ 10/Nov/14 ]
Hi Ceej,

The aforementioned build needs updating to upload to an S3 bucket (packages.couchbase.com). I don't know what credentials Jenkins should use for this, could you take a look please?




[MB-12126] there is not manifest file on windows 3.0.1-1253 Created: 03/Sep/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0.1
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 r2 64-bit

Attachments: PNG File ss 2014-09-03 at 12.05.41 PM.png    
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.1-1253 on windows server 2008 r2 64-bit. There is not manifest file in directory c:\Program Files\Couchbase\Server\



 Comments   
Comment by Chris Hillery [ 03/Sep/14 ]
Also true for 3.0 RC2 build 1205.
Comment by Chris Hillery [ 03/Sep/14 ]
(Side note: While fixing this, log onto build slaves and delete stale "server-overlay/licenses.tgz" file so we stop shipping that)
Comment by Anil Kumar [ 17/Sep/14 ]
Ceej - Any update on this?
Comment by Chris Hillery [ 18/Sep/14 ]
No, not yet.




[MB-12758] Incorrect resutls from "ifnull" Created: 24/Nov/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4, 3.0, sherlock
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ketaki Gangal Assignee: Isha Kandaswamy
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: 4h
Time Spent: Not Specified
Original Estimate: 4h

Triage: Untriaged
Is this a Regression?: Yes

 Description   
Regression runs on 3.5.0-365-rel shows errors on result mismatch on ifNull

test to run : https://github.com/couchbase/testrunner/blob/master/conf/tuq/py-tuq-nulls.conf#L11

======================================================================
FAIL: test_ifnull (tuqquery.tuq_nulls.NULLTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/tuqquery/tuq_nulls.py", line 187, in test_ifnull
    self._verify_results(actual_result['results'], expected_result)
  File "pytests/tuqquery/tuq.py", line 2558, in _verify_results
    expected_result[:100],expected_result[-100:]))
AssertionError: Results are incorrect.

 Actual first and last 100: [{u'feature_name': u'0', u'point': 3}, {u'feature_name': u'1', u'point': 3}, {u'feature_name': u'10', u'point': 3}, {u'feature_name': u'100', u'point': 3}, {u'feature_name': u'101', u'point': 3}, {u'feature_name': u'102', u'point': 3}, {u'feature_name': u'103', u'point': 3}, {u'feature_name': u'104', u'point': 3}, {u'feature_name': u'105', u'point': 3}, {u'feature_name': u'106', u'point': 3}, {u'feature_name': u'107', u'point': 3}, {u'feature_name': u'108', u'point': 3}, {u'feature_name': u'109', u'point': 3}, {u'feature_name': u'11', u'point': 3}, {u'feature_name': u'110', u'point': 3}, {u'feature_name': u'111', u'point': 3}, {u'feature_name': u'112', u'point': 3}, {u'feature_name':

 Comments   
Comment by Iryna Mironava [ 25/Nov/14 ]
the issue is that we see results like
{u\'feature_name\': u\'987\', u\'point\': None}, {u\'feature_name\': u\'988\', u\'point\': None}, {u\'feature_name\': u\'989\', u\'point\': None}, {u\'feature_name\': u\'99\', u\'point\': 3}, {u\'feature_name\': u\'990\', u\'point\': None}, {u\'feature_name\': u\'991\', u\'point\': None}, {u\'feature_name\': u\'992\', u\'point\': None}, {u\'feature_name\': u\'993\', u\'point\': None}, {u\'feature_name\': u\'994\', u\'point\': None}, {u\'feature_name\': u\'995\', u\'point\': None}, {u\'feature_name\': u\'996\', u\'point\': None}, {u\'feature_name\': u\'997\', u\'point\': None}, {u\'feature_name\': u\'998\', u\'point\': None}, {u\'feature_name\': u\'999\', u\'point\': None}
documentation says
IFNULL(expr1, expr2, ...) - first non-NULL value. Note that this function may return MISSING.
so i would expect to see results like {u\'feature_name\': u\'999\'}
Comment by Gerald Sangudi [ 25/Nov/14 ]
Isha,

Please work with Ketaki and Iryna to troubleshoot this.

Thanks.




[MB-7250] Mac OS X App should be signed by a valid developer key Created: 22/Nov/12  Updated: 25/Nov/14

Status: In Progress
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0-beta-2, 2.1.0, 2.2.0, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: J Chris Anderson Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Build_2.5.0-950.png     PNG File Screen Shot 2013-02-17 at 9.17.16 PM.png     PNG File Screen Shot 2013-04-04 at 3.57.41 PM.png     PNG File Screen Shot 2013-08-22 at 6.12.00 PM.png     PNG File ss_2013-04-03_at_1.06.39 PM.png    
Issue Links:
Dependency
depends on MB-9437 macosx installer package fails during... Closed
Duplicate
is duplicated by MB-12319 [OS X] Check for Updates upgrade does... Resolved
is duplicated by MB-12345 Version 3.0.0-1209-rel prompts for up... Closed
Relates to
relates to CBLT-104 Enable Mac developer signing on Mac b... Open
Is this a Regression?: No

 Description   
Currently launching the Mac OS X version tells you it's from an unidentified developer. You have to right click to launch the app. We can fix this.

 Comments   
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
Chris,

do you know what needs to change on the build machine to embed our developer key ?
Comment by J Chris Anderson [ 22/Nov/12 ]
I have no idea. I could start researching how to get a key from Apple but maybe after the weekend. :)
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
we can discuss this next week : ) . Thanks for reporting the issue Chris.
Comment by Steve Yen [ 26/Nov/12 ]
we'll want separate, related bugs (tasks) for other platforms, too (windows, linux)
Comment by Jens Alfke [ 30/Nov/12 ]
We need to get a developer ID from Apple; this will give us some kind of cert, and a local private key for signing.
Then we need to figure out how to get that key and cert onto the build machine, in the Keychain of the account that runs the buildbot.
Comment by Farshid Ghods (Inactive) [ 02/Jan/13 ]
the instructions to build is available here :
https://github.com/couchbase/couchdbx-app
we need to add codesign as a build step there
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
Phil,

do you have any update on this ticket. ?
Comment by Phil Labee (Inactive) [ 22/Jan/13 ]
I have signing cert installed on 10.17.21.150 (MacBuild).

Change to Makefile: http://review.couchbase.org/#/c/24149/
Comment by Phil Labee (Inactive) [ 23/Jan/13 ]
need to change master.cfg and pass env.var. to package-mac
Comment by Phil Labee (Inactive) [ 29/Jan/13 ]
disregard previous. Have added signing to Xcode projects.

see http://review.couchbase.org/#/c/24273/
Comment by Phil Labee (Inactive) [ 31/Jan/13 ]
To test this go to System Preferences / Security & Privacy, and on the General tab set "Allow applications downloaded from" to "Mac App Store and Identified Developers". Set this before running Couchbase Server.app the first time. Once an app has been allowed to run this setting is no longer checked for that app, and there doesn't seem to be a way to reset that.

What is odd is that on my system, I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked (and would all be allowed to run). Either there is a flaw in my testing methodology, or a serious weakness in this security setting: Just because one app called Couchbase Server was allowed to run should confer this privilege to other apps with the same name. A common malware tactic is to modify a trusted app and distribute it as update, and if the security setting keys off the app name it will do nothing to prevent that.

I'm approving this change without having satisfactorily tested it.
Comment by Jens Alfke [ 31/Jan/13 ]
Strictly speaking it's not the app name but its bundle ID, i.e. "com.couchbase.CouchbaseServer" or whatever we use.

> I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked

By OK'ing an unsigned app you're basically agreeing to toss security out the window, at least for that app. This feature is really just a workaround for older apps. By OK'ing the app you're not really saying "yes, I trust this build of this app" so much as "yes, I agree to run this app even though I don't trust it".

> A common malware tactic is to modify a trusted app and distribute it as update

If it's a trusted app it's hopefully been signed, so the user wouldn't have had to waive signature checking for it.
Comment by Jens Alfke [ 31/Jan/13 ]
Further thought: It might be a good idea to change the bundle ID in the new signed version of the app, because users of 2.0 with strict security settings have presumably already bypassed security on the unsigned version.
Comment by Jin Lim [ 04/Feb/13 ]
Per bug scrubs, keep this a blocker since customers ran into this issues (and originally reported it).
Comment by Phil Labee (Inactive) [ 06/Feb/13 ]
revert the change so that builds can complete. App is currently not being signed.
Comment by Farshid Ghods (Inactive) [ 11/Feb/13 ]
i suggest for 2.0.1 release we do this build manually.
Comment by Jin Lim [ 11/Feb/13 ]
As one-off fix, add the signature manually and automate the required steps later in 2.0.2 or beyond.
Comment by Jin Lim [ 13/Feb/13 ]
Please move this bug to 2.0.2 after populating the required signature manually. I am lowing the severity to critical for it isn't no longer a blocking issue.
Comment by Farshid Ghods (Inactive) [ 15/Feb/13 ]
Phil to upload the binary to latestbuilds , ( 2.0.1-101-rel.zip )
Comment by Phil Labee (Inactive) [ 15/Feb/13 ]
Please verify:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee (Inactive) [ 15/Feb/13 ]
uploaded:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip

I can rename it when uploading for release.
Comment by Farshid Ghods (Inactive) [ 17/Feb/13 ]
i still do get the error that it is from an identified developer.

Comment by Phil Labee (Inactive) [ 18/Feb/13 ]
operator error.

I rebuilt the app, this time verifying that the codesign step occurred.

Uploaded now file to same location:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee (Inactive) [ 26/Feb/13 ]
still need to perform manual workaround
Comment by Phil Labee (Inactive) [ 04/Mar/13 ]
release candidate has been uploaded to:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip
Comment by Wayne Siu [ 03/Apr/13 ]
Phil, looks like version 172/185 is still getting the error. My Mac version is 10.8.2
Comment by Thuan Nguyen [ 03/Apr/13 ]
Install couchbase server (build 2.0.1-172 community version) in my mac osx 10.7.4 , I only see the warning message
Comment by Wayne Siu [ 03/Apr/13 ]
Latest version (04.03.13) : http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.zip
Comment by Maria McDuff (Inactive) [ 03/Apr/13 ]
works in 10.7 but not in 10.8.
if we can get the fix for 10.8 by tomorrow, end of day, QE is willing to test for release on tuesday, april 9.
Comment by Phil Labee (Inactive) [ 04/Apr/13 ]
The mac builds are not being automatically signed, so build 185 is not signed. The original 172 is also not signed.

Did you try

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip

to see if that was signed correctly?

Comment by Wayne Siu [ 04/Apr/13 ]
Phil,
Yes, we did try the 172-signed version. It works on 10.7 but not 10.8. Can you take a look?
Comment by Phil Labee (Inactive) [ 04/Apr/13 ]
I rebuilt 2.0.1-185 and uploaded a signed app to:

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.SIGNED.zip

Test on a machine that has never had Couchbase Server installed, and has the security setting to only allow Appstore or signed apps.

If you get the "Couchbase Server.app was downloaded from the internet" warning and you can click OK and install it, then this bug is fixed. The quarantining of files downloaded by a browser is part of the operating system and is not controlled by signing.
Comment by Wayne Siu [ 04/Apr/13 ]
Tried the 185-signed version (see attached screen shot). Same error message.
Comment by Phil Labee (Inactive) [ 04/Apr/13 ]
This is not an error message related to this bug.

Comment by Maria McDuff (Inactive) [ 14/May/13 ]
per bug triage, we need to have mac 10.8 osx working since it is a supported platform (published in the website).
Comment by Wayne Siu [ 29/May/13 ]
Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Anil Kumar [ 31/May/13 ]
we need to address signed key for both Windows and Mac deferring this to next release.
Comment by Dipti Borkar [ 08/Aug/13 ]
Please let's make sure this is fixed in 2.2.
Comment by Phil Labee (Inactive) [ 16/Aug/13 ]
New keys will be created using new account.
Comment by Phil Labee (Inactive) [ 20/Aug/13 ]
iOS Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=iOS Distribution expires Aug 12, 2014

    ~buildbot/Desktop/appledeveloper.couchbase.com/certs/ios/ios_distribution_appledeveloper.couchbase.com.cer

Identifiers:
  App IDS:
    "Couchbase Server" id=com.couchbase.*

Provisining Profiles:
  Distribution:
    "appledeveloper.couchbase.com" type=Distribution

  ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/ios/appledevelopercouchbasecom.mobileprovision
Comment by Phil Labee (Inactive) [ 20/Aug/13 ]
Mac Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)
    "Couchbase, Inc." type=Developer ID installer (Aug,16,2014)
    "Couchbase, Inc." type=Developer ID Application (Aug,16,2014)
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)

     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developerID_installer.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developererID_application.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution-2.cer

Identifiers:
  App IDs:
    "Couchbase Server" id=couchbase.com.* Prefix=N2Q372V7W2
    "Coucbase Server adhoc" id=couchbase.com.* Prefix=N2Q372V7W2
    .

Provisioning Profiles:
  Distribution:
    "appstore.couchbase.com" type=Distribution
    "Couchbase Server adhoc" type=Distribution

     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/appstorecouchbasecom.privisioningprofile
     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/Couchbase_Server_adhoc.privisioningprofile

Comment by Phil Labee (Inactive) [ 21/Aug/13 ]

As of build 2.2.0-806 the app is signed by a new provisioning profile
Comment by Phil Labee (Inactive) [ 22/Aug/13 ]
 Install version 2.2.0-806 on a macosx 10.8 machine that has never had Couchbase Server installed, which has the security setting to require applications to be signed with a developer ID.
Comment by Phil Labee (Inactive) [ 22/Aug/13 ]
please assign to tester
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
just tried this against newest build 809:
still getting restriction message. see attached.
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
restriction still exists.
Comment by Maria McDuff (Inactive) [ 28/Aug/13 ]
verified in rc1 (build 817). still not fixed. getting same msg:
“Couchbase Server” can’t be opened because it is from an unidentified developer.
Your security preferences allow installation of only apps from the Mac App Store and identified developers.

Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Phil Labee (Inactive) [ 03/Sep/13 ]
Need to create new certificates to replace these that were revoked:

Certificate: Mac Development
Team Name: Couchbase, Inc.

Certificate: Mac Installer Distribution
Team Name: Couchbase, Inc.

Certificate: iOS Development
Team Name: Couchbase, Inc.

Certificate: iOS Distribution
Team Name: Couchbase, Inc.
Comment by Maria McDuff (Inactive) [ 18/Sep/13 ]
candidate for 2.2.1 bug fix release.
Comment by Dipti Borkar [ 28/Oct/13 ]
This is going to make it into 2.5? We seemed to keep differing it?
Comment by Phil Labee (Inactive) [ 29/Oct/13 ]
cannot test changes with installer that fails
Comment by Phil Labee (Inactive) [ 11/Nov/13 ]
Installed certs as buildbot and signed app with "(recommended) 3rd Party Mac Developer Application", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-001.zip

Signed with "(Oct 30) 3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-002.zip

These zip files were made on the command line, not a result of the make command. They are 2.5G in size, so they obviously include mote than the zip files produced by the make command.

Both versions of the app appear to be signed correctly!

Note: cannot run make command from ssh session. Must Remote Desktop in and use terminal shell natively.
Comment by Phil Labee (Inactive) [ 11/Nov/13 ]
Finally, some progress: If the zip file is made using the --symlinks argument it appears to be un-signed. If the symlinked files are included, the app appears to be signed correctly.

The zip file with symlinks is 60M, while the zip file with copies of the files is 2.5G, more than 40X the size.
Comment by Phil Labee (Inactive) [ 25/Nov/13 ]
Fixed in 2.5.0-950
Comment by Dipti Borkar [ 25/Nov/13 ]
Maria, can QE please verify this?
Comment by Wayne Siu [ 28/Nov/13 ]
Tested with build 2.5.0-950. Still see the warning box (attached).
Comment by Wayne Siu [ 19/Dec/13 ]
Phil,
Can you give an update on this?
Comment by Ashvinder Singh [ 14/Jan/14 ]
I tested the code signature with apple utility "spctl -a -v /Applications/Couchbase\ Server.app/" and got the output :
>>> /Applications/Couchbase Server.app/: a sealed resource is missing or invalid

also tried running the command:
 
bash: codesign -dvvvv /Applications/Couchbase\ Server.app
>>>
Executable=/Applications/Couchbase Server.app/Contents/MacOS/Couchbase Server
Identifier=com.couchbase.couchbase-server
Format=bundle with Mach-O thin (x86_64)
CodeDirectory v=20100 size=639 flags=0x0(none) hashes=23+5 location=embedded
Hash type=sha1 size=20
CDHash=868e4659f4511facdf175b44a950b487fa790dc4
Signature size=4355
Authority=3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)
Authority=Apple Worldwide Developer Relations Certification Authority
Authority=Apple Root CA
Signed Time=Jan 8, 2014, 10:59:16 AM
Info.plist entries=31
Sealed Resources version=1 rules=4 files=5723
Internal requirements count=1 size=216

It looks like the code signature is present but got invalid as the new file were added/modified to the project. I suggest for the build team to rebuild and add the code signature again.
Comment by Phil Labee (Inactive) [ 17/Apr/14 ]
need VM to clone for developer experimentation
Comment by Anil Kumar [ 18/Jul/14 ]
Any update on this? We need this for 3.0.0 GA.

Please update the ticket.

Triage - July 18th
Comment by Wayne Siu [ 02/Aug/14 ]
Siri is helping to figure out what the next step is.
Comment by Anil Kumar [ 13/Aug/14 ]
Jens - Assigning as per Ravi's request.
Comment by Chris Hillery [ 13/Aug/14 ]
Jens requested assistance in setting up a MacOS development environment for building Couchbase. Phil (or maybe Siri?), can you help him with that?
Comment by Phil Labee (Inactive) [ 13/Aug/14 ]
The production macosx builder has been cloned:

    10.6.2.159 macosx-x64-server-builder-01-clone

if you want to use your own host, see:

    http://hub.internal.couchbase.com/confluence/display/CR/How+to+Setup+a+MacOSX+Server+Build+Node
Comment by Jens Alfke [ 15/Aug/14 ]
Here are the Apple docs on building apps signed with a Developer ID: https://developer.apple.com/library/mac/documentation/IDEs/Conceptual/AppDistributionGuide/DistributingApplicationsOutside/DistributingApplicationsOutside.html#//apple_ref/doc/uid/TP40012582-CH12-SW2

I've got everything configured, but the build process fails at the final step, after I press the Distribute button in the Organizer window. I get a very uninformative error alert, "Code signing operation failed / Check that the identity you selected is valid."

I've asked for help on the xcode-users mailing list. Blocked until I hear something back.
Comment by Anil Kumar [ 18/Aug/14 ]
Triage - Not blocking 3.0 RC1
Comment by Phil Labee (Inactive) [ 25/Aug/14 ]
from Apple Developer mail list:

Dear Developer,

With the release of OS X Mavericks 10.9.5, the way that OS X recognizes signed apps will change. Signatures created with OS X Mountain Lion 10.8.5 or earlier (v1 signatures) will be obsoleted and Gatekeeper will no longer recognize them. Users may receive a Gatekeeper warning and will need to exempt your app to continue using it. To ensure your apps will run without warning on updated versions of OS X, they must be signed on OS X Mavericks 10.9 or later (v2 signatures).

If you build code with an older version of OS X, use OS X Mavericks 10.9 or later to sign your app and create v2 signatures using the codesign tool. Structure your bundle according to the signature evaluation requirements for OS X Mavericks 10.9 or later. Considerations include:

 * Signed code should only be placed in directories where the system expects to find signed code.

 * Resources should not be located in directories where the system expects to find signed code.

 * The --resource-rules flag and ResourceRules.plist are not supported.

Make sure your current and upcoming releases work properly with Gatekeeper by testing on OS X Mavericks 10.9.5 and OS X Yosemite 10.10 Developer Preview 5 or later. Apps signed with v2 signatures will work on older versions of OS X.

For more details, read “Code Signing changes in OS X Mavericks” and “Changes in 
OS X 10.9.5 and Yosemite Developer Preview 5” in OS X Code Signing In Depth":

    http://c.apple.com/r?v=2&la=en&lc=us&a=EEjRsqZNfcheZauIAhlqmxVG35c6HJuf50mGu47LWEktoAjykEJp8UYqbgca3uWG&ct=AJ0T0e3y2W

Best regards,
Apple Developer Technical Support
Comment by Phil Labee (Inactive) [ 28/Aug/14 ]
change to buildbot-internal to unlock keychain before running make and lock after:

    http://review.couchbase.org/#/c/41028/

change to couchdbx-app to sign app, on dev branch "plabee/MB-7250":

    http://review.couchbase.org/#/c/41025/

change to manifest to use this dev branch for 3.0.1 builds:

    http://review.couchbase.org/#/c/41026/
Comment by Wayne Siu [ 29/Aug/14 ]
Moving it to 3.0.1.




[MB-12772] EP-engine support on XDCR Built-in Time Synchronization mechnism Created: 25/Nov/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
This is for tracking the requirements on ep-engine side to support XDCR built-in time synchronization mechanism.

The design spec is at https://docs.google.com/document/d/1xBjx0SDeEnWToEv1RGLHypQAkN4HWirVqQ-JtaBtlrI/edit?usp=sharing

This is a sherlock stretch goal.




[MB-12738] Checkpoints are always purged if there are no cursors in them Created: 20/Nov/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Task Priority: Major
Reporter: Mike Wiederhold Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We are aggressively removing checkpoints from memory. In this case I loaded 100k items into Couchbase, waited, and observed that there was only one item in each checkpoint manager. We should keep checkpoints in memory if we have space.

Mikes-MacBook-Pro:ep-engine mikewied$ management/cbstats 10.5.2.34:12000 checkpoint | grep num_checkpoint_items | cut -c 44- | awk '{s+=$1} END {print s}'
1024




[MB-12692] EP engine side of changes needed for supporting XDCR LWW in Sherlock Created: 17/Nov/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
The design doc for lww is at https://docs.google.com/document/d/1xBjx0SDeEnWToEv1RGLHypQAkN4HWirVqQ-JtaBtlrI/edit?usp=sharing

 Comments   
Comment by Chiyoung Seo [ 19/Nov/14 ]
Xiaomei,

Per our discussion, the ep-engine team needs more clarifications regarding how HLC should be generated in various edge cases.

Please update this ticket when the requirement doc is ready.
Comment by Xiaomei Zhang [ 25/Nov/14 ]
Chiyoung,

The design doc is updated. Please review.




[MB-12773] Go-XDCR: No transfer of data when a second replication on same source bucket is created Created: 25/Nov/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-11-25 at 2.55.55 PM.png    
Triage: Untriaged
Epic Link: XDCR next release
Is this a Regression?: No

 Description   
Consistently reproducible

Topology: Star

Steps to reproduce
--------------------------
1. Create 3 buckets default, dest, target
2. default --> dest (10000 keys replicated)
3. Now create default -> target
4. Everything seems fine in the log - pipeline creation, creation of upr streams, creation of batches, even dcp queue on default buckets shows a drain rate of 10k/sec(screenshot attached) but no data is transferred to bucket 'target'.

Log
-----
ReplicationManager14:55:15.843453 [INFO] Creating replication - sourceCluterUUID=127.0.0.1:9000, sourceBucket=default, targetClusterUUID=localhost:9000, targetBucket=target, filterName=, settings=map[source_nozzle_per_node:2 max_expected_replication_lag:1000 log_level:Info target_nozzle_per_node:2 failure_restart_interval:30 filter_expression: timeout_percentage_cap:80 checkpoint_interval:1800 optimistic_replication_threshold:256 active:true batch_size:2048 http_connection:20 replication_type:xmem batch_count:500], createReplSpec=true
ReplicationManager14:55:15.843471 [INFO] Creating replication spec - sourceCluterUUID=127.0.0.1:9000, sourceBucket=default, targetClusterUUID=localhost:9000, targetBucket=target, filterName=, settings=map[log_level:Info target_nozzle_per_node:2 failure_restart_interval:30 source_nozzle_per_node:2 max_expected_replication_lag:1000 timeout_percentage_cap:80 checkpoint_interval:1800 filter_expression: optimistic_replication_threshold:256 http_connection:20 replication_type:xmem batch_count:500 active:true batch_size:2048]
ReplicationManager14:55:15.848854 [INFO] Pipeline xdcr_127.0.0.1:9000_default_localhost:9000_target is created and started
AdminPort14:55:15.848863 [INFO] forwardReplicationRequest
PipelineManager14:55:15.848898 [INFO] Starting the pipeline xdcr_127.0.0.1:9000_default_localhost:9000_target
XDCRFactory14:55:15.849150 [INFO] kvHosts=[127.0.0.1]
cluster=127.0.0.1:9000
ServerList=[127.0.0.1:12000]
ServerVBMap=map[127.0.0.1:12000:[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]]
XDCRFactory14:55:15.862793 [INFO] found kv
DcpNozzle14:55:15.874883 [INFO] Constructed Dcp nozzle dcp_127.0.0.1:12000_0 with vblist [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]
DcpNozzle14:55:15.887750 [INFO] Constructed Dcp nozzle dcp_127.0.0.1:12000_1 with vblist [32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]
XDCRFactory14:55:15.887764 [INFO] Constructed 2 source nozzles
cluster=localhost:9000
2014/11/25 14:55:15 Warning: Finalizing a bucket with active connections.
2014/11/25 14:55:15 Warning: Finalizing a bucket with active connections.
ServerList=[127.0.0.1:12000]
ServerVBMap=map[127.0.0.1:12000:[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]]
XDCRFactory14:55:15.917171 [INFO] Constructed 2 outgoing nozzles
XDCRRouter14:55:15.917189 [INFO] Router created with 2 downstream parts
XDCRFactory14:55:15.917193 [INFO] Constructed router
PipelineSupervisor14:55:15.917204 [INFO] Attaching pipeline supervior service
DcpNozzle14:55:15.917236 [INFO] listener &{{0xc208bb9ce0 0xc208bb9dc0 <nil> <nil> <nil> false 0xc208bb3830} 0xc208bc1c20 0xc208b3f2a0 400000000 400000000 <nil> 0x46d0fe0 0xc208bc1c80 0xc208bc1ce0 []} is registered on event 4 for Component dcp_127.0.0.1:12000_0
PipelineSupervisor14:55:15.917243 [INFO] Registering ErrorEncountered event on part dcp_127.0.0.1:12000_0
XmemNozzle14:55:15.917258 [INFO] listener &{{0xc208bb9ce0 0xc208bb9dc0 <nil> <nil> <nil> false 0xc208bb3830} 0xc208bc1c20 0xc208b3f2a0 400000000 400000000 <nil> 0x46d0fe0 0xc208bc1c80 0xc208bc1ce0 []} is registered on event 4 for Component xmem_127.0.0.1:12000_1
PipelineSupervisor14:55:15.917264 [INFO] Registering ErrorEncountered event on part xmem_127.0.0.1:12000_1
XmemNozzle14:55:15.917280 [INFO] listener &{{0xc208bb9ce0 0xc208bb9dc0 <nil> <nil> <nil> false 0xc208bb3830} 0xc208bc1c20 0xc208b3f2a0 400000000 400000000 <nil> 0x46d0fe0 0xc208bc1c80 0xc208bc1ce0 []} is registered on event 4 for Component xmem_127.0.0.1:12000_0
PipelineSupervisor14:55:15.917285 [INFO] Registering ErrorEncountered event on part xmem_127.0.0.1:12000_0
DcpNozzle14:55:15.917301 [INFO] listener &{{0xc208bb9ce0 0xc208bb9dc0 <nil> <nil> <nil> false 0xc208bb3830} 0xc208bc1c20 0xc208b3f2a0 400000000 400000000 <nil> 0x46d0fe0 0xc208bc1c80 0xc208bc1ce0 []} is registered on event 4 for Component dcp_127.0.0.1:12000_1
PipelineSupervisor14:55:15.917306 [INFO] Registering ErrorEncountered event on part dcp_127.0.0.1:12000_1
XDCRRouter14:55:15.917324 [INFO] listener &{{0xc208bb9ce0 0xc208bb9dc0 <nil> <nil> <nil> false 0xc208bb3830} 0xc208bc1c20 0xc208b3f2a0 400000000 400000000 <nil> 0x46d0fe0 0xc208bc1c80 0xc208bc1ce0 []} is registered on event 4 for Component XDCRRouter
PipelineSupervisor14:55:15.917330 [INFO] Registering ErrorEncountered event on connector XDCRRouter
XDCRFactory14:55:15.917334 [INFO] XDCR pipeline constructed
PipelineManager14:55:15.917337 [INFO] Pipeline is constructed, start it
XmemNozzle14:55:15.917376 [INFO] Xmem starting ....
XmemNozzle14:55:15.917437 [INFO] init a new batch
XmemNozzle14:55:15.919449 [INFO] ....Finish initializing....
XmemNozzle14:55:15.919467 [INFO] Xmem nozzle is started
XmemNozzle14:55:15.919474 [INFO] Xmem starting ....
XmemNozzle14:55:15.919521 [INFO] init a new batch
XmemNozzle14:55:15.922057 [INFO] ....Finish initializing....
XmemNozzle14:55:15.922082 [INFO] Xmem nozzle is started
DcpNozzle14:55:15.922284 [INFO] Dcp nozzle dcp_127.0.0.1:12000_0 starting ....
DcpNozzle14:55:15.922289 [INFO] Dcp nozzle starting ....
XmemNozzle14:55:15.922335 [INFO] xmem_127.0.0.1:12000_0 processData starts..........
XmemNozzle14:55:15.922361 [INFO] xmem_127.0.0.1:12000_1 processData starts..........
DcpNozzle14:55:15.923789 [INFO] ....Finished dcp nozzle initialization....
DcpNozzle14:55:15.923804 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=4
DcpNozzle14:55:15.923823 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=5
DcpNozzle14:55:15.923839 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=9
DcpNozzle14:55:15.923851 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=10
DcpNozzle14:55:15.923864 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=17
DcpNozzle14:55:15.923878 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=29
DcpNozzle14:55:15.923892 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=0
DcpNozzle14:55:15.923904 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=15
DcpNozzle14:55:15.923917 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=18
DcpNozzle14:55:15.923935 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=20
DcpNozzle14:55:15.923958 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=24
DcpNozzle14:55:15.923980 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=27
DcpNozzle14:55:15.923999 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=2
DcpNozzle14:55:15.924013 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=3
DcpNozzle14:55:15.924036 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=8
DcpNozzle14:55:15.924058 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=12
DcpNozzle14:55:15.924082 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=16
DcpNozzle14:55:15.924101 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=31
DcpNozzle14:55:15.924127 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=21
DcpNozzle14:55:15.924151 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=25
DcpNozzle14:55:15.924171 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=28
DcpNozzle14:55:15.924194 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=30
DcpNozzle14:55:15.924218 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=1
DcpNozzle14:55:15.924240 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=7
DcpNozzle14:55:15.924267 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=11
DcpNozzle14:55:15.924291 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=13
DcpNozzle14:55:15.924308 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=19
DcpNozzle14:55:15.924329 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=22
DcpNozzle14:55:15.924354 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=23
DcpNozzle14:55:15.924378 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=6
DcpNozzle14:55:15.924402 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=26
DcpNozzle14:55:15.924421 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=14
DcpNozzle14:55:15.924439 [INFO] Dcp nozzle is started
DcpNozzle14:55:15.924740 [INFO] Dcp nozzle dcp_127.0.0.1:12000_1 starting ....
DcpNozzle14:55:15.924749 [INFO] Dcp nozzle starting ....
14:55:15.924810 Default None UPR_STREAMREQ for vb 4 successful
14:55:15.924832 Default None UPR_STREAMREQ for vb 5 successful
14:55:15.924851 Default None UPR_STREAMREQ for vb 9 successful
14:55:15.924869 Default None UPR_STREAMREQ for vb 10 successful
14:55:15.924887 Default None UPR_STREAMREQ for vb 17 successful
14:55:15.924905 Default None UPR_STREAMREQ for vb 29 successful
14:55:15.924926 Default None UPR_STREAMREQ for vb 0 successful
14:55:15.924945 Default None UPR_STREAMREQ for vb 15 successful
14:55:15.924961 Default None UPR_STREAMREQ for vb 18 successful
14:55:15.924979 Default None UPR_STREAMREQ for vb 20 successful
14:55:15.924997 Default None UPR_STREAMREQ for vb 24 successful
DcpNozzle14:55:15.925012 [INFO] dcp_127.0.0.1:12000_0 processData starts..........
14:55:15.925058 Default None UPR_STREAMREQ for vb 27 successful
14:55:15.925077 Default None UPR_STREAMREQ for vb 2 successful
14:55:15.925157 Default None UPR_STREAMREQ for vb 3 successful
14:55:15.925300 Default None UPR_STREAMREQ for vb 8 successful
14:55:15.925377 Default None UPR_STREAMREQ for vb 12 successful
14:55:15.925430 Default None UPR_STREAMREQ for vb 16 successful
14:55:15.925503 Default None UPR_STREAMREQ for vb 31 successful
14:55:15.925573 Default None UPR_STREAMREQ for vb 21 successful
14:55:15.925645 Default None UPR_STREAMREQ for vb 25 successful
14:55:15.925746 Default None UPR_STREAMREQ for vb 28 successful
14:55:15.925805 Default None UPR_STREAMREQ for vb 30 successful
14:55:15.925908 Default None UPR_STREAMREQ for vb 1 successful
14:55:15.925944 Default None UPR_STREAMREQ for vb 7 successful
14:55:15.926028 Default None UPR_STREAMREQ for vb 11 successful
14:55:15.926082 Default None UPR_STREAMREQ for vb 13 successful
DcpNozzle14:55:15.926127 [INFO] ....Finished dcp nozzle initialization....
DcpNozzle14:55:15.926146 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=43
DcpNozzle14:55:15.926171 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=50
DcpNozzle14:55:15.926196 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=52
DcpNozzle14:55:15.926214 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=62
DcpNozzle14:55:15.926232 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=32
DcpNozzle14:55:15.926253 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=33
DcpNozzle14:55:15.926271 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=49
DcpNozzle14:55:15.926290 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=57
DcpNozzle14:55:15.926308 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=60
DcpNozzle14:55:15.926330 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=35
DcpNozzle14:55:15.926349 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=56
DcpNozzle14:55:15.926370 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=40
DcpNozzle14:55:15.926390 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=41
DcpNozzle14:55:15.926409 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=47
DcpNozzle14:55:15.926427 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=53
DcpNozzle14:55:15.926451 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=58
DcpNozzle14:55:15.926470 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=63
DcpNozzle14:55:15.926493 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=45
DcpNozzle14:55:15.926512 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=34
DcpNozzle14:55:15.926530 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=39
DcpNozzle14:55:15.926547 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=44
DcpNozzle14:55:15.926565 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=48
DcpNozzle14:55:15.926585 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=51
DcpNozzle14:55:15.926603 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=54
DcpNozzle14:55:15.926620 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=55
DcpNozzle14:55:15.926638 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=36
DcpNozzle14:55:15.926660 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=37
DcpNozzle14:55:15.926682 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=46
DcpNozzle14:55:15.926713 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=38
DcpNozzle14:55:15.926733 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=42
DcpNozzle14:55:15.926751 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=59
DcpNozzle14:55:15.926771 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=61
DcpNozzle14:55:15.926788 [INFO] Dcp nozzle is started
GenericPipeline14:55:15.926793 [INFO] All parts has been started
GenericPipeline14:55:15.926819 [INFO] -----------Pipeline xdcr_127.0.0.1:9000_default_localhost:9000_target is started----------
14:55:15.926857 Default None UPR_STREAMREQ for vb 43 successful
14:55:15.926879 Default None UPR_STREAMREQ for vb 50 successful
14:55:15.926897 Default None UPR_STREAMREQ for vb 52 successful
14:55:15.926916 Default None UPR_STREAMREQ for vb 62 successful
14:55:15.926934 Default None UPR_STREAMREQ for vb 32 successful
14:55:15.926953 Default None UPR_STREAMREQ for vb 33 successful
14:55:15.926972 Default None UPR_STREAMREQ for vb 49 successful
14:55:15.926990 Default None UPR_STREAMREQ for vb 57 successful
DcpNozzle14:55:15.927005 [INFO] dcp_127.0.0.1:12000_1 processData starts..........
14:55:15.927049 Default None UPR_STREAMREQ for vb 19 successful
14:55:15.927069 Default None UPR_STREAMREQ for vb 22 successful
14:55:15.927091 Default None UPR_STREAMREQ for vb 23 successful
14:55:15.927109 Default None UPR_STREAMREQ for vb 6 successful
14:55:15.927126 Default None UPR_STREAMREQ for vb 26 successful
14:55:15.927144 Default None UPR_STREAMREQ for vb 14 successful
14:55:15.927667 Default None UPR_STREAMREQ for vb 60 successful
14:55:15.927690 Default None UPR_STREAMREQ for vb 35 successful
14:55:15.927708 Default None UPR_STREAMREQ for vb 56 successful
14:55:15.927725 Default None UPR_STREAMREQ for vb 40 successful
14:55:15.927741 Default None UPR_STREAMREQ for vb 41 successful
14:55:15.927757 Default None UPR_STREAMREQ for vb 47 successful
14:55:15.927772 Default None UPR_STREAMREQ for vb 53 successful
14:55:15.927787 Default None UPR_STREAMREQ for vb 58 successful
14:55:15.927803 Default None UPR_STREAMREQ for vb 63 successful
14:55:15.927821 Default None UPR_STREAMREQ for vb 45 successful
14:55:15.927840 Default None UPR_STREAMREQ for vb 34 successful
14:55:15.927863 Default None UPR_STREAMREQ for vb 39 successful
14:55:15.928160 Default None UPR_STREAMREQ for vb 44 successful
14:55:15.928179 Default None UPR_STREAMREQ for vb 48 successful
14:55:15.928197 Default None UPR_STREAMREQ for vb 51 successful
14:55:15.928216 Default None UPR_STREAMREQ for vb 54 successful
14:55:15.928234 Default None UPR_STREAMREQ for vb 55 successful
14:55:15.928444 Default None UPR_STREAMREQ for vb 36 successful
14:55:15.928464 Default None UPR_STREAMREQ for vb 37 successful
14:55:15.928483 Default None UPR_STREAMREQ for vb 46 successful
14:55:15.928695 Default None UPR_STREAMREQ for vb 38 successful
14:55:15.928716 Default None UPR_STREAMREQ for vb 42 successful
14:55:15.928734 Default None UPR_STREAMREQ for vb 59 successful
14:55:15.928751 Default None UPR_STREAMREQ for vb 61 successful
XmemNozzle14:55:15.942524 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:15.942538 [INFO] init a new batch
XmemNozzle14:55:15.942545 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
XmemNozzle14:55:15.942578 [INFO] Send batch count=500
XmemNozzle14:55:15.961096 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:15.961111 [INFO] init a new batch
XmemNozzle14:55:15.961117 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
XmemNozzle14:55:15.964156 [INFO] Send batch count=500
XmemNozzle14:55:15.976531 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:55:15.976549 [INFO] init a new batch
XmemNozzle14:55:15.976556 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 1 batches ready
XmemNozzle14:55:15.976598 [INFO] Send batch count=500
XmemNozzle14:55:15.993209 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:15.993224 [INFO] init a new batch
XmemNozzle14:55:15.993228 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
XmemNozzle14:55:15.995378 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:55:15.995389 [INFO] init a new batch
XmemNozzle14:55:15.995393 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 1 batches ready
2014/11/25 14:55:16 Warning: Finalizing a bucket with active connections.
2014/11/25 14:55:16 Warning: Finalizing a bucket with active connections.
XmemNozzle14:55:16.007176 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:16.007190 [INFO] init a new batch
XmemNozzle14:55:16.007194 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
XmemNozzle14:55:16.010383 [INFO] Send batch count=500
XmemNozzle14:55:16.021284 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:55:16.021296 [INFO] init a new batch
XmemNozzle14:55:16.021300 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 1 batches ready
XmemNozzle14:55:16.031113 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:16.031128 [INFO] init a new batch
XmemNozzle14:55:16.031135 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
XmemNozzle14:55:16.033492 [INFO] Send batch count=500
XmemNozzle14:55:16.048207 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:55:16.048225 [INFO] init a new batch
XmemNozzle14:55:16.048232 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 1 batches ready
XmemNozzle14:55:16.058367 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:16.058379 [INFO] init a new batch
XmemNozzle14:55:16.058383 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 2 batches ready
XmemNozzle14:55:16.059315 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:55:16.059320 [INFO] init a new batch
XmemNozzle14:55:16.059323 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 2 batches ready
XmemNozzle14:55:16.075746 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:16.075763 [INFO] init a new batch
XmemNozzle14:55:16.075770 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 3 batches ready
XmemNozzle14:55:16.080462 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:16.080471 [INFO] init a new batch
XmemNozzle14:55:16.080484 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 4 batches ready
XmemNozzle14:55:16.083353 [INFO] Send batch count=500
XmemNozzle14:55:16.090886 [INFO] Send batch count=500
XmemNozzle14:55:16.097298 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:55:16.097306 [INFO] init a new batch
XmemNozzle14:55:16.097310 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 2 batches ready
XmemNozzle14:55:16.107902 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:16.107926 [INFO] init a new batch
XmemNozzle14:55:16.107931 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 4 batches ready
XmemNozzle14:55:16.112732 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:55:16.112741 [INFO] init a new batch
XmemNozzle14:55:16.112744 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 3 batches ready
XmemNozzle14:55:16.117204 [INFO] Send batch count=500
XmemNozzle14:55:16.122616 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:55:16.122624 [INFO] init a new batch
XmemNozzle14:55:16.122628 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 5 batches ready
XmemNozzle14:55:16.124225 [INFO] Send batch count=500
XmemNozzle14:55:16.131236 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:55:16.131243 [INFO] init a new batch
XmemNozzle14:55:16.131247 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 3 batches ready
XmemNozzle14:55:16.137993 [INFO] Send batch count=500
XmemNozzle14:55:16.144028 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:55:16.144037 [INFO] init a new batch
XmemNozzle14:55:16.144041 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 3 batches ready
XmemNozzle14:55:16.150021 [INFO] Send batch count=500
XmemNozzle14:55:16.163865 [INFO] Send batch count=500
XmemNozzle14:55:16.169007 [INFO] Send batch count=500
XmemNozzle14:55:16.176265 [INFO] Send batch count=500
XmemNozzle14:55:16.182591 [INFO] Send batch count=500
XmemNozzle14:55:16.191643 [INFO] Send batch count=500
XmemNozzle14:55:16.200172 [INFO] Send batch count=500
XmemNozzle14:55:16.207395 [INFO] Send batch count=500
XmemNozzle14:55:16.219480 [INFO] Send batch count=500
XmemNozzle14:55:16.823473 [INFO] xmem_127.0.0.1:12000_0 batch expired, moving it to ready queue
XmemNozzle14:55:16.823495 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=38) ready queue
XmemNozzle14:55:16.823500 [INFO] init a new batch
XmemNozzle14:55:16.823505 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
XmemNozzle14:55:16.824591 [INFO] Send batch count=38
XmemNozzle14:55:16.923039 [INFO] xmem_127.0.0.1:12000_1 batch expired, moving it to ready queue
XmemNozzle14:55:16.923062 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=462) ready queue
XmemNozzle14:55:16.923067 [INFO] init a new batch
XmemNozzle14:55:16.923072 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 1 batches ready
XmemNozzle14:55:16.923599 [INFO] Send batch count=462




[MB-12770] Go-XDCR: Replication stuck with "sendSingle: transmit error: write tcp IP:12000: broken pipe" (64 vbuckets) Created: 25/Nov/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-11-25 at 2.35.01 PM.png    
Triage: Untriaged
Epic Link: XDCR next release
Is this a Regression?: No

 Description   
Inconsistent in nature. Could be related to MB-12632 per Xiaomei.

Seen this many times before. In this run, I did a make dataclean so it was a fresh cluster. I had 3 buckets created in my cluster. Tried to create replication between default and dest buckets.

AdminPort14:30:08.325226 [INFO] doCreateReplicationRequest called
ReplicationManager14:30:08.325283 [INFO] Creating replication - sourceCluterUUID=127.0.0.1:9000, sourceBucket=default, targetClusterUUID=localhost:9000, targetBucket=dest, filterName=, settings=map[batch_count:500 active:true checkpoint_interval:1800 optimistic_replication_threshold:256 http_connection:20 source_nozzle_per_node:2 target_nozzle_per_node:2 replication_type:xmem batch_size:2048 max_expected_replication_lag:1000 timeout_percentage_cap:80 failure_restart_interval:30 log_level:Info filter_expression:], createReplSpec=true
ReplicationManager14:30:08.325301 [INFO] Creating replication spec - sourceCluterUUID=127.0.0.1:9000, sourceBucket=default, targetClusterUUID=localhost:9000, targetBucket=dest, filterName=, settings=map[active:true checkpoint_interval:1800 batch_count:500 http_connection:20 source_nozzle_per_node:2 target_nozzle_per_node:2 replication_type:xmem batch_size:2048 optimistic_replication_threshold:256 max_expected_replication_lag:1000 timeout_percentage_cap:80 log_level:Info filter_expression: failure_restart_interval:30]
ReplicationManager14:30:08.330582 [INFO] Pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest is created and started
AdminPort14:30:08.330593 [INFO] forwardReplicationRequest
PipelineManager14:30:08.330626 [INFO] Starting the pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest
XDCRFactory14:30:08.330876 [INFO] kvHosts=[127.0.0.1]
cluster=127.0.0.1:9000
2014/11/25 14:30:08 Warning: Finalizing a bucket with active connections.
ServerList=[127.0.0.1:12000]
ServerVBMap=map[127.0.0.1:12000:[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]]
XDCRFactory14:30:08.344569 [INFO] found kv
DcpNozzle14:30:08.354537 [INFO] Constructed Dcp nozzle dcp_127.0.0.1:12000_0 with vblist [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]
DcpNozzle14:30:08.365397 [INFO] Constructed Dcp nozzle dcp_127.0.0.1:12000_1 with vblist [32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]
XDCRFactory14:30:08.365410 [INFO] Constructed 2 source nozzles
cluster=localhost:9000
2014/11/25 14:30:08 Warning: Finalizing a bucket with active connections.
2014/11/25 14:30:08 Warning: Finalizing a bucket with active connections.
ServerList=[127.0.0.1:12000]
ServerVBMap=map[127.0.0.1:12000:[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]]
XDCRFactory14:30:08.388646 [INFO] Constructed 2 outgoing nozzles
XDCRRouter14:30:08.388660 [INFO] Router created with 2 downstream parts
XDCRFactory14:30:08.388663 [INFO] Constructed router
PipelineSupervisor14:30:08.388671 [INFO] Attaching pipeline supervior service
DcpNozzle14:30:08.388695 [INFO] listener &{{0xc2084bb420 0xc2084bb490 <nil> <nil> <nil> false 0xc2087bbfa0} 0xc208a035c0 0xc208b8ba80 400000000 400000000 <nil> 0x46d0fe0 0xc208a03620 0xc208a03680 []} is registered on event 4 for Component dcp_127.0.0.1:12000_0
PipelineSupervisor14:30:08.388699 [INFO] Registering ErrorEncountered event on part dcp_127.0.0.1:12000_0
XmemNozzle14:30:08.388709 [INFO] listener &{{0xc2084bb420 0xc2084bb490 <nil> <nil> <nil> false 0xc2087bbfa0} 0xc208a035c0 0xc208b8ba80 400000000 400000000 <nil> 0x46d0fe0 0xc208a03620 0xc208a03680 []} is registered on event 4 for Component xmem_127.0.0.1:12000_1
PipelineSupervisor14:30:08.388713 [INFO] Registering ErrorEncountered event on part xmem_127.0.0.1:12000_1
XmemNozzle14:30:08.388722 [INFO] listener &{{0xc2084bb420 0xc2084bb490 <nil> <nil> <nil> false 0xc2087bbfa0} 0xc208a035c0 0xc208b8ba80 400000000 400000000 <nil> 0x46d0fe0 0xc208a03620 0xc208a03680 []} is registered on event 4 for Component xmem_127.0.0.1:12000_0
PipelineSupervisor14:30:08.388726 [INFO] Registering ErrorEncountered event on part xmem_127.0.0.1:12000_0
DcpNozzle14:30:08.388734 [INFO] listener &{{0xc2084bb420 0xc2084bb490 <nil> <nil> <nil> false 0xc2087bbfa0} 0xc208a035c0 0xc208b8ba80 400000000 400000000 <nil> 0x46d0fe0 0xc208a03620 0xc208a03680 []} is registered on event 4 for Component dcp_127.0.0.1:12000_1
PipelineSupervisor14:30:08.388738 [INFO] Registering ErrorEncountered event on part dcp_127.0.0.1:12000_1
XDCRRouter14:30:08.388750 [INFO] listener &{{0xc2084bb420 0xc2084bb490 <nil> <nil> <nil> false 0xc2087bbfa0} 0xc208a035c0 0xc208b8ba80 400000000 400000000 <nil> 0x46d0fe0 0xc208a03620 0xc208a03680 []} is registered on event 4 for Component XDCRRouter
PipelineSupervisor14:30:08.388754 [INFO] Registering ErrorEncountered event on connector XDCRRouter
XDCRFactory14:30:08.388756 [INFO] XDCR pipeline constructed
PipelineManager14:30:08.388759 [INFO] Pipeline is constructed, start it
XmemNozzle14:30:08.388785 [INFO] Xmem starting ....
XmemNozzle14:30:08.388855 [INFO] init a new batch
XmemNozzle14:30:08.391947 [INFO] ....Finish initializing....
XmemNozzle14:30:08.391960 [INFO] Xmem nozzle is started
XmemNozzle14:30:08.391972 [INFO] Xmem starting ....
XmemNozzle14:30:08.392025 [INFO] init a new batch
XmemNozzle14:30:08.393793 [INFO] ....Finish initializing....
XmemNozzle14:30:08.393801 [INFO] Xmem nozzle is started
DcpNozzle14:30:08.394001 [INFO] Dcp nozzle dcp_127.0.0.1:12000_0 starting ....
DcpNozzle14:30:08.394008 [INFO] Dcp nozzle starting ....
XmemNozzle14:30:08.394049 [ERROR] xmem_127.0.0.1:12000_1 Quit receiveResponse. err=EOF
XmemNozzle14:30:08.394052 [INFO] xmem_127.0.0.1:12000_1 receiveResponse exits
XmemNozzle14:30:08.394057 [INFO] xmem_127.0.0.1:12000_1 processData starts..........
XmemNozzle14:30:08.394065 [ERROR] xmem_127.0.0.1:12000_0 Quit receiveResponse. err=EOF
XmemNozzle14:30:08.394068 [INFO] xmem_127.0.0.1:12000_0 receiveResponse exits
XmemNozzle14:30:08.394071 [INFO] xmem_127.0.0.1:12000_0 processData starts..........
2014/11/25 14:30:08 Warning: Finalizing a bucket with active connections.
2014/11/25 14:30:08 Warning: Finalizing a bucket with active connections.
DcpNozzle14:30:08.396507 [INFO] ....Finished dcp nozzle initialization....
DcpNozzle14:30:08.396530 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=10
DcpNozzle14:30:08.396550 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=12
DcpNozzle14:30:08.396566 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=20
DcpNozzle14:30:08.396579 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=30
DcpNozzle14:30:08.396591 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=0
DcpNozzle14:30:08.396603 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=3
DcpNozzle14:30:08.396615 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=7
DcpNozzle14:30:08.396630 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=19
DcpNozzle14:30:08.396645 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=27
DcpNozzle14:30:08.396661 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=4
DcpNozzle14:30:08.396687 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=11
DcpNozzle14:30:08.396712 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=17
DcpNozzle14:30:08.396733 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=16
DcpNozzle14:30:08.396751 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=21
DcpNozzle14:30:08.396773 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=22
DcpNozzle14:30:08.396794 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=1
DcpNozzle14:30:08.396822 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=5
DcpNozzle14:30:08.396841 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=8
DcpNozzle14:30:08.396861 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=18
DcpNozzle14:30:08.396886 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=23
DcpNozzle14:30:08.396915 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=28
DcpNozzle14:30:08.396934 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=13
DcpNozzle14:30:08.396951 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=29
DcpNozzle14:30:08.396965 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=31
DcpNozzle14:30:08.396977 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=15
DcpNozzle14:30:08.396989 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=26
DcpNozzle14:30:08.397003 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=9
DcpNozzle14:30:08.397020 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=24
DcpNozzle14:30:08.397042 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=25
DcpNozzle14:30:08.397062 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=2
DcpNozzle14:30:08.397087 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=6
DcpNozzle14:30:08.397117 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=14
DcpNozzle14:30:08.397136 [INFO] Dcp nozzle is started
DcpNozzle14:30:08.397406 [INFO] Dcp nozzle dcp_127.0.0.1:12000_1 starting ....
DcpNozzle14:30:08.397417 [INFO] Dcp nozzle starting ....
14:30:08.397475 Default None UPR_STREAMREQ for vb 10 successful
14:30:08.397501 Default None UPR_STREAMREQ for vb 12 successful
14:30:08.397523 Default None UPR_STREAMREQ for vb 20 successful
14:30:08.397540 Default None UPR_STREAMREQ for vb 30 successful
14:30:08.397556 Default None UPR_STREAMREQ for vb 0 successful
14:30:08.397576 Default None UPR_STREAMREQ for vb 3 successful
14:30:08.397599 Default None UPR_STREAMREQ for vb 7 successful
14:30:08.397616 Default None UPR_STREAMREQ for vb 19 successful
14:30:08.397637 Default None UPR_STREAMREQ for vb 27 successful
14:30:08.397659 Default None UPR_STREAMREQ for vb 4 successful
14:30:08.397676 Default None UPR_STREAMREQ for vb 11 successful
DcpNozzle14:30:08.397692 [INFO] dcp_127.0.0.1:12000_0 processData starts..........
14:30:08.397760 Default None UPR_STREAMREQ for vb 17 successful
14:30:08.397784 Default None UPR_STREAMREQ for vb 16 successful
14:30:08.397882 Default None UPR_STREAMREQ for vb 21 successful
14:30:08.397945 Default None UPR_STREAMREQ for vb 22 successful
14:30:08.398020 Default None UPR_STREAMREQ for vb 1 successful
14:30:08.398095 Default None UPR_STREAMREQ for vb 5 successful
14:30:08.398191 Default None UPR_STREAMREQ for vb 8 successful
14:30:08.398282 Default None UPR_STREAMREQ for vb 18 successful
14:30:08.398359 Default None UPR_STREAMREQ for vb 23 successful
14:30:08.398423 Default None UPR_STREAMREQ for vb 28 successful
14:30:08.398497 Default None UPR_STREAMREQ for vb 13 successful
14:30:08.398632 Default None UPR_STREAMREQ for vb 29 successful
14:30:08.398719 Default None UPR_STREAMREQ for vb 31 successful
14:30:08.398734 Default None UPR_STREAMREQ for vb 15 successful
14:30:08.398811 Default None UPR_STREAMREQ for vb 26 successful
14:30:08.398881 Default None UPR_STREAMREQ for vb 9 successful
DcpNozzle14:30:08.398919 [INFO] ....Finished dcp nozzle initialization....
DcpNozzle14:30:08.398929 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=32
DcpNozzle14:30:08.398959 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=33
DcpNozzle14:30:08.398981 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=37
DcpNozzle14:30:08.399004 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=52
DcpNozzle14:30:08.399025 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=58
DcpNozzle14:30:08.399049 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=59
DcpNozzle14:30:08.399070 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=63
DcpNozzle14:30:08.399093 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=40
DcpNozzle14:30:08.399120 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=46
DcpNozzle14:30:08.399142 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=50
DcpNozzle14:30:08.399162 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=51
DcpNozzle14:30:08.399177 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=55
DcpNozzle14:30:08.399200 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=56
DcpNozzle14:30:08.399222 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=34
DcpNozzle14:30:08.399244 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=36
DcpNozzle14:30:08.399273 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=45
DcpNozzle14:30:08.399298 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=53
DcpNozzle14:30:08.399319 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=38
DcpNozzle14:30:08.399338 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=54
DcpNozzle14:30:08.399359 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=57
DcpNozzle14:30:08.399378 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=62
DcpNozzle14:30:08.399397 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=35
DcpNozzle14:30:08.399416 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=39
DcpNozzle14:30:08.399436 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=44
DcpNozzle14:30:08.399455 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=48
DcpNozzle14:30:08.399474 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=49
DcpNozzle14:30:08.399497 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=60
DcpNozzle14:30:08.399518 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=61
DcpNozzle14:30:08.399543 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=42
DcpNozzle14:30:08.399563 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=43
DcpNozzle14:30:08.399584 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=41
DcpNozzle14:30:08.399606 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=47
DcpNozzle14:30:08.399626 [INFO] Dcp nozzle is started
GenericPipeline14:30:08.399632 [INFO] All parts has been started
GenericPipeline14:30:08.399658 [INFO] -----------Pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest is started----------
14:30:08.399702 Default None UPR_STREAMREQ for vb 32 successful
14:30:08.399728 Default None UPR_STREAMREQ for vb 33 successful
14:30:08.399750 Default None UPR_STREAMREQ for vb 37 successful
14:30:08.399769 Default None UPR_STREAMREQ for vb 52 successful
14:30:08.399789 Default None UPR_STREAMREQ for vb 58 successful
14:30:08.399807 Default None UPR_STREAMREQ for vb 59 successful
14:30:08.399823 Default None UPR_STREAMREQ for vb 63 successful
14:30:08.399844 Default None UPR_STREAMREQ for vb 40 successful
14:30:08.399861 Default None UPR_STREAMREQ for vb 46 successful
14:30:08.399878 Default None UPR_STREAMREQ for vb 50 successful
14:30:08.399896 Default None UPR_STREAMREQ for vb 51 successful
DcpNozzle14:30:08.399910 [INFO] dcp_127.0.0.1:12000_1 processData starts..........
14:30:08.399961 Default None UPR_STREAMREQ for vb 24 successful
14:30:08.399981 Default None UPR_STREAMREQ for vb 25 successful
14:30:08.400002 Default None UPR_STREAMREQ for vb 2 successful
14:30:08.400019 Default None UPR_STREAMREQ for vb 6 successful
14:30:08.400037 Default None UPR_STREAMREQ for vb 14 successful
14:30:08.400655 Default None UPR_STREAMREQ for vb 55 successful
14:30:08.400679 Default None UPR_STREAMREQ for vb 56 successful
14:30:08.400698 Default None UPR_STREAMREQ for vb 34 successful
14:30:08.400716 Default None UPR_STREAMREQ for vb 36 successful
14:30:08.400734 Default None UPR_STREAMREQ for vb 45 successful
14:30:08.400751 Default None UPR_STREAMREQ for vb 53 successful
14:30:08.400768 Default None UPR_STREAMREQ for vb 38 successful
14:30:08.400787 Default None UPR_STREAMREQ for vb 54 successful
14:30:08.400805 Default None UPR_STREAMREQ for vb 57 successful
14:30:08.400823 Default None UPR_STREAMREQ for vb 62 successful
14:30:08.400845 Default None UPR_STREAMREQ for vb 35 successful
14:30:08.400866 Default None UPR_STREAMREQ for vb 39 successful
14:30:08.401123 Default None UPR_STREAMREQ for vb 44 successful
14:30:08.401146 Default None UPR_STREAMREQ for vb 48 successful
14:30:08.401166 Default None UPR_STREAMREQ for vb 49 successful
14:30:08.401184 Default None UPR_STREAMREQ for vb 60 successful
14:30:08.401330 Default None UPR_STREAMREQ for vb 61 successful
14:30:08.401353 Default None UPR_STREAMREQ for vb 42 successful
14:30:08.401454 Default None UPR_STREAMREQ for vb 43 successful
14:30:08.401476 Default None UPR_STREAMREQ for vb 41 successful
14:30:08.401588 Default None UPR_STREAMREQ for vb 47 successful
XmemNozzle14:30:08.413573 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.413591 [INFO] init a new batch
XmemNozzle14:30:08.413597 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
XmemNozzle14:30:08.413624 [INFO] Send batch count=500
XmemNozzle14:30:08.413734 [ERROR] xmem_127.0.0.1:12000_0 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.413746 [INFO] xmem_127.0.0.1:12000_0 connection is broken, try to repair...
XmemNozzle14:30:08.413765 [INFO] xmem_127.0.0.1:12000_0 - The connection is repaired
XmemNozzle14:30:08.414584 [INFO] xmem_127.0.0.1:12000_0 - The unresponded items are resent
XmemNozzle14:30:08.414614 [ERROR] xmem_127.0.0.1:12000_0 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.414622 [INFO] xmem_127.0.0.1:12000_0 connection is broken, try to repair...
XmemNozzle14:30:08.414638 [INFO] xmem_127.0.0.1:12000_0 - The connection is repaired
XmemNozzle14:30:08.415461 [INFO] xmem_127.0.0.1:12000_0 - The unresponded items are resent
XmemNozzle14:30:08.415490 [ERROR] xmem_127.0.0.1:12000_0 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.415498 [INFO] xmem_127.0.0.1:12000_0 connection is broken, try to repair...
XmemNozzle14:30:08.415513 [INFO] xmem_127.0.0.1:12000_0 - The connection is repaired
XmemNozzle14:30:08.416339 [INFO] xmem_127.0.0.1:12000_0 - The unresponded items are resent
XmemNozzle14:30:08.416378 [ERROR] xmem_127.0.0.1:12000_0 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.416387 [INFO] xmem_127.0.0.1:12000_0 connection is broken, try to repair...
XmemNozzle14:30:08.416404 [INFO] xmem_127.0.0.1:12000_0 - The connection is repaired
XmemNozzle14:30:08.417222 [INFO] xmem_127.0.0.1:12000_0 - The unresponded items are resent
XmemNozzle14:30:08.417257 [ERROR] xmem_127.0.0.1:12000_0 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.417265 [INFO] xmem_127.0.0.1:12000_0 connection is broken, try to repair...
XmemNozzle14:30:08.417282 [INFO] xmem_127.0.0.1:12000_0 - The connection is repaired
XmemNozzle14:30:08.418129 [INFO] xmem_127.0.0.1:12000_0 - The unresponded items are resent
XmemNozzle14:30:08.418161 [ERROR] xmem_127.0.0.1:12000_0 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.418169 [INFO] xmem_127.0.0.1:12000_0 connection is broken, try to repair...
XmemNozzle14:30:08.418184 [INFO] xmem_127.0.0.1:12000_0 - The connection is repaired
XmemNozzle14:30:08.419002 [INFO] xmem_127.0.0.1:12000_0 - The unresponded items are resent
XmemNozzle14:30:08.419034 [ERROR] xmem_127.0.0.1:12000_0 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.419042 [INFO] xmem_127.0.0.1:12000_0 connection is broken, try to repair...
XmemNozzle14:30:08.419056 [INFO] xmem_127.0.0.1:12000_0 - The connection is repaired
XmemNozzle14:30:08.419887 [INFO] xmem_127.0.0.1:12000_0 - The unresponded items are resent
XmemNozzle14:30:08.419915 [ERROR] xmem_127.0.0.1:12000_0 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.419923 [INFO] xmem_127.0.0.1:12000_0 connection is broken, try to repair...
XmemNozzle14:30:08.419939 [INFO] xmem_127.0.0.1:12000_0 - The connection is repaired
XmemNozzle14:30:08.419995 [ERROR] xmem_127.0.0.1:12000_0 sendSingle: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.429414 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.429427 [INFO] init a new batch
XmemNozzle14:30:08.429431 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
XmemNozzle14:30:08.435810 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:30:08.435833 [INFO] init a new batch
XmemNozzle14:30:08.435843 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 1 batches ready
XmemNozzle14:30:08.435850 [INFO] Send batch count=500
XmemNozzle14:30:08.435967 [ERROR] xmem_127.0.0.1:12000_1 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.435975 [INFO] xmem_127.0.0.1:12000_1 connection is broken, try to repair...
XmemNozzle14:30:08.436000 [INFO] xmem_127.0.0.1:12000_1 - The connection is repaired
XmemNozzle14:30:08.436540 [INFO] xmem_127.0.0.1:12000_1 - The unresponded items are resent
XmemNozzle14:30:08.436563 [ERROR] xmem_127.0.0.1:12000_1 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.436569 [INFO] xmem_127.0.0.1:12000_1 connection is broken, try to repair...
XmemNozzle14:30:08.436580 [INFO] xmem_127.0.0.1:12000_1 - The connection is repaired
XmemNozzle14:30:08.437498 [INFO] xmem_127.0.0.1:12000_1 - The unresponded items are resent
XmemNozzle14:30:08.437527 [ERROR] xmem_127.0.0.1:12000_1 batchSend: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:08.437535 [INFO] xmem_127.0.0.1:12000_1 connection is broken, try to repair...
XmemNozzle14:30:08.437550 [INFO] xmem_127.0.0.1:12000_1 - The connection is repaired
XmemNozzle14:30:08.437615 [ERROR] xmem_127.0.0.1:12000_1 sendSingle: transmit error: write tcp 127.0.0.1:12000: connection reset by peer
XmemNozzle14:30:08.445744 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.445757 [INFO] init a new batch
XmemNozzle14:30:08.445761 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 2 batches ready
XmemNozzle14:30:08.451096 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:30:08.451110 [INFO] init a new batch
XmemNozzle14:30:08.451116 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 1 batches ready
XmemNozzle14:30:08.462003 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.462018 [INFO] init a new batch
XmemNozzle14:30:08.462026 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 3 batches ready
XmemNozzle14:30:08.465912 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:30:08.465921 [INFO] init a new batch
XmemNozzle14:30:08.465927 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 2 batches ready
XmemNozzle14:30:08.473803 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.473819 [INFO] init a new batch
XmemNozzle14:30:08.473825 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 4 batches ready
XmemNozzle14:30:08.476426 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:30:08.476434 [INFO] init a new batch
XmemNozzle14:30:08.476438 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 3 batches ready
XmemNozzle14:30:08.492647 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.492666 [INFO] init a new batch
XmemNozzle14:30:08.492677 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 5 batches ready
XmemNozzle14:30:08.492857 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:30:08.492866 [INFO] init a new batch
XmemNozzle14:30:08.492872 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 4 batches ready
XmemNozzle14:30:08.509287 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.509303 [INFO] init a new batch
XmemNozzle14:30:08.509310 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 6 batches ready
XmemNozzle14:30:08.509490 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:30:08.509498 [INFO] init a new batch
XmemNozzle14:30:08.509504 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 5 batches ready
XmemNozzle14:30:08.522248 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.522268 [INFO] init a new batch
XmemNozzle14:30:08.522275 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 7 batches ready
XmemNozzle14:30:08.522464 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:30:08.522473 [INFO] init a new batch
XmemNozzle14:30:08.522479 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 6 batches ready
XmemNozzle14:30:08.537511 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.537536 [INFO] init a new batch
XmemNozzle14:30:08.537542 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 8 batches ready
XmemNozzle14:30:08.537740 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:30:08.537749 [INFO] init a new batch
XmemNozzle14:30:08.537755 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 7 batches ready
XmemNozzle14:30:08.549348 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:30:08.549363 [INFO] init a new batch
XmemNozzle14:30:08.549370 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 9 batches ready
XmemNozzle14:30:08.549465 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=500) ready queue
XmemNozzle14:30:08.549469 [INFO] init a new batch
XmemNozzle14:30:08.549472 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 8 batches ready
XmemNozzle14:30:08.894274 [ERROR] xmem_127.0.0.1:12000_1 sendSingle: transmit error: write tcp 127.0.0.1:12000: broken pipe
XmemNozzle14:30:09.294250 [INFO] xmem_127.0.0.1:12000_0 batch expired, moving it to ready queue
XmemNozzle14:30:09.294279 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=38) ready queue
XmemNozzle14:30:09.294287 [INFO] init a new batch
XmemNozzle14:30:09.294293 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 10 batches ready
XmemNozzle14:30:21.294472 [ERROR] xmem_127.0.0.1:12000_0 sendSingle: transmit error: write tcp 127.0.0.1:12000: broken pipe

****Hangs******************

Attaching screenshot of dcp drain queue for source 'default' bucket. Default bucket had 10000 keys.

 Comments   
Comment by Aruna Piravi [ 25/Nov/14 ]
Workaround: restart xdcr rest server, drop and recreate replication, replication happens.
Comment by Xiaomei Zhang [ 25/Nov/14 ]
Aruna,

Is this the same as MB-12771?

Thanks,
-Xiaomei
Comment by Aruna Piravi [ 25/Nov/14 ]
No Xiaomei, I've never seen "sendSingle: transmit error: write tcp IP:12000: broken pipe" error message while recreating replication on recreated target buckets.

In my opinion, this bug is due to transmit error but the pipeline does get created as you can see above. However with MB-12771 even pipeline doesn't get created(as you see in the log).




[MB-12771] Go-XDCR: Pipeline broken when target bucket is recreated (replication also recreated) Created: 25/Nov/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-11-25 at 2.49.55 PM.png    
Triage: Untriaged
Epic Link: XDCR next release
Is this a Regression?: No

 Description   
Consistently reproducible

Steps to reproduce
-------------------------
1. default -> dest (10000 keys transferred)
2. delete(not flush) dest bucket
3. recreate bucket with same name,
4. recreate replication

ReplicationManager14:44:59.977539 [INFO] Creating replication - sourceCluterUUID=127.0.0.1:9000, sourceBucket=default, targetClusterUUID=localhost:9000, targetBucket=dest, filterName=, settings=map[batch_size:2048 checkpoint_interval:1800 source_nozzle_per_node:2 http_connection:20 target_nozzle_per_node:2 log_level:Info failure_restart_interval:30 replication_type:xmem active:true max_expected_replication_lag:1000 filter_expression: optimistic_replication_threshold:256 batch_count:500 timeout_percentage_cap:80], createReplSpec=true
ReplicationManager14:44:59.977565 [INFO] Creating replication spec - sourceCluterUUID=127.0.0.1:9000, sourceBucket=default, targetClusterUUID=localhost:9000, targetBucket=dest, filterName=, settings=map[source_nozzle_per_node:2 batch_size:2048 checkpoint_interval:1800 log_level:Info failure_restart_interval:30 http_connection:20 target_nozzle_per_node:2 max_expected_replication_lag:1000 replication_type:xmem active:true batch_count:500 timeout_percentage_cap:80 filter_expression: optimistic_replication_threshold:256]
ReplicationManager14:44:59.982938 [INFO] Pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest is created and started
AdminPort14:44:59.982948 [INFO] forwardReplicationRequest
PipelineManager14:44:59.982986 [INFO] Starting the pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest
XDCRFactory14:44:59.983228 [INFO] kvHosts=[127.0.0.1]
cluster=127.0.0.1:9000
2014/11/25 14:44:59 Warning: Finalizing a bucket with active connections.
2014/11/25 14:44:59 Warning: Finalizing a bucket with active connections.
2014/11/25 14:44:59 Warning: Finalizing a bucket with active connections.
ServerList=[127.0.0.1:12000]
ServerVBMap=map[127.0.0.1:12000:[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]]
XDCRFactory14:45:00.000807 [INFO] found kv
DcpNozzle14:45:00.014456 [INFO] Constructed Dcp nozzle dcp_127.0.0.1:12000_0 with vblist [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]
DcpNozzle14:45:00.026886 [INFO] Constructed Dcp nozzle dcp_127.0.0.1:12000_1 with vblist [32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]
XDCRFactory14:45:00.026898 [INFO] Constructed 2 source nozzles
cluster=localhost:9000
2014/11/25 14:45:00 Warning: Finalizing a bucket with active connections.
2014/11/25 14:45:00 Warning: Finalizing a bucket with active connections.
ServerList=[127.0.0.1:12000]
ServerVBMap=map[127.0.0.1:12000:[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]]
XDCRFactory14:45:00.049272 [INFO] Constructed 2 outgoing nozzles
XDCRRouter14:45:00.049285 [INFO] Router created with 2 downstream parts
XDCRFactory14:45:00.049289 [INFO] Constructed router
PipelineSupervisor14:45:00.049296 [INFO] Attaching pipeline supervior service
DcpNozzle14:45:00.049321 [INFO] listener &{{0xc2083a1f10 0xc2083a1f80 <nil> <nil> <nil> false 0xc208115660} 0xc2088cdd40 0xc20867d720 400000000 400000000 <nil> 0x46d0fe0 0xc2088cdda0 0xc2088cde00 []} is registered on event 4 for Component dcp_127.0.0.1:12000_0
PipelineSupervisor14:45:00.049325 [INFO] Registering ErrorEncountered event on part dcp_127.0.0.1:12000_0
XmemNozzle14:45:00.049335 [INFO] listener &{{0xc2083a1f10 0xc2083a1f80 <nil> <nil> <nil> false 0xc208115660} 0xc2088cdd40 0xc20867d720 400000000 400000000 <nil> 0x46d0fe0 0xc2088cdda0 0xc2088cde00 []} is registered on event 4 for Component xmem_127.0.0.1:12000_0
PipelineSupervisor14:45:00.049338 [INFO] Registering ErrorEncountered event on part xmem_127.0.0.1:12000_0
XmemNozzle14:45:00.049348 [INFO] listener &{{0xc2083a1f10 0xc2083a1f80 <nil> <nil> <nil> false 0xc208115660} 0xc2088cdd40 0xc20867d720 400000000 400000000 <nil> 0x46d0fe0 0xc2088cdda0 0xc2088cde00 []} is registered on event 4 for Component xmem_127.0.0.1:12000_1
PipelineSupervisor14:45:00.049351 [INFO] Registering ErrorEncountered event on part xmem_127.0.0.1:12000_1
DcpNozzle14:45:00.049360 [INFO] listener &{{0xc2083a1f10 0xc2083a1f80 <nil> <nil> <nil> false 0xc208115660} 0xc2088cdd40 0xc20867d720 400000000 400000000 <nil> 0x46d0fe0 0xc2088cdda0 0xc2088cde00 []} is registered on event 4 for Component dcp_127.0.0.1:12000_1
PipelineSupervisor14:45:00.049363 [INFO] Registering ErrorEncountered event on part dcp_127.0.0.1:12000_1
XDCRRouter14:45:00.049375 [INFO] listener &{{0xc2083a1f10 0xc2083a1f80 <nil> <nil> <nil> false 0xc208115660} 0xc2088cdd40 0xc20867d720 400000000 400000000 <nil> 0x46d0fe0 0xc2088cdda0 0xc2088cde00 []} is registered on event 4 for Component XDCRRouter
PipelineSupervisor14:45:00.049379 [INFO] Registering ErrorEncountered event on connector XDCRRouter
XDCRFactory14:45:00.049381 [INFO] XDCR pipeline constructed
PipelineManager14:45:00.049384 [INFO] Pipeline is constructed, start it
XmemNozzle14:45:00.049411 [INFO] Xmem starting ....
XmemNozzle14:45:00.049962 [INFO] init a new batch
XmemNozzle14:45:00.051929 [INFO] ....Finish initializing....
XmemNozzle14:45:00.051936 [INFO] Xmem nozzle is started
XmemNozzle14:45:00.051944 [INFO] Xmem starting ....
XmemNozzle14:45:00.051993 [INFO] init a new batch
XmemNozzle14:45:00.055547 [INFO] ....Finish initializing....
XmemNozzle14:45:00.055556 [INFO] Xmem nozzle is started
DcpNozzle14:45:00.055754 [INFO] Dcp nozzle dcp_127.0.0.1:12000_0 starting ....
DcpNozzle14:45:00.055760 [INFO] Dcp nozzle starting ....
XmemNozzle14:45:00.055810 [INFO] xmem_127.0.0.1:12000_0 processData starts..........
XmemNozzle14:45:00.055820 [INFO] xmem_127.0.0.1:12000_1 processData starts..........
2014/11/25 14:45:00 Warning: Finalizing a bucket with active connections.
2014/11/25 14:45:00 Warning: Finalizing a bucket with active connections.
DcpNozzle14:45:00.058104 [INFO] ....Finished dcp nozzle initialization....
DcpNozzle14:45:00.058122 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=13
DcpNozzle14:45:00.058150 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=17
DcpNozzle14:45:00.058171 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=24
DcpNozzle14:45:00.058192 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=0
DcpNozzle14:45:00.058211 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=7
DcpNozzle14:45:00.058229 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=30
DcpNozzle14:45:00.058251 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=20
DcpNozzle14:45:00.058270 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=25
DcpNozzle14:45:00.058292 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=21
DcpNozzle14:45:00.058321 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=27
DcpNozzle14:45:00.058346 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=4
DcpNozzle14:45:00.058365 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=14
DcpNozzle14:45:00.058387 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=15
DcpNozzle14:45:00.058405 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=18
DcpNozzle14:45:00.058427 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=29
DcpNozzle14:45:00.058450 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=1
DcpNozzle14:45:00.058472 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=12
DcpNozzle14:45:00.058490 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=9
DcpNozzle14:45:00.058505 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=22
DcpNozzle14:45:00.058522 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=26
DcpNozzle14:45:00.058544 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=6
DcpNozzle14:45:00.058564 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=8
DcpNozzle14:45:00.058582 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=28
DcpNozzle14:45:00.058602 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=31
DcpNozzle14:45:00.058621 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=23
DcpNozzle14:45:00.058640 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=3
DcpNozzle14:45:00.058655 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=19
DcpNozzle14:45:00.058674 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=10
DcpNozzle14:45:00.058692 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=11
DcpNozzle14:45:00.058710 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=16
DcpNozzle14:45:00.058726 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=2
DcpNozzle14:45:00.058744 [INFO] dcp_127.0.0.1:12000_0 starting vb stream for vb=5
DcpNozzle14:45:00.058761 [INFO] Dcp nozzle is started
DcpNozzle14:45:00.059057 [INFO] Dcp nozzle dcp_127.0.0.1:12000_1 starting ....
DcpNozzle14:45:00.059064 [INFO] Dcp nozzle starting ....
14:45:00.059117 Default None UPR_STREAMREQ for vb 13 successful
14:45:00.059136 Default None UPR_STREAMREQ for vb 17 successful
14:45:00.059153 Default None UPR_STREAMREQ for vb 24 successful
14:45:00.059169 Default None UPR_STREAMREQ for vb 0 successful
14:45:00.059186 Default None UPR_STREAMREQ for vb 7 successful
14:45:00.059203 Default None UPR_STREAMREQ for vb 30 successful
14:45:00.059219 Default None UPR_STREAMREQ for vb 20 successful
14:45:00.059236 Default None UPR_STREAMREQ for vb 25 successful
14:45:00.059251 Default None UPR_STREAMREQ for vb 21 successful
DcpNozzle14:45:00.059282 [INFO] dcp_127.0.0.1:12000_0 processData starts..........
14:45:00.059325 Default None UPR_STREAMREQ for vb 27 successful
14:45:00.059342 Default None UPR_STREAMREQ for vb 4 successful
14:45:00.059429 Default None UPR_STREAMREQ for vb 14 successful
14:45:00.059514 Default None UPR_STREAMREQ for vb 15 successful
14:45:00.059588 Default None UPR_STREAMREQ for vb 18 successful
14:45:00.059667 Default None UPR_STREAMREQ for vb 29 successful
14:45:00.059764 Default None UPR_STREAMREQ for vb 1 successful
14:45:00.059844 Default None UPR_STREAMREQ for vb 12 successful
14:45:00.059907 Default None UPR_STREAMREQ for vb 9 successful
14:45:00.059977 Default None UPR_STREAMREQ for vb 22 successful
14:45:00.060059 Default None UPR_STREAMREQ for vb 26 successful
14:45:00.060131 Default None UPR_STREAMREQ for vb 6 successful
14:45:00.060257 Default None UPR_STREAMREQ for vb 8 successful
14:45:00.060316 Default None UPR_STREAMREQ for vb 28 successful
14:45:00.060349 Default None UPR_STREAMREQ for vb 31 successful
14:45:00.060435 Default None UPR_STREAMREQ for vb 23 successful
14:45:00.060501 Default None UPR_STREAMREQ for vb 3 successful
DcpNozzle14:45:00.060528 [INFO] ....Finished dcp nozzle initialization....
DcpNozzle14:45:00.060547 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=38
DcpNozzle14:45:00.060571 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=53
DcpNozzle14:45:00.060599 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=60
DcpNozzle14:45:00.060621 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=62
DcpNozzle14:45:00.060641 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=37
DcpNozzle14:45:00.060659 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=39
DcpNozzle14:45:00.060680 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=41
DcpNozzle14:45:00.060695 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=46
DcpNozzle14:45:00.060713 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=58
DcpNozzle14:45:00.060736 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=35
DcpNozzle14:45:00.060755 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=45
DcpNozzle14:45:00.060773 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=54
DcpNozzle14:45:00.060793 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=42
DcpNozzle14:45:00.060811 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=43
DcpNozzle14:45:00.060838 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=49
DcpNozzle14:45:00.060862 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=33
DcpNozzle14:45:00.060884 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=36
DcpNozzle14:45:00.060904 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=40
DcpNozzle14:45:00.060923 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=47
DcpNozzle14:45:00.060942 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=59
DcpNozzle14:45:00.060963 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=61
DcpNozzle14:45:00.060981 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=63
DcpNozzle14:45:00.061000 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=32
DcpNozzle14:45:00.061021 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=44
DcpNozzle14:45:00.061039 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=50
DcpNozzle14:45:00.061062 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=52
DcpNozzle14:45:00.061081 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=34
DcpNozzle14:45:00.061101 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=48
DcpNozzle14:45:00.061125 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=51
DcpNozzle14:45:00.061144 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=55
DcpNozzle14:45:00.061162 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=56
DcpNozzle14:45:00.061181 [INFO] dcp_127.0.0.1:12000_1 starting vb stream for vb=57
DcpNozzle14:45:00.061207 [INFO] Dcp nozzle is started
GenericPipeline14:45:00.061213 [INFO] All parts has been started
GenericPipeline14:45:00.061305 [INFO] -----------Pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest is started----------
14:45:00.061350 Default None UPR_STREAMREQ for vb 38 successful
14:45:00.061371 Default None UPR_STREAMREQ for vb 53 successful
14:45:00.061390 Default None UPR_STREAMREQ for vb 60 successful
14:45:00.061409 Default None UPR_STREAMREQ for vb 62 successful
14:45:00.061427 Default None UPR_STREAMREQ for vb 37 successful
14:45:00.061444 Default None UPR_STREAMREQ for vb 39 successful
14:45:00.061462 Default None UPR_STREAMREQ for vb 41 successful
14:45:00.061483 Default None UPR_STREAMREQ for vb 46 successful
14:45:00.061500 Default None UPR_STREAMREQ for vb 58 successful
14:45:00.061517 Default None UPR_STREAMREQ for vb 35 successful
14:45:00.061536 Default None UPR_STREAMREQ for vb 45 successful
DcpNozzle14:45:00.061551 [INFO] dcp_127.0.0.1:12000_1 processData starts..........
14:45:00.061597 Default None UPR_STREAMREQ for vb 54 successful
14:45:00.061631 Default None UPR_STREAMREQ for vb 19 successful
14:45:00.061654 Default None UPR_STREAMREQ for vb 10 successful
14:45:00.061671 Default None UPR_STREAMREQ for vb 11 successful
14:45:00.061691 Default None UPR_STREAMREQ for vb 16 successful
14:45:00.061710 Default None UPR_STREAMREQ for vb 2 successful
14:45:00.061730 Default None UPR_STREAMREQ for vb 5 successful
14:45:00.062630 Default None UPR_STREAMREQ for vb 42 successful
14:45:00.062662 Default None UPR_STREAMREQ for vb 43 successful
14:45:00.062681 Default None UPR_STREAMREQ for vb 49 successful
14:45:00.062700 Default None UPR_STREAMREQ for vb 33 successful
14:45:00.062720 Default None UPR_STREAMREQ for vb 36 successful
14:45:00.062738 Default None UPR_STREAMREQ for vb 40 successful
14:45:00.062762 Default None UPR_STREAMREQ for vb 47 successful
14:45:00.062784 Default None UPR_STREAMREQ for vb 59 successful
14:45:00.062802 Default None UPR_STREAMREQ for vb 61 successful
14:45:00.062820 Default None UPR_STREAMREQ for vb 63 successful
14:45:00.062841 Default None UPR_STREAMREQ for vb 32 successful
14:45:00.062871 Default None UPR_STREAMREQ for vb 44 successful
14:45:00.062896 Default None UPR_STREAMREQ for vb 50 successful
14:45:00.062922 Default None UPR_STREAMREQ for vb 52 successful
14:45:00.062941 Default None UPR_STREAMREQ for vb 34 successful
14:45:00.062960 Default None UPR_STREAMREQ for vb 48 successful
14:45:00.062991 Default None UPR_STREAMREQ for vb 51 successful
14:45:00.063016 Default None UPR_STREAMREQ for vb 55 successful
14:45:00.063040 Default None UPR_STREAMREQ for vb 56 successful
14:45:00.063064 Default None UPR_STREAMREQ for vb 57 successful
XmemNozzle14:45:00.078227 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=500) ready queue
XmemNozzle14:45:00.078242 [INFO] init a new batch
XmemNozzle14:45:00.078248 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
XmemNozzle14:45:00.078263 [INFO] Send batch count=500
XmemNozzle14:45:00.086123 [ERROR] Raise error condition MCResponse status=0x08, opcode=0xa2, opaque=0, msg: Unknown error code
ReplicationManager14:45:00.086227 [INFO] Pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest reported failure. The following parts are broken: map[xmem_127.0.0.1:12000_0:MCResponse status=0x08, opcode=0xa2, opaque=0, msg: Unknown error code]
ReplicationManager14:45:00.086236 [INFO] Pausing replication xdcr_127.0.0.1:9000_default_localhost:9000_dest
PipelineManager14:45:00.086241 [INFO] Try to stop the pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest
GenericPipeline14:45:00.086246 [INFO] stoppping pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest
GenericPipeline14:45:00.086253 [INFO] Trying to stop part dcp_127.0.0.1:12000_0
DcpNozzle14:45:00.086268 [INFO] Stopping DcpNozzle dcp_127.0.0.1:12000_0
2014/11/25 14:45:00 been asked to close ...
DcpNozzle14:45:00.086297 [INFO] server is stopped per request sent at 2014-11-25 14:45:00.086273143 -0800 PST
DcpNozzle14:45:00.086310 [INFO] dcp_127.0.0.1:12000_0 processData exits
DcpNozzle14:45:00.086317 [INFO] DcpNozzle dcp_127.0.0.1:12000_0 is stopped
GenericPipeline14:45:00.086321 [INFO] part dcp_127.0.0.1:12000_0 is stopped
GenericPipeline14:45:00.086327 [INFO] Trying to stop part xmem_127.0.0.1:12000_1
GenericPipeline14:45:00.086335 [INFO] Trying to stop part xmem_127.0.0.1:12000_0
GenericPipeline14:45:00.086342 [INFO] Trying to stop part dcp_127.0.0.1:12000_1
DcpNozzle14:45:00.086352 [INFO] Stopping DcpNozzle dcp_127.0.0.1:12000_1
2014/11/25 14:45:00 been asked to close ...
DcpNozzle14:45:00.086369 [INFO] server is stopped per request sent at 2014-11-25 14:45:00.086356242 -0800 PST
DcpNozzle14:45:00.086376 [INFO] dcp_127.0.0.1:12000_1 processData exits
DcpNozzle14:45:00.086381 [INFO] DcpNozzle dcp_127.0.0.1:12000_1 is stopped
GenericPipeline14:45:00.086385 [INFO] part dcp_127.0.0.1:12000_1 is stopped
GenericPipeline14:45:00.086390 [INFO] Trying to stop part xmem_127.0.0.1:12000_0
XmemNozzle14:45:00.086397 [INFO] Stop XmemNozzle xmem_127.0.0.1:12000_0
XmemNozzle14:45:00.086404 [INFO] xmem_127.0.0.1:12000_0 move the batch (count=1) ready queue
XmemNozzle14:45:00.086408 [INFO] init a new batch
XmemNozzle14:45:00.086414 [INFO] xmem_127.0.0.1:12000_0 End moving batch, 1 batches ready
14:45:00.124197 Default None Exiting send command go routine ...
14:45:00.144912 Default None Exiting send command go routine ...
XmemNozzle14:45:00.894020 [INFO] xmem_127.0.0.1:12000_1 batch expired, moving it to ready queue
XmemNozzle14:45:00.894036 [INFO] xmem_127.0.0.1:12000_1 move the batch (count=56) ready queue
XmemNozzle14:45:00.894041 [INFO] init a new batch
XmemNozzle14:45:00.894045 [INFO] xmem_127.0.0.1:12000_1 End moving batch, 1 batches ready
XmemNozzle14:45:00.894570 [INFO] Send batch count=56
XmemNozzle14:45:00.972952 [ERROR] Raise error condition MCResponse status=0x08, opcode=0xa2, opaque=0, msg: Unknown error code
ReplicationManager14:45:00.973020 [INFO] Pipeline xdcr_127.0.0.1:9000_default_localhost:9000_dest reported failure. The following parts are broken: map[xmem_127.0.0.1:12000_1:MCResponse status=0x08, opcode=0xa2, opaque=0, msg: Unknown error code]

Screenshot of dcp queues (see last spike) attached.




[MB-11989] XDCR next release Created: 18/Aug/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: feature-backlog
Fix Version/s: None
Security Level: Public

Type: Epic Priority: Major
Reporter: Xiaomei Zhang Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Epic Name: XDCR next release
Epic Status: To Do




[MB-12708] Go-XDCR: Http server crashes on starting (gometa 'Repository' structure has changed) Created: 18/Nov/14  Updated: 25/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: sherlock
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Yu Sui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Epic Link: XDCR next release
Is this a Regression?: No

 Description   
Pulled latest code(forestdb, gometa, goforestdb, goxdcr).

Please refer commit https://github.com/couchbase/gometa/commit/5df9fc0d5b602151a707c9332aaa3ded3eb9f05a from gometa that has changed "Repository" struct to accommodate forestdb api changes.

As a result, http server crashes at line 103 of metadata_service.go in ActiveReplicationSpecs()

Arunas-MacBook-Pro:bin apiravi$ ./xdcr localhost:9000
starting gometa service. this will take a couple seconds
started gometa service.
MetadataService17:09:38.052419 [INFO] Metdata service started with host=127.0.0.1:5003
PipelineManager17:09:38.052559 [INFO] Pipeline Manager is constucted
ReplicationManager17:09:38.052564 [INFO] Replication manager init - starting existing replications
fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x118 pc=0x7fff8a9cee3d]

runtime stack:
runtime: unexpected return pc for runtime.sigpanic called from 0x7fff8a9cee3d
runtime.throw(0x46c5a76)
/usr/local/go/src/pkg/runtime/panic.c:520 +0x69
runtime: unexpected return pc for runtime.sigpanic called from 0x7fff8a9cee3d
runtime.sigpanic()
/usr/local/go/src/pkg/runtime/os_darwin.c:439 +0x3d

goroutine 16 [syscall]:
runtime.cgocall(0x4001740, 0x49de8c8)
/usr/local/go/src/pkg/runtime/cgocall.c:143 +0xe5 fp=0x49de8b0 sp=0x49de868
github.com/couchbaselabs/goforestdb._Cfunc_fdb_iterator_init(0x4f00010, 0xc20803c128, 0xc208001160, 0x6, 0xc208001168, 0x6, 0x4b00000, 0xc20801290c)
github.com/couchbaselabs/goforestdb/_obj/_cgo_defun.c:215 +0x36 fp=0x49de8c8 sp=0x49de8b0
github.com/couchbaselabs/goforestdb.(*Database).IteratorInit(0xc20803c120, 0xc208001160, 0x6, 0x8, 0xc208001168, 0x6, 0x8, 0x4001600, 0xc208001150, 0x0, ...)
/Users/apiravi/sherlock/godeps/src/github.com/couchbaselabs/goforestdb/iterator.go:99 +0x117 fp=0x49de950 sp=0x49de8c8
github.com/couchbase/gometa/repository.(*Repository).NewIterator(0xc208001150, 0x447abf0, 0x6, 0x447ac10, 0x6, 0xc20803f9c0, 0x0, 0x0)
/Users/apiravi/sherlock/godeps/src/github.com/couchbase/gometa/repository/repo.go:156 +0x28c fp=0x49dea80 sp=0x49de950
github.com/couchbase/goxdcr/services.(*MetadataSvc).ActiveReplicationSpecs(0xc20800f3c0, 0x450e390, 0x0, 0x0)
/Users/apiravi/sherlock/goproj/src/github.com/couchbase/goxdcr/services/metadata_service.go:103 +0x9f fp=0x49deb48 sp=0x49dea80
github.com/couchbase/goxdcr/replication_manager.(*replicationManager).startReplications(0x46e6000)
/Users/apiravi/sherlock/goproj/src/github.com/couchbase/goxdcr/replication_manager/replication_manager.go:295 +0x77 fp=0x49dec08 sp=0x49deb48
github.com/couchbase/goxdcr/replication_manager.(*replicationManager).init(0x46e6000, 0x4b12dd8, 0xc20800f3c0, 0x4b12e20, 0x46e4db0, 0x4b12e70, 0x46e4db0, 0x4b12ec0, 0x46e4db0)
/Users/apiravi/sherlock/goproj/src/github.com/couchbase/goxdcr/replication_manager/replication_manager.go:64 +0x183 fp=0x49dec98 sp=0x49dec08
github.com/couchbase/goxdcr/replication_manager.func·001()
/Users/apiravi/sherlock/goproj/src/github.com/couchbase/goxdcr/replication_manager/replication_manager.go:48 +0x70 fp=0x49dece8 sp=0x49dec98
sync.(*Once).Do(0x46e6040, 0x49ded18)
/usr/local/go/src/pkg/sync/once.go:40 +0x9f fp=0x49ded00 sp=0x49dece8
github.com/couchbase/goxdcr/replication_manager.Initialize(0x4b12dd8, 0xc20800f3c0, 0x4b12e20, 0x46e4db0, 0x4b12e70, 0x46e4db0, 0x4b12ec0, 0x46e4db0)
/Users/apiravi/sherlock/goproj/src/github.com/couchbase/goxdcr/replication_manager/replication_manager.go:49 +0x6d fp=0x49ded48 sp=0x49ded00
main.main()
/Users/apiravi/sherlock/goproj/src/github.com/couchbase/goxdcr/main/main.go:76 +0x645 fp=0x49def50 sp=0x49ded48
runtime.main()
/usr/local/go/src/pkg/runtime/proc.c:247 +0x11a fp=0x49defa8 sp=0x49def50
runtime.goexit()
/usr/local/go/src/pkg/runtime/proc.c:1445 fp=0x49defb0 sp=0x49defa8
created by _rt0_go
/usr/local/go/src/pkg/runtime/asm_amd64.s:97 +0x120


 Comments   
Comment by Aruna Piravi [ 18/Nov/14 ]
Working around by commenting out contents of ActiveReplicationSpecs().
Comment by Xiaomei Zhang [ 25/Nov/14 ]
Can't reproduce when start goxdcr from <sherlock repo dir>/install/bin
Comment by Aruna Piravi [ 25/Nov/14 ]
Actually looks like Yu has commented the code and pushed the same into the repo which is why we cannot reproduce with latest code - https://github.com/couchbase/goxdcr/blob/ac4f3c4fe12caa75986e38e73764ce080a393f98/services/metadata_service.go#L101-118.

It's possible Yu has a solution for this. So leaving this open for tracking.




[MB-12766] subqueries fail with error Error doing bulk get Created: 25/Nov/14  Updated: 25/Nov/14  Due: 01/Dec/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4, sherlock
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: 8h
Time Spent: Not Specified
Original Estimate: 8h

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
query
select name, join_day from default where join_day = (select AVG(join_day) as average from default d use keys ['query-test-Sales-0', 'query-test-Sales-1', 'query-test-Sales-2', 'query-test-Sales-3', 'query-test-Sales-4', 'query-test-Sales-5'])[0].average

{
    "request_id": "d1ed4e07-3841-41d0-8465-5187f799a40b",
    "signature": {
        "join_day": "json",
        "name": "json"
    },
    "results": [
    ]
    "errors": [
        {
            "caller": "couchbase:492",
            "cause": "{3 errors, starting with bulkget exceeded MaxBulkRetries for vbucket 755}",
            "code": 5000,
            "key": "Internal Error",
            "message": "Error doing bulk get"
        },
        {
            "caller": "couchbase:492",
            "cause": "{1 errors, starting with bulkget exceeded MaxBulkRetries for vbucket 1018}",
            "code": 5000,
            "key": "Internal Error",
            "message": "Error doing bulk get"
        },
        {
            "caller": "couchbase:492",
            "cause": "{1 errors, starting with bulkget exceeded MaxBulkRetries for vbucket 253}",
            "code": 5000,
            "key": "Internal Error",
            "message": "Error doing bulk get"
        }
    ],
    "status": "errors",
    "metrics": {
        "elapsedTime": "3m1.147069007s",
        "executionTime": "3m1.14694197s",
        "resultCount": 0,
        "resultSize": 0,
        "errorCount": 3
    }
}





[MB-12765] meta() shows incorrect cas value Created: 25/Nov/14  Updated: 25/Nov/14  Due: 28/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: 4h
Time Spent: Not Specified
Original Estimate: 4h

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
item has key 'query-test67f75c3-0'
if try to get cas via sdk it is 28686505813123738

client.get(key.encode('utf-8'))
tuple: (4042322160, 28686505813123738, '{"tasks_points": {"task1": 1, "task2": 1}, "name": "employee-28", "mutated": 0, "skills": ["skill2010", "skill2011"], "join_day": 28, "join_mo": 10, "email": "28-mail@couchbase.com", "test_rate": 10.1, "join_yr": 2011, "_id": "query-test67f75c3-0", "VMs": [{"RAM": 10, "os": "ubuntu", "name": "vm_10", "memory": 10}, {"RAM": 10, "os": "windows", "name": "vm_11", "memory": 10}], "job_title": "Engineer"}')

but if i try to get cas via n1ql it is 28686505813123736
cbq> select META(default) from default use keys ['query-test67f75c3-0']
   > ;
{
    "request_id": "e15c93f7-ad42-4ad7-9d99-d443b9b19427",
    "signature": {
        "$1": "object"
    },
    "results": [
        {
            "$1": {
                "cas": 2.8686505813123736e+16,
                "flags": 4.04232216e+09,
                "id": "query-test67f75c3-0",
                "type": "json"
            }
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "57.992ms",
        "executionTime": "57.866ms",
        "resultCount": 1,
        "resultSize": 209
    }
}





[MB-11623] test for performance regressions with JSON detection Created: 02/Jul/14  Updated: 25/Nov/14

Status: In Progress
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0.2
Security Level: Public

Type: Task Priority: Critical
Reporter: Matt Ingenthron Assignee: Thomas Anderson
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: 0h
Time Spent: 120h
Original Estimate: Not Specified

Attachments: File JSONDoctPerfTest140728.rtf     File JSONPerfTestV3.uos    
Issue Links:
Relates to
relates to MB-11675 20-30% performance degradation on app... Closed

 Description   
Related to one of the changes in 3.0, we need to test what has been implemented to see if a performance regression or unexpected resource utilization has been introduced.

In 2.x, all JSON detection was handled at the time of persistence. Since persistence was done in batch and in background, with the then current document, it would limit the resource utilization of any JSON detection.

Starting in 3.x, with the datatype/HELLO changes introduced (and currently disabled), the JSON detection has moved to both memcached and ep-engine, depending on the type of mutation.

Just to paint the reason this is a concern, here's a possible scenario.

Imagine a cluster node that is happily accepting 100,000 sets/s for a given small JSON document, and it accounts for about 20mbit of the network (small enough to not notice). That node has a fast SSD at about 8k IOPS. That means that we'd only be doing JSON detection some 5000 times per second with Couchbase Server 2.x

With the changes already integrated, that JSON detection may be tried over 100k times/s. That's a 20x increase. The detection needs to occur somewhere other than on the persistence path, as the contract between DCP and view engine is such that the JSON detection needs to occur before DCP transfer.

This request is to test/assess if there is a performance change and/or any unexpected resource utilization when having fast mutating JSON documents.

I'll leave it to the team to decide what the right test is, but here's what I might suggest.

With a view defined create a test that has a small to moderate load at steady state and one fast-changing item. Test it with a set of sizes and different complexity. For instance, permutations that might be something like this:
non-JSON of 1k, 8k, 32k, 128k
simple JSON of 1k, 8k, 32k, 128k
complex JSON of 1k, 8k, 32k, 128k
metrics to gather:
throughput, CPU utilization by process, RSS by process, memory allocation requests by process (or minor faults or something)

Hopefully we won't see anything to be concerned with, but it is possible.

There are options to move JSON detection to somewhere later in processing (i.e., before DCP transfer) or other optimization thoughts if there is an issue.

 Comments   
Comment by Cihan Biyikoglu [ 07/Jul/14 ]
this is no longer needed for 3.0 is that right? ready to postpone to 3.0.1?
Comment by Pavel Paulau [ 07/Jul/14 ]
HELLO-based negotiation was disabled but detection still happens in ep-engine.
We need to understand impact before 3.0 release. Sooner than later.
Comment by Matt Ingenthron [ 23/Jul/14 ]
I'm curious Thomas, when you say "increase in bytes appended", do you mean for the same workload the RSS is larger in the 'increase' case? Great to see you making progress.
Comment by Wayne Siu [ 24/Jul/14 ]
Pasted comment from Thomas:
Subject: Re: Couchbase Issues: (MB-11623) test for performance regressions with JSON detection
Yes, ~20% increase from 2.5.1 to 3.0 for same load generator. as reported by the cb server for same input load. I’m verifying and ‘isolating’ . Will also be looking at if/how this contributes to replication load increase (20% on 20% increase …)
The issues seem related. Same increase for 1K, 8K, 16K and 32K with some variance.
—thomas
Comment by Thomas Anderson [ 29/Jul/14 ]
initial results using JSON document load test.
Comment by Matt Ingenthron [ 29/Jul/14 ]
Tom: saw your notes in the work log, out of curiosity, what was deferred to 3.0.1? Also, from the comment above, 20% increase in what?
Comment by Anil Kumar [ 13/Aug/14 ]
Thomas - As discussed please update the ticket with % or regression it has caused with JSON detection now in memcached. I will open separate ticket to document it.
Comment by Thomas Anderson [ 19/Aug/14 ]
a comparison of non-JSON to JSON in 2.5.1 and 3.0.0.1105 showed statistically similar performance, i.e., the minimal overhead of handling JSON document over similar KV document stayed consistent from 2.5.1 to 3.0.0 pre-RC1. see attached file JSONPerfTestV3.uos. to be re-run with official RC1 candidate. feature to load complex JSON documents now modified to 4 levels of JSON complexity (for each document size in bytes) {simpleJSON:: 1 element-attribute value pair; smallJSON:: 10 elements - no array, no nesting; mediumJSON:: 100 elements - arrays & nesting; largeJSON:: 10000 elements mix of element types}.

note, the original seed to this issue was a detected performance issue with JSON documents, ~20-30%. the code/architectural change which caused this was deferred to 3.0.1. additional modifications to server to address simple append mode performance degradation, further lessened issue of whether the document type was the cause of performance degradation. the tests did however show the positive change in compaction, i.e., 3.x compacts documents ~ 5-7% over 2.5.1

 
Comment by Thomas Anderson [ 19/Aug/14 ]
re-run with build 1105. regression comparing same document size, same document load for non-JSON to simple-JSON.
2.5.1:: 1024 byte document, 10 loaders, 1.25M documents for nonJSON to JSON showed a < 4% performance degredation; 3.0:: shows a < 3% degredation. many other factors seem to dominate
Comment by Matt Ingenthron [ 19/Aug/14 ]
Just for the comments here, the original seed wasn't an observed performance regression but rather an architectural concern that there could be a space/CPU/throughput cost for the new JSON detection. That's why I opened it.
Comment by Anil Kumar [ 25/Nov/14 ]
If we haven't seen any regression can we resolve this ticket now ?




[MB-8564] cbhealthchecker should produce a timestamped, zipped file by default Created: 03/Jul/13  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.1.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Perry Krug Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Now that we are shipping cbhealthchecker with the product and asking customers to run it more frequently, it would be great if it produced a timestamped and zipped output file (similar to cbcollect_info) for customers to much more easily upload

 Comments   
Comment by Maria McDuff (Inactive) [ 08/Oct/13 ]
Bin,

any update on this issue?




[MB-8774] healthchecker should warn or fail with Transparent Huge Pages enabled Created: 08/Aug/13  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.1.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Minor
Reporter: James Mauss Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We have seen Transparent Huge Pages being enabled cause problems and even crashes on nodes.

Is there a way to be able to warn or fail on install to make sure that these are not enabled?

 Comments   
Comment by Dipti Borkar [ 08/Aug/13 ]
Other NoSQL databases have the same issue. Not sure if we should build this into the installer, its way to specific. We should certain document is as a pre-requisite and in the best practices section.

James, can you please open a separate doc bug for this.
Comment by Perry Krug [ 09/Aug/13 ]
How about building it into the healthchecker and/or log analyser?
Comment by Dipti Borkar [ 09/Aug/13 ]
healthchecker is a good option.
Comment by Anil Kumar [ 27/Aug/13 ]
We need to add ALERT to healthchecker report if tools find out 'Transparent Huge Pages' are enabled on RHEL6 servers.

ALERT: Disable 'Transparent HugePages' on RHEL6 Kernels

-------------

You can check the current setting for Transparent HugePages "enabled=[always]".

    # cat /sys/kernel/mm/transparent_hugepage/enabled
    [always] madvise never
    #


Comment by Bin Cui [ 27/Aug/13 ]
Healthchecker is agentless monitoring/management tool. Unless ns_server provides such monitoring capability, healthchecker won't be able to run any script remotely.




CBHealthChecker - Fix fetching number of CPU processors (MB-8686)

[MB-8817] REST API support to report number of CPU cores for a specified node Created: 13/Aug/13  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0
Fix Version/s: feature-backlog, sherlock
Security Level: Public

Type: Technical task Priority: Major
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
One approach is to publish api to run cbcollect_info and retrieve result remotely. scope argument is expected to limit task to a certain group of tasks.

Another approach is to add number of cpu core as part of the current REST call.

 Comments   
Comment by Aleksey Kondratenko [ 13/Aug/13 ]
I need really good reason for that. We're not going to add random APIs for random needs.

Also most likely some escript (not REST API) is going to be a easier to do given that erlang does have this information.
Comment by Aleksey Kondratenko [ 16/Aug/13 ]
See my comment above




[MB-11186] Revisit some failed test cases for couchbase-cli, which are commented out now Created: 22/May/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Task Priority: Major
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Steve Yen [ 29/Jul/14 ]
Reviewed these with Bin and when run individually, he says the majority of these work; but, when run all together, there seem to be interactions that make them fail.

As of 2014/07/29, commented out testcases in pump_dcp_test.py include...

    def __test_rejected_auth(self):
    def __test_close_after_auth(self):
    def __test_full_diff(self):
    def __test_full_diff_diff_acc(self):
    def __test_2_mutation_chopped_header(self):
    def __test_delete_ack(self):
    def __test_noop(self):
    def __test_tap_cmd_opaque(self):
    def __test_flush_all(self):
    def __test_restore_1M_blob(self):
    def __test_restore_30M_blob(self):
    def __test_restore_batch_max_bytes(self):
    def __test_immediate_not_my_vbucket_during_restore(self):
    def __test_later_not_my_vbucket_during_restore(self):
    def __test_immediate_not_my_vbucket_during_restore_1T(self):
    def __test_immediate_not_my_vbucket_during_restore_5T(self):
    def __test_immediate_not_my_vbucket_during_restore_5B(self):
    def __test_rejected_auth(self):




[MB-12163] Memcached Closing connection due to read error: Unknown error Created: 10/Sep/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: memcached
Affects Version/s: 2.5.0, 3.0, 3.0.2
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Dave Rigby
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: [info] OS Name : Microsoft Windows Server 2008 R2 Enterprise
[info] OS Version : 6.1.7601 Service Pack 1 Build 7601
[info] CB Version : 2.5.0-1059-rel-enterprise

Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
The error message "Closing connection due to read error: Unknown error" doesn't explain what the problem is. Unfortunately on Windows we aren't parsing the error code properly. We need to call FormatMessage() not strerror().

Code At
http://src.couchbase.org/source/xref/2.5.0/memcached/daemon/memcached.c#5360




[MB-12265] Enhance cbbackup test suite on performance and footprint measurement Created: 25/Sep/14  Updated: 25/Nov/14

Status: In Progress
Project: Couchbase Server
Component/s: performance
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Bin Cui Assignee: Thomas Anderson
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
See CBSE-1407.

We missed the test coverage for cbbackup tool on the following areas:
1. On DGM scenario, watch the backup throughput performance
2. On DGM scenario, watch the memory consumption for cbbackup process.

 Comments   
Comment by Thomas Anderson [ 25/Nov/14 ]
testing latest release, 3.0.2-1560. 100M x 2K documents, 4 node cluster, 64G/node RAM. (considered baseline evaluation to determine what performance tests to conduct on each release build.
* not-DGM: default settings: create 12 concurrent streams consuming 15-18% cluster cpu resource, impact on 10Kops workload of ~10%- see MB-
* DGM of 20%: CBBACKUP performance impacted by concurrency.
* DGM of 5%: magnified contention of system resource,

comparison of data transfer using ftp on the bucket folder show CBBACKUP almost 3x slower; and creates a backup footprint almost 2x larger than just the files themselves. see MB-




[MB-12763] large body length value causes moxi to restart Created: 25/Nov/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 3.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Steve Yen
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Memcached has a bug CVE-2011-4971,

"Multiple integer signedness errors in the (1) process_bin_sasl_auth, (2) process_bin_complete_sasl_auth, (3) process_bin_update, and (4) process_bin_append_prepend functions in Memcached 1.4.5 and earlier allow remote attackers to cause a denial of service (crash) via a large body length value in a packet."

I tried this out on CB Server 3.0.1, it causes Moxi to restart but didn't seem to interrupt operations or cause a need to warmup the cache. Is Moxi restarting a concern ?

Steps to reproduce:

echo -en '\x80\x12\x00\x01\x08\x00\x00\x00\xff\xff\xff\xe8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff\x01\x00\x00\x00\x00\x00\x00\x00\x00\x000\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' | nc localhost 11211

Result:

Port server moxi on node 'babysitter_of_ns_1@127.0.0.1' exited with status 139. Restarting. Messages: 2014-11-25 13:50:47: (/home/buildbot/buildbot_slave/debian-7-x64-301-builder/build/build/moxi/src/cproxy_config.c.327) env: MOXI_SASL_PLAIN_USR (1)
2014-11-25 13:50:47: (/home/buildbot/buildbot_slave/debian-7-x64-301-builder/build/build/moxi/src/cproxy_config.c.336) env: MOXI_SASL_PLAIN_PWD (32)

 Comments   
Comment by Ian McCloy [ 25/Nov/14 ]
https://code.google.com/p/memcached/issues/detail?id=192 < Original Memcached bug and simple patch / fix
Comment by Ian McCloy [ 25/Nov/14 ]
Same command on memcached port gives back a sensible error message

echo -en '\x80\x12\x00\x01\x08\x00\x00\x00\xff\xff\xff\xe8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff\x01\x00\x00\x00\x00\x00\x00\x00\x00\x000\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' | nc localhost 11210

� Too large




[MB-9700] Error during rebalance: A view spec can not consist of merges exclusively Created: 09/Dec/13  Updated: 24/Nov/14  Due: 10/Dec/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: 2.5.0
Fix Version/s: cbq-alpha
Security Level: Public

Type: Bug Priority: Major
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
1 node cluster, 2 buckets,
tuq is started in this node
started rebalance 1->2 nodes

http://172.27.33.18:8093/query?q=SELECT+name%2C+VMs+FROM+standard_bucket0+AS+employee+WHERE+ANY+vm.RAM+%3E+5+AND+vm.os+%3D+%22ubuntu%22+OVER+vm+IN+employee.VMs+end+ORDER+BY+name%2C+VMs%5B0%5D.RAM

ERROR: Unable to access view - cause: Error executing view req at http://127.0.0.1:8092/default/_all_docs?limit=1001&startkey=%22query-test2666fa7-5%22&startkey_docid=query-test2666fa7-5: 500 Internal Server Error - {"error":"error","reason":"A view spec can not consist of merges exclusively."}



 Comments   
Comment by Cihan Biyikoglu [ 19/Jun/14 ]
assigning to DP4 for now. pls triage to specific release or bug-backlog.
Comment by Sriram Melkote [ 20/Jun/14 ]
Iryna, can you please give more details on how to reproduce this? Which data set is it running against?
Comment by Iryna Mironava [ 02/Jul/14 ]
can be reproduced by tuqquery.tuq_cluster_ops.QueriesOpsTests.test_incr_rebalance_in,GROUP=REBALANCE;P1,nodes_in=3 (a testrunner test)

1 default bucket and 1 standard bucket
20000 items (can be loaded by testrunner script:
./scripts/doc_loader.py -i cluster.ini -p bucket_name=default,doc_per_day=5)
tuq is started the same vm where couchbase is installed
add a node, start reblance
fails right after rebalance is started



CBQError: host 10.1.3.176: ERROR:{u'code': 5000, u'message': u'Unable to access view', u'caller': u'view_util:82', u'cause': u'error executing view req at http://127.0.0.1:8092/default/_all_docs?limit=1001&startkey=%22query-test16233f7-4%22&startkey_docid=query-test16233f7-4: 500 Internal Server Error - {"error":"error","reason":"A view spec can not consist of merges exclusively."}\n', u'key': u'Internal Error'}
Comment by Ketaki Gangal [ 24/Nov/14 ]
Hi Iryna,

Is this bug still valid?





[MB-9871] Create subdirectory in tarball Created: 09/Jan/14  Updated: 24/Nov/14  Due: 01/Dec/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: 2.0
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Minor
Reporter: Sergey Avseyev Assignee: Manik Taneja
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Is this a Regression?: Yes

 Description   
The n1ql package isn't "tar xzf"-friendly, I mean it extracts itself into current directory unlike all source packages I ever installed so far.

Is it possible to put everything into directory named after tarball name and then archive this directory?

(it would be nice if this project has its own issue tracker, because it is mandatory to specify version in MB tracker, which doesn't make sense)




[MB-8707] N1QL Preview White Paper Created: 25/Jul/13  Updated: 24/Nov/14  Due: 15/Dec/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: 2.2.0
Fix Version/s: cbq-alpha
Security Level: Public

Type: Task Priority: Major
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-9102] concurrent queries return error "bucket default not found" Created: 10/Sep/13  Updated: 24/Nov/14  Due: 10/Dec/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-alpha
Security Level: Public

Type: Task Priority: Major
Reporter: Deepkaran Salooja Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

I am running the below query with 30 threads in parallel. All doing POST to port 8093:
SELECT personal_details.display_name, profile_details.user_creation_time, profile_details.last_login_time FROM default WHERE profile_details.user_id = "Maire_35413466"

The below error is being returned:
"error":
        {
            "caller": "view_index:155",
            "code": 5000,
            "key": "Internal Error",
            "message": "Bucket default not found."
        }

tuqtng is running against couchbase with 100k docs in default bucket.

Reducing the threads to 20 makes it work fine.

After creating the index, I am able to run the same query with 200 threads in parallel.
CREATE INDEX user_id_idx ON default(profile_details.user_id)



 Comments   
Comment by Marty Schoch [ 24/Sep/13 ]
Based on the error message returned and the location of the caller, we know that our view query was unsuccessful. Given that this was a stress test, the most likely explanation is that the HTTP request timed out.

What behavior do we want when the following scenarios occur?

1. View request times out
2. View request gets an unsuccessful response (anything other than 200)
3. View request get successful response (200), but results are partial (view engine sometimes returns partials results in case of error)

If we specify the behavior for these 3 conditions, we can code that up and have Deep retry.
Comment by Gerald Sangudi [ 17/Oct/13 ]
More specific error messages would be sufficient:

- "Scan request timed out."
- "Scan request was unsuccessful."
- "Scan request had partial results."




[MB-9186] query-engine does not always detect deleted indexes Created: 27/Sep/13  Updated: 24/Nov/14  Due: 08/Dec/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-alpha
Security Level: Public

Type: Bug Priority: Major
Reporter: Marty Schoch Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Is this a Regression?: Yes

 Description   
1. Start Couchbase Server and Query Engine (with no beer-sample bucket)
2. Load beer-sample bucket
3. Run query:

SELECT COUNT(*) FROM beer-sample WHERE abv > 7

Query works fine

4. Create index on abv field

CREATE INDEX abvidx ON beer-sample(abv)

5. Run query again

Query still works fine

6. Delete view backing index in UI

7. Run query again

{
    "error":
        {
            "caller": "view_index:152",
            "code": 5000,
            "key": "Internal Error",
            "message": "Bucket beer-sample not found."
        }
}

Why? Because the planner still planned on using the abvidx index. Scanning this index returned 404, which we translate into bucket not found (same codepath is how we detect deleted buckets)

The same thing happens when you delete/recreate bucket but fail to recreate an index that existed before.

The good news is that somehow this problem is self-healing. Once you encounter this problem, the next time we revert back to using all docs (i think we force a pool refresh on the failure)




[MB-9243] MIN uses index and scans optimum # of rows Created: 08/Oct/13  Updated: 24/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP2
Fix Version/s: cbq-alpha
Security Level: Public

Type: Improvement Priority: Major
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Gerald Sangudi [ 21/Oct/13 ]
Please mark as resolved if satisfactory.
Comment by Marty Schoch [ 21/Oct/13 ]
Theres already another issue for this.




[MB-9253] graduate from couchbaselabs Created: 08/Oct/13  Updated: 24/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP2
Fix Version/s: cbq-alpha
Security Level: Public

Type: Improvement Priority: Major
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-9261] memcached bucket Created: 08/Oct/13  Updated: 24/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP2
Fix Version/s: cbq-alpha
Security Level: Public

Type: Improvement Priority: Major
Reporter: Gerald Sangudi Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
how can we detect it's a memcached bucket




[MB-9266] detect view index added Created: 08/Oct/13  Updated: 24/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP2
Fix Version/s: cbq-alpha
Security Level: Public

Type: Improvement Priority: Major
Reporter: Gerald Sangudi Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
issues of knowing when a view index is ready for use




[MB-9286] sample apps Created: 08/Oct/13  Updated: 24/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP2
Fix Version/s: cbq-alpha
Security Level: Public

Type: Improvement Priority: Major
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-9288] online tutorial looks better on a smaller screen Created: 08/Oct/13  Updated: 24/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP2
Fix Version/s: cbq-alpha
Security Level: Public

Type: Improvement Priority: Major
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-9299] tuqtng refactoring / package restructuring Created: 08/Oct/13  Updated: 24/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP2
Fix Version/s: cbq-alpha
Security Level: Public

Type: Improvement Priority: Major
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
what is tuq? what is N1QL? catalog API an example of this direction. tuq might handle multiple languages, more than just N1QL? e.g., jq, jsoniq. (perhaps different language specific planners.) language as plugin to tuq. new interfaces needed perhaps. base Expression (with Evaluate()). ideally, pre-conference nice-to-have. thoughts at this point. post-aug-30.




[MB-9291] demos Created: 08/Oct/13  Updated: 24/Nov/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP2
Fix Version/s: cbq-alpha
Security Level: Public

Type: Improvement Priority: Major
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-9401] error 'Bucket X does not exist' appears when trying to query sasl bucket Created: 22/Oct/13  Updated: 24/Nov/14  Due: 05/Dec/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: