[MB-1816] windows setup.exe for just moxi Created: 13/Aug/10  Updated: 17/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: installer
Affects Version/s: 1.6.0 beta4
Fix Version/s: 1.6.0 beta4

Type: Bug
Reporter: Steve Yen Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Operating System: All
Platform: All


 Description   
We need a "moxi-setup.exe" for windows. So, we need yet another additional setup.exe that will be output from wallace. This should package up and install moxi.exe, its required DLL's, and potentially any technical docs that may be related into a separate setup.exe.

The "northscale server setup.exe", however, also keeps its own 'embedded' moxi.exe.

Related, see MB-1815, where a 'standalone moxi' will need to become an NT service. When MB-1815 is done, the moxi-setup.exe will also need to register/unregister that NT service.

-----Original Message-----
From: Steve Yen
Sent: Thursday, August 12, 2010 10:54 PM
To: Sharon Barr; Trond Norbye
Subject: RE: Didn't we port the stand alone moxi to windows?

Trond did port moxi to win32. But, there's no separate installshield project for a standalone moxi setup.exe, and no work on that has been started. It should be pretty straightforward for Dmitry to extend wallace to provide that, and we should understand his task plate.

Currently, on windows moxi is comprised of an executable (moxi.exe) plus 2 required DLL's (pthreadGC2.dll and libcurl-4.dll). Those DLL's can probably be compiled into a moxi.exe statically, to result in a single *.exe, too, if we want that.

To 'productize' moxi to be more standalone-friendly for windows, we might need to make it into an NT service, but that can probably be later.

Steve

 Comments   
Comment by sharon.barr@northscale.com [ 16/Aug/10 ]
We don't need stanalone moxi on windows. we have the .NET client for this.
Comment by Patrick Varley [ 17/Nov/14 ]
Reopening this ticket as there has been a user on IRC asking about a stand alone moxi package for Windows.
Comment by kireevco [ 17/Nov/14 ]
I'm using moxi on windows servers as a drop-in replacement for memcache. I don't need a full couchbase install or .net clients, because clients talks memcached protocol.
I created chocolatey packae (https://www.myget.org/feed/kireevco-chocolatey/package/moxi) to deliver moxi binaries to clients seamless, but right now I have to host them myself https://kireevco.github.io/download/moxi_2.5.0.zip.

Windows community would highly appreciate if couchbase could have some sort of packaged windows moxi binaries on couchbase.org.




[MB-11846] Compiling breakdancer test case exceeds available memory Created: 29/Jul/14  Updated: 30/Oct/14  Due: 30/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Chris Hillery Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
1. With memcached change 4bb252a2a7d9a369c80f8db71b3b5dc1c9f47eb9, cc1 on ubuntu-1204 quickly uses up 100% of the available memory (4GB RAM, 512MB swap) and crashes with an internal error.

2. Without Trond's change, cc1 compiles fine and never takes up more than 12% memory, running on the same hardware.

 Comments   
Comment by Chris Hillery [ 29/Jul/14 ]
Ok, weird fact - on further investigation, it appears that this is NOT happening on the production build server, which is an identically-configured VM. It only appears to be happening on the commit validation server ci03. I'm going to temporarily disable that machine so the next make-simple-github-tap test runs on a different ci server and see if it is unique to ci03. If it is I will lower the priority of the bug. I'd still appreciate some help in understanding what's going on either way.
Comment by Trond Norbye [ 30/Jul/14 ]
Please verify that the two builders have the same patch level so that we're comparing apples with apples.

It does bring up another interesting topic. should our builders just use the compiler provided with the installation, or should we have a reference compiler we're using to build our code. It does seems like a bad idea having to support a ton of various compiler revision (including the fact that they support different levels of C++11 that we have to work around).
Comment by Chris Hillery [ 31/Jul/14 ]
This is now occurring on other CI build servers in other tests - http://www.couchbase.com/issues/browse/CBD-1423

I am bumping this back to Test Blocker and I will revert the change as a work-around for now.
Comment by Chris Hillery [ 31/Jul/14 ]
Partial revert committed to memcached master: http://review.couchbase.org/#/c/40152/ and 3.0: http://review.couchbase.org/#/c/40153/
Comment by Trond Norbye [ 01/Aug/14 ]
That review in memcached should NEVER have been pushed through. Its subject line is too long
Comment by Chris Hillery [ 01/Aug/14 ]
If there's a documented standard out there for commit messages, my apologies; it was never revealed to me.
Comment by Trond Norbye [ 01/Aug/14 ]
When it doesn't fit within a terminal window there is a problem. it is way better to use multiple lines..

IN addition I'm not happy with the fix. instead of deleting the line it should have been checking for an environment variable so that people could explicitly disable it. This is why we have review cycles.
Comment by Chris Hillery [ 01/Aug/14 ]
I don't think I want to get into style arguments. If there's a standard I'll use it. In the meantime I'll try to keep things to 72-character lines.

As to the content of the change, it was not intended to be a "fix"; it was a simple revert of a change that was provably breaking other jobs. I returned the code to its previous state, nothing more or less. And especially given the time crunch of the beta (which is supposed to be built tomorrow), waiting for a code review on a reversion is not in the cards.
Comment by Trond Norbye [ 01/Aug/14 ]
The normal way of doing a revert is to use git revert (which as an extra bonus makes the commit message contain that).
Comment by Trond Norbye [ 01/Aug/14 ]
http://review.couchbase.org/#/c/40165/
Comment by Chris Hillery [ 01/Aug/14 ]
1. Your fix is not correct, because simply adding -D to cmake won't cause any preprocessor defines to be created. You need to have some CONFIGURE_FILE() or similar to create a config.h using #cmakedefine. As it is there is no way to compile with your change.

2. The default behaviour should not be the one that is known to cause problems. Until and unless there is an actual fix for the problem (whether or not that is in the code), the default should be to keep the optimization, with an option to let individuals bypass that if they desire and accept the risks.

3. Characterizing the problem as "misconfigured VMs" is, at best, premature.

I will revert this change again on the 3.0 branch shortly, unless you have a better suggestion (I'm definitely all ears for a better suggestion!).
Comment by Trond Norbye [ 01/Aug/14 ]
If you look at the comment it pass the -D over into the CMAKE_C_FLAGS, causing it to be set into the compiler flags and it'll be passed on to compilation cycle.

As of misconfiguration, it is either insufficient resources on the vm or a "broken" compiler version installed there.
Comment by Trond Norbye [ 01/Aug/14 ]
Can I get login credentials to the server it fails and an identical vm where it succeeds.
Comment by Chris Hillery [ 01/Aug/14 ]
[CMAKE_C_FLAGS] Fair enough, I did misread that. That's not really a sufficient workaround, though. Doing that may overwrite other CFLAGS set by other parts of the build process.

I still maintain that the default behaviour should be the known-working version. However, for the moment I have temporarily locked the rel-3.0.0.xml manifest to the revision before my revert (ie, to 5cc2f8d928f0eef8bddbcb2fcb796bc5e9768bb8), so I won't revert anything else until that has been tested.

The only VM I know of at the moment where we haven't seen build failures is the production build slave. I can't give you access to that tonight as we're in crunch mode to produce a beta build. Let's plan to hook up next week and do some exploration.
Comment by Volker Mische [ 01/Aug/14 ]
There are commit message guidelines. At the bottom of

http://www.couchbase.com/wiki/display/couchbase/Contributing+Changes

links to:

http://en.wikibooks.org/wiki/Git/Introduction#Good_commit_messages
Comment by Trond Norbye [ 01/Aug/14 ]
I've not done anything on the 3.0.0 branch, the fix going forward is for 3.0.1 and trunk. Hopefully the 3.0 branch will die relatively soon since we've got a lot of good stuff in the 3.0.1 branch.

The "workaround" is not intended as a permanent solution, its just until the vms is fixed. I've not been able to reproduce this issue on my centos, ubuntu, fedora or smartos builders. They're running in the following vm's:

[root@00-26-b9-85-bd-92 ~]# vmadm list
UUID TYPE RAM STATE ALIAS
04bf8284-9c23-4870-9510-0224e7478f08 KVM 2048 running centos-6
7bcd48a8-dcc2-43a6-a1d8-99fbf89679d9 KVM 2048 running ubuntu
c99931d7-eaa3-47b4-b7f0-cb5c4b3f5400 KVM 2048 running fedora
921a3571-e1f6-49f3-accb-354b4fa125ea OS 4096 running compilesrv
Comment by Trond Norbye [ 01/Aug/14 ]
I need access to two identical configured builders where one may reproduce the error and one where it succeeds.
Comment by Volker Mische [ 01/Aug/14 ]
I would also add that I think it is about bad VMs. On the commit validation we have 6 VMs, It failed only always on ubuntu-1204-64-ci-01 due to this error and never on the others (ubuntu-1204-64-ci-02 - 06).
Comment by Chris Hillery [ 01/Aug/14 ]
That's not correct. The problem originally occurred on ci-03.
Comment by Volker Mische [ 01/Aug/14 ]
Then I need to correct it that my comment only holds true for the couchdb-gerrit-300 job.
Comment by Trond Norbye [ 01/Aug/14 ]
can I get login creds to one that it fails on? while I'm waiting for access to one that it works on?
Comment by Volker Mische [ 01/Aug/14 ]
I don't know about creds (I think my normal user login works) The machine details are here: http://factory.couchbase.com/computer/ubuntu-1204-64-ci-01/
Comment by Chris Hillery [ 01/Aug/14 ]
Volker - it was initially detected in the make-simple-github-tap job, so it's not unique to couchdb-gerrit-300 either. Both jobs pretty much just checkout the code and build it, though; they're pretty similar.
Comment by Trond Norbye [ 01/Aug/14 ]
Adding swap space to the builder makes the compilation pass. I've been trying to figure out how to get gcc to print more information about each step (the -ftime-reports memory usage didn't at all match the process usage ;-))
Comment by Anil Kumar [ 12/Aug/14 ]
Adding the component as "build". Let me know if that's not correct.




[MB-10156] "XDCR - Cluster Compare" support tool Created: 07/Feb/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Cihan Biyikoglu Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: 2.5.1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
for the recent issues we have seen we need a tool that cam compare metadata (specifically revids) for a given replication definition in XDCR. To scale to large data sizes, being able to do this per vbucket or per doc range would be great but we can do without these. for clarity, here is a high level desc.

Ideal case:
xdcr_compare cluster1_connectioninfo cluster1_bucketname cluster2connectioninfo cluster2_bucketname [vbucketid] [keyrange]
should return a line per docid for each row where cluster1 metadata and clustermetadata for the given key differ.
docID - cluster1_metadata cluster2_metadata

simplification: the tool is expected to return false positives in a moving system but we will tackle that by rerunning the tool multiple times.

 Comments   
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Aaron, do you have a timeline for this?
thanks
-cihan
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan,

For test automation/verification, can you list out the stats/metadata that we should be testing specifically?
we want to create/implement the tests accordingly.


Also -- is this tool de-coupled from the server package? or is this part of rpm/deb/.exe/osx build package?

Thanks,
Maria
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
This depends on the requirements; A tool that requires the manual collection of all data from all nodes in both clusters onto one machine (like we've done recently) could be done pretty quickly, but I imagine that may be difficult or unfeasible entirely for some users.

Better would be to be able to operate remotely on clusters and only look at metadata. Unfortunately there is no *currently exposed* interface to only extract metadata from the system without also retrieving values. I may be able to work around this, but the workaround is unlikely to be simple.

Also for some users, even the amount of *metadata* may be prohibitively large to transfer all to one place, this also can be avoided, but again, adds difficulty.

Q: Can the tool be JVM-based?
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
I think it would be more feasible for this to ship separately from the server package.
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan, Aaron,

If it's de-coupled, what older versions of Couchbase would this tool support? as far back as 1.8.x? pls confirm as this would expand our backward compatibility testing for this tool.
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
Well, 1.8.x didn't have XDCR or the rev field; It can't be compatible with anything older than 2.0 since it operates mostly to check things added since 2.0.

I don't know how far back it needs to go but it *definitely* needs to be able to run against 2.2
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Agree with Aaron, lets keep this lightweight. can we depend on Aaron for testing if this will initially be just a support tool? for 3.0, we may graduate the tool to the server shipped category.
thanks
Comment by Sangharsh Agarwal [ 27/Feb/14 ]
Cihan, Is the Spec finalized for this tool in version 2.5.1?
Comment by Cihan Biyikoglu [ 27/Feb/14 ]
Sangharsh, for 2.5.1, we wanted to make this a "Aaron tested" tool. I believe Aaron already has the tool. Aaron?
Comment by Aaron Miller (Inactive) [ 27/Feb/14 ]
Working on it; wanted to get my actually-in-the-package 2.5.1 stuff into review first.

What I do already have is a diff tool for *files*, but is highly inconvenient to use; this should be a tool that doesn't require collecting all data files into one place in order to use, and instead can work against a running cluster.
Comment by Maria McDuff (Inactive) [ 05/Mar/14 ]
Aaron,

Is the tool merged yet into the build? can you update pls?
Comment by Cihan Biyikoglu [ 06/Mar/14 ]
2.5.1 shiproom note: Phil raised a build concern on getting this packaged with 2.5.1. The initial bar we set was not to ship this as part of the server - it was intended to be a downloadable support tool. Aaron/Cihan will re-eval and get back to shiproom.
Comment by Cihan Biyikoglu [ 15/Jun/14 ]
Aaron no longer here. assigning to Xiaomei for consideration.




[MB-9632] diag / master events captured in log file Created: 22/Nov/13  Updated: 27/Aug/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Blocker
Reporter: Steve Yen Assignee: Ravi Mayuram
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
The information available in the diag / master events REST stream should be captured in a log (ALE?) file and hence available to cbcollect-info's and later analysis tools.

 Comments   
Comment by Aleksey Kondratenko [ 22/Nov/13 ]
It is already available in collectinfo
Comment by Dustin Sallings (Inactive) [ 26/Nov/13 ]
If it's only available in collectinfo, then it's not available at all. We lose most of the useful information if we don't run an http client to capture it continually throughout the entire course of a test.
Comment by Aleksey Kondratenko [ 26/Nov/13 ]
Feel free to submit a patch with exact behavior you need
Comment by Cihan Biyikoglu [ 27/Aug/14 ]
is this still relevant?




[MB-8838] Security Improvement - Connectors to implement security improvements Created: 14/Aug/13  Updated: 19/May/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Security Improvement - Connectors to implement security improvements

Spec ToDo.




[MB-4030] enable traffic for for ready nodes even if not all nodes are up/healthy/ready (aka partial janitor) (was: After two nodes crashed, curr_items remained 0 after warmup for extended period of time) Created: 06/Jul/11  Updated: 20/May/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.1, 2.0, 2.0.1, 2.2.0, 2.1.1, 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
we had two nodes crash at a customer, possibly related to a disk space issue, but I don't think so.

After they crashed, the nodes warmed up relatively quickly, but immediately "discarded" their items. I say that because I see that they warmed up ~10m items, but the current item counts were both 0.

I tried shutting down the service and had to kill memcached manually (kill -9). Restarting it went through the same process of warming up and then nothing.

While I was looking around, I left it sit for a little while and magically all of the items came back. I seem to recall this bug previously where a node wouldn't be told to be active until all the nodes in the cluster were active...and it got into trouble when not all of the nodes restarted.

Diags for all nodes will be attached

 Comments   
Comment by Perry Krug [ 06/Jul/11 ]
Full set of logs at \\corp-fs1\export_support_cases\bug_4030
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
It _is_ ns_server issue caused by janitor needing all nodes to be up for vbuckets activation. We planned fix for 1.8.1 (now 1.8.2)
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
Fix would land as part of fast warmup integration
Comment by Perry Krug [ 18/Jul/12 ]
Peter, can we get a second look at this one? We've seen this before, and the problem is that the janitor did not run until all nodes had joined the cluster and warmed up. I'm not sure we've fixed that already...
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Latest 2.0 will mark nodes as green and enable memcached traffic when all of them are up. So easy part is done.

Partial janitor (i.e. enabling traffic for some nodes when others are still down/warming up) is something that will unlikely be done soon
Comment by Perry Krug [ 18/Jul/12 ]
Thanks Alk...what's the difference in behavior (in this area) between 1.x and 2.0? It "sounds" like they're the same, no?

And this bug should still remain open until we fix the primary issue which is the partial janitor...correct?
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
1.8.1 will show node as green when ep-engine thinks it's warmed up. But confusingly it'll not be really ready. All vbuckets will be in state dead and curr_items will be 0.

2.0 fixes this confusion. Node is marked green when it's actually warmed up from user's perspective. I.e. right vbucket states are set and it'll serve clients traffic.

2.0 is still very conservative about only making vbucket state changes when all nodes are up and warmed up. Thats "impartial" janitor. Whether it's a bug or "lack of feature" is debatable. But I think main concern that users are confused by green-ness of nodes is resolved.
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Closing as fixed. We'll get to partial janitor some day in future which is feature we lack today, not bug we have IMHO
Comment by Perry Krug [ 12/Nov/12 ]
Reopening this for the need for partial janitor. Recent customer had multiple nodes need to be hard-booted and none returned to service until all were warmed up
Comment by Steve Yen [ 12/Nov/12 ]
bug-scrub: moving out of 2.0, as this looks like a feature req.
Comment by Farshid Ghods (Inactive) [ 13/Nov/12 ]
in system testing we have noticed many times that if multiple nodes crash until all nodes are warmed up node status for those that are already warmed up appears as yellow.


user won't be able to understand which node has successfully warmed up from the console and if one node is actually not recovering or not warm up in a reasonable time they have to figure it out some other way ( cbstats ... )

another issue with this is that user won't be able to perform a fail over for 1 node even though N-1 nodes has warmed up already.

i am not sure if fixing this bug will impact cluster-restore functionality but something important to fix or suggest a workaround to the user ( by workaround i mean a documented , tested and supported set of commands )
Comment by Mike Wiederhold [ 17/Mar/13 ]
Comments say this is an ns_server issue so I am removing couchbase-bucket from affected components. Please re-add if there is a couchbase-bucket task for this issue.
Comment by Aleksey Kondratenko [ 23/Feb/14 ]
Not going to happen for 3.0.




[MB-12500] Indexing with more than one KV node does not work with toy-indexing Created: 29/Oct/14  Updated: 30/Oct/14

Status: Open
Project: Couchbase Server
Component/s: secondary-index
Affects Version/s: sherlock
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Prathibha Bisarahalli Assignee: Pratap Chakravarthy
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
In reference to bug https://www.couchbase.com/issues/browse/CBIDXT-246 , currently this works only for one KV node, ie one node running KV, indexer and query engine.

If couchbase is setup on two nodes that are joined to a cluster, secondary index creation fails with "Internal error". This is because projector is started in each node with KV address: 127.0.0.1 but when nodes are joined to cluster, their host names are internal IPs of AWS nodes. Projecto is unable to get mutation stream from KV nodes.

This currently blocks testing with multi node setup. However single node setup works fine if node name is configured as 127.0.0.1 as part of couchbase setup.


 Comments   
Comment by Pratap Chakravarthy [ 30/Oct/14 ]
The hostname supplied to projector should match the vbmap obtained from ns_server.
Comment by Prathibha Bisarahalli [ 30/Oct/14 ]
Right. Currently host name supplied is 127.0.0.1 but ns_server as host names as IP addresses of nodes after they are joined to a cluster.
Comment by Pratap Chakravarthy [ 30/Oct/14 ]
One way is to assign the correct hostname while adding nodes to cluster. It is allowed via UI. But I am not sure whether we are going to advice our users (customers) to do that.




[MB-12096] collect_server_info.py does not work on a dev tree on windows.. Created: 29/Aug/14  Updated: 30/Oct/14

Status: Open
Project: Couchbase Server
Component/s: test-execution
Affects Version/s: techdebt-backlog
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Trond Norbye Assignee: Tommie McAfee
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
cbcollectinfo.py tries to use ssh to collect the files even if the target machine is the same machine as the test is running on, and that doesn't seem to work on my windows development box. Since all of the files should be local it could might as well use normal copy.




[MB-4593] Windows Installer hangs on "Computing Space Requirements" Created: 27/Dec/11  Updated: 07/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: installer
Affects Version/s: 2.0-developer-preview-3, 2.0-developer-preview-4
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Bin Cui Assignee: Sriram Melkote
Resolution: Unresolved Votes: 3
Labels: windows, windows-3.0-beta, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7 Ultimate 64. Sony Vaio, i3 with 4GB RAM and 200 GB of 500 GB free. Also on a Sony Vaio, Windows 7 Ultimate 64, i7, 6 GB RAM and a 750GB drive with about 600 GB free.

Attachments: PNG File couchbase-installer.png     PNG File image001.png     PNG File ss 2014-08-28 at 4.16.09 PM.png    
Triage: Triaged

 Description   
When installing the Community Server 2.0 DP3 on Windows, the installer hangs on the "Computing space requirements screen." There is no additional feedback from the installer. After 90-120 minutes or so, it does move forward and complete. The same issue was reported on Google Groups a few months back - http://groups.google.com/group/couchbase/browse_thread/thread/37dbba592a9c150b/f5e6d80880f7afc8?lnk=gst&q=msi.

Executable: couchbase-server-community_x86_64_2.0.0-dev-preview-3.setup.exe

WORKAROUND IN 3.0 - Create a registry key HKLM\SOFTWARE\Couchbase, name=SkipVcRuntime, type=DWORD, value=1 to skip installing VC redistributable installation which is causing this issue. If VC redistributable is necessary, it must be installed manually if the registry key is set to skip automatic install of it.


 Comments   
Comment by Filip Stas [ 23/Feb/12 ]
Is there any solution for this? I'm experiencing the same problem. Running the unpacked msi does not seem to work because the Installshield setup has been configured to require to install through the exe.

Comment by Farshid Ghods (Inactive) [ 22/Mar/12 ]
from Bin:

Looks like it is related to installshield engine. Maybe installshield tries to access system registry and it is locked by other process. The suggestion is to shut down other running programs and try again if such problem pops up.
Comment by Farshid Ghods (Inactive) [ 22/Mar/12 ]
we were unable to reproduce this on windows 2008 64-bit

the bug mentions this happened on windows 7 64-bit which is not a supported platform but that should not make any difference
Comment by Farshid Ghods (Inactive) [ 23/Mar/12 ]
From Bin:

Windows 7 is my dev environment. And I have no problem to install and test it. From your description, I cannot tell whether it is failed during the installation or after installation finishes but couchcbase server cannot start.
 
If it is due to installshield failure, you can generate the log file for debugging as:
setup.exe /debuglog"C:\PathToLog\setupexe.log"
 
If Couchbase server fails to start, the most possible reason is due to missing or incompatible Microsoft runtime library. You can manually service_start.bat under bin directory and check what is going on. And you can run cbbrowse_log.bat to generate log file for further debugging.
Comment by John Zablocki (Inactive) [ 23/Mar/12 ]
This is an installation only problem. There's not much more to it other than the installer hangs on the screen (see attachment).

However, after a failed install, I did get it to work by:

a) deleting C:\Program Files\Couchbase\*

b) deleting all registry keys with Couchbase Server left over from the failed install

c) rebooting

Next time I see this problem, I'll run it again with the /debuglog

I think the problem might be that a previous install of DP3 or DP4 (nightly build) failed and left some bits in place somewhere.
Comment by Steve Yen [ 05/Apr/12 ]
from Perry...
Comment by Thuan Nguyen [ 05/Apr/12 ]
I can not repo this bug. I test on Windows 7 Professional 64 bit and Windows Server 2008 64 bit.
Here are steps:
- Install couchbase server 2.0.0r-388 (dp3)
- Open web browser and go to initial setup in web console.
- Uninstall couchbase server 2.0.0r-388
- Install couchbase server 2.0.0dp4r-722
- Open web browser and go to initial setup in web console.
Install and uninstall couchbase server go smoothly without any problem.
Comment by Bin Cui [ 25/Apr/12 ]
Maybe we need to get the installer verbose log file to get some clues.

setup.exe /verbose"c:\temp\logfile.txt"
Comment by John Zablocki (Inactive) [ 06/Jul/12 ]
Not sure if this is useful or not, but without fail, every time I encounter this problem, simply shutting down apps (usually Chrome for some reason) causes the hanging to stop. Right after closing Chrome, the C++ redistributable dialog pops open and installation completes.
Comment by Matt Ingenthron [ 10/Jul/12 ]
Workarounds/troubleshooting for this issue:


On installshield's website, there are similar problems reported for installshield. There are several possible reasons behind it:

1. The installation of the Microsoft C++ redistributable is blocked by some other running program, sometimes Chrome.
2. There are some remote network drives that are mapped to local system. Installshield may not have enough network privileges to access them.
3. Couchbase server was installed on the machine before and it was not totally uninstalled and/or removed. Installshield tried to recover from those old images.

To determine where to go next, run setup with debugging mode enabled:
setup.exe /debuglog"C:\temp\setupexe.log"

The contents of the log will tell you where it's getting stuck.
Comment by Bin Cui [ 30/Jul/12 ]
Matt's explanation should be included in document and Q&A website. I reproduced the hanging problem during installation if Chrome browser is running.
Comment by Farshid Ghods (Inactive) [ 30/Jul/12 ]
so does that mean the installer should wait until chrome and other browsers are terminated before proceeding ?

i see this as a very common use case with many installers that they ask the user to stop those applications and if user does not follow the instructions the set up process does not continue until these conditions are met.
Comment by Dipti Borkar [ 31/Jul/12 ]
Is there no way to fix this? At the least we need to provide an error or guidance that chrome needs to be quit before continuing. Is chrome the only one we have seen causing this problem?
Comment by Steve Yen [ 13/Sep/12 ]
http://review.couchbase.org/#/c/20552/
Comment by Steve Yen [ 13/Sep/12 ]
See CBD-593
Comment by Øyvind Størkersen [ 17/Dec/12 ]
Same bug when installing 2.0.0 (build-1976) on Windows 7. Stopping Chrome did not help, but killing the process "Logitech ScrollApp" (KhalScroll.exe) did..
Comment by Joseph Lam [ 13/Sep/13 ]
It's happening to me when installing 2.1.1 on Windows 7. What is this step for and it is really necessary? I see that it happens after the files have been copied to the installation folder. No entirely sure what it's computing space requirements for.
Comment by MikeOliverAZ [ 16/Nov/13 ]
Same problem on 2.2.0x86_64. I have tried everything, closing down chrome and torch from Task Manager to ensure no other apps are competing. Tried removing registry entries but so many, my time please. As is noted above this doesn't seem to be preventing writing the files under Program Files so what's it doing? So I cannot install, it now complains it cannot upgrade and run the installer again.

BS....giving up and going to MongoDB....it installs no sueat.

Comment by Sriram Melkote [ 18/Nov/13 ]
Reopening. Testing on VMs is a problem because they are all clones. We miss many problems like these.
Comment by Sriram Melkote [ 18/Nov/13 ]
Please don't close this bug until we have clear understanding of:

(a) What is the Runtime Library that we're trying to install that conflicts with all these other apps
(b) Why we need it
(c) A prioritized task to someone to remove that dependency on 3.0 release requirements

Until we have these, please do not close the bug.

We should not do any fixes on the lines of checking for known apps that conflict etc, as that is treating the symptom and not fixing the cause.
Comment by Bin Cui [ 18/Nov/13 ]
We install window runtime library because erlang runtime libraries depend on it. Not any runtime library, but the one that comes with erlang distribution package. Without it or with incompatible versions, erl.exe won't run.

In stead of checking any particular applications, the current solution is:
Run a erlang test script. If it runs correctly, no runtime library installed. Otherwise, installer has to install the runtime library.

Please see CBD-593.

Comment by Sriram Melkote [ 18/Nov/13 ]
My suggestion is that let us not attempt to install MSVCRT ourselves.

Let us check the library we need is present or not prior to starting the install (via appropriate registry keys).

If it is absent, let us direct the user to download and install it and exit.
Comment by Bin Cui [ 18/Nov/13 ]
The approach is not totally right. Even if the msvcrt exists, we still need to install it. Here the key is the absolute same msvrt package that comes with erlang distribution. We had problems before that with the same version, but different build of msvcrt installed, erlang won't run.

One possible solution is to ask user to download the msvcrt library from our website and make it a prerequisite for installing couchbase server.
Comment by Sriram Melkote [ 18/Nov/13 ]
OK. It looks like MS distributes some versions of VC runtime with the OS itself. I doubt that Erlang needs anything newer.

So let us rebuild Erlang and have it link to the OS supplied version of MSVCRT (i.e., msvcr70.dll) in Couchbase 3.0 onwards

In the meanwhile, let us point the user to the vcredist we ship in Couchbase 2.x versions and ask them to install it from there.
Comment by Steve Yen [ 23/Dec/13 ]
Saw this in the email inboxes...

From: Tal V
Date: December 22, 2013 at 1:19:36 AM PST
Subject: Installing Couchbase on Windows 7

Hi CouchBase support,
I would like to get your assist on an issue I’m having. I have a windows 7 machine on which I tried to install Couchbase, the installation is stuck on the “Computing space requirements”.
I tried several things without success:

1. 1. I tried to download a new installation package.

2. 2. I deleted all records of the software from the Registry.

3. 3. I deleted the folder that was created under C:\Program Files\Couchbase

4. 4. I restart the computer.

5. 5. Opened only the installation package.

6. 6. Re-install it again.
And again it was stuck on the same step.
What is the solution for it?

Thank you very much,


--
Tal V
Comment by Steve Yen [ 23/Dec/13 ]
Hi Bin,
Not knowing much about installshield here, but one idea - are there ways of forcibly, perhaps optionally, skipping the computing space requirements step? Some environment variable flag, perhaps?
Thanks,
Steve

Comment by Bin Cui [ 23/Dec/13 ]
This "Computing space requirements" is quite misleading. It happens at the post install step while GUI still shows that message. Within the step, we run the erlang test script and fails and the installer runs "vcredist.exe" for microsoft runtime library which gets stuck.

For the time being, the most reliable way is not to run this vcredist.exe from installer. Instead, we should provide a link in our download web site.

1. During installation, if we fails to run the erlang test script, we can pop up a warning dialog and ask customers to download and run it after installation.
 
Comment by Bin Cui [ 23/Dec/13 ]
To work around the problem, we can instruct the customer to download the vcredist.exe and run it manually before set up couchbase server. If running environment is set up correctly, installer will bypass that step.
Comment by Bin Cui [ 30/Dec/13 ]
Use windows registry key to install/skip the vcredist.exe step:

On 32bit windows, Installer will check HKEY_LOCAL_MACHINE\SOFTWARE\Couchbase\SkipVcRuntime
On 64bit windows, Installer will check HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Couchbase\SkipVcRuntime,
where SkipVcRuntime is a DWORD (32-bit) value.

When SkipVcRuntime is set to 1, installer will skip the step to install vcredist.exe. Otherwise, installer will follow the same logic as before.
vcredist_x86.exe can be found in the root directory of couchbase server. It can be run as:
c:\<couchbase_root>\vcredist_x86.exe

http://review.couchbase.org/#/c/31501/
Comment by Bin Cui [ 02/Jan/14 ]
Check into branch 2.5 http://review.couchbase.org/#/c/31558/
Comment by Iryna Mironava [ 22/Jan/14 ]
tested with Win 7 and Win Server 2008
I am unable to reproduce this issue(build 2.0.0-1976, dp3 is no longer available)
Installed/uninstalled couchbase several times
Comment by Sriram Melkote [ 22/Jan/14 ]
Unfortunately, for this problem, if it did not reproduce, we can't say it is fixed. We have to find a machine where it reproduces and then verify a fix.

Anyway, no change made actually addresses the underlying problem (the registry key just gives a way to workaround it when it happens), so reopening the bug and targeting for 3.0
Comment by Sriram Melkote [ 23/Jan/14 ]
Bin - I just noticed that the Erlang installer itself (when downloaded from their website) installs VC redistributable in non-silent mode. The Microsoft runtime installer dialog pop us up, indicates it will install VC redistributable and then complete. Why do we run it in silent mode (and hence assume liability of it running properly)? Why do we not run the MSI in interactive mode like ESL Erlang installer itself does?
Comment by Wayne Siu [ 05/Feb/14 ]
If we could get the information on the exact software version, it could be helpful.
From registry, Computer\HKLM\Software\Microsoft\WindowsNT\CurrentVersion
Comment by Wayne Siu [ 12/Feb/14 ]
Bin, looks like the erl.ini was locked when this issue happened.
Comment by Pavel Paulau [ 19/Feb/14 ]
Just happened to me in 2.2.0-837.
Comment by Anil Kumar [ 18/Mar/14 ]
Triaged by Don and Anil as per Windows Developer plan.
Comment by Bin Cui [ 08/Apr/14 ]
http://review.couchbase.org/#/c/35463/
Comment by Chris Hillery [ 13/May/14 ]
I'm new here, but it seems to me that vcredist_x64.exe does exactly the same thing as the corresponding MS-provided merge module for MSVC2013. If that's true, we should be able to just include that merge module in our project, and not need to fork out to install things. In fact, as of a few weeks ago, the 3.0 server installers are doing just that.

http://msdn.microsoft.com/en-us/library/dn501987.aspx

Is my understanding incomplete in some way?
Comment by Chris Hillery [ 14/May/14 ]
I can confirm that the most recent installers do install msvcr120.dll and msvcp120.dll in apparently the correct places, and the server can start with them. I *believe* this means that we no longer need to fork out vcredist_x64.exe, or have any of the InstallShield tricks to detect whether it is needed and/or skip installing it, etc. I'm leaving this bug open to both verify that the current merge module-based solution works, and to track removal of the unwanted code.
Comment by Sriram Melkote [ 16/May/14 ]
I've also verified that 3.0 build installed VCRT (msvcp100) is sufficient for Erlang R16.
Comment by Bin Cui [ 15/Sep/14 ]
Recently I happen to reproduce this problem on my own laptop. Use setup.exe /verbose"c:\temp\verbose.log", i generated a log file with more verbose debugging information. At the end the file, it looks something like :

MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: OVERVIEW.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_admin\overview\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: BUCKETS.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_admin\buckets\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: MN_DIALOGS.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_dialogs\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: ABOUT.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_dialogs\about\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: ALLUSERSPROFILE , Object: Q:\
MSI (c) (C4:C0) [10:51:36:274]: PROPERTY CHANGE: Adding INSTALLLEVEL property. Its value is '1'.

It means that installer tried to populate some property values for alluser profile after it copied all data to install location even though it shows this notorious "Computing space requirements" message.

From every installation, installer will use user temp directory to populate installer related data. After I delete or rename temp data under
c:\Users\<logonuser>\AppData\Temp, I reboot the machine. I solve the problem. at least for my laptop.

Conclusion:

1. After installed copied files, it needs to set alluser profiles. This action is synchronous and it waits and checks exit code. And certainly it will hangs on if this action never returns.

2. This is an issue related to setup environment, i.e. caused by other running applications, etc.

Suggestion:

1. Stop any other browers and applications when you install couchbase.
2. Kill the installation process and uninstall the failed setup.
3. Delete/rename the temp location under c:\Users\<logonuser>\AppData\Temp
4. Reboot and try again.

Comment by Bin Cui [ 17/Sep/14 ]
Turns out, it is really about the installation environment, not about a particular installation step.

Suggest to document the work around method.
Comment by Don Pinto [ 17/Sep/14 ]
Bin, some installers kill conflicting processes before installation starts so that it can complete. Why can't we do this?

(Maybe using something like this - http://stackoverflow.com/questions/251218/how-to-stop-a-running-process-during-an-msi-based-un-install)

Thanks,
Don
Comment by Don Pinto [ 23/Sep/14 ]
Triaged by PM and QE -
Present a dialog that says - Do you want to stop <Dependant process> or Continue.

If user hits stop, installer should kill the dependant process.
If user hits continue, should retry and if dependant process is still open, continue showing the dialog

Thanks,
Comment by Bin Cui [ 15/Oct/14 ]
As discussed, please update the documentation for the work around solution, i.e. clean up user tmp directory and reinstall again.
Comment by Ruth Harris [ 22/Oct/14 ]
Does this apply to all Couchbase versions since 2.0?
So, this MB should be added to the release notes, yes?

Does this apply to the 3.0.1 release, or will there be a fix?

Thanks, Ruth
Comment by Bin Cui [ 22/Oct/14 ]
It will be applied to all releases since 2.0. It is really not a fix but provides a suggestion for customers to deal with such issues when it happens because it is related to customer setup environment.
Comment by Ruth Harris [ 22/Oct/14 ]
Add for all releases starting at 2.0:

If the Windows installer hangs on the Computing Space Requirements screen, there is an issue with your setup or installation environment, for example, other running applications.
      <p>Workaround: <ol>
        <li>Stop any other running browers and applications when you started installing Couchbase.</li>
        <li>Kill the installation process and uninstall the failed setup.</li>
        <li>Delete or rename the temp location under <codeph>C:\Users\[logonuser]\AppData\Temp</codeph></li>
        <li>Reboot and try again.</li>
Comment by Ruth Harris [ 22/Oct/14 ]
Fixed for 2.0, 2.1, 2.2, 2.5, 3.0
Comment by Ruth Harris [ 22/Oct/14 ]
Pushed to master. Should be published in the evening.
Comment by Sriram Melkote [ 03/Nov/14 ]
Bin, I'm reopening this bug because we've not really fixed it - the workaround we documented is a good step forward, but other software vendors don't encounter this problem so often as us, so surely there's something we can do about this?




[MB-11589] Sliding endseqno during initial index build or upr reading from disk snapshot results in longer stale=false query latency and index startup time Created: 28/Jun/14  Updated: 10/Nov/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Sarath Lakshman Assignee: Meenakshi Goel
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks MB-11920 DCP based rebalance with views doesn'... Closed
Relates to
relates to MB-11919 3-5x increase in index size during re... Open
relates to MB-12179 Allow incremental pausable backfills Closed
relates to MB-12125 rebalance swap regression of 39.3% c... Open
relates to MB-12081 Remove counting mutations introduced ... Resolved
relates to MB-11918 Latency of stale=update_after queries... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We have to fix this depending on the development cycles we have left for 3.0

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - July 17

Currently investigating we will decide depending on the scope of changes needed.
Comment by Anil Kumar [ 30/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Raising this issue to "Critical" this needs to be fixed by RC.
Comment by Sriram Melkote [ 31/Jul/14 ]
The issue is that we'll have to change the view dcp client to stream all 1024 vbuckets in parallel, or we'll need an enhancement in ep-engine to stop streaming at the point requested. Neither is a simple change - the reason it's in 3.0 is because Dipti had requested we try to optimize query performance. I'll leave it at Major as I don't want to commit to fixing this in RC and also, the product works with reasonable performance without this fix and so it's not a must have for RC.
Comment by Sriram Melkote [ 31/Jul/14 ]
Mike noted that even streaming all vbuckets in parallel (which was perhaps possible to do in 3.0) won't directly solve the issue as the backfills are scheduled one at a time. ep-engine could hold onto smaller snapshots but that's not something we can consider in 3.0 - so net effect is that we'll have to revisit this in 3.0.1 to design a proper solution.
Comment by Sriram Melkote [ 12/Aug/14 ]
Bringing back to 3.0 as this is the root cause of MB-11920 and MB-11918
Comment by Anil Kumar [ 13/Aug/14 ]
Deferring this to 3.0.1 since making this out of scope for 3.0.
Comment by Sarath Lakshman [ 05/Sep/14 ]
We need to file an EP-Engine dependency ticket to implement parallel streaming support without causing sliding endseq during ondisk snapshot backfill.
Comment by Sriram Melkote [ 29/Oct/14 ]
Our changes to parallelize streams yielded improvement to rebalance in and rebalance out scenarios. However, they did not address swap rebalance changes.

Next step is to integrate Mike's changes. To this effect, Nimish provided a build with ep-engine changes to Venu. We are waiting for Venu to let us know what speedup these changes yield, so we can determine where the fixes should go (3.0.2 or master).

Reassigning to Venu as action item is on performance team now.
Comment by Venu Uppalapati [ 03/Nov/14 ]
Toy build testing: Rebalance got stuck in two consecutive run and does not complete. This issue is not seen on builds from master branch and it is seen only with the toy build, I suspect it is related to the two specific changes in this build enabling parallel streaming(client and server side). It seems like this toy build has not been functionally tested. Investigation of rebalance getting stuck is necessary. Logs can be found below. Live cluster available at node ips listed below.

http://ci.sc.couchbase.com/job/leto-dev/72/artifact/172.23.100.29.zip
http://ci.sc.couchbase.com/job/leto-dev/72/artifact/172.23.100.30.zip
http://ci.sc.couchbase.com/job/leto-dev/72/artifact/172.23.100.31.zip
http://ci.sc.couchbase.com/job/leto-dev/72/artifact/172.23.100.32.zip
Comment by Sriram Melkote [ 03/Nov/14 ]
Mike, we're run this test with parallel streaming changes from view engine successfully, so request your help to do initial investigation.
Comment by Mike Wiederhold [ 03/Nov/14 ]
Siri,

I'm assigned this back to the view engine team because of how old the build is that you tested (about a month). The rebalance appeared to be stuck waiting for the indexes to build. I recommend making sure that we have all of the correct code and making sure it is also the latest code in the toy build before testing again.
Comment by Sriram Melkote [ 04/Nov/14 ]
I think the CentOS build being tested is few days old. However, based on conversation yesterday, we'll follow up to define the possible impact of the wait for backfill followed by wait for indexing before moving to next vBucket first, and then proceed on this. If the latter yields a bigger performance improvement, we can focus there.
Comment by Sriram Melkote [ 07/Nov/14 ]
I've requested Meenakshi to do a functional test on ep-engine changes. The view changes have been validated.
Comment by Sriram Melkote [ 10/Nov/14 ]
This is too large a change for 3.0.2 and deferring it to Sherlock.




[MB-12185] update to "couchbase" from "membase" in gerrit mirroring and manifests Created: 14/Sep/14  Updated: 11/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.0, 2.5.1, 3.0-Beta
Fix Version/s: 3.0.2
Security Level: Public

Type: Task Priority: Blocker
Reporter: Matt Ingenthron Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-8297 Some key projects are still hosted at... Resolved

 Description   
One of the key components of Couchbase is still only at github.com/membase and not at github.com/couchbase. I think it's okay to mirror to both locations (not that there's an advantage), but for sure it should be at couchbase and the manifest for Couchbase Server releases should be pointing to Couchbase.

I believe the steps here are as follows:
- Set up a github.com/couchbase/memcached project (I've done that)
- Update gerrit's commit hook to update that repository
- Change the manifests to start using that repository

Assigning this to build as a component, as gerrit is handled by the build team. Then I'm guessing it'll need to be handed over to Trond or another developer to do the manifest change once gerrit is up to date.

Since memcached is slow changing now, perhaps the third item can be done earlier.

 Comments   
Comment by Chris Hillery [ 15/Sep/14 ]
Actually manifests are owned by build team too so I will do both parts.

However, the manifest for the hopefully-final release candidate already exists, and I'm a teensy bit wary about changing it after the fact. The manifest change may need to wait for 3.0.1.
Comment by Matt Ingenthron [ 15/Sep/14 ]
I'll leave it to you to work out how to fix it, but I'd just point out that manifest files are mutable.
Comment by Chris Hillery [ 15/Sep/14 ]
The manifest we build from is mutable. The historical manifests recording what we have already built really shouldn't be.
Comment by Matt Ingenthron [ 15/Sep/14 ]
True, but they are. :) That was half me calling back to our discussion about tagging and mutability of things in the Mountain View office. I'm sure you remember that late night conversation.

If you can help here Ceej, that'd be great. I'm just trying to make sure we have the cleanest project possible out there on the web. One wart less will bring me to 999,999 or so. :)
Comment by Trond Norbye [ 15/Sep/14 ]
Just a FYI, we've been ramping up the changes to memcached, so it's no longer a slow moving component ;-)
Comment by Matt Ingenthron [ 15/Sep/14 ]
Slow moving w.r.t. 3.0.0 though, right? That means the current github.com/couchbase/memcached probably has the commit planned to be released, so it's low risk to update github.com/couchbase/manifest with the couchbase repo instead of membase.

That's all I meant. :)
Comment by Trond Norbye [ 15/Sep/14 ]
_all_ components should be slow moving with respect to 3.0.0 ;)
Comment by Chris Hillery [ 16/Sep/14 ]
Matt, it appears that couchbase/memcached is a *fork* of membase/memcached, which is probably undesirable. We can actively rename the membase/memcached project to couchbase/memcached, and github will automatically forward requests from the old name to the new so it is seamless. It also means that we don't have to worry about migrating any commits, etc.

Does anything refer to couchbase/memcached already? Could we delete that one outright and then rename membase/memcached instead?
Comment by Matt Ingenthron [ 16/Sep/14 ]
Ah, that would be my fault. I propose deleting the couchbase/memcached and then transferring ownership from membase/memcached to couchbase/memcached. I think that's what you meant by "actively rename", right? Sounds like a great plan.

I think that's all in your hands Ceej, but I'd be glad to help if needed.

I still think in the interest of reducing warts, it'd be good to fix the manifest.
Comment by Chris Hillery [ 16/Sep/14 ]
I will do that (rename the repo), just please confirm explicitly that temporarily deleting couchbase/memcached won't cause the world to end. :)
Comment by Matt Ingenthron [ 16/Sep/14 ]
It won't since it didn't exist until this last Sunday when I created this ticket. If something world-ending happens as a result, I'll call it a bug to have depended on it. ;)
Comment by Chris Hillery [ 18/Sep/14 ]
I deleted couchbase/memcached and then transferred ownership of membase/memcached to couchbase. The original membase/memcached repository had a number of collaborators, most of which I think were historical. For now, couchbase/memcached only has "Owners" and "Robots" listed as collaborators, which is generally the desired configuration.

http://review.couchbase.org/#/c/41470/ proposes changes to the active manifests. I see no problem with committing that.

As for the historical manifests, there are two:

1. Sooner or later we will add a "released/3.0.0.xml" manifest to the couchbase/manifest repository, representing the exact SHAs which were built. I think it's probably OK to retroactively change the remote on that manifest since the two repositories are aliases for each other. This will affect any 3.0.0 hotfixes which are built, etc.

2. However, all of the already-built 3.0 packages (.deb / .rpm / .zip files) have embedded in them the manifest which was used to build them. Those, unfortunately, cannot be changed at this time. Doing so would require re-packaging the deliverables which have already undergone QE validation. While it is technically possible to do so, it would be a great deal of manual work, and IMHO a non-trivial and unnecessary risk. The only safe solution would be to trigger a new build, but in that case I would argue we would need to re-validate the deliverables, which I'm sure is a non-starter for PM. I'm afraid this particular sub-wart will need to wait for 3.0.1 to be fully addressed.
Comment by Matt Ingenthron [ 18/Sep/14 ]
Excellent, thanks Ceej. I think this is a great improvement-- espeically if 3.0.0's release manifest no longer references membase.

I'll leave it to the build team to manage, but I might suggest that gerrit and various other things pointing to membase should slowly change as well, in case someone decides someday to cancel the membase organization subscription to github.
Comment by Chris Hillery [ 11/Nov/14 ]
Matt - I realized I never closed on this issue. I have updated my merge proposal (URL above) to include rel-3.0.0, rel-3.0.1, and rel-3.0.2, as well as released/3.0.0 and released/3.0.1. I've added you as a reviewer if you wouldn't mind voting.

I will also make the same change in the sherlock.xml manifest (separately, since I'm making frequent changes in there anyway).




[MB-12615] Sherlock build fails on Windows (as it expects to find gcc) Created: 10/Nov/14  Updated: 12/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build, forestdb, secondary-index
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Task Priority: Blocker
Reporter: Dave Finlay Assignee: Sriram Melkote
Resolution: Unresolved Votes: 0
Labels: build
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency

 Description   
Filing for tracking purposes.

Sherlock build on Windows is currently failing with the following stack:

I don't know who's responsible for this, but the build fails on Windows when building the indexer with:

Scanning dependencies of target indexer
can't load package: package github.com/couchbase/indexing/secondary/indexer/main: found packages indexer (dump_windows.go) and main (main.go)

With this fixed, we run into this:
cgo.exe -objdir goforestdb\_obj\ -- -I forestdb/include -I goforestdb\_obj commit.go config.go doc.go error.go forestdb.go info.go iterator.go kv.go
exec: "gcc": executable file not found in %PATH%


 Comments   
Comment by Sriram Melkote [ 11/Nov/14 ]
I fixed this specific issue, but it uncovers a much larger problem:

cgo.exe -objdir goforestdb\_obj\ -- -I forestdb/include -I goforestdb\_obj commit.go config.go doc.go error.go forestdb.go info.go iterator.go kv.go
exec: "gcc": executable file not found in %PATH%

The underlying issue is the CGO works with GCC and we don't use GCC (MinGW) on Windows. We use the Windows native toolchain, Visual Studio. We need to find a solution to this. There are 3 possible approaches:

(1) Pratap suggested running ForestDB as a separate process. We'd use RPC to invoke methods on it. Performance impact is to be measured.
(2) Couchbase could enhance CGO to generate bindings compatible with Visual Studio.
(3) We can load forestdb as a DLL using syscall.LoadLibrary

I'm going to try #3 first as it's the easiest of the three. It will take some time, so please expect goforestdb to be broken on Windows for a little while.




[MB-12375] Query latency on Windows Created: 17/Oct/14  Updated: 11/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0.1
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Venu Uppalapati Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates MB-12466 Query Latency, 80th percentile, for 1... Resolved
is duplicated by MB-12374 115% regression in 80th percentile qu... Resolved
Gantt: start-finish
is triggering MB-12628 Investigate visualization discrepanci... Open
is triggering MB-12607 View Query Performance Open
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Test description:
80th percentile query latency (ms), 1 bucket x 20M x 2KB, non-DGM, 4 x 1 views, 500 mutations/sec/node, 400 queries/sec

Observation:
80th percentile latency increased from 13ms to 54ms
NOTE - other runs are showing 21ms response, not 54 ms

links:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_301-1330_2c1_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_251-1083_eba_access

logs:
http://ci.sc.couchbase.com/job/zeus-64/1221/artifact/172.23.96.25.zip
http://ci.sc.couchbase.com/job/zeus-64/1221/artifact/172.23.96.26.zip
http://ci.sc.couchbase.com/job/zeus-64/1221/artifact/172.23.96.27.zip
http://ci.sc.couchbase.com/job/zeus-64/1221/artifact/172.23.96.28.zip
http://ci.sc.couchbase.com/job/zeus-64/1221/artifact/web_log_172.23.96.25.json


 Comments   
Comment by Volker Mische [ 20/Oct/14 ]
I've added the label "Windows" as it really makes a difference. As the performance looks right on Linux.
Comment by Volker Mische [ 20/Oct/14 ]
Removed the "windows" label again as I just found out that there's also a way to set the "Operating System".
Comment by Volker Mische [ 28/Oct/14 ]
I looked into the logs of the 3.0 run on Windows and Linux. I couldn't find anything suspicious. In case anyone wants to have a look, here are the links to all the stuff:

### ShowFast test
80th percentile query latency (ms), 1 bucket x 20M x 2KB, non-DGM, 4 x 1 views, 500 mutations/sec/node, 400 queries/sec

### Leto run on build 1330
http://showfast.sc.couchbase.com/#/runs/query_lat_20M_leto_ssd/3.0.1-1330
http://ci.sc.couchbase.com/job/leto/621/console

### Xeus run on build 1330
http://showfast.sc.couchbase.com/#/runs/query_lat_20M_zeus_ssd/3.0.1-1330
http://ci.sc.couchbase.com/job/zeus-64/1221/

### Comparison between Zeus and Leto run
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_301-1330_2c1_access&snapshot=leto_ssd_301-1330_488_access
Comment by Sriram Melkote [ 29/Oct/14 ]
Plan of action:

(a) Nimish to look at 2.5.1 and 3.0.1 runs more closely to characterize the source of slowdown
(b) Ceej to create a R16 2.5.1 build so we can eliminate the Erlang version change variable
(c) Nimish to use timestamps to see if we can narrow down the source of slowdown
Comment by Volker Mische [ 29/Oct/14 ]
The plan of action according to yesterdays meeting should be:

(a) Rerun the test with the 2.5.1 build that uses Erlang R16
(b) Nimish to get it reproduced locally
(c) Nimish to reduce the problem to the smallest possible case (e.g. using a single node and no load)
Comment by Aleksey Kondratenko [ 29/Oct/14 ]
Can somebody post query latency comparison of 3.0.1 vs 2.5.1 on similar hardware but running GNU/Linux ?

Also maybe I'm looking at wrong things but looking at graphs like this: http://i.imgur.com/1RIqfmq.png I see 80%-ile to be far bigger than 300%.

Such massive difference is likely to visible without fancy testrunner tests. But that's just a guess.

Also would be great to know exactly which queries are being sent.
Comment by Volker Mische [ 29/Oct/14 ]
Alk, here's the 2.5.1 run (12ms) [1].
And here the 3.0.0-1330 run (14ms) [2].

You can find them with going to ShowFast [3] and then click on "View Query" and "All". Then search for "80th percentile query latency (ms), 1 bucket x 20M x 2KB, non-DGM, 4 x 1 views, 500 mutations/sec/node, 400 queries/sec". You can then also switch between Linux and Windows builds.

[1]: http://showfast.sc.couchbase.com/#/runs/query_lat_20M_leto_ssd/2.5.1-1083
[2]: http://showfast.sc.couchbase.com/#/runs/query_lat_20M_leto_ssd/3.0.1-1330
[3]: http://showfast.sc.couchbase.com/#/timeline
Comment by Aleksey Kondratenko [ 29/Oct/14 ]
But leto_ssd vs. zeus are different set of hardware. Are we sure we can say that ssd versus hdd doesn't make a difference in this case ?
Comment by Venu Uppalapati [ 29/Oct/14 ]
 there are separate cluster config specifications for zeus for KV,Views and XDCR. all view related tests on zeus run on SSD.
Comment by Aleksey Kondratenko [ 29/Oct/14 ]
Is that _exact_ same hardware as leto_ssd for the purpose of query_lat_20M tests ?
Comment by Aleksey Kondratenko [ 29/Oct/14 ]
Also why then there's zeus_ssd.spec that's different than zeus.spec in perfrunner repo ?
Comment by Venu Uppalapati [ 29/Oct/14 ]
Yes, they are identical
https://github.com/couchbaselabs/perfrunner/blob/master/clusters/zeus_ssd.spec
https://github.com/couchbaselabs/perfrunner/blob/master/clusters/leto_ssd.spec
the hardware dedicated for windows is limited so this is a way of executing different tests, KV(with HDD),Views(with SSD),XDCR(HDD) using the same HW cluster.
Comment by Aleksey Kondratenko [ 29/Oct/14 ]
.spec files only tells me that CPUs are same. It doesn't tell me if rest of hardware is indeed same and configured same way.
Comment by Aleksey Kondratenko [ 29/Oct/14 ]
Looking at original cbmonitor reports I see that 3.x is eating more cpu. Which might be indication of cpu being saturated while in 2.5 run it might be less saturated. Which might cause huge difference in latency even if perf difference is not as great.

In order to prove/disprove that theory I propose to run same comparison with same configuration except that with half load (both in kv ops and in view ops).
Comment by Volker Mische [ 31/Oct/14 ]
Nimish was chatting with me that he saw that the throughput increased in 3.0.1 (click on Windows and search for "Query throughput (qps), 1 bucket x 20M x 2KB, non-DGM, 4 x 1 views, 500 mutations/sec/node" [1]. It has >800 requests per second. If you now compare it to the test mentioned in the bug, you can see that the latency is way better [2] (it takes a while to load). Look at the second "[bucket-1] latency_query" chart. Orange is the throughput test, green the one mentioned in the bug.

[1]: http://showfast.sc.couchbase.com/#/timeline
[2]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_301-1330_2c1_access&snapshot=zeus_301-1437_05c_access
Comment by Nimish Gupta [ 31/Oct/14 ]
From the showfast graph, it looks like with new 3.0.1 build (3.0.1-1437) we are seeing better qps. Venu, could you please run this test with the newer 3.0.1 build (currently results are with 3.0.1-1330 in showfast) ?
Comment by Raju Suravarjjala [ 31/Oct/14 ]
Venu, as per our triage meeting today, can you please rerun the test with 3.0.2 build once it is available and report the results?
Comment by Volker Mische [ 31/Oct/14 ]
Venu, a rerun with build 3.0.1-1437 would be better.
Comment by Venu Uppalapati [ 31/Oct/14 ]
The result of this test from run with 3.0.1-1437 is 23ms(posted to showfast). for comparison the latency in previous runs is 13ms(2.5.1-1083) to 54ms(3.0.1-1330). Will keep the ticket assigned to me and re-run the test with 3.0.2 Windows build when available.
Comment by Volker Mische [ 01/Nov/14 ]
I'd also like to see a re-run as Alk suggested in an earlier comment [1], with half the mutations and half the view queries per second.

[1]: https://www.couchbase.com/issues/browse/MB-12375?focusedCommentId=104069&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-104069
Comment by Volker Mische [ 01/Nov/14 ]
I forgot to mention that between build 1330 and 1437 weren't *any* functional changes in the view engine (although the latency is now half as hight). As Alk and I mentioned in one bug blocker meeting, the issue can be anywhere and we don't have a clue where.
Comment by Venu Uppalapati [ 05/Nov/14 ]
The result of the full run with 3.0.2-1503 is '21ms'. Test with half the mutations and half the queries per sec in progress.
Comment by Sriram Melkote [ 05/Nov/14 ]
Removing the percentage number, as the latency seems to vary from build to build.

That is, initially it was 54ms, now it's 23ms etc.
Comment by Volker Mische [ 05/Nov/14 ]
Venu, thanks for the update. So reducing the load doesn't have an impact on the latency (the run with the normal number of queries and mutations with the 3.0.2-1503 build even has only "21ms").

So my current conclusion is that on Windows the latency is higher for some reason and we don't have any clue which part of Couchbase it is due to.
Comment by Aleksey Kondratenko [ 05/Nov/14 ]
>> So reducing the load doesn't have an impact on the latency

May I have evidence of that? So far I've not seen any reduce load results. Let alone 2.5.1 vs. 3.0.2. Perhaps I'm missing something that was posted but I'm not seeing it.

Comment by Venu Uppalapati [ 05/Nov/14 ]
I had removed that run from displaying on showfast as it is not a standard test but the report is available here,
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_302-1503_866_access
apple-apple run with 2.5.1 should be available by tomorrow morning.
Comment by Volker Mische [ 05/Nov/14 ]
Alk, in this comment [1] Venu mentions that it resulted in "23ms".

[1]: https://www.couchbase.com/issues/browse/MB-12375?focusedCommentId=104686&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-104686
Comment by Aleksey Kondratenko [ 05/Nov/14 ]
>> Alk, in this comment [1] Venu mentions that it resulted in "23ms".

That was full run from what I understand.
Comment by Venu Uppalapati [ 05/Nov/14 ]
That was a typo. the result in build 3.0.2-1503 with the full load of mutations and query throughput was 21ms(not 23ms). With half the mutations and half the query throughput the result is still the same i.e 21ms 80th percentile latency.
Comment by Aleksey Kondratenko [ 05/Nov/14 ]
Ok. Thanks.

BTW may I have links to full reports rather than mere numbers? Reducing all complexity of performance results to single number may be quite misleading so I'm quite interested looking at fuller picture.

P.S. I've asked above what kind of queries it is (w.r.t. staleness, limit, etc). Would be nice to get some answer.
Comment by Venu Uppalapati [ 05/Nov/14 ]
I have posted the report for the run with reduced workload in my previous comment. posting all relevant reports here,

For 2.5.1-1083 run with full workload,
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_251-1083_71d_access
For 3.0.2-1503 run with full regular workload,
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_302-1503_4fd_access
For 3.0.2-1503 run with reduced workload,
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_302-1503_866_access

The query params are generated here,
https://github.com/couchbaselabs/spring/blob/master/spring/querygen.py#L7
generated query params are passed to python couchbase client to querying
https://github.com/couchbaselabs/spring/blob/master/spring/querygen.py#L84
https://github.com/couchbase/couchbase-python-client/blob/master/couchbase/views/params.py#L218

I do not have fully constructed queries available. if this is needed I would have to start a new run with additional logging enabled to capture the queries being sent out by the python couchbase client. This test send queries with stale setting true(default)
Comment by Volker Mische [ 06/Nov/14 ]
Siri, this issue is Windows related *only* (hence adding the Operating System filed back). On Linux we see a jump from 12ms to 14ms between 2.5.1 and 3.x, which I think is reasonable given the architectural changes we made for 3.x.
Comment by Volker Mische [ 06/Nov/14 ]
Venu, a slight correction, the queries are sen with `stale=update_after` which is the default.
Comment by Volker Mische [ 06/Nov/14 ]
I can't make sense out of the numbers from the reduced load run and the normal one. If you look at the query latency graph [1] (green is the normal run, orange the reduce load run) you see that the query latency of the orange run for the 80th percentile is way below the green one. Though both runs report "21ms" as the result (green [2], orange [3]).

[1]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_302-1503_4fd_access&snapshot=zeus_302-1503_866_access#zeus_302-1503_4fd_accesszeus_302-1503_866_accesszeus_302-1503_4fdzeus_302-1503_866bucket-1latency_query_lt90
[2]: http://ci.sc.couchbase.com/job/zeus-64/1246/console
[3]: http://ci.sc.couchbase.com/job/zeus-64/1248/console
Comment by Venu Uppalapati [ 06/Nov/14 ]
querying the TS dbs and processing data using numpy yields 80th percentile latencies of 21.350240707397468ms and 21.117305755615234ms for [2] and [3] above respectively.
Comment by Sriram Melkote [ 07/Nov/14 ]
Request to defer query performance issues out of 3.0.2
Comment by Aleksey Kondratenko [ 07/Nov/14 ]
>> querying the TS dbs and processing data using numpy yields 80th percentile latencies of 21.350240707397468ms and 21.117305755615234ms for [2] and [3] above respectively.

Interesting. The either:

* I completely forgot what percentile is (which I find quite unlikely)

* or above mentioned percentile computation code is buggy or used incorrectly

* or graphs of percentiles are plotted (or computed) incorrectly

Because as Volker said (and see one of my earlier questions too) things don't add up. I.e. see http://i.imgur.com/AMKoda0.png
Comment by Venu Uppalapati [ 07/Nov/14 ]
At this point, I suspect an issue in the graph plotting and not the computation since the visualization framework itself uses numpy package for calculating percentiles. This will be investigated further.
Comment by Sriram Melkote [ 10/Nov/14 ]
Folks - View Query performance will have to addressed as a project. I'm going to track this on MB-12607.

I'm closing this for manageability aspect of JIRA issues.
Comment by Sriram Melkote [ 10/Nov/14 ]
Folks - View Query performance will have to addressed as a project. I'm going to track this on MB-12607.

I'm closing this for manageability aspect of JIRA issues.
Comment by Volker Mische [ 10/Nov/14 ]
Siri, that doesn't make sense to me. Why do you create a new bug and close one that contains valuable information? Especially since the new bug mixes up several separate issues.

This bug is about *Windows only* and for example MB-12419 is not about Windows at all but a general one.

I agree that we can group together all Windows related performance regressions under this one, but we should group unrelated issues together.
Comment by Volker Mische [ 11/Nov/14 ]
I re-open this issue as we haven't really solved the problem or any clue yet.

I assigned it to Venu as the first step is to find out why the graphs and the reported percentile number don't match.
Comment by Venu Uppalapati [ 11/Nov/14 ]
In my opinion, perf visualization framework bug should not be tracked through this ticket. I have created MB-12628 to investigate the visualization bug.
Comment by Volker Mische [ 11/Nov/14 ]
The followup big for the graph/numbers issues is tracked in MB-12628.
Comment by Volker Mische [ 11/Nov/14 ]
Nimish, is it OK for you if I assign it to you? It is now about finding a way to reproduce this issue reliably locally, so that other devs can have a look.




[MB-12545] "Wire Protocol Between Query and Index" needs to be updated in 2i documentation Created: 04/Nov/14  Updated: 12/Nov/14

Status: Open
Project: Couchbase Server
Component/s: secondary-index
Affects Version/s: sherlock
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Prathibha Bisarahalli Assignee: Sriram Melkote
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
The 2i documentation for "Wire Protocol Between Query and Index" talks about HTTP based protocol but that is changed to use protobuf over TCP. The documentation at https://docs.google.com/document/d/1j9D4ryOi1d5CNY5EkoRuU_fc5Q3i_QwIs3zU9uObbJY/edit?usp=sharing needs to be updated.




[MB-12607] View Query Performance Created: 10/Nov/14  Updated: 17/Nov/14  Due: 31/Jan/15

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Task Priority: Blocker
Reporter: Sriram Melkote Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by MB-12266 Batch size for View Query is not tune... Resolved
is duplicated by MB-12419 stale=false view performance up to 10... Resolved
Gantt: start-finish
is triggered by MB-12375 Query latency on Windows Reopened
is triggered by MB-11840 3.0 (Beta): Views periodically take 2... Closed

 Description   
This is a meta bug to track all reported issues relating to view query, both in terms of latency and throughput. A mini project to significantly improve view query performance.

 Comments   
Comment by Volker Mische [ 11/Nov/14 ]
Issue MB-12375 isn't really a duplicate, but lead to this bug. MB-12375 is now focused to find out what's going on on Windows.
Comment by Volker Mische [ 11/Nov/14 ]
I should learn to read, this was never a duplicate. I think think they are related though.




[MB-10059] Replica vbucket simply ignores rev_seq values of new items from the active vbucket. Created: 29/Jan/14  Updated: 19/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.2.0, 2.5.0
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: 2.5.1, musicservice
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Relates to
Triage: Triaged

 Description   
While debugging the customer issue, we found that the replica vbucket simply ignores rev_seq values of new items from the active vbucket, but instead generates new rev_seq values for those new items. This can cause the XDCR to be in inconsistent state, especially when the rebalance or failover happens.


 Comments   
Comment by Cihan Biyikoglu [ 07/Feb/14 ]
Repro:
1. 1 node
2. add 10K items
3. update 10K items 5 times
4. add node
5. rebalance
6. confirm revids on new node are 1. - expected revid6.

happens in every case where replica has to be built in the node. failover+rebalance, increasing replica count etc should do it.
Comment by Cihan Biyikoglu [ 10/Feb/14 ]
sending this over to Aruna for a reliable repro in the test suite.
Comment by Chiyoung Seo [ 13/Feb/14 ]
Pushed the fixes into gerrit for review:

http://review.couchbase.org/#/c/33501
http://review.couchbase.org/#/c/33502
http://review.couchbase.org/#/c/33503
Comment by Cihan Biyikoglu [ 13/Feb/14 ]
Hi Aruna, do we have this reproduced now?
thanks
Comment by Chiyoung Seo [ 13/Feb/14 ]
The fixes were merged into ep-engine 2.5.0 branch
Comment by Aruna Piravi [ 17/Feb/14 ]
Hi Cihan,

I wasn't able to get to this last week as I was working on CBSE-960 with Chiyoung. Parag will be working on this.

Thanks.
Comment by Maria McDuff (Inactive) [ 17/Feb/14 ]
Parag will be assigned to test this for 2.5.1 - he'll start on this tomorrow or the day after.
Comment by Parag Agarwal [ 20/Feb/14 ]

For Build 2.5.1-1062

Repro:
1. 1 node
2. add 10K items
3. update 10K items 5 times
4. add node
5. rebalance
6. confirm revids on new node are 1. - expected revid6.

This scenario passed

 I also tried fail over and rebalance and it passed rev id remaining same as before.
Comment by Parag Agarwal [ 25/Feb/14 ]
This was done on Centos 64
1. Install 2.5.0-1059 on nodes 172.23.105.45
2. Install 2.5.0-1059 on nodes 172.23.105.48
3. Create Bucket default with replica=1, and size=1000MB
4. Upgrade with MB 10059 by changing the ep.so
5. Add 100K items, and update thrice
6. Rebalance-out, Rebalance-in
7. Verify if active and replica is the same (i.e. rev id is 4 and not 1 in replica)

On step 7, we see replica is not as expected

example

For Key :: 22748

[As seen on active:: '4-00003c611237e2a90000000000000000', As seen on Replica:: '1-00003c611237e2a90000000000000000']


For verification, we created a view and query active vs replica

here are the steps

1. Create a view for the data bucket we want to analyze

 function (doc, meta) {
              emit(meta.id, meta.rev, doc.key);
    }

2. Run Queries for Active and Replica

For Active Query

http://172.23.105.45:8092/default/_design/dev_doc1/_view/test?stale=false&connection_timeout=60000&limit=10000&skip=0

For Replica Query

http://172.23.105.45:8092/default/_design/dev_doc1/_view/test?stale=false&connection_timeout=60000&limit=10000&_type=replica&skip=0

3. Use Active result to find out if the expected number of replicas exist along with key, rev id, and value

Comment by Parag Agarwal [ 25/Feb/14 ]
Will attach logs as well

https://s3.amazonaws.com/bugdb/jira/MB-100059/172.23.105.45-2252014-1317-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-100059/172.23.105.48-2252014-1319-diag.zip
Comment by Parag Agarwal [ 26/Feb/14 ]
I was able to repro the bug with 2.5.0 build + upgrade with ep.so as provided by the upgrade package for MB-10059. The issue is repro in 172.23.105.45. Please take a look
Comment by Chiyoung Seo [ 28/Feb/14 ]
There was an issue in building the hot fix. Parag verified the new hot fix.
Comment by Parag Agarwal [ 03/Mar/14 ]
I have verified the new fix given for MB 10059 functionally and ran the system test for kv xdcr as well. The fix looks good.
Comment by Parag Agarwal [ 03/Mar/14 ]
Tests pass now. Marking the bug as fixed. Beyond build 2.5.1-1062, this issue is fixed. Will be automating test cases around it as well.
Comment by Jeff Dillon [ 19/Nov/14 ]
Please see previous comment, thx




[MB-9415] auto-failover in seconds - (reduced from minimum 30 seconds) Created: 21/May/12  Updated: 22/Oct/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0, 1.8.1, 2.0, 2.0.1, 2.2.0, 3.0.1, 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Dipti Borkar Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 2
Labels: customer, ns_server-story
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-9416 Make auto-failover near immediate whe... Technical task Open Aleksey Kondratenko  

 Description   
including no false positives

http://www.pivotaltracker.com/story/show/25006101

 Comments   
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
At the very least it requires getting our timeout-ful cases under control. So at least splitting couchdb into separate VM is a requirement for this. But not necessarily enough.
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
Still seeing misunderstanding on this one.

So we have _different_ problem that even manual failover (let alone automatic) cannot succeed quickly if master node fails. It can easily take up to 2 minutes because of our use of erlang "global" facility than requires us to detect that node is dead and erlang is tuned to detect that within 2 minutes.

Now _this_ problem is lowering autofailover detection to 10 seconds. We can blindly make it happen today. But it will not be usable because of all sorts of timeouts happening in cluster management layer. We have some significant proportion of CBSEs _today_ about false positive autofailovers even with 30 seconds threshold. Clearly lowering it to 10 will only make it worse. Therefore my point above. We have to get those timeouts under control so that heartbeats are sent/received timely. Or whatever else we use to detect node being unresponsive.

I would like to note however that especially in some virtualized environments (arguably, oversubscribed) we saw as high as low tens of seconds delays from virtualization _alone_. Given relatively high cost of failover in our software I'd like to point out that people could too easily abuse that feature.

High cost of failover is refered to above is this:

* you almost certainly and irrecoverably lose some recent mutations. _At least_ recent mutations. I.e. if replication is really working well. In node that's on the edge of autofailover you can imagine replication not being "diamond-hard quick". That's cost 1.

* in order to return node back to cluster (say node crashed and needed some time to recover, whatever it might mean) you need rebalance. That type of rebalance is relatively quick by design; i.e. it only moves data back to this node and nothing else. But it's still rebalance. with upr we can possibly make it better. I.e. because its failover log is capable of rewinding just conflicting mutations.

What I'm trying to say in "our approach appears to have relatively high price for failover" is that it appears inherent issue for strongly consistent system. I'm trying to say that in many cases it might be actually better to wait up to few minutes for node to recover and restore it's availability than failing it over and paying price of restoring cluster capacility (with rebalancing this node back or it's replacement, which is irrelevant). If somebody wants stronger availability then some other approaches which particularly can "reconcile" changes from both failed over node and it's replacement node look like fundamentally better choice _for this requirements_.




[MB-12465] Reverse iteration issues Created: 27/Oct/14  Updated: 20/Nov/14

Status: Open
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: bug-backlog
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Chiyoung Seo Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
The following issues were identified by Jens while he was trying to implement the reverse view iteration in Couchbase Lite:

I'm adding support for reverse view iteration in Couchbase Lite (i.e. the descending=true option in the view REST API), now that ForestDB has implemented fdb_iterator_prev. But it seems like this function isn't enough. In general, after creating the iterator I need to call fdb_iterator_seek so that it'll start at the end_key. But there are two problems:

1. I can't just seek to the end_key, because this seeks to before that key, so the first call to fdb_iterator_prev wouldn't return that key but the one before it. The workaround seems to be to modify the key I seek to, basically generating the "next key" in collation order by incrementing the last byte (with overflows carrying over to previous bytes.) Ugly but doable.

2. If the client didn't specify an end_key at all, it means they want to start from the highest key in the database. Now I'm stuck — I don't know what that key is. The only workaround I can think of is to generate a key filled with FF bytes, of length FDB_MAX_KEYLEN. Yuck! Even worse, if there happens to be such a key in the database, I'll still miss it due to the behavior of fdb_iterator_seek.


To address the above issues, we will extend the iterator APIs in the following way:

fdb_iterator_init(fdb_handle *handle,
                          fdb_iterator **iterator,
                          const void *min_key,
                          size_t min_keylen,
                          const void *max_key,
                          size_t min_keylen,
                          fdb_iterator_opt_t opt); --> We just rename start_key (keylen) and end_key (keylen) to min_key (keylen) and max_key (keylen), respectively.


fdb_iterator_sequence_init(fdb_handle *handle,
                                           fdb_iterator **iterator,
                                           const fdb_seqnum_t min_seq,
                                           const fdb_seqnum_t max_seq,
                                           fdb_iterator_opt_t opt); --> We just rename start_seq and end_seq to min_seq and max_seq, respectively.



fdb_iterator_seek(fdb_iterator *iterator, const void *seek_key, const size_t seek_keylen); —> The iterator positions at a seek key or at a next sorted key if a seek key doesn’t exist.

fdb_iterator_seek_to_min(fdb_iterator *iterator); —> The iterator positions at the min key.

fdb_iterator_seek_to_max(fdb_iterator *iterator); —> The iterator positions a the max key.

fdb_iterator_next(fdb_iterator *iterator); —> The iterator moves to the next key. But, this function doesn’t return any item to the caller.

fdb_iterator_prev(fdb_iterator *iterator); —> The iterator moves to the previous key. But, this function doesn’t return any item to the caller.

fdb_iterator_get(fdb_iterator *iterator, fdb_doc **doc); —> Return an item (key, metadata, value) that is currently pointed by the iterator.

fdb_iterator_get_metaonly(fdb *iterator, fdb_doc **doc); —> Return an item’s metadata (key, metadata) that is currently pointed by the iterator.

For example,

do {
    fdb_doc *doc = NULL;
    fdb_iterator_get(iterator, &doc);
    if (doc) {
        printf(“key: %s, metadat: %s, value: %s\n”, doc->key, doc->meta, doc->value);
        fdb_doc_free(doc);
    }
} while (fdb_iterator_next(iterator) != FDB_RESULT_ITERATOR_FAIL)


 Comments   
Comment by Sundar Sridharan [ 13/Nov/14 ]
New semantics uploaded for review at http://review.couchbase.org/#/c/42859/ thanks




[MB-12676] Views timeout after failover Created: 16/Nov/14  Updated: 24/Nov/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, view-engine
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Sriram Melkote
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.5.0-256, centos 6x

1:10.3.5.115
2:10.3.5.116
3:10.3.5.117
4:10.3.5.118
5:10.6.2.185
6:10.6.2.186
7:10.5.3.5


Issue Links:
Duplicate
duplicates MB-12697 QueryViewException: Error occured que... Resolved
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-12676/10.3.5.115-11162014-1512-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12676/10.3.5.116-11162014-1514-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12676/10.3.5.117-11162014-1515-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12676/10.3.5.118-11162014-1516-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12676/10.5.3.5-11162014-1520-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12676/10.6.2.185-11162014-1517-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12676/10.6.2.186-11162014-1519-diag.zip
Is this a Regression?: Yes

 Description   
Test Case::

./testrunner -i centos_x64--01_01--failover_upr.ini -t failover.failovertests.FailoverTests.test_failover_stop_server,replicas=1,graceful=False,num_failed_nodes=1,numViews=5,withViewsOps=True,createIndexesDuringFailover=True,items=100000,active_resident_threshold=70,dgm_run=True,failoverMaster=True,skip_cleanup=True,GROUP=P0

1. Create 7 node cluster
2. Create defaul bucket with 100K items
3. Create 5 views
4. Stop 1 node and hard failover
5. Run queries to create indexes in parallel to step 4

Step 5 fails with timeout. Expected results not returned.

2014-11-16 14:56:35 | INFO | MainProcess | Cluster_Thread | [task.check] Server: 10.3.5.116, Design Doc: dev_ddoc1, View: default_view2, (100000 rows) expected, (83994 rows) returned
ERROR
[('/usr/lib/python2.7/threading.py', 524, '__bootstrap', 'self.__bootstrap_inner()'), ('/usr/lib/python2.7/threading.py', 551, '__bootstrap_inner', 'self.run()'), ([('/usr/lib/python2.7/threading.py', 524, '__bootstrap', 'self.__bootstrap_inner()'), ('/usr/lib/python2.7/threading.py', 551, '__bootstrap_inner', 'self.run()'), ('./testrunner.py', 262, 'run', '**self._Thread__kwargs)'), ('/usr/lib/python2.7/unittest/runner.py', 151, 'run', 'test(result)'), ('/usr/lib/python2.7/unittest/case.py', 391, '__call__', 'return self.run(*args, **kwds)'), ('/usr/lib/python2.7/unittest/case.py', 327, 'run', 'testMethod()'), ('pytests/failover/failovertests.py', 25, 'test_failover_stop_server', "self.common_test_body('stop_server')"), ('pytests/failover/failovertests.py', 100, 'common_test_body', 'self.run_failover_operations_with_ops(self.chosen, failover_reason)'), ('pytests/failover/failovertests.py', 408, 'run_failover_operations_with_ops', 'self.query_and_monitor_view_tasks(nodes)'), ('pytests/failover/failovertests.py', 538, 'query_and_monitor_view_tasks', 'self.verify_query_task()'), ('pytests/failover/failovertests.py', 562, 'verify_query_task', 'self.perform_verify_queries(num_views, prefix, ddoc_name, query, bucket=bucket, wait_time=timeout, expected_rows=expected_rows)'), ('pytests/basetestcase.py', 778, 'perform_verify_queries', 'task.result(wait_time)'), ('lib/tasks/future.py', 160, 'result', 'return self.__get_result()'), ('lib/tasks/future.py', 111, '__get_result', 'print traceback.extract_stack()')]
Error occured querying view default_view2: {u'reason': u'lexical error: invalid char in json text.\n', u'from': u'http://10.6.2.185:8092/_view_merge/?stale=false&#39;}

Seen timeouts with graceful failover as well. Following tests seen failing

test_failover_firewall,replicas=1,graceful=False,num_failed_nodes=1,items=100000,active_resident_threshold=70,dgm_run=True,doc_ops=update,withMutationOps=true,withQueries=True,numViews=5,withViewsOps=True,GROUP=P0
test_failover_normal,replicas=1,graceful=False,num_failed_nodes=1,items=100000,active_resident_threshold=70,dgm_run=True,withQueries=True,numViews=5,withViewsOps=True,GROUP=P0
test_failover_stop_server,replicas=1,graceful=False,num_failed_nodes=1,numViews=5,withViewsOps=True,createIndexesDuringFailover=True,items=100000,active_resident_threshold=70,dgm_run=True,failoverMaster=True,GROUP=P0
test_failover_stop_server,replicas=1,graceful=False,num_failed_nodes=1,numViews=5,withViewsOps=True,createIndexesDuringFailover=True,items=100000,active_resident_threshold=70,dgm_run=True,failoverMaster=True,GROUP=P0
test_failover_stop_server,replicas=1,graceful=False,num_failed_nodes=1,items=100000,active_resident_threshold=70,dgm_run=True,withQueries=True,numViews=5,withViewsOps=True,max_verify=10000,GROUP=P0
test_failover_then_add_back,replicas=1,num_failed_nodes=1,items=100000,numViews=5,withViewsOps=True,createIndexesDuringFailover=True,sasl_buckets=1,upr_check=False,recoveryType=full,graceful=True,GROUP=P0;GRACEFUL
test_failover_then_add_back,replicas=1,num_failed_nodes=1,items=100000,numViews=5,withViewsOps=True,createIndexesDuringFailover=True,sasl_buckets=1,upr_check=False,recoveryType=delta,graceful=True,GROUP=P0;GRACEFUL
test_failover_then_add_back,replicas=1,num_failed_nodes=1,items=100000,numViews=5,compact=True,withViewsOps=True,createIndexesDuringFailover=True,sasl_buckets=1,upr_check=False,recoveryType=delta,graceful=True,GROUP=P1;GRACEFUL

 Comments   
Comment by Aleksey Kondratenko [ 17/Nov/14 ]
I don't see anything suspicious in logs except this error:

[couchdb:error,2014-11-16T13:15:21.399,couchdb_ns_1@127.0.0.1:<0.21006.0>:couch_log:error:44]Set view `default`, main (prod) group `_design/dev_ddoc1`, DCP process <0.21014.0> died with unexpected reason: {{case_clause,
                                                                                                                  {{error,
                                                                                                                    vbucket_stream_not_found},
                                                                                                                   {bufsocket,
                                                                                                                    #Port<0.17642>,
                                                                                                                    <<>>}}},
                                                                                                                 [{couch_dcp_client,
                                                                                                                   init,
                                                                                                                   1,
                                                                                                                   [{file,
                                                                                                                     "/home/buildbot/jenkins/workspace/sherlock-testing/couchdb/src/couch_dcp/src/couch_dcp_client.erl"},
                                                                                                                    {line,
                                                                                                                     305}]},
                                                                                                                  {couch_dcp_client,
                                                                                                                   restart_worker,
                                                                                                                   1,
                                                                                                                   [{file,
                                                                                                                     "/home/buildbot/jenkins/workspace/sherlock-testing/couchdb/src/couch_dcp/src/couch_dcp_client.erl"},
                                                                                                                    {line,
                                                                                                                     1433}]},
                                                                                                                  {gen_server,
                                                                                                                   handle_msg,
                                                                                                                   5,
                                                                                                                   [{file,
                                                                                                                     "gen_server.erl"},
                                                                                                                    {line,
                                                                                                                     604}]},
                                                                                                                  {proc_lib,
                                                                                                                   init_p_do_apply,
                                                                                                                   3,
                                                                                                                   [{file,
                                                                                                                     "proc_lib.erl"},
                                                                                                                    {line,
                                                                                                                     239}]}]}


Whether it's cause of timeouts or not I cannot say. But passing to view engine because it's likely a place where something gets stuck.
Comment by Volker Mische [ 24/Nov/14 ]
This looks like a duplicate of MB-12697. Once that one is closed we should rerun that one again.




Mac version update check is incorrectly identifying newest version (MB-10214)

[MB-12051] Update the Release_Server job on Jenkins to include updating the file (membasex.xml) and the download URL Created: 22/Aug/14  Updated: 30/Oct/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0.1, 2.2.0, 2.1.1, 2.5.0
Fix Version/s: 3.0.2
Security Level: Public

Type: Technical task Priority: Blocker
Reporter: Wayne Siu Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We can update the Release_Server job on Jenkins to create an updated version of this XML file from a template, and upload it to S3.




[MB-7250] Mac OS X App should be signed by a valid developer key Created: 22/Nov/12  Updated: 25/Nov/14

Status: In Progress
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0-beta-2, 2.1.0, 2.2.0, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: J Chris Anderson Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Build_2.5.0-950.png     PNG File Screen Shot 2013-02-17 at 9.17.16 PM.png     PNG File Screen Shot 2013-04-04 at 3.57.41 PM.png     PNG File Screen Shot 2013-08-22 at 6.12.00 PM.png     PNG File ss_2013-04-03_at_1.06.39 PM.png    
Issue Links:
Dependency
depends on MB-9437 macosx installer package fails during... Closed
Duplicate
is duplicated by MB-12319 [OS X] Check for Updates upgrade does... Resolved
is duplicated by MB-12345 Version 3.0.0-1209-rel prompts for up... Closed
Relates to
relates to CBLT-104 Enable Mac developer signing on Mac b... Open
Is this a Regression?: No

 Description   
Currently launching the Mac OS X version tells you it's from an unidentified developer. You have to right click to launch the app. We can fix this.

 Comments   
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
Chris,

do you know what needs to change on the build machine to embed our developer key ?
Comment by J Chris Anderson [ 22/Nov/12 ]
I have no idea. I could start researching how to get a key from Apple but maybe after the weekend. :)
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
we can discuss this next week : ) . Thanks for reporting the issue Chris.
Comment by Steve Yen [ 26/Nov/12 ]
we'll want separate, related bugs (tasks) for other platforms, too (windows, linux)
Comment by Jens Alfke [ 30/Nov/12 ]
We need to get a developer ID from Apple; this will give us some kind of cert, and a local private key for signing.
Then we need to figure out how to get that key and cert onto the build machine, in the Keychain of the account that runs the buildbot.
Comment by Farshid Ghods (Inactive) [ 02/Jan/13 ]
the instructions to build is available here :
https://github.com/couchbase/couchdbx-app
we need to add codesign as a build step there
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
Phil,

do you have any update on this ticket. ?
Comment by Phil Labee (Inactive) [ 22/Jan/13 ]
I have signing cert installed on 10.17.21.150 (MacBuild).

Change to Makefile: http://review.couchbase.org/#/c/24149/
Comment by Phil Labee (Inactive) [ 23/Jan/13 ]
need to change master.cfg and pass env.var. to package-mac
Comment by Phil Labee (Inactive) [ 29/Jan/13 ]
disregard previous. Have added signing to Xcode projects.

see http://review.couchbase.org/#/c/24273/
Comment by Phil Labee (Inactive) [ 31/Jan/13 ]
To test this go to System Preferences / Security & Privacy, and on the General tab set "Allow applications downloaded from" to "Mac App Store and Identified Developers". Set this before running Couchbase Server.app the first time. Once an app has been allowed to run this setting is no longer checked for that app, and there doesn't seem to be a way to reset that.

What is odd is that on my system, I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked (and would all be allowed to run). Either there is a flaw in my testing methodology, or a serious weakness in this security setting: Just because one app called Couchbase Server was allowed to run should confer this privilege to other apps with the same name. A common malware tactic is to modify a trusted app and distribute it as update, and if the security setting keys off the app name it will do nothing to prevent that.

I'm approving this change without having satisfactorily tested it.
Comment by Jens Alfke [ 31/Jan/13 ]
Strictly speaking it's not the app name but its bundle ID, i.e. "com.couchbase.CouchbaseServer" or whatever we use.

> I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked

By OK'ing an unsigned app you're basically agreeing to toss security out the window, at least for that app. This feature is really just a workaround for older apps. By OK'ing the app you're not really saying "yes, I trust this build of this app" so much as "yes, I agree to run this app even though I don't trust it".

> A common malware tactic is to modify a trusted app and distribute it as update

If it's a trusted app it's hopefully been signed, so the user wouldn't have had to waive signature checking for it.
Comment by Jens Alfke [ 31/Jan/13 ]
Further thought: It might be a good idea to change the bundle ID in the new signed version of the app, because users of 2.0 with strict security settings have presumably already bypassed security on the unsigned version.
Comment by Jin Lim [ 04/Feb/13 ]
Per bug scrubs, keep this a blocker since customers ran into this issues (and originally reported it).
Comment by Phil Labee (Inactive) [ 06/Feb/13 ]
revert the change so that builds can complete. App is currently not being signed.
Comment by Farshid Ghods (Inactive) [ 11/Feb/13 ]
i suggest for 2.0.1 release we do this build manually.
Comment by Jin Lim [ 11/Feb/13 ]
As one-off fix, add the signature manually and automate the required steps later in 2.0.2 or beyond.
Comment by Jin Lim [ 13/Feb/13 ]
Please move this bug to 2.0.2 after populating the required signature manually. I am lowing the severity to critical for it isn't no longer a blocking issue.
Comment by Farshid Ghods (Inactive) [ 15/Feb/13 ]
Phil to upload the binary to latestbuilds , ( 2.0.1-101-rel.zip )
Comment by Phil Labee (Inactive) [ 15/Feb/13 ]
Please verify:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee (Inactive) [ 15/Feb/13 ]
uploaded:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip

I can rename it when uploading for release.
Comment by Farshid Ghods (Inactive) [ 17/Feb/13 ]
i still do get the error that it is from an identified developer.

Comment by Phil Labee (Inactive) [ 18/Feb/13 ]
operator error.

I rebuilt the app, this time verifying that the codesign step occurred.

Uploaded now file to same location:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee (Inactive) [ 26/Feb/13 ]
still need to perform manual workaround
Comment by Phil Labee (Inactive) [ 04/Mar/13 ]
release candidate has been uploaded to:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip
Comment by Wayne Siu [ 03/Apr/13 ]
Phil, looks like version 172/185 is still getting the error. My Mac version is 10.8.2
Comment by Thuan Nguyen [ 03/Apr/13 ]
Install couchbase server (build 2.0.1-172 community version) in my mac osx 10.7.4 , I only see the warning message
Comment by Wayne Siu [ 03/Apr/13 ]
Latest version (04.03.13) : http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.zip
Comment by Maria McDuff (Inactive) [ 03/Apr/13 ]
works in 10.7 but not in 10.8.
if we can get the fix for 10.8 by tomorrow, end of day, QE is willing to test for release on tuesday, april 9.
Comment by Phil Labee (Inactive) [ 04/Apr/13 ]
The mac builds are not being automatically signed, so build 185 is not signed. The original 172 is also not signed.

Did you try

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip

to see if that was signed correctly?

Comment by Wayne Siu [ 04/Apr/13 ]
Phil,
Yes, we did try the 172-signed version. It works on 10.7 but not 10.8. Can you take a look?
Comment by Phil Labee (Inactive) [ 04/Apr/13 ]
I rebuilt 2.0.1-185 and uploaded a signed app to:

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.SIGNED.zip

Test on a machine that has never had Couchbase Server installed, and has the security setting to only allow Appstore or signed apps.

If you get the "Couchbase Server.app was downloaded from the internet" warning and you can click OK and install it, then this bug is fixed. The quarantining of files downloaded by a browser is part of the operating system and is not controlled by signing.
Comment by Wayne Siu [ 04/Apr/13 ]
Tried the 185-signed version (see attached screen shot). Same error message.
Comment by Phil Labee (Inactive) [ 04/Apr/13 ]
This is not an error message related to this bug.

Comment by Maria McDuff (Inactive) [ 14/May/13 ]
per bug triage, we need to have mac 10.8 osx working since it is a supported platform (published in the website).
Comment by Wayne Siu [ 29/May/13 ]
Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Anil Kumar [ 31/May/13 ]
we need to address signed key for both Windows and Mac deferring this to next release.
Comment by Dipti Borkar [ 08/Aug/13 ]
Please let's make sure this is fixed in 2.2.
Comment by Phil Labee (Inactive) [ 16/Aug/13 ]
New keys will be created using new account.
Comment by Phil Labee (Inactive) [ 20/Aug/13 ]
iOS Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=iOS Distribution expires Aug 12, 2014

    ~buildbot/Desktop/appledeveloper.couchbase.com/certs/ios/ios_distribution_appledeveloper.couchbase.com.cer

Identifiers:
  App IDS:
    "Couchbase Server" id=com.couchbase.*

Provisining Profiles:
  Distribution:
    "appledeveloper.couchbase.com" type=Distribution

  ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/ios/appledevelopercouchbasecom.mobileprovision
Comment by Phil Labee (Inactive) [ 20/Aug/13 ]
Mac Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)
    "Couchbase, Inc." type=Developer ID installer (Aug,16,2014)
    "Couchbase, Inc." type=Developer ID Application (Aug,16,2014)
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)

     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developerID_installer.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developererID_application.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution-2.cer

Identifiers:
  App IDs:
    "Couchbase Server" id=couchbase.com.* Prefix=N2Q372V7W2
    "Coucbase Server adhoc" id=couchbase.com.* Prefix=N2Q372V7W2
    .

Provisioning Profiles:
  Distribution:
    "appstore.couchbase.com" type=Distribution
    "Couchbase Server adhoc" type=Distribution

     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/appstorecouchbasecom.privisioningprofile
     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/Couchbase_Server_adhoc.privisioningprofile

Comment by Phil Labee (Inactive) [ 21/Aug/13 ]

As of build 2.2.0-806 the app is signed by a new provisioning profile
Comment by Phil Labee (Inactive) [ 22/Aug/13 ]
 Install version 2.2.0-806 on a macosx 10.8 machine that has never had Couchbase Server installed, which has the security setting to require applications to be signed with a developer ID.
Comment by Phil Labee (Inactive) [ 22/Aug/13 ]
please assign to tester
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
just tried this against newest build 809:
still getting restriction message. see attached.
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
restriction still exists.
Comment by Maria McDuff (Inactive) [ 28/Aug/13 ]
verified in rc1 (build 817). still not fixed. getting same msg:
“Couchbase Server” can’t be opened because it is from an unidentified developer.
Your security preferences allow installation of only apps from the Mac App Store and identified developers.

Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Phil Labee (Inactive) [ 03/Sep/13 ]
Need to create new certificates to replace these that were revoked:

Certificate: Mac Development
Team Name: Couchbase, Inc.

Certificate: Mac Installer Distribution
Team Name: Couchbase, Inc.

Certificate: iOS Development
Team Name: Couchbase, Inc.

Certificate: iOS Distribution
Team Name: Couchbase, Inc.
Comment by Maria McDuff (Inactive) [ 18/Sep/13 ]
candidate for 2.2.1 bug fix release.
Comment by Dipti Borkar [ 28/Oct/13 ]
This is going to make it into 2.5? We seemed to keep differing it?
Comment by Phil Labee (Inactive) [ 29/Oct/13 ]
cannot test changes with installer that fails
Comment by Phil Labee (Inactive) [ 11/Nov/13 ]
Installed certs as buildbot and signed app with "(recommended) 3rd Party Mac Developer Application", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-001.zip

Signed with "(Oct 30) 3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-002.zip

These zip files were made on the command line, not a result of the make command. They are 2.5G in size, so they obviously include mote than the zip files produced by the make command.

Both versions of the app appear to be signed correctly!

Note: cannot run make command from ssh session. Must Remote Desktop in and use terminal shell natively.
Comment by Phil Labee (Inactive) [ 11/Nov/13 ]
Finally, some progress: If the zip file is made using the --symlinks argument it appears to be un-signed. If the symlinked files are included, the app appears to be signed correctly.

The zip file with symlinks is 60M, while the zip file with copies of the files is 2.5G, more than 40X the size.
Comment by Phil Labee (Inactive) [ 25/Nov/13 ]
Fixed in 2.5.0-950
Comment by Dipti Borkar [ 25/Nov/13 ]
Maria, can QE please verify this?
Comment by Wayne Siu [ 28/Nov/13 ]
Tested with build 2.5.0-950. Still see the warning box (attached).
Comment by Wayne Siu [ 19/Dec/13 ]
Phil,
Can you give an update on this?
Comment by Ashvinder Singh [ 14/Jan/14 ]
I tested the code signature with apple utility "spctl -a -v /Applications/Couchbase\ Server.app/" and got the output :
>>> /Applications/Couchbase Server.app/: a sealed resource is missing or invalid

also tried running the command:
 
bash: codesign -dvvvv /Applications/Couchbase\ Server.app
>>>
Executable=/Applications/Couchbase Server.app/Contents/MacOS/Couchbase Server
Identifier=com.couchbase.couchbase-server
Format=bundle with Mach-O thin (x86_64)
CodeDirectory v=20100 size=639 flags=0x0(none) hashes=23+5 location=embedded
Hash type=sha1 size=20
CDHash=868e4659f4511facdf175b44a950b487fa790dc4
Signature size=4355
Authority=3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)
Authority=Apple Worldwide Developer Relations Certification Authority
Authority=Apple Root CA
Signed Time=Jan 8, 2014, 10:59:16 AM
Info.plist entries=31
Sealed Resources version=1 rules=4 files=5723
Internal requirements count=1 size=216

It looks like the code signature is present but got invalid as the new file were added/modified to the project. I suggest for the build team to rebuild and add the code signature again.
Comment by Phil Labee (Inactive) [ 17/Apr/14 ]
need VM to clone for developer experimentation
Comment by Anil Kumar [ 18/Jul/14 ]
Any update on this? We need this for 3.0.0 GA.

Please update the ticket.

Triage - July 18th
Comment by Wayne Siu [ 02/Aug/14 ]
Siri is helping to figure out what the next step is.
Comment by Anil Kumar [ 13/Aug/14 ]
Jens - Assigning as per Ravi's request.
Comment by Chris Hillery [ 13/Aug/14 ]
Jens requested assistance in setting up a MacOS development environment for building Couchbase. Phil (or maybe Siri?), can you help him with that?
Comment by Phil Labee (Inactive) [ 13/Aug/14 ]
The production macosx builder has been cloned:

    10.6.2.159 macosx-x64-server-builder-01-clone

if you want to use your own host, see:

    http://hub.internal.couchbase.com/confluence/display/CR/How+to+Setup+a+MacOSX+Server+Build+Node
Comment by Jens Alfke [ 15/Aug/14 ]
Here are the Apple docs on building apps signed with a Developer ID: https://developer.apple.com/library/mac/documentation/IDEs/Conceptual/AppDistributionGuide/DistributingApplicationsOutside/DistributingApplicationsOutside.html#//apple_ref/doc/uid/TP40012582-CH12-SW2

I've got everything configured, but the build process fails at the final step, after I press the Distribute button in the Organizer window. I get a very uninformative error alert, "Code signing operation failed / Check that the identity you selected is valid."

I've asked for help on the xcode-users mailing list. Blocked until I hear something back.
Comment by Anil Kumar [ 18/Aug/14 ]
Triage - Not blocking 3.0 RC1
Comment by Phil Labee (Inactive) [ 25/Aug/14 ]
from Apple Developer mail list:

Dear Developer,

With the release of OS X Mavericks 10.9.5, the way that OS X recognizes signed apps will change. Signatures created with OS X Mountain Lion 10.8.5 or earlier (v1 signatures) will be obsoleted and Gatekeeper will no longer recognize them. Users may receive a Gatekeeper warning and will need to exempt your app to continue using it. To ensure your apps will run without warning on updated versions of OS X, they must be signed on OS X Mavericks 10.9 or later (v2 signatures).

If you build code with an older version of OS X, use OS X Mavericks 10.9 or later to sign your app and create v2 signatures using the codesign tool. Structure your bundle according to the signature evaluation requirements for OS X Mavericks 10.9 or later. Considerations include:

 * Signed code should only be placed in directories where the system expects to find signed code.

 * Resources should not be located in directories where the system expects to find signed code.

 * The --resource-rules flag and ResourceRules.plist are not supported.

Make sure your current and upcoming releases work properly with Gatekeeper by testing on OS X Mavericks 10.9.5 and OS X Yosemite 10.10 Developer Preview 5 or later. Apps signed with v2 signatures will work on older versions of OS X.

For more details, read “Code Signing changes in OS X Mavericks” and “Changes in 
OS X 10.9.5 and Yosemite Developer Preview 5” in OS X Code Signing In Depth":

    http://c.apple.com/r?v=2&la=en&lc=us&a=EEjRsqZNfcheZauIAhlqmxVG35c6HJuf50mGu47LWEktoAjykEJp8UYqbgca3uWG&ct=AJ0T0e3y2W

Best regards,
Apple Developer Technical Support
Comment by Phil Labee (Inactive) [ 28/Aug/14 ]
change to buildbot-internal to unlock keychain before running make and lock after:

    http://review.couchbase.org/#/c/41028/

change to couchdbx-app to sign app, on dev branch "plabee/MB-7250":

    http://review.couchbase.org/#/c/41025/

change to manifest to use this dev branch for 3.0.1 builds:

    http://review.couchbase.org/#/c/41026/
Comment by Wayne Siu [ 29/Aug/14 ]
Moving it to 3.0.1.




[MB-12126] there is not manifest file on windows 3.0.1-1253 Created: 03/Sep/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0.1
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 r2 64-bit

Attachments: PNG File ss 2014-09-03 at 12.05.41 PM.png    
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.1-1253 on windows server 2008 r2 64-bit. There is not manifest file in directory c:\Program Files\Couchbase\Server\



 Comments   
Comment by Chris Hillery [ 03/Sep/14 ]
Also true for 3.0 RC2 build 1205.
Comment by Chris Hillery [ 03/Sep/14 ]
(Side note: While fixing this, log onto build slaves and delete stale "server-overlay/licenses.tgz" file so we stop shipping that)
Comment by Anil Kumar [ 17/Sep/14 ]
Ceej - Any update on this?
Comment by Chris Hillery [ 18/Sep/14 ]
No, not yet.




[MB-10214] Mac version update check is incorrectly identifying newest version Created: 14/Feb/14  Updated: 25/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0.1, 2.2.0, 2.1.1, 3.0.1, 3.0
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: David Haikney Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified
Environment: Mac OS X

Attachments: PNG File upgrade_check.png    
Issue Links:
Duplicate
is duplicated by MB-12345 Version 3.0.0-1209-rel prompts for up... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-12051 Update the Release_Server job on Jenk... Technical task Open Chris Hillery  
Is this a Regression?: Yes

 Description   
Running 2.1.1 version of couchbase on a Mac, "check for latest version" reports the latest version is already running (e.g. see attached screenshot)


 Comments   
Comment by Aleksey Kondratenko [ 14/Feb/14 ]
Definitely not ui bug. It's using phone home to find out about upgrades. And I have no idea who owns that now.
Comment by Steve Yen [ 12/Jun/14 ]
got an email from ravi to look into this
Comment by Steve Yen [ 12/Jun/14 ]
Not sure if this is correct analysis, but I did a quick scan of what I think is the mac installer, which I think is...

  https://github.com/couchbase/couchdbx-app

It gets its version string by running a "git describe", in the Makefile here...

  https://github.com/couchbase/couchdbx-app/blob/master/Makefile#L1

Currently, a "git describe" on master branch returns...

  $ git describe
  2.1.1r-35-gf6646fa

...which is *kinda* close to the reported version string in the screenshot ("2.1.1-764-rel").

So, I'm thinking one fix needed would be a tagging (e.g., "git tag -a FOO -m FOO") of the couchdbx-app repository.

So, reassigning to Phil to do that appropriately.

Also, it looks like the our mac installer is using an open-source packaging / installer / runtime library called "sparkle" (which might be a little under-maintained -- not sure).

  https://github.com/andymatuschak/Sparkle/wiki

The sparkle library seems to check for version updates by looking at the URL here...

  https://github.com/couchbase/couchdbx-app/blob/master/cb.plist.tmpl#L42

Which seems to either be...

  http://appcast.couchbase.com/membasex.xml

Or, perhaps...

  http://appcast.couchbase.com/couchbasex.xml

The appcast.couchbase.com appears to be actually an S3 bucket, off of our production couchbase AWS account. So those *.xml files need to be updated, as they currently have content that has older versions. For example, http://appcast.couchbase.com/couchbase.xml looks currently like...

    <rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle" version="2.0">
    <channel>
    <title>Updates for Couchbase Server</title>
    <link>http://appcast.couchbase.com/couchbase.xml&lt;/link>
    <description>Recent changes to Couchbase Server.</description>
    <language>en</language>
    <item>
    <title>Version 1.8.0</title>
    <sparkle:releaseNotesLink>
    http://www.couchbase.org/wiki/display/membase/Couchbase+Server+1.8.0
    </sparkle:releaseNotesLink>
    <!-- date -u +"%a, %d %b %Y %H:%M:%S GMT" -->
    <pubDate>Fri, 06 Jan 2012 16:11:17 GMT</pubDate>
    <enclosure url="http://packages.couchbase.com/1.8.0/Couchbase-Server-Community.dmg" sparkle:version="1.8.0" sparkle:dsaSignature="MCwCFAK8uknVT3WOjPw/3LkQpLBadi2EAhQxivxe2yj6EU6hBlg9YK/5WfPa5Q==" length="33085691" type="application/octet-stream"/>
    </item>
    </channel>
    </rss>

Not updating the xml files, though, probably causes no harm. Just that our osx users won't be pushed news on updates.
Comment by Phil Labee (Inactive) [ 12/Jun/14 ]
This has nothing to do with "git describe". There should be no place in the product that "git describe" should be used to determine version info. See:

    http://hub.internal.couchbase.com/confluence/display/CR/Branching+and+Tagging

so there's definitely a bug in the Makefile.

The version update check seems to be out of date. The phone-home file is generated during:

    http://factory.hq.couchbase.com:8080/job/Product_Staging_Server/

but the process of uploading it is not automated.
Comment by Steve Yen [ 12/Jun/14 ]
Thanks for the links.

> This has nothing to do with "git describe".

My read of the Makefile makes me think, instead, that "git describe" is the default behavior unless it's overridden by the invoker of the make.

> There should be no place in the product that "git describe" should be used to determine version info. See:
> http://hub.internal.couchbase.com/confluence/display/CR/Branching+and+Tagging

It appears all this couchdbx-app / sparkle stuff predates that wiki page by a few years, so I guess it's inherited legacy.

Perhaps voltron / buildbot are not setting the PRODUCT_VERSION correctly before invoking the the couchdbx-app make, which makes the Makefile default to 'git describe'?

    commit 85710d16b1c52497d9f12e424a22f3efaeed61e4
    Date: Mon Jun 4 14:38:58 2012 -0700

    Apply correct product version number
    
    Get version number from $PRODUCT_VERSION if it's set.
    (Buildbot and/or voltron will set this.)
    If not set, default to `git describe` as before.
    
> The version update check seems to be out of date.

Yes, that's right. The appcast files are out of date.

> The phone-home file is generated during:
> http://factory.hq.couchbase.com:8080/job/Product_Staging_Server/

I think appcast files for OSX / sparkle are a _different_ mechanism than the phone-home file, and an appcast XML file does not appear to be generated/updated by the Product_Staging_Server job.

But, I'm not an expert or really qualified on the details here -- this is just my opinions from a quick code scan, not from actually doing/knowing.

Comment by Wayne Siu [ 01/Aug/14 ]
Per PM (Anil), we should get this fixed by 3.0 RC1.
Raising the priority to Critical.
Comment by Wayne Siu [ 07/Aug/14 ]
Phil,
Please provide update.
Comment by Anil Kumar [ 12/Aug/14 ]
Triage - Upgrading to 3.0 Blocker

Comment by Wayne Siu [ 20/Aug/14 ]
Looks like we may have a short term "fix" for this ticket which Ceej and I have tested.
@Ceej, can you put in the details here?
Comment by Chris Hillery [ 20/Aug/14 ]
The file is hosted in S3, and we proved tonight that overwriting that file (membasex.xml) with a version containing updated version information and download URLs works as expected. We updated it to point to 2.2 for now, since that is the latest version with a freely-available download URL.

We can update the Release_Server job on Jenkins to create an updated version of this XML file from a template, and upload it to S3.

Assigning back to Wayne for a quick question: Do we support Enterprise edition for MacOS? If we do, then this solution won't be sufficient without more effort, because the two editions will need different Sparkle configurations for updates. Also, Enterprise edition won't be able to directly download the newer release, unless we provide a "hidden" URL for that (the download link on the website goes to a form).
Comment by Chris Hillery [ 14/Oct/14 ]
We manually uploaded a new version of membasex.xml when 3.0.0 was released, but as MB-12345 shows, it doesn't work correctly (it still thinks there's a new download even if you're running the released 3.0.0).

I do not anticipate being able to put more time into this issue in the near future.




[MB-12671] Support for CLI Tools: Node Services Created: 15/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Parag Agarwal Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: all


 Description   
With Sherlock, we will have node services as an option with add node and joinCluster. This should reflect in cli tools as well.

Reference

https://github.com/couchbase/ns_server/blob/master/doc/api.txt


// adds node with given hostname to given server group with specified
// set of services
//
// services field is optional and defaults to kv,moxi
POST /pools/default/serverGroups/<group-uuid>/addNode
hostname=<hostname>&user=Administrator&password=asdasd&services=kv,n1ql,moxi
// same as serverGroups addNode endpoint, but for default server group
POST /controller/addNode
hostname=<hostname>&user=Administrator&password=asdasd&services=kv,n1ql,moxi
// joins _this_ node to cluster which member is given in hostname parameter
POST /node/controller/doJoinCluster
hostname=<hostname>&user=Administrator&password=asdasd&services=kv,n1ql,moxi


 Comments   
Comment by Gerald Sangudi [ 15/Nov/14 ]
+Cihan.

Is it possible to use the more generic terms data, index, and query for these services? In particular, the term "n1ql" is a brand, not a feature. It can be changed by marketing.
Comment by Dave Finlay [ 26/Nov/14 ]
Bin planning to get to this next week (week beginning 12/1)




[MB-12673] [system-tests]items count mismatch uni-directional XDCR Created: 16/Nov/14  Updated: 26/Nov/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 3.0.2
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.2-1520

Attachments: PNG File uni.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
bidirection replication for 4 buckets on source 172.23.105.156 and destination 172.23.105.160

AbRegNums
MsgsCalls
RevAB
UserInfo

data load more then 2 days
all this time has been done a large number of steps with different scenarios.
More detailed steps can be found here:
https://github.com/couchbaselabs/couchbase-qe-docs/blob/master/system-tests/viber/build_3.0.2-1520/report.txt

the problem is that I can not say at what stage of discrepancy of data occurred because I check data match only when I stopped data load( last step)

result
source:
AbRegNums 1607045
MsgsCalls 33301
RevAB 35716338
UserInfo 292190

destination:
AbRegNums 1607045
MsgsCalls 33300
RevAB 35716351
UserInfo 292190

diff <(curl http://172.23.105.156:8092/MsgsCalls/_design/docs/_view/docs?inclusive_end=true&stale=false&connection_timeout=60000&skip=0) <(curl http://172.23.105.160:8092/MsgsCalls/_design/docs/_view/docs?inclusive_end=true&stale=update_after&connection_timeout=60000&skip=0)
   % %To taTl o t % aRecleiv ed % X fer d %Ave ragRe Sepeecd e Tiimev e Tidme % T imeX Cfurreentr
 d A ve r a g e S Dploaed eUpldoad Tot al T Sipenmt e L eft S pee d
i 0m e 0 0 T0 i 0m e 0 C 0 u r 0r --e:--n:--t -
-:- -:- - - -:- -:- - 0 Dload Upload Total Spent Left Speed
100 2664k 0 2664k 0 0 680k 0 --:--:-- 0:00:03 --:--:-- 680k
100 2664k 0 2664k 0 0 321k 0 --:--:-- 0:00:08 --:--:-- 588k
1c1
< {"total_rows":33301,"rows":[
---
> {"total_rows":33300,"rows":[
33244d33243
< {"id":"MSG_owmiixgxuqiptrwjjhzgorkfsvrxcgwsrlmrtxkp_myiunhwwynfqjobdtwffjwoic","key":null,"value":null},

so, "MSG_owmiixgxuqiptrwjjhzgorkfsvrxcgwsrlmrtxkp_myiunhwwynfqjobdtwffjwoic" exists on src, doesn't on dest

just in case, leave the cluster alive for investigation for a few days


 Comments   
Comment by Andrei Baranouski [ 16/Nov/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.156-11162014-214-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.157-11162014-235-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.158-11162014-224-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.160-11162014-37-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.206-11162014-33-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.207-11162014-310-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.22-11162014-254-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12673/fc3ae2d4/172.23.105.159-11162014-245-diag.zip

Comment by Mike Wiederhold [ 17/Nov/14 ]
The expiry pager is running on the destination cluster. This needs to be turned off.

Mikes-MacBook-Pro:ep-engine mikewied$ management/cbstats 172.23.105.207:11210 -b MsgsCalls all | grep exp
 ep_exp_pager_stime: 3600
 ep_expired_access: 0
 ep_expired_pager: 19655
 ep_item_flush_expired: 0
 ep_num_expiry_pager_runs: 53
 vb_active_expired: 19547
 vb_pending_expired: 0
 vb_replica_expired: 108
Comment by Andrei Baranouski [ 18/Nov/14 ]
Sorry Mike, it's not clear to me

I didn't run any expiry pagers in the tests. why I need to turn off something? I used default settings for the clusters
when you say "The expiry pager is running on the destination cluster", does it mean that it should be completed and then items should be matched? but it does not occur

[root@centos-64-x64 bin]# ./cbstats 172.23.105.207:11210 -b MsgsCalls all | grep exp
 ep_exp_pager_stime: 3600
 ep_expired_access: 0
 ep_expired_pager: 19655
 ep_item_flush_expired: 0
 ep_num_expiry_pager_runs: 105
 vb_active_expired: 19547
 vb_pending_expired: 0
 vb_replica_expired: 108
[root@centos-64-x64 bin]# ./cbstats 172.23.105.156:11210 -b MsgsCalls all | grep exp
 ep_exp_pager_stime: 3600
 ep_expired_access: 0
 ep_expired_pager: 8
 ep_item_flush_expired: 0
 ep_num_expiry_pager_runs: 135
 vb_active_expired: 8
 vb_pending_expired: 0
 vb_replica_expired: 0
Comment by Andrei Baranouski [ 18/Nov/14 ]
diff <(curl http://172.23.105.156:8092/MsgsCalls/_design/docs/_view/docs?inclusive_end=true&stale=false&connection_timeout=60000&skip=0) <(curl http://172.23.105.160:8092/MsgsCalls/_design/docs/_view/docs?inclusive_end=true&stale=update_after&connection_timeout=60000&skip=0)
   % %To taTl o t % aRecleiv ed % X fer d %Ave ragRe Sepeecd e Tiimev e Tidme % T imeX Cfurreentr
 d A ve r a g e S Dploaed eUpldoad Tot al T Sipenmt e L eft S pee d
i 0m e 0 0 T0 i 0m e 0 C 0 u r 0r --e:--n:--t -
-:- -:- - - -:- -:- - 0 Dload Upload Total Spent Left Speed
100 2664k 0 2664k 0 0 680k 0 --:--:-- 0:00:03 --:--:-- 680k
100 2664k 0 2664k 0 0 321k 0 --:--:-- 0:00:08 --:--:-- 588k
1c1
< {"total_rows":33301,"rows":[
---
> {"total_rows":33300,"rows":[
33244d33243
< {"id":"MSG_owmiixgxuqiptrwjjhzgorkfsvrxcgwsrlmrtxkp_myiunhwwynfqjobdtwffjwoic","key":null,"value":null},

key "MSG_owmiixgxuqiptrwjjhzgorkfsvrxcgwsrlmrtxkp_myiunhwwynfqjobdtwffjwoic" doesn't exist on dest, exists on source
Comment by Mike Wiederhold [ 18/Nov/14 ]
Andre,

The stat ep_expired_pager_runs shows that the expiry pager is running. You should not have this running on the destination cluster otherwise items might be deleted. This will cause the rev sequence number to be increased and can result in items not being replicated to the destination. This is a known issue so you need to re-run the test and make sure that the expiry pager is not running. You can turn off the expiry pager by running the command below on each node.

cbepctl host:port set flush_param exp_pager_stime 0
Comment by Andrei Baranouski [ 18/Nov/14 ]
Thanks Mike,

could you point the ticket for "a known issue"
so, it should be run only on all nodes in destination cluster?

BTW, how we proceed to test bi-XDCR replication, I believe that there may also be a problem?
Comment by Mike Wiederhold [ 18/Nov/14 ]
Yes, for unidirectional you need to disable the expiry pager on the destination nodes. You can leave it on in the source cluster. Also, I don't know of a ticket that specifically relates to this issue, but I discussed it with support and it is known. If I can find something I'll post it here.

The problem is that if the destination cluster has any traffic (in this case expiry counts as traffic) then the rev sequence number will be increased. This can cause the destination node to win conflict resolution and as a result would mean an item from the source would not end up getting to the destination node. At some point this issue would work itself out, but only after the item expired on both sides.

For bi-directional this wouldn't be an issue because the destination will replicate back the source. In the case of this ticket the destination rev id is 74 and the source is 73. So when the destination replicates this item back it will win the conflict resolution.
Comment by Andrei Baranouski [ 18/Nov/14 ]
Thanks for the update!
Comment by Andrei Baranouski [ 24/Nov/14 ]
Hi Mike,

 with the above scenarios it's expected that destination cluster has doc but source doesn't have it?

source: http://172.23.105.156:8091/index.html#sec=buckets
destination: http://172.23.105.160:8091/index.html#sec=buckets

the following items don't exist on src
RAB_222565502766
RAB_222740635920
RAB_222750550473

and one more question:
do we still support vbuckettool in 3.0.0? http://www.couchbase.com/issues/browse/MB-7253?focusedCommentId=98776&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-98776

I see this tool after installation 3.0.2 but seems like it doesn't work

curl http://172.23.105.160:8091/pools/default/buckets/RevAB | ./vbuckettool RAB_222565502766
vbuckettool mapfile key0 [key1 ... [keyN]]

  The vbuckettool expects a vBucketServerMap JSON mapfile, and
  will print the vBucketId and servers each key should live on.
  You may use '-' instead for the filename to specify stdin.

  Examples:
    ./vbuckettool file.json some_key another_key

    curl http://HOST:8091/pools/default/buckets/default | \
       ./vbuckettool - some_key another_key
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 12205 100 12205 0 0 434k 0 --:--:-- --:--:-- --:--:-- 458k
curl: (23) Failed writing body (0 != 12205)

Comment by Mike Wiederhold [ 24/Nov/14 ]
Apparently the last I was told we don't support vbuckettool. You should file a separate bug about this and assign it to the PM team.
Comment by Mike Wiederhold [ 24/Nov/14 ]
There appears to be an issue with persistence on the source node. I don't think this could be a DCP problem since there aren't any deletes or expirations in the cluster.
Comment by Mike Wiederhold [ 24/Nov/14 ]
First off, the missing keys are as follows:

Comparing active VBuckets across clusters

 Error found:
139 active 44219 172.23.105.157
139 active 44220 172.23.105.160

 Error found:
653 active 44440 172.23.105.158
653 active 44441 172.23.105.207

 Error found:
788 active 43780 172.23.105.158
788 active 43781 172.23.105.207

If you look at VBucket 653:

Mikes-MacBook-Pro:ep-engine mikewied$ management/cbstats 172.23.105.158:11210 -b RevAB vbucket-details | grep vb_653
 vb_653: active
 vb_653:db_data_size: 4408989
 vb_653:db_file_size: 4878427
 vb_653:high_seqno: 46365
 vb_653:ht_cache_size: 3891306
 vb_653:ht_item_memory: 3891306
 vb_653:ht_memory: 393720
 vb_653:num_ejects: 0
 vb_653:num_items: 44440
 vb_653:num_non_resident: 0
 vb_653:num_temp_items: 0
 vb_653:ops_create: 44440
 vb_653:ops_delete: 0
 vb_653:ops_reject: 0
 vb_653:ops_update: 589
 vb_653:pending_writes: 0
 vb_653:purge_seqno: 0
 vb_653:queue_age: 0
 vb_653:queue_drain: 45029
 vb_653:queue_fill: 45029
 vb_653:queue_memory: 0
 vb_653:queue_size: 0
 vb_653:uuid: 229288576785427
Mikes-MacBook-Pro:ep-engine mikewied$ management/cbstats 172.23.105.207:11210 -b RevAB vbucket-details | grep vb_653
 vb_653: active
 vb_653:db_data_size: 4417903
 vb_653:db_file_size: 5001307
 vb_653:high_seqno: 45467
 vb_653:ht_cache_size: 3917611
 vb_653:ht_item_memory: 3917611
 vb_653:ht_memory: 197032
 vb_653:num_ejects: 0
 vb_653:num_items: 44441
 vb_653:num_non_resident: 0
 vb_653:num_temp_items: 0
 vb_653:ops_create: 44441
 vb_653:ops_delete: 0
 vb_653:ops_reject: 0
 vb_653:ops_update: 1012
 vb_653:pending_writes: 0
 vb_653:purge_seqno: 0
 vb_653:queue_age: 0
 vb_653:queue_drain: 45453
 vb_653:queue_fill: 45453
 vb_653:queue_memory: 0
 vb_653:queue_size: 0
 vb_653:uuid: 228614974178837

We can see above that the number of creates is different between the clusters, but what is strange is that the destination node has an extra item and there are no deletes or expirations. On top of this the couch files show that the item on the destination cluster never existed on the source.

Src:

[root@centos-64-x64 ~]# /opt/couchbase/bin/couch_dbdump /opt/couchbase/var/lib/couchbase/data/RevAB/653.couch.14 | grep RAB_222740635920 -B 1 -A 6
Dumping "/opt/couchbase/var/lib/couchbase/data/RevAB/653.couch.14":

Total docs: 44440

Dest:

[root@centos-64-x64 ~]# /opt/couchbase/bin/couch_dbdump /opt/couchbase/var/lib/couchbase/data/RevAB/653.couch.24 | grep RAB_222740635920 -B 1 -A 6
Dumping "/opt/couchbase/var/lib/couchbase/data/RevAB/653.couch.24":
Doc seq: 31333
     id: RAB_222740635920
     rev: 1
     content_meta: 131
     size (on disk): 23
     cas: 1416742771886067118, expiry: 0, flags: 0, datatype: 0
     size: 13
     data: (snappy) ,111005375249

Total docs: 44441
Comment by Abhinav Dangeti [ 25/Nov/14 ]
Andrei, is this a regression? Was this seen consistently in 3.0.2?
Was this issue seen in 3.0.1, there is no history here: https://github.com/couchbaselabs/couchbase-qe-docs/tree/master/system-tests/viber ?

I can't find any trace of the missing items on the source side in the logs, as there seem to be no deletes, no expirations and the items aren't in couchstore as well. Are we certain that there was absolutely so load on the destination?

I read through the test spec, at the end of each phase do you wait for replication to catch up and do you grab stats by any chance, this is to check if any rebalance caused this data loss?
If you can confirm that this is indeed a regression from 3.0.1, we can take a look at all the changes that were made for 3.0.2, that could cause this data loss.
Comment by Andrei Baranouski [ 25/Nov/14 ]
Hi Abhinav,

I've never seen this issue before. and I didn't run the tests against any 3.0.1 version https://github.com/couchbaselabs/couchbase-qe-docs/commits/master/system-tests/viber

Are we certain that there was absolutely so load on the destination?
yes, destination didn't have any loaders

do you wait for replication to catch up and do you grab stats by any chance, this is to check if any rebalance caused this data loss?
no, I do not wait anything between phases, loader runs continuously. here I'm not able to verify that the docs/stats are identical on clusters.

I can't confirm that is regression from 3.0.1 because I didn't run against any 3.0.1 build

btw, before that I never disable the expiry pager on dest cluster and never seen data lost on sorce or destination( expect known and already fixed bug)


Comment by Abhinav Dangeti [ 25/Nov/14 ]
Enabling/Disabling the expiry pager on the destination cluster shouldn't have any effect on the data on the source cluster in a unidirectional XDCR scenario.
Comment by Abhinav Dangeti [ 26/Nov/14 ]
Andrei, the hard failover operation seems to be the main suspect here (as there could be data loss if there were backed up items in the replication queue).

I will need you to run this job again, and please grab the logs after every rebalance operation or phase in your test.
I would also need you to wait for replication to catch up before you trigger the failover operation to make sure that there aren't items backed up in the replication queue (to confirm if this caused the data loss), and check for the item mismatches after each of the rebalance operations. Please let me know if this is possible.
Comment by Andrei Baranouski [ 26/Nov/14 ]
"I will need you to run this job again, and please grab the logs after every rebalance operation or phase in your test.|
okay, will get logs after each operation

"I would also need you to wait for replication to catch up before you trigger the failover operation to make sure that there aren't items backed up in the replication queue (to confirm if this caused the data loss), and check for the item mismatches after each of the rebalance operations"
there are a couple questions here:
"wait for replication to catch up before you trigger the failove" so, it means that there is no any dataload on server before any hard/gracefull failover. I guess in this case the scenario wil be very simple and such cases should be covered in many func tests
" and check for the item mismatches after each of the rebalance operations" the same, when I should stop loader? I think it make sence only after rebalance?

I'm going to split each step on parts:
1) load n hours
2) stop loader, check items
3) start loader, wait a little and start any rebalance or failover operations
4) wait rebalance/failover completed
5) stop loader, check items ( do you need a logs if all is well after the iteration?)

let me know if this works

Comment by Abhinav Dangeti [ 26/Nov/14 ]
This would work Andrei, but you'll need to make sure there are no items in the replication queue when you do hard failover.
Running a load during rebalance and graceful failover operations is fine.
Get logs at the end of each phase only if you do see any item mismatches.
Thanks for the help Andrei.




[MB-11917] One node slow probably due to the Erlang scheduler Created: 09/Aug/14  Updated: 27/Nov/14

Status: Reopened
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Volker Mische Assignee: Harsha Havanur
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File crash_toy_701.rtf     PNG File leto_ssd_300-1105_561_build_init_indexleto_ssd_300-1105_561172.23.100.31beam.smp_cpu.png    
Issue Links:
Duplicate
duplicates MB-12200 Seg fault during indexing on view-toy... Resolved
duplicates MB-12579 View Index DGM 20% Regression (Initia... Resolved
duplicates MB-9822 One of nodes is too slow during indexing Closed
is duplicated by MB-12183 View Query Thruput regression compare... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
One node is slow, that's probably due to the "scheduler collapse" bug in the Erlang VM R16.

I will try to find a way to verify that it is really the scheduler and no other problem. This is basically a duplicate of MB-9822. Though that bug has a long history, hence I dare to create a new one.

 Comments   
Comment by Volker Mische [ 09/Aug/14 ]
I forgot to add that our issue sounds exactly like that one: http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
Comment by Sriram Melkote [ 11/Aug/14 ]
Upgrading to blocker as this is doubling initial index time in recent runs on showfast.
Comment by Volker Mische [ 12/Aug/14 ]
I verified that it's the "scheduler collapse". Have a look at the chart I've attached (It's from [1] [172.23.100.31] beam.smp_cpu). It starts with a utilization of around 400% at around 120 I reduced the online schedulers to 1 (with running erlang:system_flag(schedulers_online, 1) via a remote shell). I then increased the schedulers_online again at around 150 to the original value of 24. You can see that it got back to normal.

[1]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1105_561_build_init_index
Comment by Volker Mische [ 12/Aug/14 ]
I would try to run on R16 and see how often it happens with COUCHBASE_NS_SERVER_VM_EXTRA_ARGS=["+swt", "low", "+sfwi", "100"] set (as suggested in MB-9822 [1]).

[1]: https://www.couchbase.com/issues/browse/MB-9822?focusedCommentId=89219&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-89219
Comment by Pavel Paulau [ 12/Aug/14 ]
We agreed to try:

+sfwi 100/500 and +sbwt long

Will run test 5 times with these options.
Comment by Pavel Paulau [ 13/Aug/14 ]
5 runs of tests/index_50M_dgm.test with -sfwi 100 -sbwt long:

http://ci.sc.couchbase.com/job/leto-dev/19/
http://ci.sc.couchbase.com/job/leto-dev/20/
http://ci.sc.couchbase.com/job/leto-dev/21/
http://ci.sc.couchbase.com/job/leto-dev/22/
http://ci.sc.couchbase.com/job/leto-dev/23/

3 normal runs, 2 with slowness.
Comment by Volker Mische [ 13/Aug/14 ]
I see only one slow run (22): http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1137_6a0_build_init_index

But still :-/
Comment by Pavel Paulau [ 13/Aug/14 ]
See (20), incremental indexing: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1137_ed9_build_incr_index
Comment by Volker Mische [ 13/Aug/14 ]
Oh, I was only looking at the initial building.
Comment by Volker Mische [ 13/Aug/14 ]
I got a hint in the #erlang IRC channel. I'll try to use the erlang:bump_reductions(2000) and see if that helps.
Comment by Volker Mische [ 13/Aug/14 ]
Let's see if bumping the reductions make it work: http://review.couchbase.org/40591
Comment by Aleksey Kondratenko [ 13/Aug/14 ]
merged that commit.
Comment by Pavel Paulau [ 13/Aug/14 ]
Just tested build 3.0.0-1150, rebalance test but with initial indexing phase.

2 nodes are super slow and utilize only single core.
Comment by Volker Mische [ 18/Aug/14 ]
I can't reproduce it locally. I tend towards closing this issue as "won't fix". We should really not have long running NIFS.

I also think that it won't happen much under real work loads. And even if, the workaround would be to reduce the number of online schedulers to 1 and immediately increasing it again back to the original number.
Comment by Volker Mische [ 18/Aug/14 ]
Assigning to Siri to make the call on whether we close it or not.
Comment by Anil Kumar [ 18/Aug/14 ]
Triage - Not blocking 3.0 RC1
Comment by Raju Suravarjjala [ 19/Aug/14 ]
Triage: Siri will put additional information and this bug is being retargeted to 3.0.1
Comment by Sriram Melkote [ 19/Aug/14 ]
Folks, for too long we've had trouble that get pinned to our NIFs. In 3.5, let's solve them whatever is the correct Erlang approach to running heavy high performance code. Port, or reporting reductions, or moving to R17 with dirty schedulers, or some other option I missed - whatever is the best solution, let us implement in 3.5 and be done.
Comment by Volker Mische [ 09/Sep/14 ]
I think we should close this issue and rather create a new one for whatever we come up with (e.g. the async mapreduce NIF).
Comment by Harsha Havanur [ 10/Sep/14 ]
Toy Build for this change at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_ubunt12-3.0.0-toy-hhs-x86_64_3.0.0-702-toy.deb

Review in progress at
http://review.couchbase.org/#/c/41221/4
Comment by Harsha Havanur [ 12/Sep/14 ]
Please find udpated toy build for this
http://latestbuilds.hq.couchbase.com/couchbase-server-community_ubunt12-3.0.0-toy-hhs-x86_64_3.0.0-704-toy.deb
Comment by Sriram Melkote [ 12/Sep/14 ]
Another occurrence of this, MB-12183.

I'm making this a blocker.
Comment by Harsha Havanur [ 13/Sep/14 ]
Centos build at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-hhs-x86_64_3.0.0-700-toy.rpm
Comment by Ketaki Gangal [ 16/Sep/14 ]
Filed bug MB-12200 for this toy-build
Comment by Ketaki Gangal [ 17/Sep/14 ]
Attaching stack from toy-build 701
File

crash_toy_701.rtf

Access to machine is as mentioned previously on MB-12200.
Comment by Harsha Havanur [ 19/Sep/14 ]
We are facing 2 issues with async nif implementation.
1) Loss of signals leading to deadlock in enqueue and dequeue in queues
I am suspecting enif mutex and condition variables. I could reproduce deadlock scenario on Centos which potentially point to both producer and consumer (enqueue and dequeue) in our case going to sleep due to not handling condition variable signals correctly.
To address this issue, I have replaced enif mutex and condition variables with that of C++ stl counterparts. This seem to fix the dead lock situation.

2) Memory getting freed by terminator task when the context is alive during mapDoc.
This is still work in progress and will update once I have a solution for this.
Comment by Harsha Havanur [ 21/Sep/14 ]
Segmentation fault is probably due to termination of erlang process calling map_doc. This triggers destructor which cleans up v8 context when the task is still in the queue. Will attempt a fix for this.
Comment by Harsha Havanur [ 22/Sep/14 ]
I have fixed both issues in this build
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-hhs-x86_64_3.0.0-709-toy.rpm
I am running systests as ketki suggested on VMs 10.6.2.164, 165, 168, 171, 172, 194, 195. Currently rebalance is in progress.

For the deadlock situation resolution was to broadcast condition signal to wake up all waiting threads instead of waking up only one of the threads.
For Segmentation fault resolution was to complete map task for the context before it is cleaned up by destructor when erlang process calling map task terminates or crashes.

Please use this build for further functional and performance verification. Thanks,
Comment by Venu Uppalapati [ 01/Oct/14 ]
For performance runs, do the sfwi and sbwt options need to be set? pls provide some guidance on how to set these.
Comment by Volker Mische [ 02/Oct/14 ]
I would try a run without additional options.

In case you want to run with additional options see the comment above. You only need to set the COUCHBASE_NS_SERVER_VM_EXTRA_ARGS environment variable.
Comment by Venu Uppalapati [ 02/Oct/14 ]
2 runs of tests/index_50M_dgm.test with the above toy build. both are slow.
http://ci.sc.couchbase.com/job/leto-dev/29/
http://ci.sc.couchbase.com/job/leto-dev/28/
Comment by Harsha Havanur [ 08/Oct/14 ]
If we can confirm slowness of indexing is not because of erlang scheduler collapse, Can we merge these changes and investigate further?
Comment by Volker Mische [ 08/Oct/14 ]
But then we need to confirm tat it's not a scheduler issue :)

One way I did it in the past:

1. Monitor the live system, see if one node has low CPU usage while the others perform normal
2. Open an remote erlang shell to that node (couchbase-cli can do that with the undocumented `server-eshell` command:

    ./couchbase-cli server-eshell -c 127.0.0.1:8091 -u Administrator -p asdasd

3. Run the following erlang (without the comments of course:

    %% Get current number of online schedulers
    Schedulers = erlang:system_info(schedulers_online).

    %% Reduce number online to 1
    erlang:system_flag(schedulers_online, 1).

    %% Restore to original number of online schedulers
    erlang:system_flag(schedulers_online, Schedulers)

4. Monitor this node again. If it gets back to normal I'd say it's the scheduler collapse (or at least some scheduler issue).
Comment by Volker Mische [ 09/Oct/14 ]
Look what I've stumbled upon: https://github.com/huiqing/percept2

It's a profiling tool with a useful feature: scheduler activity: the number of active schedulers at any time during the profiling;

There's even a screenshot: http://refactoringtools.github.io/percept2/percept2_scheduler.png

I haven't looked at it closely or tried it, but it sounds promising.

Harsha, I think we should give this tool a try.
Comment by Sriram Melkote [ 29/Oct/14 ]
Status so far is that removing map and reduce NIF, which were long suspected to be misusing Erlang threads to run heavy operations without reporting proper reductions has not helped.

The plan of action going forward is:

(a) To reproduce this on EC2 so we are not delayed on availability of Leto
(b) To run the GDB script to detect more details of scheduler thread behavior
(c) To run with R14 locally
Comment by Volker Mische [ 29/Oct/14 ]
Siri, I'm not sure on (c). We've seen a similar issue in the past on R14 on the past, but less frequently. So it would need more runs to reproduce it. I don't think it's a bug in the Erlang VM that was introduced in R16. Anyway, if it's easy to do let's do it, but let's spend more time on moving forward, rather than backwards :)
Comment by Harsha Havanur [ 05/Nov/14 ]
Making couch_view_parser NIF async seems to address erlang scheduler collapse in this case.
Review in progress at
http://review.couchbase.org/#/c/42821/
Comment by Harsha Havanur [ 05/Nov/14 ]
A toy build for the same is at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-hhs-x86_64_3.0.0-726-toy.rpm
Request QE to run basic functional tests and Query throughput tests on this build.
Comment by Sriram Melkote [ 10/Nov/14 ]
ETA to merge this is 5pm IST on Nov 11
Comment by Harsha Havanur [ 12/Nov/14 ]
Change has been successfully cherry-picked as 3bf0b23892a11299ff5cc25e3d1ebf83e3beec9f
Comment by Volker Mische [ 13/Nov/14 ]
I'm re-opening this issue as the problem is still there even with the async couch_view_parser NIF. It can be seen on the low CPU utilisation on one node [1] (search for "[172.23.100.30] beam.smp_cpu") compared to the others.

[1]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_302-1518_130_build_incr_index
Comment by Sriram Melkote [ 25/Nov/14 ]
Moving to Sherlock per 3.0.2 release meeting today.
Comment by Volker Mische [ 27/Nov/14 ]
Is there a way to create a toy build with a patched Erlang?

I'd like to see a run with a toy build that was built with an Erlang from this branch [1]. This is Erlang R16B03-1 with some backported patches.

The idea comes from an email on the Erlang users mailing list [2].

[1]: https://github.com/rickard-green/otp/commits/rickard/R16B03-1/load_balance/OTP-11385
[2]: http://erlang.org/pipermail/erlang-questions/2014-November/081683.html
Comment by Volker Mische [ 27/Nov/14 ]
Assigning to Ceej to answer the question: Is it possible to have a toy build with a patched Erlang?




[MB-12795] MacOSX build is unavailable after 3.0.2-1582-rel Created: 27/Nov/14  Updated: 27/Nov/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0.2
Fix Version/s: 3.0.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Meenakshi Goel Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
MacOSX build seems to be unavailable after 3.0.2-1582-rel

http://latestbuilds.hq.couchbase.com/index_3.0.2.html




[MB-9419] Support JSON spec as a JSON database Created: 25/Oct/13  Updated: 24/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Michael Nitschinger Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
As a JSON database, we should support the JSON spec fully.

CouchDB only supports toplevel objects and not arrays (See MB-9208 for an example).
I think we need to fix this.

Also, while we are at it supporting multi-root documents like::
{"one":1}{"two":2}

would by very cool since we could do atomic appends/prepends with it.




[MB-12356] unusual high number of gets per sec (20k) opening "Documents" view Created: 15/Oct/14  Updated: 15/Oct/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Matt Ingenthron Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Reported by a user here: https://forums.couchbase.com/t/what-is-causing-high-number-of-gets-per-sec-20k-opening-documents-view-3-0-1/1746

Every time I click on "Documents" to view documents in a Bucket, the page hangs for a little bit while it loads. If I switch over to the General Bucket Analysis graph, I see some 20k gets/sec. I am currently able to produce this behavior consistently. There are no Views for this bucket.

 Comments   
Comment by rdev5 [ 15/Oct/14 ]
Please see https://github.com/rdev5/cachewarmer/tree/master for instructions on how I was able to reproduce this on a new 3.0.0/1 cluster.
Comment by Matt Ingenthron [ 15/Oct/14 ]
Thanks so much for the sample rdev5! That's hugely helpful in finding/addressing the issue.
Comment by rdev5 [ 15/Oct/14 ]
Sure thing. Hope you guys are able to track it down. Let me know if you need anything else.
Comment by rdev5 [ 15/Oct/14 ]
Just an added observation - It appears that flushing the bucket purges it of this behavior. Running the test project again brings it back (even after items have been expired).
Comment by Aleksey Kondratenko [ 15/Oct/14 ]
We have tracked it down.

It's interesting combination of our fundamental lack of support of range ops and the way how expirations work.

Specifically:

* every time you open Documents section of UI it makes "get me 10 docs with lexicographically smallest keys".

* our in-memory layer is unable to handle it directly. So couchbase actually has to go to specific .couch files. This is where problem number one exists: it's unable to see docs that are created but are not persisted yet.

* getting 10 smallest keys from specific per-vbucket .couch file is efficient. But getting list of 10 smallest keys across all 1024 vbuckets is far less efficient. It's quick if your by-id btree of db files fit in page cache, but once you get to larger data sizes expect it to be very slow

* after we've got 10 smallest keys from underlying .couch files we fetch their bodies from caching layer. This is where second issue kicks in.

* It happens because actual expirations are lazy in couchbase (same as upstream memcached in fact). When document gets expired it does not get deleted immediately. Instead, when you GET some key, caching layer checks if document is expired and if it is, then it'll delete it and return "missing" response. There is however periodic task that periodically scans all keys and deletes expired documents.

* so this causes problem for views (both for DCP-based views of 3.0.0 and .couch-file based views of 2.x). And it causes similar "problem" for documents ui.

* so when underlying Documents UI REST API implementation gets list of keys and does GETs, it will in your case discover that those docs are actually gone already. It will try larger window and larger until it finds 10 "live" docs.

* and that's what is causing slowness for you


In general, Documents UI works for development environments (albeit with issues mentioned above). But don't expect Documents UI to be work ok for serious production deployments. It's just not ready.

Also, as far as I know there are no plans to fix it. There are some internal discussions about primary index related to niql work. But I'm not aware of any plans at all w.r.t. expirations problem. So I have no idea when this 2 issues that you're dealing with will be fixed.


Comment by rdev5 [ 15/Oct/14 ]
Two quick questions:

So basically avoid using the Documents UI in production altogether since it could possibly impact server performance for clients in production?

Are older versions like 2.2 and 2.5 also affected by this?
Comment by Aleksey Kondratenko [ 15/Oct/14 ]
>> So basically avoid using the Documents UI in production altogether since it could possibly impact server performance for clients in production?

Yes.

>> Are older versions like 2.2 and 2.5 also affected by this?

Yes. Even more than 3.0, because 3.0's implementation is more optimized.




[MB-12234] Secondary Indexing (2i) for Sherlock Created: 23/Sep/14  Updated: 23/Sep/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Sriram Melkote
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
over-arching tracking item for the GA of query facilities for sherlock
http://hub.internal.couchbase.com/confluence/display/PM/Query+Requirements+-+Sherlock




[MB-12233] N1QL for Sherlock - SELECT and DDL Created: 23/Sep/14  Updated: 23/Sep/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: sherlock
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
over-arching tracking item for the GA of query facilities for sherlock
http://hub.internal.couchbase.com/confluence/display/PM/Query+Requirements+-+Sherlock




[MB-12173] SSL certificate should allow importing certs besides server generated certs Created: 12/Sep/14  Updated: 15/Oct/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt: finish-start
has to be done before DOC-124 document SDK usage of CA and self-sig... Open
Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Matt Ingenthron [ 12/Sep/14 ]
Existing SDKs should be compatible with this, but importing the CA certs will need to be documented.




[MB-11557] Compaction: Report a useful error message when temporary files cannot be created Created: 26/Jun/14  Updated: 26/Jun/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 2.5.0, 2.5.1, 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dave Rigby Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 6u5

Attachments: File 0.couch.1    
Issue Links:
Relates to
relates to MB-11403 KV+XDCR System test : Race between co... Closed
relates to MB-11560 Compaction: Don't write to /tmp Resolved
Flagged:
Release Note

 Description   
The couch_compact program appears to create a number of temporary files (using tmpdir()) during compaction (at a guess for compacting the b-tree index). If these files cannot be created compaction fails with a generic error message:

[couchdb:info,2014-06-26T1:06:41.785,ns_1@localhost:<0.29571.6153>:couch_log:info:39]Native compactor output: Couchstore error: no such file

It would be *extremely* useful if the was more meaningful, and actually pointed out the reason for the failure.

Steps to reproduce:

Run couch_compact with the attached couchdb file (note: just contains test data) with /tmp set to be unwritable to the couchbase user.

$ chmod o-w /tmp
$ sudo -u couchbase -s
$ /opt/couchbase/bin/couch_compact 0.couch.1 foo.out
Couchstore error: no such file



 Comments   
Comment by Dave Rigby [ 26/Jun/14 ]
Location where the tmp file creation fails: http://src.couchbase.org/source/xref/2.5.0/couchstore/src/tree_writer.c#42

    writer->file = unsortedFilePath ? fopen(unsortedFilePath, "r+b") : tmpfile();
    if (!writer->file) {
        TreeWriterFree(writer);
        error_pass(COUCHSTORE_ERROR_NO_SUCH_FILE);
    }

Comment by Dave Rigby [ 26/Jun/14 ]
If this isn't fixed by 3.0, I think we should at least release-note it as a reason for compaction to fail.
Comment by Sundar Sridharan [ 26/Jun/14 ]
MB-11403 was fixed to address this temporary files issue specifically. thanks
Comment by Perry Krug [ 26/Jun/14 ]
Could you also add to the log message which file specifically is failing rather than just "no such file"?




[MB-11401] implement ns_server-side support for memcached ctl token extension Created: 11/Jun/14  Updated: 15/Oct/14

Status: In Progress
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Artem Stemkovski
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
There's a chance that memcached requests that are lost (from dead OS or erlang process for example or timed out) are still going to be executed by memcached. Causing arbitrary issues.

There's agreement with ep-engine and memcached folks and working implementation of new memcached extension described here: https://docs.google.com/document/d/116_7evfelOL5CMTl1GwtlbK2DtIH6Dp1tBSxV9zxUOM/edit

We should use it.

 Comments   
Comment by Aleksey Kondratenko [ 15/Oct/14 ]
moved out of 3.0.1 into sherlock. We won't be able to complete this in time for 3.0.1 most likely. I expect us to have enough time to do it for Sherlock.




[MB-11346] Audit logs for User/App actions Created: 06/Jun/14  Updated: 24/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Don Pinto
Resolution: Unresolved Votes: 0
Labels: security, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Couchbase Server should be able to get an audit logs for all User/App actions such-as login/logout events, mutations and other bucket and security changes.






[MB-11314] Enhaced Authentication model for Couchbase Server for Administrators, Users and Applications Created: 04/Jun/14  Updated: 03/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Don Pinto
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Couchbase Server will add support for authentication using various techniques example: Kerberos, LDAP etc…







[MB-11282] Separate stats for internal memory allocation (application vs. data) Created: 02/Jun/14  Updated: 02/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Story Priority: Critical
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
AFAIK currently we track allocation for data and application together.

But sometimes application (memcached / ep-engine) overhead is huge and cannot be ignored.




[MB-11229] For pre/append operations DCP should only send the append chuck and not the whole document. Created: 28/May/14  Updated: 26/Aug/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Patrick Varley Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: AWS and other cloud environments that charge for network usage.

Issue Links:
Dependency
blocks MB-11591 Optimism pre/append operations throug... Open
Duplicate

 Description   
I believe in TAP we sync the whole document even when it was a append/preappend operation from the client.

This can cause high network usages (as a result cost too) where append is used heavily.

If the architecture allows we should send just the append chuck, the meta-data and maybe the previous meta-data so the system can validate that it is up to day before taking the new append.

This should reduce network bandwidth and reduce the need to do a bg_fetch for append operations.

 Comments   
Comment by Mike Wiederhold [ 04/Aug/14 ]
We don't internally store whether an update is an append or not. As a result this is not something that is current;y possible to do from a DCP standpoint.




[MB-11154] Document proper way to detect a flush success from the SDK Created: 19/May/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Critical
Reporter: Michael Nitschinger Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Hi folks,

while implementing the 2.0 SDK for java, I had the need for flush() again and thought let's do it right this time. Here is how the old SDK did this, more or less in a hacky way:

- Do the HTTP flush command agains the bucket
- Then poll for ep_degraded_mode

Now I talked to trond and he said polling those stats is just guessing since the only authorative for this can be ns_server. I guess the only way to poll is since flush can take a long time and HTTP request can time out before that?

We need to come up:

1) With a documented way how to do this reliably for 2.* series so we can provide good support for it.
2) If this is not good enough and has some edge cases or whatever, come up with something better for 3.*

I'm starting with Alk here since I guess ns_server has the coordination of all that.
Cheers,
Michael

 Comments   
Comment by Trond Norbye [ 19/May/14 ]
Polling for such a status change in ep-engine will never be "safe". It may enter and exit the degraded mode between any poll requests. You would have to have a stat with some sort of uuid in ep-engine in order to implement this. Given that this is a "cluster-wide" operation the only component that know the overall status of this operation is ns_server.

I don't think it is a good idea to spread the "internal logic" from ep-engine to the clients (since that may make it hard to change the implementation logic inside the engine)..
Comment by Matt Ingenthron [ 19/May/14 ]
A couple of high-level points (much of this has been discussed before in email):
- Since this is a cluster, the thing managing the cluster is the place to ask for the 'flush', that is ns-server as Trond mentions
- With REST, any long running operations are supposed to return an HTTP 201 with a location to check status on that operation. This is something we really need for many things beyond flush(). For instance, bucket create... how should a client (doesn't matter if it's an SDK) know when that operation is done?
- Connected clients (those who did not request the flush) should have very simple interaction with the cluster (to Trond's other point). If it's a flush, during the duration of the flushing activity there should be TMPFAIL replies and we should make the flush as low latency as possible. I know it can't be as fast as memcached, but I also know it can be pretty fast.

Mike: I assume there must be some other reason you're asking about this now? Related to UPR work?
Comment by Michael Nitschinger [ 19/May/14 ]
Matt,

I just asked because I wanted to implement flush in the new jvm core so that I can support my own unit tests properly. I then digged into the dusty corners of the old SDK and wondered if there is a better way then how we do it right now. And also to bring it up so we get better semantics moving forward.
Comment by Aleksey Kondratenko [ 19/May/14 ]
Unfortunately there is no clean and bullet-proof way of doing it. Here's what I could come up with which should be usable for tests:

* upload some "marker doc". Say empty doc with key __flush_marker

* send flush via REST API.

* if it returned 200 then you're done

* if it returned 201 poll for __flush_marker until you get miss (note, not temp error, and not hit, bit miss)

* if it returned anything else assume that request failed and restart by sending another flush request
Comment by Brett Lawson [ 18/Jun/14 ]
@Alk: Will this method of detecting a flush degrade on larger clusters, where many nodes may not be done flushing even if the doc has been flushed from a particular node?
Comment by Aleksey Kondratenko [ 18/Jun/14 ]
No. Flush is done in 2pc fashion. If you stop seeing marker doc in some vbucket, then you know that other vbuckets are already rejecting ops or already done flushing.
Comment by Aleksey Kondratenko [ 18/Jun/14 ]
Let me clarify. Once you start seeing _lack of presence_ of marker docs, then as pointed above flush is guaranteed to be done. And done means that you may see tmperrors for some time after that. But you will not see any docs from before flush.




[MB-11099] Couchbase cluster provisioning, deployment, and management on Open Stack KVM and Trove Created: 12/May/14  Updated: 08/Jul/14

Status: Open
Project: Couchbase Server
Component/s: cloud
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
support Couchbase cluster provisioning, deployment, and management on Open Stack KVM and trove.




[MB-11101] supported go SDK for couchbase server Created: 12/May/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Matt Ingenthron
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
go client




[MB-11007] Request for Get Multi Meta Call for bulk meta data reads Created: 30/Apr/14  Updated: 30/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Parag Agarwal Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: All


 Description   
Currently we support per key call for getMetaData. As a result our verification requires per key fetch during verification phase. This request is to support for get bulk meta data call which can get us meta data per vbucket for all keys or in batches. This would help enhance our verification ability for meta data per documents over time or after operations like rebalance, as it will be faster. If there is a better alternative, please recommend.

Current Behavior

https://github.com/couchbase/ep-engine/blob/master/src/ep.cc

ENGINE_ERROR_CODE EventuallyPersistentStore::getMetaData(
                                                        const std::string &key,
                                                        uint16_t vbucket,
                                                        const void *cookie,
                                                        ItemMetaData &metadata,
                                                        uint32_t &deleted,
                                                        bool trackReferenced)
{
    (void) cookie;
    RCPtr<VBucket> vb = getVBucket(vbucket);
    if (!vb || vb->getState() == vbucket_state_dead ||
        vb->getState() == vbucket_state_replica) {
        ++stats.numNotMyVBuckets;
        return ENGINE_NOT_MY_VBUCKET;
    }

    int bucket_num(0);
    deleted = 0;
    LockHolder lh = vb->ht.getLockedBucket(key, &bucket_num);
    StoredValue *v = vb->ht.unlocked_find(key, bucket_num, true,
                                          trackReferenced);

    if (v) {
        stats.numOpsGetMeta++;

        if (v->isTempInitialItem()) { // Need bg meta fetch.
            bgFetch(key, vbucket, -1, cookie, true);
            return ENGINE_EWOULDBLOCK;
        } else if (v->isTempNonExistentItem()) {
            metadata.cas = v->getCas();
            return ENGINE_KEY_ENOENT;
        } else {
            if (v->isTempDeletedItem() || v->isDeleted() ||
                v->isExpired(ep_real_time())) {
                deleted |= GET_META_ITEM_DELETED_FLAG;
            }
            metadata.cas = v->getCas();
            metadata.flags = v->getFlags();
            metadata.exptime = v->getExptime();
            metadata.revSeqno = v->getRevSeqno();
            return ENGINE_SUCCESS;
        }
    } else {
        // The key wasn't found. However, this may be because it was previously
        // deleted or evicted with the full eviction strategy.
        // So, add a temporary item corresponding to the key to the hash table
        // and schedule a background fetch for its metadata from the persistent
        // store. The item's state will be updated after the fetch completes.
        return addTempItemForBgFetch(lh, bucket_num, key, vb, cookie, true);
    }
}



 Comments   
Comment by Venu Uppalapati [ 30/Apr/14 ]
Server has support for quiet CMD_GETQ_META call which can be used on the client side to create a multi-getMeta call similar to multiGet call implementation.
Comment by Parag Agarwal [ 30/Apr/14 ]
Please point to a working example for this call
Comment by Venu Uppalapati [ 30/Apr/14 ]
Parag, you can find some relevant information on using queuing requests using quiet call at https://code.google.com/p/memcached/wiki/BinaryProtocolRevamped#Get,_Get_Quietly,_Get_Key,_Get_Key_Quietly
Comment by Chiyoung Seo [ 30/Apr/14 ]
Changing the fix version to the feature backlog given that 3.0 feature complete date was already passed and it is requested for the QE testing framework.




[MB-10791] Support async Get / GetMeta / Add / Delete APIs Created: 07/Apr/14  Updated: 07/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Protocol extensions will be required to support heavy DGM in a more scalable way and use connection sockets much more efficiently.

As a starting point and short-term solution, we need to provide async types of operations for Get / GetMeta / Add / Delete APIs. Note that those extensions will mitigate the performance regression from the full ejection cache management.

More details can be found on

https://docs.google.com/document/d/1tJYCW_sPbjqQD-X7LLAb88LNZwYmODGn-awxYrPHQeg/edit#

As a long-term solution, we need to revisit the overall protocol specs and extend them to address the above issues completely.






[MB-10788] Enhance the OBSERVE durability support by leveraging VBucket UUID and seq number Created: 07/Apr/14  Updated: 23/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Chiyoung Seo Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
The current OBSERVE command is based on CAS returned from the server, which consequently can't provide the correct tracking for failover scenarios especially. To enhance it, we will investigate leveraging VBucket UUID and seq number to provide a better tracking for various failover or soft / hard node shutdown scenarios.

 Comments   
Comment by Cihan Biyikoglu [ 08/Apr/14 ]
There are a number of reasons why this comes up regularly with customers.
- they see replication as a better way to create durability over local node disk persistence.
- this does allow replica reads without compromise on consistency
Comment by Chiyoung Seo [ 02/Sep/14 ]
We discussed the high-level design that is still based on polling-based approach like OBSERVE, but can provide the better aspects in terms of performance (e.g., latency) and tracking replication. Sriram will write up the design doc and share it later.




[MB-10714] add/rebalance nodes immediately after install can interfere with loading sample data Created: 01/Apr/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1, 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Marty Schoch Assignee: Aliaksey Artamonau
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3 nodes - CentOS release 6.4 (Final)
each 4 core - 4GB

Issue Links:
Relates to
relates to MB-11441 Loading the example buckets is extrem... Resolved
relates to MB-11820 beer-sample loading is stuck in crash... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
1. Clean install of couchbase-server 2.5.1 on 3 nodes
2. Complete set up wizard on first node, select beer-sample database, 100MB for default bucket, everything else default values
3. As soon as you're taken to the regular console page, click to server tab
4. Add second server
5. Add third server
6. Press rebalance
7. Observe that rebalance has started
8. Click over to data buckets tab, observe that less than 7303 documents are in the beer-sample bucket (presumably exact value depends on timing of previous operations)

First time this happened I had around 2300 documents. When reproducing the behavior I had only 869 documents.

 Comments   
Comment by Matt Ingenthron [ 01/Apr/14 ]
Probably because it is using the non rebalance aware python stub of a client.
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - 06/04/2014 Alk, Wayne, Parag, Anil
Comment by Trond Norbye [ 17/Jun/14 ]
See also http://www.couchbase.com/issues/browse/MB-11441
Comment by Perry Krug [ 30/Jul/14 ]
I think this might be similar to mb-11820




[MB-10716] SSD, HDD and Cloud Storage IO throughput optimizations: ForestDB Created: 01/Apr/14  Updated: 22/Oct/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by MB-11098 Ability to set block size written to ... Resolved

 Description   
forestdb work




[MB-11208] stats.org should be installed Created: 27/May/14  Updated: 27/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: techdebt-backlog
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
stats.org contains a description of the stats we're sending from ep-engine. It could be useful for people

 Comments   
Comment by Matt Ingenthron [ 27/May/14 ]
If it's "useful" shouldn't this be part of official documentation? I've often thought it should be. There's probably a duplicate here somewhere.

I also think the stats need stability labels applied as people may rely on stats when building their own integration/monitoring tools. COMMITTED, UNCOMMITTED, VOLATILE, etc. would be useful for the stats.

Relatedly, someone should document deprecation of TAP stats for 3.0.




[MB-11188] RemoteMachineShellConnection.extract_remote_info doesn't work on OSX Mavericks Created: 22/May/14  Updated: 15/Aug/14

Status: Open
Project: Couchbase Server
Component/s: test-execution
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Artem Stemkovski Assignee: Parag Agarwal
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
2 problems:

1:
executing sw_vers on ssh returns:
/Users/artem/.bashrc: line 2: brew: command not found

2:
workbook:ns_server artem$ hostname -d
hostname: illegal option -- d
usage: hostname [-fs] [name-of-host]




[MB-11171] mem_used stat exceeds the bucket memory quota in extremely heavy DGM and highly overloaded cluster Created: 20/May/14  Updated: 21/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.1.0, 2.2.0, 2.1.1, 2.5.0, 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This issue was reported from one of the customers. Their cluster was extremely heavy DGM (resident ratio near zero in both active and replica vbuckets) and was highly overloaded when this memory bloating issue happened.

From the logs, we saw that the number of memcached connections was spiked from 300 to 3K during the period having the memory issue. However, we were not able to correlate the increased number of connections to the memory bloating issue yet, but plan to keep investigating this issue by running the similar workload tests.





[MB-10842] cbdocloader can't handle UTF-16 input files created on Windows Created: 11/Apr/14  Updated: 23/Sep/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.1, 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Tom Green Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: cbdocloader, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 32/64bit

Issue Links:
Dependency

 Description   
When using cbdocloader on Windows, and passing in JSON input files created on Windows in UTF-16 format, cbdocloader fails with "No JSON object could be decoded", as seen below.

Input files inside the zip are:
$ file venue_2.json
venue_2.json: Little-endian UTF-16 Unicode text, with very long lines, with no line terminators

When running cbdocloader:
cbdocloader.exe -n localhost:8091 -u Administrator -p [password] -b timbre H:\json\venues.zip

output

[2014-04-10 15:02:09,473] - [rest_client] [9308] - INFO - existing buckets : [u'beer-sample', u'gamesim-sample', u'timbre']
[2014-04-10 15:02:09,473] - [rest_client] [9308] - INFO - found bucket timbre
No JSON object could be decoded
No JSON object could be decoded
No JSON object could be decoded
done


If the input file is unzipped, passed through dos2unix, zipped back up again, cbdocloader will process successfully.

After dos2unix, the input files are:
$ file venue_1.json
venue_1.json: ASCII text, with very long lines, with no line terminators

When running cbdocloader with reformatted input, it works successfully:

./cbdocloader -n localhost:8091 -u Administrator -p [password] -b timbre venues_dos2unix.zip
[2014-04-10 10:24:25,393] - [rest_client] [140735166104336] - INFO - existing buckets : [u'default', u'timbre']
[2014-04-10 10:24:25,393] - [rest_client] [140735166104336] - INFO - found bucket timbre
done


 Comments   
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - June 04 2014 Bin, Ashivinder, Venu, Tony, Anil




[MB-10831] Add new stats to Admin UI 'memory fragmentation outside mem_used' Created: 10/Apr/14  Updated: 04/Aug/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, storage-engine
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Steve Yen Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Is this a Regression?: No

 Comments   
Comment by Perry Krug [ 10/Apr/14 ]
I think it might be worthwhile to have an absolute number in terms of amount of memory above-and-beyond mem_used as opposed to a percentage...it may let users be a bit more intelligent about when they need to take action.
Comment by Anil Kumar [ 07/Jul/14 ]
Artem – As discussed it will be useful to have this stats as absolute value rather than percentage (%).

Text for stat – “Fragmented data measured outside of mem_used” in GB
Comment by Artem Stemkovski [ 22/Jul/14 ]
What's the best way to get memory fragmentation out of ep_engine?
I see the stat called total_fragmentation_bytes. Is this the correct stat?

Chiyoung Seo:
From the ep-engine stats,
total_free_bytes (free and mapped pages in the allocator) + total_fragmentation_bytes can be used for this.
Comment by Artem Stemkovski [ 22/Jul/14 ]
Apparently there's no way to get tcmalloc stats out of ep_engine if there are no buckets.
We still need to display system stats even is there are no buckets configured.

So we need a way to query tcmalloc stats globally without logging in as a particular bucket.
Rerouting the ticket to ep_engine
Comment by Mike Wiederhold [ 30/Jul/14 ]
Trond,

Can you take a look at this? The ns_server team wants a way to get tcmalloc memory stats without having to authenticate to any particular bucket.
Comment by Trond Norbye [ 31/Jul/14 ]
Can I get a better description on what we want to show here? I don't think we should tie it directly to tcmalloc internal statistics, because we _are_ doing experiments with using other memory allocators than tcmalloc.
Comment by Trond Norbye [ 31/Jul/14 ]
What exactly do you want?
Comment by Steve Yen [ 31/Jul/14 ]
> What exactly do you want?

Here's my take...

This is already becoming too late for 3.0, so this'll likely become 3.0.1. Also, I think...

1) On ns-server side...

If the stats aren't available in some situations (such as because there isn't a bucket), then please don't display the related graphs in the web admin UI until there is a bucket. This comes from the theory that some visibility sometimes is better than none all the time.

2) On the memcached side...

Please add tcmalloc/memory allocator stats (as many as you can) to the relevant "stats <relevant-subkey>" response values, if not already, ideally if the client is SASL authenticated using some special "system" user (if think we have such a thing like that?). And, doc it up in some spec.

If these data are available only at the bucket level, I'd also recommend any total'ing and SUM'ing be done "on the outside", by the client.

The "as many as you can" advice is so we don't have to do more future ping pong of "please please expose stat or counter XYZ". The design policy should be "expose 'em all, because we're going to wish we had them at 3am some future day" (and while I'm suggesting policy, please also stop renaming these things as they get passed around through the system so folks can grep code with a single, stable identifier).

Anyways, that's my 0.00002 NOK.

Thanks.

Now that I forked this thing into two subtasks across 2 teams, I arbitrarily chose Artem as the next receiver of this MB.
Comment by Aleksey Kondratenko [ 04/Aug/14 ]
It doesn't look like it's going to be a simple change in ns_server land. So moving out of 3.0




[MB-10834] update the license.txt for enterprise edition for 2.5.1 Created: 10/Apr/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Microsoft Word 2014-04-07 EE Free Clickthru Breif License.docx    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
document attached.

 Comments   
Comment by Phil Labee (Inactive) [ 10/Apr/14 ]
2.5.1 has already been shipped, so this file can't be included.

Is this for 3.0.0 release?
Comment by Phil Labee (Inactive) [ 10/Apr/14 ]
voltron commit: 8044c51ad7c5bc046f32095921f712234e74740b

uses the contents of the attached file to update LICENSE-enterprise.txt on the master branch.




[MB-10835] update the license.txt for community edition for 2.5.1 Created: 10/Apr/14  Updated: 23/Jun/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: doc attached

Attachments: Microsoft Word 2011-12-07 CE License.docx    
Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Phil Labee (Inactive) [ 10/Apr/14 ]
2.5.1 has already been shipped, so this file can't be included.

Is this for 3.0.0 release?
Comment by Phil Labee (Inactive) [ 10/Apr/14 ]
voltron commit: 4d82ae743364074fd8783c2ff864bbf63c70a040

uses the contents of the attached file to update LICENSE-community.txt on the master branch.
Comment by Cihan Biyikoglu [ 10/Apr/14 ]
this is for 2.5.1 - Anil will validate but we may need to fix that. we'll expand tests to prevent the issue in future.
Comment by Phil Labee (Inactive) [ 10/Apr/14 ]
fix it for where? For what releases is the attached file relevant?
Comment by Cihan Biyikoglu [ 15/Apr/14 ]
this is for 2.5.1
Comment by Phil Labee (Inactive) [ 15/Apr/14 ]
2.5.1 has already shipped
Comment by Anil Kumar [ 23/Jun/14 ]
2.5.1 for EE is released and not CE. License.txt file in CE 2.5.1 - build 1073 needs to be verified if that's old we will replace it.




[MB-10823] Log failed/successful login with source IP to detect brute force attacks Created: 10/Apr/14  Updated: 26/Aug/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Don Pinto
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Cihan Biyikoglu [ 18/Jun/14 ]
http://www.couchbase.com/issues/browse/MB-11463 for covering ports 11209 or 11211.




[MB-10821] optimize storage of larger binary object in couchbase Created: 10/Apr/14  Updated: 10/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10524] need working and official instructions or make target to correctly clean working repository from all build products Created: 20/Mar/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
SUBJ.

So that people can work.

The following was tried and did not work:

# repo forall -c 'git clean -dfx'
# rm cmake/CMakeCache.txt

As part of investigating that we found that it is automagically, without being asked for, tried to build against libcurl in /opt/couchbase. Removing /opt/couchbase moved it forward but did not resolve the problem.


 Comments   
Comment by Trond Norbye [ 20/Mar/14 ]
THis is what's I've been running for the last n months..

gmake clean-xfd
repo forall -c 'git clean -dfx'

Ideally we should focus on completing the transition leaving the transition period as short as possible..
Comment by Aleksey Kondratenko [ 20/Mar/14 ]
Here's what I've added to .repo/Makefile.extra


superclean:
rm -rf install
repo forall -c 'git clean -dfx'
rm -fr cmake/CMakeCache.txt cmake/CMakeFiles dependencies/
cd cmake/ && (ruby -e 'puts Dir["*"].select {|n| File.file?(n)}' | xargs rm)
cp -f tlm/CMakeLists.txt cmake/
cp -f tlm/Makefile.top ./Makefile

it appears to work. And after superclean I'm seeing no extra stuff being left.
Comment by Aleksey Kondratenko [ 20/Mar/14 ]
It appears that trouble with instructions above were caused by repo's inability to replace cmake/CMakelists.txt with fresh copy. Old top level makefile had special rule to refresh top makefile. In fact it still has that. But not for cmakelists.





[MB-10517] XDCR: enable last write wins conflict resolution Created: 20/Mar/14  Updated: 23/Sep/14

Status: Open
Project: Couchbase Server
Component/s: clients, cross-datacenter-replication
Affects Version/s: 2.2.0, 2.5.0, 2.5.1, 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Today XDCR conflict resolves based on revID. we need a way for client applications to predictably define which value should win under conflicts. LWW (last write wins) bring in a number of benefits. details are here;
http://hub.internal.couchbase.com/confluence/display/PM/XDCR+-+Last+Write+Wins+%28LWW%29+Conflict+Resolution


 Comments   
Comment by Cihan Biyikoglu [ 16/Jul/14 ]
adding clients label since these is a good possibility this will involve client side requirements as well.




[MB-10511] Feature request for supporting rolling downgrades Created: 19/Mar/14  Updated: 11/Apr/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Abhishek Singh Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to

 Description   
Some customers are interested in Couchbase supporting rolling downgrades. Currently we can't add 2.2 nodes inside a cluster that has all nodes on 2.5.




[MB-10469] Support Couchbase Server on SuSE linux platform Created: 14/Mar/14  Updated: 09/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build, installer
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: SuSE linux platform

Issue Links:
Dependency
Duplicate

 Description   
Add support for SuSE Linux platform




[MB-10379] index is not used for simple query Created: 06/Mar/14  Updated: 28/May/14  Due: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: 2.5.0
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Iryna Mironava
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 64-bit

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
I created index for name field of bucket b0 and then my_skill index for b0
cbq> select * from :system.indexes
{
    "resultset": [
        {
            "bucket_id": "b0",
            "id": "#alldocs",
            "index_key": [
                "META().id"
            ],
            "index_type": "view",
            "name": "#alldocs",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
        {
            "bucket_id": "b0",
            "id": "my_name",
            "index_key": [
                "name"
            ],
            "index_type": "view",
            "name": "my_name",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
       {
            "bucket_id": "b0",
            "id": "my_skill",
            "index_key": [
                "skills"
            ],
            "index_type": "view",
            "name": "my_skill",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
        {
            "bucket_id": "b1",
            "id": "#alldocs",
            "index_key": [
                "META().id"
            ],
            "index_type": "view",
            "name": "#alldocs",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
        {
            "bucket_id": "default",
            "id": "#alldocs",
            "index_key": [
                "META().id"
            ],
            "index_type": "view",
            "name": "#alldocs",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "4"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "1.185438ms"
        }
    ]
}

I see my view on UI, I can query it.
but explain says i am still using #alldocs

cbq> explain select name from b0
{
    "resultset": [
        {
            "input": {
                "as": "b0",
                "bucket": "b0",
                "ids": null,
                "input": {
                    "as": "",
                    "bucket": "b0",
                    "cover": false,
                    "index": "#alldocs",
                    "pool": "default",
                    "ranges": null,
                    "type": "scan"
                },
                "pool": "default",
                "projection": null,
                "type": "fetch"
            },
            "result": [
                {
                    "as": "name",
                    "expr": {
                        "left": {
                            "path": "b0",
                            "type": "property"
                        },
                        "right": {
                            "path": "name",
                            "type": "property"
                        },
                        "type": "dot_member"
                    },
                    "star": false
                }
            ],
            "type": "projector"
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "1"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "1.236104ms"
        }
    ]
}
same result i see for skills


 Comments   
Comment by Sriram Melkote [ 07/Mar/14 ]
I think the current implementation considers secondary indexes only for filtering operations. When you do SELECT <anything> FROM <bucket>, it is a full bucket scan, and that is implemented by #alldocs and by #primary index only.

So the current behavior looks to be correct. Try running "CREATE PRIMARY INDEX USING VIEW" and please see if the query will then switch from #alldocs to #primary. Please also try adding a filter, like WHERE name > 'Mary' and see if the my_name index gets used for the filtering.

As a side note, what you're running is a covered query, where all the data necessary is held in a secondary index completely. However, this is not implemented. A secondary index is only used as an access path, and not as a source of data.
Comment by Gerald Sangudi [ 11/Mar/14 ]
This particular query will always use #primary or #alldocs. Even for documents without a "name" field, we return a result object that is missing the "name" field.

@Iryna, please test WHERE name iS NOT MISSING to see if it uses the index. If not, we'll fix that for DP4. Thanks.




[MB-10003] [Port-configurability] Non-root instances and multiple sudo instances in a box cannot be 'offline' upgraded Created: 24/Jan/14  Updated: 27/Mar/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Unix/Linux


 Description   
Scenario
------------
As of today, we do not support offline 'upgrade' per se for packages installed in non-root/sudo users. Upgrades are usually handled by package managers. Since these are absent in non-root users and rpm cannot handle more than a a single package upgrade(if there are many instances running), offline upgrades are not supported (confirmed with Bin).

ALL non-root installations will be affected by this limitation. Although a single instance running on a box under sudo user can be offline upgraded, it cannot be extended to more than one such instance.

This is important

Workaround
-----------------
- Online upgrade (swap with nodes running latest build, take old nodes down and do clean install)
- Backup data and restore after fresh install (cbbackup and cbrestore)

Note : At this point, these are mere suggestions and both these workarounds haven't been tested yet.




[MB-9635] Audit logs for Admin actions Created: 22/Nov/13  Updated: 24/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Don Pinto
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Duplicate

 Description   
Couchbase Server should be able to get an audit logs for all Admin actions such-as login/logout events, significant events (rebalance, failover, etc) etc.



 Comments   
Comment by Matt Ingenthron [ 13/Mar/14 ]
Note there isn't exactly a "login/logout" event. This is mostly by design. A feature like this could be added, but there may be better ways to achieve the underlying requirement. One suggestion would be to log initial activities instead of every activity and have a 'cache' for having seen that user agent within a particular window. That would probably meet most auditing requirements and is, I think, relatively straightforward to implement.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
We have access.log implemented now. But it's not exactly same as full-blown audit. Particularly we do log that certain POST was handled in access.log, but we do not log any parameters of that action. So it doesn't count as fullly-featured audit log I think.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
access.log for log and ep-engine's access.log do not conflict due to being in necessarily different directories.
Comment by Perry Krug [ 06/Jun/14 ]
They may not conflict in terms of unique names in the same directory, but to our customers it may be a little bit too close to remember which access.log does what...
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
Ok. Any specific proposals ?
Comment by Perry Krug [ 06/Jun/14 ]
Yes, as mentioned above, login.log would be one proposal but I'm not tied to it.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
access.log has very little to do with logins. It's full blown equivalent of apache's access.log.
Comment by Perry Krug [ 06/Jun/14 ]
Oh sorry, I misread this specific section.

How about audit.log? I know it's not fully "audit" but I'm just trying to avoid the name clash in our customer's minds...
Comment by Anil Kumar [ 09/Jun/14 ]
Agreed we should rename this file to audit.log to avoid any confusion. Updating the MB-10020 to make that change.
Comment by Larry Liu [ 10/Jun/14 ]
Hi, Anil

Does this feature satisfy PCI compliance?

Larry
Comment by Cihan Biyikoglu [ 11/Jun/14 ]
Hi Larry, PCI is a comprehensive set of requirements that go beyond database features. This does help with some part of PCI but talking about compliance with PCI involve many additional controls and most can be done at the operational levels or at the app level.
thanks




[MB-9446] there's chance of starting janitor while not having latest version of config (was: On reboot entire cluster , see many conflicting bucket config changes frequently.) Created: 30/Oct/13  Updated: 04/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.0, 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ketaki Gangal Assignee: Aliaksey Artamonau
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: build 0.0.0-7040toy

Triage: Triaged
Is this a Regression?: Yes

 Description   

Load items on a cluster , build toy-000-704
Reboot cluster

Post reboot, see a lot of messages on conflicting bucket config on the web logs.

Cluster logs here: https://s3.amazonaws.com/bugdb/bug_9445/9435.tar

Sample

{fastForwardMap,undefined}]}]}]}, choosing the former, which looks newer.
ns_config003 ns_1@soursop-s11207.sc.couchbase.com 18:59:30 - Wed Oct 30, 2013
Conflicting configuration changes to field buckets:
{[{'ns_1@172.23.105.45',{5088,63550403967}},
{'ns_1@soursop-s11203.sc.couchbase.com',{1,63550403967}},
{'ns_1@soursop-s11204.sc.couchbase.com',{1764,63550403283}}],
[{'_vclock',[{'ns_1@172.23.105.45',{5088,63550403967}},
{'ns_1@soursop-s11203.sc.couchbase.com',{1,63550403967}},
{'ns_1@soursop-s11204.sc.couchbase.com',{1764,63550403283}}]},
{configs,[{"saslbucket",
[{uuid,<<"b51edfdad356db7e301d9b32c6ef47a3">>},
{num_replicas,1},
{replica_index,false},
{ram_quota,3355443200},
{auth_type,sasl},
{sasl_password,"password"},
{autocompaction,false},
{purge_interval,undefined},
{flush_enabled,false},
{num_threads,3},
{type,membase},
{num_vbuckets,1024},
{servers,['ns_1@soursop-s11203.sc.couchbase.com',
'ns_1@soursop-s11204.sc.couchbase.com',
'ns_1@soursop-s11205.sc.couchbase.com',
'ns_1@soursop-s11207.sc.couchbase.com']},
{map,[['ns_1@soursop-s11207.sc.couchbase.com',
'ns_1@soursop-s11205.sc.couchbase.com'],
['ns_1@soursop-s11207.sc.couchbase.com',
'ns_1@soursop-s11203.sc.couchbase.com'],
['ns_1@soursop-s11207.sc.couchbase.com',
'ns_1@soursop-s11204.sc.couchbase.com'],

 Comments   
Comment by Aleksey Kondratenko [ 30/Oct/13 ]
Very weird. But if indeed issue, there's likely exactly same issue on 2.5.0. And if it's the case looks pretty scary.
Comment by Aliaksey Artamonau [ 01/Nov/13 ]
I set affect version to 2.5 because I really know that it affects 2.5. And actually many preceding releases.
Comment by Maria McDuff (Inactive) [ 31/Jan/14 ]
Alk,

is this already merged in 2.5? pls confirm and mark as resolved if that's the case, assign back to QE.
Thanks.
Comment by Aliaksey Artamonau [ 31/Jan/14 ]
No, it's not fixed in 2.5.
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - 06/04/2014 Alk, Wayne, Parag, Anil




[MB-9321] Get us off erlang's global facility and re-elect failed master quickly and safely Created: 10/Oct/13  Updated: 23/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: bug-backlog, sherlock
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Duplicate
is duplicated by MB-9691 rebalance repeated failed when add no... Closed
Relates to
relates to MB-9691 rebalance repeated failed when add no... Closed
Triage: Triaged
Is this a Regression?: No

 Description   
We have a number of bugs due to erlang global facility or related issue of not being able to spawn new master quickly. I.e.:

* MB-7282 (erlang's global naming facility apparently drops globally registered service with actual service still alive (was: impossible to change settings/autoFailover after rebalance))

* MB-7168 [Doc'd 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)

* MB-8682 start rebalance request is hunging sometimes (looks like another global facility issue)

* MB-5622 Crash of master node may lead to autofailover in 2 minutes instead of configured shorter autofailover period or similarly slow manual failover

By getting us off global, we will fix all this issues.


 Comments   
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
This also includes making sure autofailover takes into account time it takes for master election in case of master crash.

Current thinking is that every node will run autofailover service but it will run only if it's on master node. And we can have special code that speeds up master re-election if we detect that master node is down.
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
Note that currently mb_master is the thing that first suffers when timeout-ful situation starts.

So we should look at making mb_master more robust if necessary
Comment by Aleksey Kondratenko [ 17/Oct/13 ]
I'm _really_ curious who makes decisions to move this into 2.5.0. Why. And why they think we have bandwidth to handle it.
Comment by Aleksey Kondratenko [ 09/Dec/13 ]
Workaround diag/eval snippet:

rpc:call(mb_master:master_node(), erlang, apply ,[fun () -> erlang:exit(erlang:whereis(mb_master), kill) end, []]).

Detection snippet:

F = (fun (Name) -> {Oks, NotOks} = rpc:multicall(ns_node_disco:nodes_actual(), global, whereis_name, [Name], 60000), case {lists:usort(Oks), NotOks} of {_, [_|_]} -> {failed_rpc, NotOks}; {[_], _} -> ok; {L, _} -> {different_answers, L} end end), [(catch {N, ok} = {N, F(N)}) || N <- [ns_orchestrator, ns_tick, auto_failover]].

Detection snipped should return:

 [{ns_orchestrator,ok},{ns_tick,ok},{auto_failover,ok}]

If not, there's decent chance that we're hitting this issue.
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
As part of that we'll likely have to revamp autofailover. And John Liang suggested nice idea for single ejected node to disable memcached traffic on itself to signal smart clients that something notable occurred.

On Fri, Dec 13, 2013 at 11:48 AM, John Liang <john.liang@couchbase.com> wrote:
>> I.e. consider client that only updates vbucket map if it receives "not my vbucket". And consider 3 node cluster where 1 node is partitioned off other nodes but is accessible from client. Lets name this node C. And imagine that remaining two nodes did failover that node. It can be seen that client will happily continue using old vbucket map and reading/writing to/from node C, because it'll never get a single "not my vbucket" reply.

Thanks Alk. In this case, is there a reason why not to change the vbucket state on the singly-paritioned node on auto-failover? There still be a window for "data loss", but this window should be much smaller.

Yes we can do it. Good idea.
Comment by Perry Krug [ 13/Dec/13 ]
But if that node (C) is partitioned off...how will we be able to tell it to set those vbucket states? IMO, wouldn't it be better for the clients to implement a true quorum approach to the map when they detect that something isn't right? Or am I being too naive and missing something?
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
It's entirely possible that I misunderstood original text, but I understand it the following:

* when autofailover is enabled every node observes if it's alone. If node finds itself alone and usual autofailover threshold passes. This node can be somewhat sure that it was automatically failed over by other cluster

* when that happens, node can either turn all vbuckets to replicas or disable traffic (similarly to what we're doing during flush).

There is of course a chance that all other nodes have truly failed and that single node is all that's left. But in can be argued that in this case amount of data loss is big enough anyways. And one node that artificially disables traffic doesn't change things much.

Regarding "quorum on clients". I've seen one proposal for that. And I don't think it's good idea. I.e. being in majority and being right are almost completely independent things. We can do far better than that. Particularly with CCCP we have rev field that gives sufficient ordering between bucket configurations.
Comment by Perry Krug [ 13/Dec/13 ]
My concern is that our recent observations of false-positive autofailovers may lead lots of individual nodes to decide that they have been isolated and disable their traffic...whether they've been automatically failed over or not.

As you know, one of the very nice safety nets of our autofailover is that it will not activate if it sees more than one node down at once which means that we can never do the wrong thing. If we allow one node to disable its traffic when it can't intelligently reason about the state of the rest of the cluster, IMO we go away from this safety net...no?
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
No. Because node can only do that when it's sure that other side of cluster is not accessible. And it can immediately recover it's memcached traffic ASAP after it detects that rest of cluster is back.`
Comment by Perry Krug [ 13/Dec/13 ]
But it can't ever be sure that the other side of the cluster is actually not accessible...clients may still be able to reach it right?

I'm thinking about some extreme corner cases...but what about the situation where two nodes of a >2-node cluster are completely isolated via some weird networking situation and yet are still reachable to the clients. Both of them would decide that they were isolated from the whole cluster, both of them would disable all their vbuckets and yet neither would be auto-failed over because the rest of the cluster would see two nodes down and not trigger the autofailover. I realize it's rare...but I bet there are less convoluted scenarios that would lead the software to do something undesirable.

I think this is a good discussion...but not directly relevant to the purpose of this bug which I believe is still an important fix that needs to be made. Do you want to take this discussion offline from this bug?
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
There's definitely ways how this can backfire. But tradeoffs are quite clear. You "buy" ability detect autofailovers (and only autofailovers in my words above; but this can be potentially extended to other cases), at expense of small chance of node false-positively disabling it's traffic, briefly and without data loss.

Thinking about this more I now see that it's less good idea than I thought. I.e. particularly autofailover but not manual failover is not as interesting. But we can return to this discussion when mb_master work is actually in progress.

Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Lowered to critical. It's not blocking anyone
Comment by Cihan Biyikoglu [ 23/Sep/14 ]
Hi Alk, should we move this to sherlock if you expect we'll be done with this by that point?
Comment by Aleksey Kondratenko [ 23/Sep/14 ]
Cihan, yes we hope to get at least parts of it "in" sherlock. But given big amount of work we would like avoid any promises.

I'm trying to represent this as "dual" fix version of bug-backlog and sherlock trying to say "either sherlock if it's ready or version after it".
Comment by Cihan Biyikoglu [ 23/Sep/14 ]
I think at this early stage we have that understanding that things may fall off if the train needs to leave the station.




[MB-9171] Error code : ehostunreach - Provide good error message during node addition Created: 24/Sep/13  Updated: 13/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.1.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 64bits / Couchbase 2.1 EE

Issue Links:
Duplicate
is duplicated by MB-11690 UI: Add Server - Unfriendly error mes... Resolved

 Description   
I am trying to add a node to a cluster and this node is down.

The erroe message is:
Attention - Failed to reach erlang port mapper at node "vm2". Error: ehostunreach

It would be great to have a user friednly error explaning that this not is unreachable instead (or in addition to the code: ehostunreach )

 Comments   
Comment by Anil Kumar [ 20/Jun/14 ]
Can we fix the error message to below -


"Failed to add node "<node-name>" since its unreachable at the port"
Comment by Aleksey Kondratenko [ 11/Jul/14 ]
No we cannot
Comment by Aleksey Kondratenko [ 11/Jul/14 ]
More specifically we've agreed that:

* it was there since 1.6.x times

* we will not dumb-down error message. MS-style error reporting is not going to happen in couchbase ui on my watch

* we will special case this particular situation because of it's frequency to provide tip to user that ip address might be incorrect or firewall might be misconfigured or something like that. But all that in addition to error message we have today which is perfect.
Comment by David Haikney [ 13/Jul/14 ]
The "erlang port mapper" is not a construct that an end-user typically has any visibility or control over - the error message is exposing an internal of the product. Of course it's useful for our own debugging purposes but not so much for end-users. Providing an explanation at a higher-level to the end-user along of the lines of "failed to contact the Couchbase service on that node" and some guidance as to what the common remediation would be ("please check the Couchbase service is running and the Couchbase network ports are accessible on that node" would be far more user-friendly.




[MB-9004] Frontend ops/sec drops by 5% - 15% during rebalance Created: 29/Aug/13  Updated: 13/Mar/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.2.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged

 Description   
This ticket is created to further address the frontend ops/sec drop during rebalance. Previously, we saw more than 40-50% drop in the frontend ops/sec during rebalance. Please refer to MB-7972 for more details.

We recently made a fix in the ep-engine side to address this issue for 2.2.0 release, but still observed 5% - 15% drop. We plan to address this issue furthermore in the next major release.

 Comments   
Comment by Chiyoung Seo [ 01/Nov/13 ]
Move this to 3.0 release or later as it requires some thread scheduling changes.
Comment by Maria McDuff (Inactive) [ 10/Feb/14 ]
Chiyoung,

are there any fix related to this issue that went in 3.0?




[MB-8915] Tombstone purger need to find a better home for lifetime of deletion Created: 21/Aug/13  Updated: 15/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication, storage-engine
Affects Version/s: 2.2.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Junyi Xie (Inactive) Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-8916 migration tool for offline upgrade Technical task Open Anil Kumar  
Triage: Untriaged

 Description   
=== copy and paste my email to a group of people, it should explain clearly why we need this ticket ===

Thanks for your comments. Probably it is easier to read in email than code review.

Let me explain a bit to see if we can be on the same page. First of all the current resolution algorithm (comparing all fields) is still right, yes there is small chance we would touch fields after CAS, but for correctness we should have them there.

The cause of MB-8825 is that tombstone purger uses expiration time field to put the purger specific "lifetime of deletion". This is just a "temporary solution" because IMHO the expiration time of a key is not the right place for "lifetime of deletion" (this is purely a storage specific metadata, IMHO should not be in eo_engine), but unfortunately today we cannot find a better place to put such info unless we change the storage format, which has too much overhead at this time. In future, I think we need to figure out the best place for "lifetime of deletion" and move it out of key expiration time field.

In practice, today this temporary solution in tombstone purger is OK in most cases because rarely you have collision in CAS for two deletions on the same key. But MB-8825 just hit the small dark area, when destination tries to replicate a deletion from source back to source in bi-dir XDCR, both share the same (SeqNo, CAS) but different expiration time field (which is not exp time of key, but lifetime of deletion created by tombstone purger), exp time at destination is some times bigger than that at source, causing incorrect resolution results at source. The problem exists for both CAPI and XMEM.

For backward compatibility,
1) If both sides are 2.2, we uses new resolution algorithm for deletion and we are safe.
2) if both sides are pre-2.2, since they do not have tombstone purger, the current algorithm (comparing all fields) should be safe.
3) If a bi-dir XDCR between pre-2.2 and 2.2 cluster on CAPI. deletion born at 2.2 replicating to pre-2.2 should be safe because there is no tombstone purger at pre-2.2. For deletions born at pre-2.2, we may see them bounced back from 2.2. But there should be no dataloss since you just re-delete something already deleted.

This fix may not be perfect, but it is still much better than issues in MB-8825. I hope in near future we can find a right place for "lifetime of deletion" in tombstone purger.


Thanks,

Junyi

 Comments   
Comment by Junyi Xie (Inactive) [ 21/Aug/13 ]
Anil and Dipti,

Please determine the priority of this task, and comment if I miss anything. Thanks.


Comment by Anil Kumar [ 21/Aug/13 ]
Upgrade - We need migration tool (which we talked about) in case of Offline upgrade to move the data. Created a subtask for that.
Comment by Aaron Miller (Inactive) [ 17/Oct/13 ]
Considering that fixing this has lots of implications w.r.t. upgrade and all components that touch the file format, and that not fixing it is not causing any problems, I believe that this is not appropriate for 2.5.0
Comment by Junyi Xie (Inactive) [ 22/Oct/13 ]
I agree with Aaron that this may not be a small task and may have lots of implications to different components.

Anil, please reconsider if this is appropriate for 2.5. Thanks.
Comment by Anil Kumar [ 22/Oct/13 ]
Moved it to 3.0.
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
As "temporary head of xdcr for 3.0" I don't need this fixed in 3.0

And my guess is that after 3.0 when "the plan" for xdcr will be ready, we'll just close it as won't fix, but lets wait and see.
Comment by Cihan Biyikoglu [ 15/Jun/14 ]
Aaron is no longer here. assigning to Chiyoung for consideration.




[MB-8832] Allow for some back-end setting to override hard limit on server quota being 80% of RAM capacity Created: 14/Aug/13  Updated: 28/May/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.1.0, 2.2.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Perry Krug Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Relates to
relates to DOC-27 Server Quota: Inconsistency between d... Open
Triage: Untriaged
Is this a Regression?: Yes

 Description   
At the moment, there is no way to override the 80% of RAM limit for the server quota. At very large node sizes, this can end up leaving a lot of RAM unused.

 Comments   
Comment by Aleksey Kondratenko [ 14/Aug/13 ]
Passing this to Dipti.

We've seen memory fragmentation to easily be 50% of memory usage. So even with 80% you can get into swap and badness.

I'd recommend _against_ this until we solve fragmentation issues we have today.

Also keep in mind that today you _can_ raise this above all limites with simple /diag/eval snippet
Comment by Perry Krug [ 14/Aug/13 ]
We have seen this I agree, but it's been fairly uncommon in production environments and is something that can be monitored and resolved when it does occur. In larger RAM systems, I think we would be better served for most use cases by allowing more RAM to be used.

For example, 80% of 60GB is 48GB...leaving 12GB unused. Even worse for 256GB (leaving 50+GB unused)
Comment by Aleksey Kondratenko [ 14/Aug/13 ]
And on 256 gigs machine fragmentation can be as big as 128 gigs (!) IMHO this is not about absolute numbers but about percentages. Anyways, Dipti will tell us what to do, but your numbers above are just saying how bad our _expected_ fragmentation is.
Comment by Perry Krug [ 14/Aug/13 ]
But that's where I disagree...I think it _is_ about absolute numbers. If we leave fragmentation out of it (since it's something we will fix eventually, something that is specific to certain workloads and something that can be worked around via rebalancing), the point of this overhead was specifically to leave space available for the operating system and any other processes running outside of Couchbase. I'm sure you'd agree that Linux doesn't need anywhere near 50GB of RAM to run properly :) Even if we could decrease that by half it would provide huge savings in terms of hardware and costs to our users.

Is fragmentation the only concern of yours? If we were able to analyze a running production cluster to quantify the RAM fragmentation that exists and determine that it is within a certain bounds...would it be okay to raise the quota about 80%?
Comment by Aleksey Kondratenko [ 14/Aug/13 ]
My point was that fragmentation is also % not absolute. So with larger ram, waste from fragmentation looks scarier.

Now that you're asking if that's my only concern I see that there's more.

Without sufficient space for page cache disk performance will suffer. How much we need to be at least on par with sqlite I cannot say. Nobody can, apparently. Things depend on whether you're going to do bgfetches or not.

Because if you do care about quick bgfetches (or, say views and xdcr) then you may want to set lowest possible quota and give us much ram as possible for page cache, hoping that at least all metadata is in page cache.

If you do not care about residency of metadata, that means you don't care about btree leafs being page-cache-resident. But in order to remain io-efficient you do need to keep non-leaf nodes in page cache. The issue is that with our append-only design nobody knows how well it works in practice and exactly how much of page cache you need to give to keep few perhaps hundreds of megs of metadata-of-metadata page-cache resident. And quite possibly that "correct" recommendation is something like "you need XX percents of your data size for page cache to keep disk subsystem efficient".
Comment by Perry Krug [ 14/Aug/13 ]
Okay, that does make a very good point.

But it also highlights the need for a flexible configuration on our end depending on the use case and customer's needs. i.e., certain customers want to enforce that they are 100% resident and to me that would mean giving Couchbase more than the default quota (while still keeping the potential for fragmentation in mind).
Comment by Patrick Varley [ 11/Feb/14 ]
MB-10180 is strongly related to this issue.
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Anil, pls see my comment on MB-10180.




[MB-8022] Fsync optimizations (remove double fsyncs) Created: 05/Feb/13  Updated: 01/Apr/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 2.0, 2.0.1, 2.1.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dipti Borkar Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: PM-PRIORITIZED
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Aaron Miller (Inactive) [ 28/Mar/13 ]
There is a toy build that Ronnie is testing to see the potential perfomance impacts of this. (toy-aaron #1022)
Comment by Maria McDuff (Inactive) [ 10/Apr/13 ]
Jin will update use case scenario that QE will run.
Comment by Jin Lim [ 11/Apr/13 ]
This feature is to optimize disk write from ep engine/couchstore.

Any existing test that measures disk drain rate should determine any tangible improvement from the feature.
Baseline:
* Heavy dgm
* Write heavy (read:20% write:80%)
* Write I/O should be mix of set/delete/update
* Measure disk drain rate and cbstats's kvtimings (writeTime, commit, save_documents)
Comment by Aaron Miller (Inactive) [ 11/Apr/13 ]
The most complicated part of this change is the addition of a corruption check that must be run the first time a file is opened after the server comes up, since we're buying these perf gains by playing a bit more fast and loose with the disk.

To check that this is behaving correctly we'll want to make sure that corrupting the most-recent transaction in a storage file rolls that transaction back.

This could be accomplished by updating an item that will land in a known vbucket, shutting down the server, and flipping some bits around end of the file. The update should be rolled back when the server comes back up, and nothing should freak out :)

A position guaranteed to affect an item body from the recentmost transaction is 4095 bytes behind the last position in the file that is a multiple of 4096, or: floor(file_length / 4096) * 4096 - 4095
Comment by Maria McDuff (Inactive) [ 16/Apr/13 ]
abhinav,
will you be able to craft a test that involves this update to an item and manipulating the bits on eof? this seems tricky. let's discuss with Jin/Aaron.
Comment by Dipti Borkar [ 19/Apr/13 ]
I don't think this is user visible and so doesn't make sense to include in the release notes.
Comment by Maria McDuff (Inactive) [ 19/Apr/13 ]
aaron, pls assign back to QE (Abhinav) once you've merged the fix.
Comment by kzeller [ 22/Apr/13 ]
Updated 4/22 - No docs needed
Comment by Maria McDuff (Inactive) [ 22/Apr/13 ]
Aaron, can you also include the code changes for review here as soon as you have checked-in the fix?
thanks.
Comment by Maria McDuff (Inactive) [ 23/Apr/13 ]
deferred.
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
Hi Aaron, are you working on this for 3.0? if yes, could you push this to fexversion=3.0
Comment by Cihan Biyikoglu [ 01/Apr/14 ]
Chiyoung, pls close if this isn't relevant anymore, given this is a year old.




[MB-8686] CBHealthChecker - Fix fetching number of CPU processors Created: 23/Jul/13  Updated: 05/Jun/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.1.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Anil Kumar Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: customer
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-8817 REST API support to report number of ... Technical task Open Bin Cui  
Triage: Untriaged

 Description   
Issue reported by customer - cbhealthchecker report showing incorrect information for 'Minimum CPU core number required'.


 Comments   
Comment by Bin Cui [ 07/Aug/13 ]
it will depend on ns_server to provide number of cpu processors in the collected stats. Suggest to push to next release.
Comment by Maria McDuff (Inactive) [ 01/Nov/13 ]
per Bin:
Suggest to push the following two bugs to next release:
1. MB-8686: it depends on ns_server to provide capability to retrieve number of cpu cores
2. MB-8502: caused by async communication between main installer thread and api to get status. Change will be dramatic for installer.
 
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Bin,

Raising to Critical.
If this is still dependent on ns_server, pls assign to Alk.
This needs to be fixed for 3.0.
Comment by Anil Kumar [ 05/Jun/14 ]
We need this information to be provided from ns_server. Created ticket MB-11334.

Traige - June 05 2014 Bin, Anil, Tony, Ashvinder
Comment by Aleksey Kondratenko [ 05/Jun/14 ]
Ehm. I don't think it's good idea to treat ns_server as "provider of random system-level stats". I believe you'll need to find other way of getting it.




[MB-8512] Size of the "value/doc" impacts a lot indexing time even when value/doc is not used Created: 25/Jun/13  Updated: 01/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.0.1
Fix Version/s: