[MB-11846] Compiling breakdancer test case exceeds available memory Created: 29/Jul/14  Updated: 25/Aug/14  Due: 30/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Chris Hillery Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
1. With memcached change 4bb252a2a7d9a369c80f8db71b3b5dc1c9f47eb9, cc1 on ubuntu-1204 quickly uses up 100% of the available memory (4GB RAM, 512MB swap) and crashes with an internal error.

2. Without Trond's change, cc1 compiles fine and never takes up more than 12% memory, running on the same hardware.

 Comments   
Comment by Chris Hillery [ 29/Jul/14 ]
Ok, weird fact - on further investigation, it appears that this is NOT happening on the production build server, which is an identically-configured VM. It only appears to be happening on the commit validation server ci03. I'm going to temporarily disable that machine so the next make-simple-github-tap test runs on a different ci server and see if it is unique to ci03. If it is I will lower the priority of the bug. I'd still appreciate some help in understanding what's going on either way.
Comment by Trond Norbye [ 30/Jul/14 ]
Please verify that the two builders have the same patch level so that we're comparing apples with apples.

It does bring up another interesting topic. should our builders just use the compiler provided with the installation, or should we have a reference compiler we're using to build our code. It does seems like a bad idea having to support a ton of various compiler revision (including the fact that they support different levels of C++11 that we have to work around).
Comment by Chris Hillery [ 31/Jul/14 ]
This is now occurring on other CI build servers in other tests - http://www.couchbase.com/issues/browse/CBD-1423

I am bumping this back to Test Blocker and I will revert the change as a work-around for now.
Comment by Chris Hillery [ 31/Jul/14 ]
Partial revert committed to memcached master: http://review.couchbase.org/#/c/40152/ and 3.0: http://review.couchbase.org/#/c/40153/
Comment by Trond Norbye [ 01/Aug/14 ]
That review in memcached should NEVER have been pushed through. Its subject line is too long
Comment by Chris Hillery [ 01/Aug/14 ]
If there's a documented standard out there for commit messages, my apologies; it was never revealed to me.
Comment by Trond Norbye [ 01/Aug/14 ]
When it doesn't fit within a terminal window there is a problem. it is way better to use multiple lines..

IN addition I'm not happy with the fix. instead of deleting the line it should have been checking for an environment variable so that people could explicitly disable it. This is why we have review cycles.
Comment by Chris Hillery [ 01/Aug/14 ]
I don't think I want to get into style arguments. If there's a standard I'll use it. In the meantime I'll try to keep things to 72-character lines.

As to the content of the change, it was not intended to be a "fix"; it was a simple revert of a change that was provably breaking other jobs. I returned the code to its previous state, nothing more or less. And especially given the time crunch of the beta (which is supposed to be built tomorrow), waiting for a code review on a reversion is not in the cards.
Comment by Trond Norbye [ 01/Aug/14 ]
The normal way of doing a revert is to use git revert (which as an extra bonus makes the commit message contain that).
Comment by Trond Norbye [ 01/Aug/14 ]
http://review.couchbase.org/#/c/40165/
Comment by Chris Hillery [ 01/Aug/14 ]
1. Your fix is not correct, because simply adding -D to cmake won't cause any preprocessor defines to be created. You need to have some CONFIGURE_FILE() or similar to create a config.h using #cmakedefine. As it is there is no way to compile with your change.

2. The default behaviour should not be the one that is known to cause problems. Until and unless there is an actual fix for the problem (whether or not that is in the code), the default should be to keep the optimization, with an option to let individuals bypass that if they desire and accept the risks.

3. Characterizing the problem as "misconfigured VMs" is, at best, premature.

I will revert this change again on the 3.0 branch shortly, unless you have a better suggestion (I'm definitely all ears for a better suggestion!).
Comment by Trond Norbye [ 01/Aug/14 ]
If you look at the comment it pass the -D over into the CMAKE_C_FLAGS, causing it to be set into the compiler flags and it'll be passed on to compilation cycle.

As of misconfiguration, it is either insufficient resources on the vm or a "broken" compiler version installed there.
Comment by Trond Norbye [ 01/Aug/14 ]
Can I get login credentials to the server it fails and an identical vm where it succeeds.
Comment by Chris Hillery [ 01/Aug/14 ]
[CMAKE_C_FLAGS] Fair enough, I did misread that. That's not really a sufficient workaround, though. Doing that may overwrite other CFLAGS set by other parts of the build process.

I still maintain that the default behaviour should be the known-working version. However, for the moment I have temporarily locked the rel-3.0.0.xml manifest to the revision before my revert (ie, to 5cc2f8d928f0eef8bddbcb2fcb796bc5e9768bb8), so I won't revert anything else until that has been tested.

The only VM I know of at the moment where we haven't seen build failures is the production build slave. I can't give you access to that tonight as we're in crunch mode to produce a beta build. Let's plan to hook up next week and do some exploration.
Comment by Volker Mische [ 01/Aug/14 ]
There are commit message guidelines. At the bottom of

http://www.couchbase.com/wiki/display/couchbase/Contributing+Changes

links to:

http://en.wikibooks.org/wiki/Git/Introduction#Good_commit_messages
Comment by Trond Norbye [ 01/Aug/14 ]
I've not done anything on the 3.0.0 branch, the fix going forward is for 3.0.1 and trunk. Hopefully the 3.0 branch will die relatively soon since we've got a lot of good stuff in the 3.0.1 branch.

The "workaround" is not intended as a permanent solution, its just until the vms is fixed. I've not been able to reproduce this issue on my centos, ubuntu, fedora or smartos builders. They're running in the following vm's:

[root@00-26-b9-85-bd-92 ~]# vmadm list
UUID TYPE RAM STATE ALIAS
04bf8284-9c23-4870-9510-0224e7478f08 KVM 2048 running centos-6
7bcd48a8-dcc2-43a6-a1d8-99fbf89679d9 KVM 2048 running ubuntu
c99931d7-eaa3-47b4-b7f0-cb5c4b3f5400 KVM 2048 running fedora
921a3571-e1f6-49f3-accb-354b4fa125ea OS 4096 running compilesrv
Comment by Trond Norbye [ 01/Aug/14 ]
I need access to two identical configured builders where one may reproduce the error and one where it succeeds.
Comment by Volker Mische [ 01/Aug/14 ]
I would also add that I think it is about bad VMs. On the commit validation we have 6 VMs, It failed only always on ubuntu-1204-64-ci-01 due to this error and never on the others (ubuntu-1204-64-ci-02 - 06).
Comment by Chris Hillery [ 01/Aug/14 ]
That's not correct. The problem originally occurred on ci-03.
Comment by Volker Mische [ 01/Aug/14 ]
Then I need to correct it that my comment only holds true for the couchdb-gerrit-300 job.
Comment by Trond Norbye [ 01/Aug/14 ]
can I get login creds to one that it fails on? while I'm waiting for access to one that it works on?
Comment by Volker Mische [ 01/Aug/14 ]
I don't know about creds (I think my normal user login works) The machine details are here: http://factory.couchbase.com/computer/ubuntu-1204-64-ci-01/
Comment by Chris Hillery [ 01/Aug/14 ]
Volker - it was initially detected in the make-simple-github-tap job, so it's not unique to couchdb-gerrit-300 either. Both jobs pretty much just checkout the code and build it, though; they're pretty similar.
Comment by Trond Norbye [ 01/Aug/14 ]
Adding swap space to the builder makes the compilation pass. I've been trying to figure out how to get gcc to print more information about each step (the -ftime-reports memory usage didn't at all match the process usage ;-))
Comment by Anil Kumar [ 12/Aug/14 ]
Adding the component as "build". Let me know if that's not correct.




[MB-10156] "XDCR - Cluster Compare" support tool Created: 07/Feb/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Cihan Biyikoglu Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: 2.5.1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
for the recent issues we have seen we need a tool that cam compare metadata (specifically revids) for a given replication definition in XDCR. To scale to large data sizes, being able to do this per vbucket or per doc range would be great but we can do without these. for clarity, here is a high level desc.

Ideal case:
xdcr_compare cluster1_connectioninfo cluster1_bucketname cluster2connectioninfo cluster2_bucketname [vbucketid] [keyrange]
should return a line per docid for each row where cluster1 metadata and clustermetadata for the given key differ.
docID - cluster1_metadata cluster2_metadata

simplification: the tool is expected to return false positives in a moving system but we will tackle that by rerunning the tool multiple times.

 Comments   
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Aaron, do you have a timeline for this?
thanks
-cihan
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan,

For test automation/verification, can you list out the stats/metadata that we should be testing specifically?
we want to create/implement the tests accordingly.


Also -- is this tool de-coupled from the server package? or is this part of rpm/deb/.exe/osx build package?

Thanks,
Maria
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
This depends on the requirements; A tool that requires the manual collection of all data from all nodes in both clusters onto one machine (like we've done recently) could be done pretty quickly, but I imagine that may be difficult or unfeasible entirely for some users.

Better would be to be able to operate remotely on clusters and only look at metadata. Unfortunately there is no *currently exposed* interface to only extract metadata from the system without also retrieving values. I may be able to work around this, but the workaround is unlikely to be simple.

Also for some users, even the amount of *metadata* may be prohibitively large to transfer all to one place, this also can be avoided, but again, adds difficulty.

Q: Can the tool be JVM-based?
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
I think it would be more feasible for this to ship separately from the server package.
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan, Aaron,

If it's de-coupled, what older versions of Couchbase would this tool support? as far back as 1.8.x? pls confirm as this would expand our backward compatibility testing for this tool.
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
Well, 1.8.x didn't have XDCR or the rev field; It can't be compatible with anything older than 2.0 since it operates mostly to check things added since 2.0.

I don't know how far back it needs to go but it *definitely* needs to be able to run against 2.2
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Agree with Aaron, lets keep this lightweight. can we depend on Aaron for testing if this will initially be just a support tool? for 3.0, we may graduate the tool to the server shipped category.
thanks
Comment by Sangharsh Agarwal [ 27/Feb/14 ]
Cihan, Is the Spec finalized for this tool in version 2.5.1?
Comment by Cihan Biyikoglu [ 27/Feb/14 ]
Sangharsh, for 2.5.1, we wanted to make this a "Aaron tested" tool. I believe Aaron already has the tool. Aaron?
Comment by Aaron Miller (Inactive) [ 27/Feb/14 ]
Working on it; wanted to get my actually-in-the-package 2.5.1 stuff into review first.

What I do already have is a diff tool for *files*, but is highly inconvenient to use; this should be a tool that doesn't require collecting all data files into one place in order to use, and instead can work against a running cluster.
Comment by Maria McDuff (Inactive) [ 05/Mar/14 ]
Aaron,

Is the tool merged yet into the build? can you update pls?
Comment by Cihan Biyikoglu [ 06/Mar/14 ]
2.5.1 shiproom note: Phil raised a build concern on getting this packaged with 2.5.1. The initial bar we set was not to ship this as part of the server - it was intended to be a downloadable support tool. Aaron/Cihan will re-eval and get back to shiproom.
Comment by Cihan Biyikoglu [ 15/Jun/14 ]
Aaron no longer here. assigning to Xiaomei for consideration.




[MB-10719] Missing autoCompactionSettings during create bucket through REST API Created: 01/Apr/14  Updated: 19/Jun/14

Status: Reopened
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: michayu Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File bucket-from-API-attempt1.txt     Text File bucket-from-API-attempt2.txt     Text File bucket-from-API-attempt3.txt     PNG File bucket-from-UI.png     Text File bucket-from-UI.txt    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Unless I'm not using the API correctly, there seems to be some holes in the Couchbase API – particularly with autoCompaction.

The autoCompaction parameter can be set via the UI (as long as the bucketType is couch base).

See the following attachments:
1) bucket-from-UI.png
2) bucket-from-UI.txt

And compare with creating the bucket (with autoCompaction) through the REST API:
1) bucket-from-API-attempt1.txt
    - Reference: http://docs.couchbase.com/couchbase-manual-2.5/cb-rest-api/#creating-and-editing-buckets
2) bucket-from-API-attempt2.txt
    - Reference: http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-admin-rest-auto-compaction
3) bucket-from-API-attempt3.txt
    - Setting autoCompaction globally
    - Reference: http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-admin-rest-auto-compaction

In all cases, autoCompactionSettings is still false.


 Comments   
Comment by Anil Kumar [ 19/Jun/14 ]
Triage - June 19 2014 Alk, parag, Anil
Comment by Aleksey Kondratenko [ 19/Jun/14 ]
It works, just apparently not properly documented:

# curl -u Administrator:asdasd -d name=other -d bucketType=couchbase -d ramQuotaMB=100 -d authType=sasl -d replicaNumber=1 -d replicaIndex=0 -d parallelDBAndViewCompaction=true -d purgeInterval=1 -d 'viewFragmentationThreshold[percentage]'=30 -d autoCompactionDefined=1 http://lh:9000/pools/default/buckets

And general hint is that you can see what browser is POSTing when it creates bucket or does anything else to figure out working (but not necessarily publicly supported) way of doing things.
Comment by Anil Kumar [ 19/Jun/14 ]
Ruth - Above documentation references needs to be fixed with correct REST API.




[MB-9358] while running concurrent queries(3-5 queries) getting 'Bucket X not found.' error from time to time Created: 16/Oct/13  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 64 bit

Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
one thread gives correct result:
[root@localhost tuqtng]# curl 'http://10.3.121.120:8093/query?q=SELECT+META%28%29.cas+as+cas+FROM+bucket2'
{
    "resultset": [
        {
            "cas": 4.956322522514292e+15
        },
        {
            "cas": 4.956322525999292e+15
        },
        {
            "cas": 4.956322554862292e+15
        },
        {
            "cas": 4.956322832498292e+15
        },
        {
            "cas": 4.956322835757292e+15
        },
        {
            "cas": 4.956322838836292e+15
...

    ],
    "info": [
        {
            "caller": "http_response:152",
            "code": 100,
            "key": "total_rows",
            "message": "0"
        },
        {
            "caller": "http_response:154",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "405.41885ms"
        }
    ]
}

but in another I see
{
    "error":
        {
            "caller": "view_index:195",
            "code": 5000,
            "key": "Internal Error",
            "message": "Bucket bucket2 not found."
        }
}

cbcollect will be attached

 Comments   
Comment by Marty Schoch [ 16/Oct/13 ]
This is a duplicate, though I can't yet find the original.

We believe under higher load the view queries timeout, which we report as bucket not found (may not be possible to distinguish).
Comment by Iryna Mironava [ 16/Oct/13 ]
https://s3.amazonaws.com/bugdb/jira/MB-9358/447a45ae/10.3.121.120-10162013-858-diag.zip
Comment by Ketaki Gangal [ 17/Oct/13 ]
Seeing these errors and frequent tuq-server crashes on concurrent queries during typical server operations like
- w/ Failovers
- w/ Backups
- w/ Indexing.

Similar server ops for single queries however seem to run okay.

Note: This is a very small number of concurrent queries ( 3-5), typically users may have higher level of concurrency if used at an Application level.




[MB-9145] Add option to download the manual in pdf format (as before) Created: 17/Sep/13  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 2.0, 2.1.0, 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
On the documentation site there is no option to download the manual in pdf format as before. We need to add this option back.

 Comments   
Comment by Maria McDuff (Inactive) [ 18/Sep/13 ]
need for 2.2.1 bug fix release.




[MB-8838] Security Improvement - Connectors to implement security improvements Created: 14/Aug/13  Updated: 19/May/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Security Improvement - Connectors to implement security improvements

Spec ToDo.




[MB-9415] auto-failover in seconds - (reduced from minimum 30 seconds) Created: 21/May/12  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0, 1.8.1, 2.0, 2.0.1, 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Dipti Borkar Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 2
Labels: customer, ns_server-story
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-9416 Make auto-failover near immediate whe... Technical task Open Aleksey Kondratenko  

 Description   
including no false positives

http://www.pivotaltracker.com/story/show/25006101

 Comments   
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
At the very least it requires getting our timeout-ful cases under control. So at least splitting couchdb into separate VM is a requirement for this. But not necessarily enough.
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
Still seeing misunderstanding on this one.

So we have _different_ problem that even manual failover (let alone automatic) cannot succeed quickly if master node fails. It can easily take up to 2 minutes because of our use of erlang "global" facility than requires us to detect that node is dead and erlang is tuned to detect that within 2 minutes.

Now _this_ problem is lowering autofailover detection to 10 seconds. We can blindly make it happen today. But it will not be usable because of all sorts of timeouts happening in cluster management layer. We have some significant proportion of CBSEs _today_ about false positive autofailovers even with 30 seconds threshold. Clearly lowering it to 10 will only make it worse. Therefore my point above. We have to get those timeouts under control so that heartbeats are sent/received timely. Or whatever else we use to detect node being unresponsive.

I would like to note however that especially in some virtualized environments (arguably, oversubscribed) we saw as high as low tens of seconds delays from virtualization _alone_. Given relatively high cost of failover in our software I'd like to point out that people could too easily abuse that feature.

High cost of failover is refered to above is this:

* you almost certainly and irrecoverably lose some recent mutations. _At least_ recent mutations. I.e. if replication is really working well. In node that's on the edge of autofailover you can imagine replication not being "diamond-hard quick". That's cost 1.

* in order to return node back to cluster (say node crashed and needed some time to recover, whatever it might mean) you need rebalance. That type of rebalance is relatively quick by design; i.e. it only moves data back to this node and nothing else. But it's still rebalance. with upr we can possibly make it better. I.e. because its failover log is capable of rewinding just conflicting mutations.

What I'm trying to say in "our approach appears to have relatively high price for failover" is that it appears inherent issue for strongly consistent system. I'm trying to say that in many cases it might be actually better to wait up to few minutes for node to recover and restore it's availability than failing it over and paying price of restoring cluster capacility (with rebalancing this node back or it's replacement, which is irrelevant). If somebody wants stronger availability then some other approaches which particularly can "reconcile" changes from both failed over node and it's replacement node look like fundamentally better choice _for this requirements_.




[MB-4030] enable traffic for for ready nodes even if not all nodes are up/healthy/ready (aka partial janitor) (was: After two nodes crashed, curr_items remained 0 after warmup for extended period of time) Created: 06/Jul/11  Updated: 20/May/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.1, 2.0, 2.0.1, 2.2.0, 2.1.1, 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
we had two nodes crash at a customer, possibly related to a disk space issue, but I don't think so.

After they crashed, the nodes warmed up relatively quickly, but immediately "discarded" their items. I say that because I see that they warmed up ~10m items, but the current item counts were both 0.

I tried shutting down the service and had to kill memcached manually (kill -9). Restarting it went through the same process of warming up and then nothing.

While I was looking around, I left it sit for a little while and magically all of the items came back. I seem to recall this bug previously where a node wouldn't be told to be active until all the nodes in the cluster were active...and it got into trouble when not all of the nodes restarted.

Diags for all nodes will be attached

 Comments   
Comment by Perry Krug [ 06/Jul/11 ]
Full set of logs at \\corp-fs1\export_support_cases\bug_4030
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
It _is_ ns_server issue caused by janitor needing all nodes to be up for vbuckets activation. We planned fix for 1.8.1 (now 1.8.2)
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
Fix would land as part of fast warmup integration
Comment by Perry Krug [ 18/Jul/12 ]
Peter, can we get a second look at this one? We've seen this before, and the problem is that the janitor did not run until all nodes had joined the cluster and warmed up. I'm not sure we've fixed that already...
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Latest 2.0 will mark nodes as green and enable memcached traffic when all of them are up. So easy part is done.

Partial janitor (i.e. enabling traffic for some nodes when others are still down/warming up) is something that will unlikely be done soon
Comment by Perry Krug [ 18/Jul/12 ]
Thanks Alk...what's the difference in behavior (in this area) between 1.x and 2.0? It "sounds" like they're the same, no?

And this bug should still remain open until we fix the primary issue which is the partial janitor...correct?
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
1.8.1 will show node as green when ep-engine thinks it's warmed up. But confusingly it'll not be really ready. All vbuckets will be in state dead and curr_items will be 0.

2.0 fixes this confusion. Node is marked green when it's actually warmed up from user's perspective. I.e. right vbucket states are set and it'll serve clients traffic.

2.0 is still very conservative about only making vbucket state changes when all nodes are up and warmed up. Thats "impartial" janitor. Whether it's a bug or "lack of feature" is debatable. But I think main concern that users are confused by green-ness of nodes is resolved.
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Closing as fixed. We'll get to partial janitor some day in future which is feature we lack today, not bug we have IMHO
Comment by Perry Krug [ 12/Nov/12 ]
Reopening this for the need for partial janitor. Recent customer had multiple nodes need to be hard-booted and none returned to service until all were warmed up
Comment by Steve Yen [ 12/Nov/12 ]
bug-scrub: moving out of 2.0, as this looks like a feature req.
Comment by Farshid Ghods (Inactive) [ 13/Nov/12 ]
in system testing we have noticed many times that if multiple nodes crash until all nodes are warmed up node status for those that are already warmed up appears as yellow.


user won't be able to understand which node has successfully warmed up from the console and if one node is actually not recovering or not warm up in a reasonable time they have to figure it out some other way ( cbstats ... )

another issue with this is that user won't be able to perform a fail over for 1 node even though N-1 nodes has warmed up already.

i am not sure if fixing this bug will impact cluster-restore functionality but something important to fix or suggest a workaround to the user ( by workaround i mean a documented , tested and supported set of commands )
Comment by Mike Wiederhold [ 17/Mar/13 ]
Comments say this is an ns_server issue so I am removing couchbase-bucket from affected components. Please re-add if there is a couchbase-bucket task for this issue.
Comment by Aleksey Kondratenko [ 23/Feb/14 ]
Not going to happen for 3.0.




[MB-10838] cbq-engine must work without all_docs Created: 11/Apr/14  Updated: 29/Jun/14  Due: 07/Jul/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: tried builds 3.0.0-555 and 3.0.0-554

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
WORKAROUND: Run "CREATE PRIMARY INDEX ON <bucket>" once per bucket, when using 3.0 server

SYMPTOM: tuq returns Bucket default not found.', u'caller': u'view_index:200 for all queries

single node cluster, 2 buckets(default and standard)
run simple query
q=FROM+default+SELECT+name%2C+email+ORDER+BY+name%2Cemail+ASC

got {u'code': 5000, u'message': u'Bucket default not found.', u'caller': u'view_index:200', u'key': u'Internal Error'}
tuq displays
[root@grape-001 tuqtng]# ./tuqtng -couchbase http://localhost:8091
22:36:07.549322 Info line disabled false
22:36:07.554713 tuqtng started...
22:36:07.554856 version: 0.0.0
22:36:07.554942 site: http://localhost:8091
22:47:06.915183 ERROR: Unable to access view - cause: error executing view req at http://127.0.0.1:8092/default/_all_docs?limit=1001: 500 Internal Server Error - {"error":"noproc","reason":"{gen_server,call,[undefined,bytes,infinity]}"}
 -- couchbase.(*viewIndex).ScanRange() at view_index.go:186


 Comments   
Comment by Sriram Melkote [ 11/Apr/14 ]
Iryna, can you please add cbcollectinfo or at least the couchdb logs?

Also, all CBQ DP4 testing must be done against 2.5.x server, please confirm it is the case in this bug.
Comment by Iryna Mironava [ 22/Apr/14 ]
cbcollect
https://s3.amazonaws.com/bugdb/jira/MB-10838/9c1cf39c/172.27.33.17-4222014-111-diag.zip

bug is valid only for 3.0. 2.5.x versions are working fine
Comment by Sriram Melkote [ 22/Apr/14 ]
Gerald, we need to update query code to not use _all_docs for 3.0

Iryna, workaround is to run "CREATE PRIMARY INDEX ON <bucket>" first before running any queries when using 3.0 server
Comment by Sriram Melkote [ 22/Apr/14 ]
Reducing severity with workaround. Please ping me if that doesn't work
Comment by Iryna Mironava [ 22/Apr/14 ]
works with workaround
Comment by Gerald Sangudi [ 22/Apr/14 ]
Manik,

Please modify the tuqtng / DP3 Couchbase catalog to return an error telling the user to CREATE PRIMARY INDEX. This should only happen with 3.0 server. For 2.5.1 or below, #all_docs should still work.

Thanks.




[MB-11736] add client SSL to 3.0 beta documentation Created: 15/Jul/14  Updated: 15/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Matt Ingenthron Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
This is mostly a curation exercise. Add to the server 3.0 beta docs the configuration information for each of the following clients:
- Java
- .NET
- PHP
- Node.js
- C/C++

No other SDKs support SSL at the moment.

This is either in work-in-progress documentation or in the blogs from the various DPs. Please check in with the component owner if you can't find what you need.




[MB-10180] Server Quota: Inconsistency between documentation and CB behaviour Created: 11/Feb/14  Updated: 21/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Dave Rigby Assignee: Ruth Harris
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File MB-10180_max_quota.png    
Issue Links:
Relates to
relates to MB-2762 Default node quota is still too high Resolved
relates to MB-8832 Allow for some back-end setting to ov... Open
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
In the documentation for the product (and general sizing advice) we tell people to allocate no more than 80% of their memory for the Server Quota, to leave headroom for the views, disk write queues and general OS usage.

However on larger[1] nodes we don't appear to enforce this, and instead allow people to allocate up to 1GB less than the total RAM.

This is inconsistent, as we document and tell people one thing and let them do another.

This appears to be something inherited from MB-2762, which the intent of which appeared to only allow the relaxing of this when joining a cluster, however this doesn't appear to be how it works - I can successfully change the existing cluster quota from the CLI to a "large" value:

    $ /opt/couchbase/bin/couchbase-cli cluster-edit -c localhost:8091 -u Administrator -p dynam1te --cluster-ramsize=127872
    ERROR: unable to init localhost (400) Bad Request
    [u'The RAM Quota value is too large. Quota must be between 256 MB and 127871 MB (memory size minus 1024 MB).']

While I can see some logic to relax the 80% constraint on big machines, with the advent of 2.X features 1024MB seems far too small an amount of headroom.

Suggestions to resolve:

A) Revert to a straightforward 80% max, with a --force option or similar to allow specific customers to go higher if they know what they are doing
B) Leave current behaviour, but document it.
B) Increase minimum headroom to something more reasonable for 2.X, *and* document the beaviour.

([1] On a machine with 128,895MB of RAM I get the "total-1024" behaviour, on a 1GB VM I get 80%. I didn't check in the code what the cutoff for 80% / total-1024 is).


 Comments   
Comment by Dave Rigby [ 11/Feb/14 ]
Screenshot of initial cluster config: maximum quota is total_RAM-1024
Comment by Aleksey Kondratenko [ 11/Feb/14 ]
Do not agree with that logic.

There's IMHO quite a bit of difference between default settings, recommended settings limit and allowed settings limit. The later can be wider for folks who really know what they're doing.
Comment by Aleksey Kondratenko [ 11/Feb/14 ]
Passed to Anil, because that's not my decision to change limits
Comment by Dave Rigby [ 11/Feb/14 ]
@Aleksey: I'm happy to resolve as something other than my (A,B,C), but the problem here is that many people haven't even been aware of this "extended" limit in the system - and moreover on a large system we actually advertise it in the GUI when specifying the allowed limit (see attached screenshot).

Furthermore, I *suspect* that this was originally only intended for upgrades for 1.6.X (see http://review.membase.org/#/c/4051/), but somehow is now being permitted for new clusters.

Ultimately I don't mind what our actual max quota value is, but the app behaviour should be consistent with the documentation (and the sizing advice we give people).
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
raising to product blocker.
this inconsistency has to be resolved - PM to re-align.
Comment by Anil Kumar [ 28/May/14 ]
Going with option B - Leave current behaviour, but document it.
Comment by Ruth Harris [ 17/Jul/14 ]
I only see the 80% number coming up as an example of setting the high water mark (85% suggested). The Server Quota section doesn't mention anything. The working set managment & ejection section(s) and item pager sub-section also mention high water mark.

Can you be more specific about where this information is? Anyway, the best solution is to add a "note" in the applicable section(s).

--ruth

Comment by Dave Rigby [ 21/Jul/14 ]
@Ruth: So the current product behaviour is that the ServerQuota limit depends on the maximum memory available:

* For machines with <= X MB of memory, the maximum server quota is 80% of total physical memory
* For machines with > X MB of memory, the maximum Server Quota is Total Physical Memory - 1024.

The value of 'X' is fixed in the code, but it wasn't obvious what it's actually is (it's derived from a few different things. I suggest you ask Alk who should be able to provide the value of it.




[MB-11060] Build and test 3.0 for 32-bit Windows Created: 06/May/14  Updated: 13/Aug/14  Due: 09/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build, ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Blocker
Reporter: Chris Hillery Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7/8 32-bit

Issue Links:
Dependency
Duplicate

 Description   
For the "Developer Edition" of Couchbase Server 3.0 on Windows 32-bit, we need to first ensure that we can build 32-bit-compatible binaries. It is not possible to build 3.0 on a 32-bit machine due to the MSVC 2013 requirement. Hence we need to configure MSVC as well as Erlang on a 64-bit machine to produce 32-bit compatible binaries.

 Comments   
Comment by Chris Hillery [ 06/May/14 ]
This is assigned to Trond who is already experimenting with this. He should:

 * test being able to start the server on a 32-bit Windows 7/8 VM

 * make whatever changes are necessary to the CMake configuration or other build scripts to produce this build on a 64-bit VM

 * thoroughly document the requirements for the build team to reproduce this build

Then he can assign this bug to Chris to carry out configuring our build jobs accordingly.
Comment by Trond Norbye [ 16/Jun/14 ]
Can you give me a 32 bit windows installation I can test on. My MSDN license have expired and I don't have Windows media available (and the internal wiki page just have a limited set of licenses and no download links)

Then assign it back to me and I'll try it
Comment by Chris Hillery [ 16/Jun/14 ]
I think you can use 172.23.106.184 - it's a 32-bit Windows 2008 VM that we can't use for 3.0 builds anyway.
Comment by Trond Norbye [ 24/Jun/14 ]
I copied the full result of a build where I set target_platform=x86 on my 64 bit windows server (the "install" directory) over to a 32 bit windows machine and was able to start memcached and it worked as expected.

Our installers make other magic like install the service etc needed in order to start the full server. Once we have such an installer I can do further testing
Comment by Chris Hillery [ 24/Jun/14 ]
Bin - could you take a look at this (figuring out how to make InstallShield on a 64-bit machine create a 32-bit compatible installer)? I won't likely be able to get to it for at least a month, and I think you're the only person here who still has access to an InstallShield 2010 designer anyway.




[MB-12043] cbq crash after trying to delete a key Created: 21/Aug/14  Updated: 21/Aug/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
cbq> delete from my_bucket KEYS ['query-testa7480c4-0'];
PANIC: Expected plan.Operator instead of <nil>..




[MB-9632] diag / master events captured in log file Created: 22/Nov/13  Updated: 27/Aug/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Blocker
Reporter: Steve Yen Assignee: Ravi Mayuram
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
The information available in the diag / master events REST stream should be captured in a log (ALE?) file and hence available to cbcollect-info's and later analysis tools.

 Comments   
Comment by Aleksey Kondratenko [ 22/Nov/13 ]
It is already available in collectinfo
Comment by Dustin Sallings (Inactive) [ 26/Nov/13 ]
If it's only available in collectinfo, then it's not available at all. We lose most of the useful information if we don't run an http client to capture it continually throughout the entire course of a test.
Comment by Aleksey Kondratenko [ 26/Nov/13 ]
Feel free to submit a patch with exact behavior you need
Comment by Cihan Biyikoglu [ 27/Aug/14 ]
is this still relevant?




[MB-12091] [Windows].compact files not cleaned up after compaction Created: 28/Aug/14  Updated: 28/Aug/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Venu Uppalapati Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive cbinfo.zip    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Steps to reproduce:
1)Load 1M items on default so that compaction runs several times by end of the loading.
2)It will be seen that all the temporary .compact files are kept even after compaction is done.
3)for ex:
*.couch.1.compact.btree-tmp-1
*.couch.1.compact.btree-tmp-2
*.couch.1.compact.btree-tmp-3
*.couch.1.compact.btree-tmp-4
when the current rev number of the vbucket file is 5
4)This is windows specific. Pretty soon disk space is claimed by all these temp files.

 Comments   
Comment by Chiyoung Seo [ 28/Aug/14 ]
Sriram,

Sundar is busy with working on the RC2-related issue. Can you please take a look at this window compaction issue?
Comment by Sriram Ganesan [ 28/Aug/14 ]
If there are any logs available from the test, please do upload them.
Comment by Venu Uppalapati [ 28/Aug/14 ]
Sundar mentioned that he used unlink function which is deprecated on Windows version of the compiler, http://msdn.microsoft.com/en-us/library/ms235350.aspx will upload the logs shortly.
Comment by Venu Uppalapati [ 28/Aug/14 ]
cbcollectinfo attached




[MB-10214] Mac version update check is incorrectly identifying newest version Created: 14/Feb/14  Updated: 28/Aug/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0.1, 2.2.0, 2.1.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: David Haikney Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified
Environment: Mac OS X

Attachments: PNG File upgrade_check.png    
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-12051 Update the Release_Server job on Jenk... Technical task Open Chris Hillery  
Is this a Regression?: Yes

 Description   
Running 2.1.1 version of couchbase on a Mac, "check for latest version" reports the latest version is already running (e.g. see attached screenshot)


 Comments   
Comment by Aleksey Kondratenko [ 14/Feb/14 ]
Definitely not ui bug. It's using phone home to find out about upgrades. And I have no idea who owns that now.
Comment by Steve Yen [ 12/Jun/14 ]
got an email from ravi to look into this
Comment by Steve Yen [ 12/Jun/14 ]
Not sure if this is correct analysis, but I did a quick scan of what I think is the mac installer, which I think is...

  https://github.com/couchbase/couchdbx-app

It gets its version string by running a "git describe", in the Makefile here...

  https://github.com/couchbase/couchdbx-app/blob/master/Makefile#L1

Currently, a "git describe" on master branch returns...

  $ git describe
  2.1.1r-35-gf6646fa

...which is *kinda* close to the reported version string in the screenshot ("2.1.1-764-rel").

So, I'm thinking one fix needed would be a tagging (e.g., "git tag -a FOO -m FOO") of the couchdbx-app repository.

So, reassigning to Phil to do that appropriately.

Also, it looks like the our mac installer is using an open-source packaging / installer / runtime library called "sparkle" (which might be a little under-maintained -- not sure).

  https://github.com/andymatuschak/Sparkle/wiki

The sparkle library seems to check for version updates by looking at the URL here...

  https://github.com/couchbase/couchdbx-app/blob/master/cb.plist.tmpl#L42

Which seems to either be...

  http://appcast.couchbase.com/membasex.xml

Or, perhaps...

  http://appcast.couchbase.com/couchbasex.xml

The appcast.couchbase.com appears to be actually an S3 bucket, off of our production couchbase AWS account. So those *.xml files need to be updated, as they currently have content that has older versions. For example, http://appcast.couchbase.com/couchbase.xml looks currently like...

    <rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle" version="2.0">
    <channel>
    <title>Updates for Couchbase Server</title>
    <link>http://appcast.couchbase.com/couchbase.xml&lt;/link>
    <description>Recent changes to Couchbase Server.</description>
    <language>en</language>
    <item>
    <title>Version 1.8.0</title>
    <sparkle:releaseNotesLink>
    http://www.couchbase.org/wiki/display/membase/Couchbase+Server+1.8.0
    </sparkle:releaseNotesLink>
    <!-- date -u +"%a, %d %b %Y %H:%M:%S GMT" -->
    <pubDate>Fri, 06 Jan 2012 16:11:17 GMT</pubDate>
    <enclosure url="http://packages.couchbase.com/1.8.0/Couchbase-Server-Community.dmg" sparkle:version="1.8.0" sparkle:dsaSignature="MCwCFAK8uknVT3WOjPw/3LkQpLBadi2EAhQxivxe2yj6EU6hBlg9YK/5WfPa5Q==" length="33085691" type="application/octet-stream"/>
    </item>
    </channel>
    </rss>

Not updating the xml files, though, probably causes no harm. Just that our osx users won't be pushed news on updates.
Comment by Phil Labee [ 12/Jun/14 ]
This has nothing to do with "git describe". There should be no place in the product that "git describe" should be used to determine version info. See:

    http://hub.internal.couchbase.com/confluence/display/CR/Branching+and+Tagging

so there's definitely a bug in the Makefile.

The version update check seems to be out of date. The phone-home file is generated during:

    http://factory.hq.couchbase.com:8080/job/Product_Staging_Server/

but the process of uploading it is not automated.
Comment by Steve Yen [ 12/Jun/14 ]
Thanks for the links.

> This has nothing to do with "git describe".

My read of the Makefile makes me think, instead, that "git describe" is the default behavior unless it's overridden by the invoker of the make.

> There should be no place in the product that "git describe" should be used to determine version info. See:
> http://hub.internal.couchbase.com/confluence/display/CR/Branching+and+Tagging

It appears all this couchdbx-app / sparkle stuff predates that wiki page by a few years, so I guess it's inherited legacy.

Perhaps voltron / buildbot are not setting the PRODUCT_VERSION correctly before invoking the the couchdbx-app make, which makes the Makefile default to 'git describe'?

    commit 85710d16b1c52497d9f12e424a22f3efaeed61e4
    Date: Mon Jun 4 14:38:58 2012 -0700

    Apply correct product version number
    
    Get version number from $PRODUCT_VERSION if it's set.
    (Buildbot and/or voltron will set this.)
    If not set, default to `git describe` as before.
    
> The version update check seems to be out of date.

Yes, that's right. The appcast files are out of date.

> The phone-home file is generated during:
> http://factory.hq.couchbase.com:8080/job/Product_Staging_Server/

I think appcast files for OSX / sparkle are a _different_ mechanism than the phone-home file, and an appcast XML file does not appear to be generated/updated by the Product_Staging_Server job.

But, I'm not an expert or really qualified on the details here -- this is just my opinions from a quick code scan, not from actually doing/knowing.

Comment by Wayne Siu [ 01/Aug/14 ]
Per PM (Anil), we should get this fixed by 3.0 RC1.
Raising the priority to Critical.
Comment by Wayne Siu [ 07/Aug/14 ]
Phil,
Please provide update.
Comment by Anil Kumar [ 12/Aug/14 ]
Triage - Upgrading to 3.0 Blocker

Comment by Wayne Siu [ 20/Aug/14 ]
Looks like we may have a short term "fix" for this ticket which Ceej and I have tested.
@Ceej, can you put in the details here?
Comment by Chris Hillery [ 20/Aug/14 ]
The file is hosted in S3, and we proved tonight that overwriting that file (membasex.xml) with a version containing updated version information and download URLs works as expected. We updated it to point to 2.2 for now, since that is the latest version with a freely-available download URL.

We can update the Release_Server job on Jenkins to create an updated version of this XML file from a template, and upload it to S3.

Assigning back to Wayne for a quick question: Do we support Enterprise edition for MacOS? If we do, then this solution won't be sufficient without more effort, because the two editions will need different Sparkle configurations for updates. Also, Enterprise edition won't be able to directly download the newer release, unless we provide a "hidden" URL for that (the download link on the website goes to a form).




Mac version update check is incorrectly identifying newest version (MB-10214)

[MB-12051] Update the Release_Server job on Jenkins to include updating the file (membasex.xml) and the download URL Created: 22/Aug/14  Updated: 22/Aug/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0.1, 2.2.0, 2.1.1, 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Technical task Priority: Blocker
Reporter: Wayne Siu Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We can update the Release_Server job on Jenkins to create an updated version of this XML file from a template, and upload it to S3.




[MB-11623] test for performance regressions with JSON detection Created: 02/Jul/14  Updated: 19/Aug/14

Status: In Progress
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Matt Ingenthron Assignee: Thomas Anderson
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: 0h
Time Spent: 120h
Original Estimate: Not Specified

Attachments: File JSONDoctPerfTest140728.rtf     File JSONPerfTestV3.uos    
Issue Links:
Relates to
relates to MB-11675 20-30% performance degradation on app... Closed

 Description   
Related to one of the changes in 3.0, we need to test what has been implemented to see if a performance regression or unexpected resource utilization has been introduced.

In 2.x, all JSON detection was handled at the time of persistence. Since persistence was done in batch and in background, with the then current document, it would limit the resource utilization of any JSON detection.

Starting in 3.x, with the datatype/HELLO changes introduced (and currently disabled), the JSON detection has moved to both memcached and ep-engine, depending on the type of mutation.

Just to paint the reason this is a concern, here's a possible scenario.

Imagine a cluster node that is happily accepting 100,000 sets/s for a given small JSON document, and it accounts for about 20mbit of the network (small enough to not notice). That node has a fast SSD at about 8k IOPS. That means that we'd only be doing JSON detection some 5000 times per second with Couchbase Server 2.x

With the changes already integrated, that JSON detection may be tried over 100k times/s. That's a 20x increase. The detection needs to occur somewhere other than on the persistence path, as the contract between DCP and view engine is such that the JSON detection needs to occur before DCP transfer.

This request is to test/assess if there is a performance change and/or any unexpected resource utilization when having fast mutating JSON documents.

I'll leave it to the team to decide what the right test is, but here's what I might suggest.

With a view defined create a test that has a small to moderate load at steady state and one fast-changing item. Test it with a set of sizes and different complexity. For instance, permutations that might be something like this:
non-JSON of 1k, 8k, 32k, 128k
simple JSON of 1k, 8k, 32k, 128k
complex JSON of 1k, 8k, 32k, 128k
metrics to gather:
throughput, CPU utilization by process, RSS by process, memory allocation requests by process (or minor faults or something)

Hopefully we won't see anything to be concerned with, but it is possible.

There are options to move JSON detection to somewhere later in processing (i.e., before DCP transfer) or other optimization thoughts if there is an issue.

 Comments   
Comment by Cihan Biyikoglu [ 07/Jul/14 ]
this is no longer needed for 3.0 is that right? ready to postpone to 3.0.1?
Comment by Pavel Paulau [ 07/Jul/14 ]
HELLO-based negotiation was disabled but detection still happens in ep-engine.
We need to understand impact before 3.0 release. Sooner than later.
Comment by Matt Ingenthron [ 23/Jul/14 ]
I'm curious Thomas, when you say "increase in bytes appended", do you mean for the same workload the RSS is larger in the 'increase' case? Great to see you making progress.
Comment by Wayne Siu [ 24/Jul/14 ]
Pasted comment from Thomas:
Subject: Re: Couchbase Issues: (MB-11623) test for performance regressions with JSON detection
Yes, ~20% increase from 2.5.1 to 3.0 for same load generator. as reported by the cb server for same input load. I’m verifying and ‘isolating’ . Will also be looking at if/how this contributes to replication load increase (20% on 20% increase …)
The issues seem related. Same increase for 1K, 8K, 16K and 32K with some variance.
—thomas
Comment by Thomas Anderson [ 29/Jul/14 ]
initial results using JSON document load test.
Comment by Matt Ingenthron [ 29/Jul/14 ]
Tom: saw your notes in the work log, out of curiosity, what was deferred to 3.0.1? Also, from the comment above, 20% increase in what?
Comment by Anil Kumar [ 13/Aug/14 ]
Thomas - As discussed please update the ticket with % or regression it has caused with JSON detection now in memcached. I will open separate ticket to document it.
Comment by Thomas Anderson [ 19/Aug/14 ]
a comparison of non-JSON to JSON in 2.5.1 and 3.0.0.1105 showed statistically similar performance, i.e., the minimal overhead of handling JSON document over similar KV document stayed consistent from 2.5.1 to 3.0.0 pre-RC1. see attached file JSONPerfTestV3.uos. to be re-run with official RC1 candidate. feature to load complex JSON documents now modified to 4 levels of JSON complexity (for each document size in bytes) {simpleJSON:: 1 element-attribute value pair; smallJSON:: 10 elements - no array, no nesting; mediumJSON:: 100 elements - arrays & nesting; largeJSON:: 10000 elements mix of element types}.

note, the original seed to this issue was a detected performance issue with JSON documents, ~20-30%. the code/architectural change which caused this was deferred to 3.0.1. additional modifications to server to address simple append mode performance degradation, further lessened issue of whether the document type was the cause of performance degradation. the tests did however show the positive change in compaction, i.e., 3.x compacts documents ~ 5-7% over 2.5.1

 
Comment by Thomas Anderson [ 19/Aug/14 ]
re-run with build 1105. regression comparing same document size, same document load for non-JSON to simple-JSON.
2.5.1:: 1024 byte document, 10 loaders, 1.25M documents for nonJSON to JSON showed a < 4% performance degredation; 3.0:: shows a < 3% degredation. many other factors seem to dominate
Comment by Matt Ingenthron [ 19/Aug/14 ]
Just for the comments here, the original seed wasn't an observed performance regression but rather an architectural concern that there could be a space/CPU/throughput cost for the new JSON detection. That's why I opened it.




[MB-10440] something isn't right with tcmalloc in build 1074 on at least rhel6 causing memcached to crash Created: 11/Mar/14  Updated: 04/Aug/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aleksey Kondratenko Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Relates to
relates to MB-10371 tcmalloc must be compiled with -DTCMA... Resolved
relates to MB-10439 Upgrade:: 2.5.0-1059 to 2.5.1-1074 =>... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
SUBJ.

Just installing latest 2.5.1 build on rhel6 and creating bucket caused segmentation fault (see also MB-10439).

When replacing tcmalloc with a copy I've built it works.

Cannot be 100% sure it's tcmalloc but crash looks too easily reproducible to be something else.


 Comments   
Comment by Wayne Siu [ 12/Mar/14 ]
Phil,
Can you review if this change has been (copied from MB-10371) applied properly?

voltron (2.5.1) commit: 73125ad66996d34e94f0f1e5892391a633c34d3f

    http://review.couchbase.org/#/c/34344/

passes "CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW" to each gprertools configure command
Comment by Andrei Baranouski [ 12/Mar/14 ]
see the same issue on centos 64
Comment by Phil Labee [ 12/Mar/14 ]
need more info:

1. What package did you install?

2. How did you build the tcmalloc which fixes the problem?
 
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
build 1740. Rhel6 package.

You can see yourself. It's easily reproducible as Andrei also confirmed too.

I've got 2.1 tar.gz from googlecode. And then did ./configure --prefix=/opt/couchbase --enable-minimal CPPFLAGS='-DTCMALLOC_SMALL_BUT_SLOW' and then make and make install. After that it works. Have no idea why.

Do you know exact CFLAGS and CXXFLAGS that are used to build our tcmalloc ? Those variables are likely set in voltron (or even from outside of voltron) and might affect optimization and therefore expose some bugs.

Comment by Aleksey Kondratenko [ 12/Mar/14 ]
And 64 bit.
Comment by Phil Labee [ 12/Mar/14 ]
We build out of:

    https://github.com/couchbase/gperftools

and for 2.5.1 use commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

compile using:

(cd /home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools \
&& ./autogen.sh \
        && ./configure --prefix=/opt/couchbase CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW --enable-minimal \
        && make \
        && make install-exec-am install-data-am)
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
That part I know. What I don't know is what cflags are being used.
Comment by Phil Labee [ 13/Mar/14 ]
from the 2.5.1 centos-6-x86 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x86-251-builder/builds/18/steps/couchbase-server%20make%20enterprise%20/logs/stdio

make[1]: Entering directory `/home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools'

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Phil Labee [ 13/Mar/14 ]
from a 2.5.1 centos-6-x64 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x64-251-builder/builds/16/steps/couchbase-server%20make%20enterprise%20/logs/stdio

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Ok. I'll try to exclude -O3 as possible reason of failure later today (in which case it might be upstream bug). In the meantime I suggest you to try lowering optimization to -O2. Unless you have other ideas of course.
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Building tcmalloc with exact same cflags -O3 doesn't cause any crashes. At this time my guess is either compiler bug or cosmic radiation hitting just this specific build.

Can we simply force rebuild ?
Comment by Phil Labee [ 13/Mar/14 ]
test with newer build 2.5.1-1075:

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_2.5.1-1075-rel.rpm

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_64_2.5.1-1075-rel.rpm
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Didn't help unfortunately. Is that still with -O3 ?
Comment by Phil Labee [ 14/Mar/14 ]
still using -O3. There are extensive comments in the voltron Makefile warning against changing to -O2
Comment by Phil Labee [ 14/Mar/14 ]
Did you try to build gperftools out of our repo?
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
The following is not true:

Got myself centos 6.4. And with it's gcc and -O3 I'm finally able to reproduce issue.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
So I've got myself centos 6.4 and _exact same compiler version_. And when I build tcmalloc myself with all right flags and replace tcmalloc from package it works. Without replacing it crashes.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Phil, please, clean ccache, reboot builder host (to clean page cache) and _then_ do another rebuild. Looking at build logs it looks like ccache is being used. So my suspicion about ram corruption is not fully excluded yet. And I have not much other ideas.
Comment by Phil Labee [ 14/Mar/14 ]
cleared ccache and restarted centos-6-x86-builder, centos-6-x64-builder

started build 2.5.1-1076
Comment by Pavel Paulau [ 14/Mar/14 ]
2.5.1-1076 seems to be working, it warns about "SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER" as well.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Maybe I'm doing something wrong but it fails in exact same way on my VM
Comment by Pavel Paulau [ 14/Mar/14 ]
Sorry, it crashed eventually.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Confirmed again. Everything is exactly same as before. Build 1076 centos 6.4 amd64 crashes very easily. Both enterprise edition and community. And doesn't crash if I replace tcmalloc with stuff that I've built, that's exact same source and exact same flags and exact same compiler version.

Build 1071 doesn't crash. All of the 100% consistently.
Comment by Phil Labee [ 17/Mar/14 ]
possibly a difference in build environment

reference env is described in voltron README.md file

for centos-6 X64 (6.4 final) we use the defaults for these tools:


gcc-4.4.7-3.el6 ( 4.4.7-4 available)
gcc-c++-4.4.7-3 ( 4.4.7-4 available)
kernel-devel-2.6.32-358 ( 2.6.32-431.5.1 available)
openssl-devel-1.0.0-27.el6_4.2 ( 1.0.1e-16.el6_5.4 available)
rpm-build-4.8.0-32 ( 4.8.0-37 available)

these tools do not have an update:

scons-2.0.1-1
libtool-2.2.6-15.5

For all centos these specific versions are installed:

gcc, g++ 4.4, currently 4.4.7-3, 4.4.7-4 available
autoconf 2.65, currently 2.63-5 (no update available)
automake 1.11.1
libtool 2.4.2
Comment by Phil Labee [ 17/Mar/14 ]
downloaded gperftools-2.1.tar.gz from

    http://gperftools.googlecode.com/files/gperftools-2.1.tar.gz

and expanded into directory: gperftools-2.1

cloned https://github.com/couchbase/gperftools.git at commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

into directory gperftools, and compared:

=> diff -r gperftools-2.1 gperftools
Only in gperftools: .git
Only in gperftools: autogen.sh
Only in gperftools/doc: pprof.see_also
Only in gperftools/src/windows: TODO
Only in gperftools/src/windows: google

Only in gperftools-2.1: Makefile.in
Only in gperftools-2.1: aclocal.m4
Only in gperftools-2.1: compile
Only in gperftools-2.1: config.guess
Only in gperftools-2.1: config.sub
Only in gperftools-2.1: configure
Only in gperftools-2.1: depcomp
Only in gperftools-2.1: install-sh
Only in gperftools-2.1: libtool
Only in gperftools-2.1: ltmain.sh
Only in gperftools-2.1/m4: libtool.m4
Only in gperftools-2.1/m4: ltoptions.m4
Only in gperftools-2.1/m4: ltsugar.m4
Only in gperftools-2.1/m4: ltversion.m4
Only in gperftools-2.1/m4: lt~obsolete.m4
Only in gperftools-2.1: missing
Only in gperftools-2.1/src: config.h.in
Only in gperftools-2.1: test-driver
Comment by Phil Labee [ 17/Mar/14 ]
Since the build files in your source are different than in the production build, we can't really say we're using the same source.

Please build from our repo and re-try your test.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The difference is in autotools products. I _cannot_ build using same autotools that's present on build machine unless I'm given access to that box.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The _source_ is exact same
Comment by Phil Labee [ 17/Mar/14 ]
I've given the versions of autotools to use, so you could make your build environment in line with the production builds.

As a shortcut, I've submitted a request for a clone of the builder VM that you can experiment with.

See CBIT-1053
Comment by Wayne Siu [ 17/Mar/14 ]
The cloned builder is available. Info in CBIT-1053.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Built tcmalloc from exact copy in builder directory.

Installed package from inside builder directory (build 1077). Verified that problem exists. Stopped service. Replaced tcmalloc. Observer that everything is fine.

Something in environment is causing this. Like maybe unusual ldflags or something else. But _not_ source.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Build full rpm package under buildbot user. With exact same make invocation as I see in buildbot logs. And resultant package works. Weird indeed.
Comment by Phil Labee [ 18/Mar/14 ]
some differences between test build and production build:


1) In gperftools, production calls "make install-exec-am install-data-am" while test calls "make install" which executes extra step "all-am"

2) In ep-engine, produciton uses "make install" while test uses "make"

3) Test build as user "root" while production build as user "buildbot", so PATH and other env.vars may be different.

In general it's hard to tell what steps were performed for the test build, as no output logfiles have been captured.
Comment by Wayne Siu [ 21/Mar/14 ]
Updated from Phil:
comment:
________________________________________

2.5.1-1082 was done without the tcmalloc flag: CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW

    http://review.couchbase.org/#/c/34755/


2.5.1-1083 was done with build step timeout increased from 60 minutes to 90

2.5.1-1084 was done with the tcmalloc flag restored:

    http://review.couchbase.org/#/c/34792/
Comment by Andrei Baranouski [ 23/Mar/14 ]
 2.5.1-1082 MB-10545 Vbucket map is not ready after 60 seconds
Comment by Meenakshi Goel [ 24/Mar/14 ]
Memcached crashes with segmentation fault is observed with build 2.5.1-1084-rel on ubuntu 12.04 during Auto Compaction tests.

Jenkins Link:
http://qa.sc.couchbase.com/view/2.5.1%20centos/job/centos_x64--00_02--compaction_tests-P0/56/consoleFull

root@jackfruit-s12206:/tmp# gdb /opt/couchbase/bin/memcached core.memcached.8276
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /opt/couchbase/bin/memcached...done.
[New LWP 8301]
[New LWP 8302]
[New LWP 8599]
[New LWP 8303]
[New LWP 8604]
[New LWP 8299]
[New LWP 8601]
[New LWP 8600]
[New LWP 8602]
[New LWP 8287]
[New LWP 8285]
[New LWP 8300]
[New LWP 8276]
[New LWP 8516]
[New LWP 8603]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler'.
Program terminated with signal 11, Segmentation fault.
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
298 src/central_freelist.cc: No such file or directory.
(gdb) t a a bt

Thread 15 (Thread 0x7f3568039700 (LWP 8603)):
#0 0x00007f356f01b9fa in __lll_unlock_wake () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f018104 in _L_unlock_644 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f018063 in pthread_mutex_unlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c663d6 in Mutex::release (this=0x5f68250) at src/mutex.cc:94
#4 0x00007f3569c9691f in unlock (this=<optimized out>) at src/locks.hh:58
#5 ~LockHolder (this=<optimized out>, __in_chrg=<optimized out>) at src/locks.hh:41
#6 fireStateChange (to=<optimized out>, from=<optimized out>, this=<optimized out>) at src/warmup.cc:707
#7 transition (force=<optimized out>, to=<optimized out>, this=<optimized out>) at src/warmup.cc:685
#8 Warmup::initialize (this=<optimized out>) at src/warmup.cc:413
#9 0x00007f3569c97f75 in Warmup::step (this=0x5f68258, d=..., t=...) at src/warmup.cc:651
#10 0x00007f3569c2644a in Dispatcher::run (this=0x5e7f180) at src/dispatcher.cc:184
#11 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5f68258) at src/dispatcher.cc:28
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 14 (Thread 0x7f356a705700 (LWP 8516)):
#0 0x00007f356ed0d83d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ed3b774 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f3569c65445 in updateStatsThread (arg=<optimized out>) at src/memory_tracker.cc:31
#3 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 13 (Thread 0x7f35703e8740 (LWP 8276)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e000, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e000, flags=<optimized out>) at event.c:1558
#3 0x000000000040c9e6 in main (argc=<optimized out>, argv=<optimized out>) at daemon/memcached.c:7996

Thread 12 (Thread 0x7f356c709700 (LWP 8300)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e280, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e280, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16814f8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 11 (Thread 0x7f356e534700 (LWP 8285)):
#0 0x00007f356ed348bd in read () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ecc8ff8 in _IO_file_underflow () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f356ecca03e in _IO_default_uflow () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f356ecbe18a in _IO_getline_info () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f356ecbd06b in fgets () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f356e535b19 in fgets (__stream=<optimized out>, __n=<optimized out>, __s=<optimized out>) at /usr/include/bits/stdio2.h:255
#6 check_stdin_thread (arg=<optimized out>) at extensions/daemon/stdin_check.c:37
#7 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 10 (Thread 0x7f356d918700 (LWP 8287)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
---Type <return> to continue, or q <return> to quit---

#1 0x00007f356db32176 in logger_thead_main (arg=<optimized out>) at extensions/loggers/file_logger.c:368
#2 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f3567037700 (LWP 8602)):
#0 SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:32
#1 0x00007f3569c6351c in lock (this=<optimized out>) at src/atomic.hh:282
#2 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#3 gimme (this=<optimized out>) at src/atomic.hh:396
#4 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#5 KVShard::getBucket (this=0x7a6e7c0, id=256) at src/kvshard.cc:58
#6 0x00007f3569c9231d in VBucketMap::getBucket (this=0x614a448, id=256) at src/vbucketmap.cc:40
#7 0x00007f3569c314ef in EventuallyPersistentStore::getVBucket (this=<optimized out>, vbid=256, wanted_state=<optimized out>) at src/ep.cc:475
#8 0x00007f3569c315f6 in EventuallyPersistentStore::firePendingVBucketOps (this=0x614a400) at src/ep.cc:488
#9 0x00007f3569c41bb1 in EventuallyPersistentEngine::notifyPendingConnections (this=0x5eb8a00) at src/ep_engine.cc:3474
#10 0x00007f3569c41d63 in EvpNotifyPendingConns (arg=0x5eb8a00) at src/ep_engine.cc:1182
#11 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f3565834700 (LWP 8600)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7e1c0) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7e204) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f3566035700 (LWP 8601)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7fa40) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7fa84) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f356cf0a700 (LWP 8299)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e500, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e500, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x1681400) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f3567838700 (LWP 8604)):
#0 0x00007f356f01b89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f017065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f016eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c6635a in Mutex::acquire (this=0x5e7f890) at src/mutex.cc:79
#4 0x00007f3569c261f8 in lock (this=<optimized out>) at src/locks.hh:48
#5 LockHolder (m=..., this=<optimized out>) at src/locks.hh:26
---Type <return> to continue, or q <return> to quit---
#6 Dispatcher::run (this=0x5e7f880) at src/dispatcher.cc:138
#7 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5e7f898) at src/dispatcher.cc:28
#8 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f356af06700 (LWP 8303)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e780, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e780, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16817e0) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f3565033700 (LWP 8599)):
#0 0x00007f356ed18267 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f3569c13997 in SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:35
#2 0x00007f3569c63e57 in lock (this=<optimized out>) at src/atomic.hh:282
#3 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#4 gimme (this=<optimized out>) at src/atomic.hh:396
#5 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#6 KVShard::getVBucketsSortedByState (this=0x7a6e7c0) at src/kvshard.cc:75
#7 0x00007f3569c5d494 in Flusher::getNextVb (this=0x168d040) at src/flusher.cc:232
#8 0x00007f3569c5da0d in doFlush (this=<optimized out>) at src/flusher.cc:211
#9 Flusher::step (this=0x5ff7010, tid=21) at src/flusher.cc:152
#10 0x00007f3569c69034 in ExecutorThread::run (this=0x5e7e8c0) at src/scheduler.cc:159
#11 0x00007f3569c6963d in launch_executor_thread (arg=0x5ff7010) at src/scheduler.cc:36
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f356b707700 (LWP 8302)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8ea00, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8ea00, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16816e8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f356bf08700 (LWP 8301)):
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
#1 0x00007f356f23ef19 in tcmalloc::CentralFreeList::FetchFromSpansSafe (this=0x7f356f45d780) at src/central_freelist.cc:283
#2 0x00007f356f23efb7 in tcmalloc::CentralFreeList::RemoveRange (this=0x7f356f45d780, start=0x7f356bf07268, end=0x7f356bf07260, N=4) at src/central_freelist.cc:263
#3 0x00007f356f2430b5 in tcmalloc::ThreadCache::FetchFromCentralCache (this=0xf5d298, cl=9, byte_size=128) at src/thread_cache.cc:160
#4 0x00007f356f239fa3 in Allocate (this=<optimized out>, cl=<optimized out>, size=<optimized out>) at src/thread_cache.h:364
#5 do_malloc_small (size=128, heap=<optimized out>) at src/tcmalloc.cc:1088
#6 do_malloc_no_errno (size=<optimized out>) at src/tcmalloc.cc:1095
#7 (anonymous namespace)::cpp_alloc (size=128, nothrow=<optimized out>) at src/tcmalloc.cc:1423
#8 0x00007f356f249538 in tc_new (size=139867476842368) at src/tcmalloc.cc:1601
#9 0x00007f3569c2523e in Dispatcher::schedule (this=0x5e7f880,
    callback=<error reading variable: DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>, outtid=0x6127930, priority=...,
    sleeptime=<optimized out>, isDaemon=true, mustComplete=false) at src/dispatcher.cc:243
#10 0x00007f3569c84c1a in TapConnNotifier::start (this=0x6127920) at src/tapconnmap.cc:66
---Type <return> to continue, or q <return> to quit---
#11 0x00007f3569c42362 in EventuallyPersistentEngine::initialize (this=0x5eb8a00, config=<optimized out>) at src/ep_engine.cc:1415
#12 0x00007f3569c42616 in EvpInitialize (handle=0x5eb8a00,
    config_str=0x7f356bf07993 "ht_size=3079;ht_locks=5;tap_noop_interval=20;max_txn_size=10000;max_size=1491075072;tap_keepalive=300;dbname=/opt/couchbase/var/lib/couchbase/data/default;allow_data_loss_during_shutdown=true;backend="...) at src/ep_engine.cc:126
#13 0x00007f356cf0f86a in create_bucket_UNLOCKED (e=<optimized out>, bucket_name=0x7f356bf07b80 "default", path=0x7f356bf07970 "/opt/couchbase/lib/memcached/ep.so", config=<optimized out>,
    e_out=<optimized out>, msg=0x7f356bf07560 "", msglen=1024) at bucket_engine.c:711
#14 0x00007f356cf0faac in handle_create_bucket (handle=<optimized out>, cookie=0x5e4bc80, request=<optimized out>, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2168
#15 0x00007f356cf10229 in bucket_unknown_command (handle=0x7f356d1171c0, cookie=0x5e4bc80, request=0x5e44000, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2478
#16 0x0000000000412c35 in process_bin_unknown_packet (c=<optimized out>) at daemon/memcached.c:2911
#17 process_bin_packet (c=<optimized out>) at daemon/memcached.c:3238
#18 complete_nread_binary (c=<optimized out>) at daemon/memcached.c:3805
#19 complete_nread (c=<optimized out>) at daemon/memcached.c:3887
#20 conn_nread (c=0x5e4bc80) at daemon/memcached.c:5744
#21 0x0000000000406e45 in event_handler (fd=<optimized out>, which=<optimized out>, arg=0x5e4bc80) at daemon/memcached.c:6012
#22 0x00007f356fd9948c in event_process_active_single_queue (activeq=<optimized out>, base=<optimized out>) at event.c:1308
#23 event_process_active (base=<optimized out>) at event.c:1375
#24 event_base_loop (base=0x5e8ec80, flags=<optimized out>) at event.c:1572
#25 0x0000000000415584 in worker_libevent (arg=0x16815f0) at daemon/thread.c:301
#26 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#27 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#28 0x0000000000000000 in ?? ()
(gdb)
Comment by Aleksey Kondratenko [ 25/Mar/14 ]
Yesterday I took that consistently failing ubuntu build and played with it on my box.

It is exactly same situation. Replacing libtcmalloc.so makes it work.

So I've spent afternoon on running what's in our actual package under debugger.

I found several evidences that some object files linked into libtcmalloc.so that we ship were built with -DTCMALLOC_SMALL_BUT_SLOW and some _were_ not.

That explains weird crashes.

I'm unable to explain how it's possible that our builders produced such .so files. Yet.

Gut feeling is that it might be:

* something caused by ccache

* perhaps not full cleanup between builds

In order to verify that I'm asking the following:

* do a build with ccache completely disabled but with define

* do git clean -xfd inside gperftools checkout before doing build

Comment by Phil Labee [ 29/Jul/14 ]
The failure was detected by

    http://qa.sc.couchbase.com/job/centos_x64--00_02--compaction_tests-P0/

Can I run this test on a 3.0.0 build to see if this bug still exists?
Comment by Phil Labee [ 29/Jul/14 ]
Can I run this test on a 3.0.0 build to see if bug still exists?
Comment by Meenakshi Goel [ 30/Jul/14 ]
Started a run with latest 3.0.0 build 1057.
http://qa.hq.northscale.net/job/centos_x64--44_01--auto_compaction_tests-P0/37/console

However haven't seen such crashes with compaction tests during 3.0.0 testing.
Comment by Meenakshi Goel [ 30/Jul/14 ]
Tests passed with 3.0.0-1057-rel.
Comment by Wayne Siu [ 31/Jul/14 ]
Pavel also helped verify that this is not an issue in 3.0 (3.0.0-1067).
Comment by Wayne Siu [ 31/Jul/14 ]
Reopening for 2.5.x.
Comment by Aleksey Kondratenko [ 01/Aug/14 ]
We still have _exactly_ same problem as in 2.5.1. Enabling -DSMALL_BUT_SLOW causes mis-compilation. And this is _not_ upstream bug.
Comment by Aleksey Kondratenko [ 01/Aug/14 ]
Looking at build log here: http://builds.hq.northscale.net:8010/builders/centos-6-x64-300-builder/builds/1111/steps/couchbase-server%20make%20enterprise%20/logs/stdio

I see that just few files were rebuilt with new define. And previous build did not have CPPFLAGS set to -DSMALL_BUT_SLOW.

So at least in this case I'm adamant that builder did not rebuild tcmalloc when it should.
Comment by Phil Labee [ 01/Aug/14 ]
check to see if ccache is causing failure to rebuild components under changed configure settings
Comment by Aleksey Kondratenko [ 01/Aug/14 ]
build logs indicate that it is unrelated to ccache. I.e. lots of files are not getting built at all.
Comment by Chris Hillery [ 01/Aug/14 ]
The quick-n-dirty solution would be to delete the buildslave directories for the next build. I would do that now, but there is a build on-going so possibly Phil has already taken care of it. If this build (1091) doesn't work, then we'll clean the world and try again.
Comment by Chris Hillery [ 01/Aug/14 ]
build 1091 is wrapping up, and visually it doesn't appear that gperftools got recompiled. I am waiting for each builder to finish and deleting the buildslave directories. When that is done I'll start a new build.
Comment by Chris Hillery [ 01/Aug/14 ]
build 1092 should start shortly.
Comment by Chris Hillery [ 01/Aug/14 ]
build 1092 is wrapping up (bits are already on latestbuilds I believe). Please test.




[MB-9917] DOC - memcached should dynamically adjust the number of worker threads Created: 14/Jan/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Trond Norbye Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
4 threads is probably not ideal for a 24 core system ;)

 Comments   
Comment by Anil Kumar [ 25/Mar/14 ]
Trond - Can you explain is this new feature in 3.0 or fixing documentation on older docs?
Comment by Ruth Harris [ 17/Jul/14 ]
Trond, Could you provide more information here and then reassign to me? --ruth
Comment by Trond Norbye [ 24/Jul/14 ]
New in 3.0 is that memcached no longer defaults to 4 threads for the frontend, but use 75% of the number of cores reported of the system (with a minimum of 4 cores).

There are 3 ways to tune this:

* Export MEMCACHED_NUM_CPUS=number of threads you want before starting couchbase server

* Use the -t <number> command line argument (this will go away in the future)

* specify it in the configuration file read during startup (but when started from the full server this file is regenerated every time so you'll loose the modifications)




[MB-12052] add stale=false semantic changes to release notes Created: 22/Aug/14  Updated: 28/Aug/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Matt Ingenthron Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Need to release note the stale=false semantic changes.
This doesn't seem to be in current release notes, though MB-11589 seems loosely related.

Please use/adapt from the following text:
Starting with the 3.0 release, the "stale" view query argument "false" has been enhanced so it will consider all document changes which have been received at the time the query has been received. This means that use of the `durability requirements` or `observe` feature to block for persistence in application code before issuing the `false` stale query is no longer needed. It is recommended that you remove all such application level checks after completing the upgrade to the 3.0 release.

- - -

Ruth: assigning this to you to work out the right way to work the text into the release notes. This probably goes with a change in a different MB.




[MB-12090] add stale=false semantic changes to dev guide Created: 28/Aug/14  Updated: 28/Aug/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Matt Ingenthron Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
Need to change the dev guide to explain the semantics change with the stale parameter.

 Comments   
Comment by Matt Ingenthron [ 28/Aug/14 ]
I could not find the 3.0 dev guide to write up something. I've generated a diff based on the 2.5 dev guide. Note that much of that dev guide refers to the 3.0 admin guide section on views. I could not find that in the "dita" directory so I could contribute a change to the XML. I think based on this and what I put in MB-12052 should help.


diff --git a/content/couchbase-devguide-2.5/finding-data-with-views.markdown b/content/couchbase-devguide-2.5/finding-data-with-views.markdown
index 77735b9..811dff0 100644
--- a/content/couchbase-devguide-2.5/finding-data-with-views.markdown
+++ b/content/couchbase-devguide-2.5/finding-data-with-views.markdown
@@ -1,6 +1,6 @@
 # Finding Data with Views
 
-In Couchbase 2.1.0 you can index and query JSON documents using *views*. Views
+In Couchbase you can index and query JSON documents using *views*. Views
 are functions written in JavaScript that can serve several purposes in your
 application. You can use them to:
 
@@ -323,16 +323,25 @@ Forinformation about the sort order of indexes, see the
 [Couchbase Server Manual](http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/).
 
 The real-time nature of Couchbase Server means that an index can become outdated
-fairly quickly when new entries and updates occur. Couchbase Server generates
-the index when it is queried, but in the meantime more data can be added to the
-server and this information will not yet be part of the index. To resolve this,
-Couchbase SDKs and the REST API provide a `stale` parameter you use when you
-query a view. With this parameter you can indicate you will accept the most
-current index as it is, you want to trigger a refresh of the index and retrieve
-these results, or you want to retrieve the existing index as is but also trigger
-a refresh of the index. For instance, to query a view with the stale parameter
-using the Ruby SDK:
+fairly quickly when new entries and updates occur. Couchbase Server updates
+the index at the time the query is received if you supply the argument
+`false` to the `stale` parameter.
+
+<div class="notebox">
+<p>Note</p>
+<p>Starting with the 3.0 release, the "stale" view query argument
+"false" has been enhanced so it will consider all document changes
+which have been received at the time the query has been received. This
+means that use of the `durability requirements` or `observe` feature
+to block for persistence in application code before issuing the
+`false` stale query is no longer needed. It is recommended that you
+remove all such application level checks after completing the upgrade
+to the 3.0 release.
+</p>
+</div>
 
+For instance, to query a view with the stale parameter
+using the Ruby SDK:
 
 ```
 doc.recent_posts(:body => {:stale => :ok})
@@ -905,13 +914,14 @@ for(ViewRow row : result) {
 }
 ```
 
-Before we create a Couchbase client instance and connect to the server, we set a
-system property 'viewmode' to 'development' to put the view into production
-mode. Then we query our view and limit the number of documents returned to 20
-items. Finally when we query our view we set the `stale` parameter to FALSE to
-indicate we want to reindex and include any new or updated beers in Couchbase.
-For more information about the `stale` parameter and index updates, see Index
-Updates and the Stale Parameter in the
+Before we create a Couchbase client instance and connect to the
+server, we set a system property 'viewmode' to 'development' to put
+the view into production mode. Then we query our view and limit the
+number of documents returned to 20 items. Finally when we query our
+view we set the `stale` parameter to FALSE to indicate we want to
+consider any recent changes to documents. For more information about
+the `stale` parameter and index updates, see Index Updates and the
+Stale Parameter in the
 [Couchbase Server Manual](http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#couchbase-views-writing-stale).
 
 The last part of this code sample is a loop we use to iterate through each item




[MB-4593] Windows Installer hangs on "Computing Space Requirements" Created: 27/Dec/11  Updated: 29/Aug/14

Status: Reopened
Project: Couchbase Server
Component/s: installer
Affects Version/s: 2.0-developer-preview-3, 2.0-developer-preview-4
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Unresolved Votes: 3
Labels: windows, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7 Ultimate 64. Sony Vaio, i3 with 4GB RAM and 200 GB of 500 GB free. Also on a Sony Vaio, Windows 7 Ultimate 64, i7, 6 GB RAM and a 750GB drive with about 600 GB free.

Attachments: PNG File couchbase-installer.png     PNG File image001.png     PNG File ss 2014-08-28 at 4.16.09 PM.png    
Triage: Triaged

 Description   
When installing the Community Server 2.0 DP3 on Windows, the installer hangs on the "Computing space requirements screen." There is no additional feedback from the installer. After 90-120 minutes or so, it does move forward and complete. The same issue was reported on Google Groups a few months back - http://groups.google.com/group/couchbase/browse_thread/thread/37dbba592a9c150b/f5e6d80880f7afc8?lnk=gst&q=msi.

Executable: couchbase-server-community_x86_64_2.0.0-dev-preview-3.setup.exe

WORKAROUND IN 3.0 - Create a registry key HKLM\SOFTWARE\Couchbase, name=SkipVcRuntime, type=DWORD, value=1 to skip installing VC redistributable installation which is causing this issue. If VC redistributable is necessary, it must be installed manually if the registry key is set to skip automatic install of it.


 Comments   
Comment by Filip Stas [ 23/Feb/12 ]
Is there any solution for this? I'm experiencing the same problem. Running the unpacked msi does not seem to work because the Installshield setup has been configured to require to install through the exe.

Comment by Farshid Ghods (Inactive) [ 22/Mar/12 ]
from Bin:

Looks like it is related to installshield engine. Maybe installshield tries to access system registry and it is locked by other process. The suggestion is to shut down other running programs and try again if such problem pops up.
Comment by Farshid Ghods (Inactive) [ 22/Mar/12 ]
we were unable to reproduce this on windows 2008 64-bit

the bug mentions this happened on windows 7 64-bit which is not a supported platform but that should not make any difference
Comment by Farshid Ghods (Inactive) [ 23/Mar/12 ]
From Bin:

Windows 7 is my dev environment. And I have no problem to install and test it. From your description, I cannot tell whether it is failed during the installation or after installation finishes but couchcbase server cannot start.
 
If it is due to installshield failure, you can generate the log file for debugging as:
setup.exe /debuglog"C:\PathToLog\setupexe.log"
 
If Couchbase server fails to start, the most possible reason is due to missing or incompatible Microsoft runtime library. You can manually service_start.bat under bin directory and check what is going on. And you can run cbbrowse_log.bat to generate log file for further debugging.
Comment by John Zablocki (Inactive) [ 23/Mar/12 ]
This is an installation only problem. There's not much more to it other than the installer hangs on the screen (see attachment).

However, after a failed install, I did get it to work by:

a) deleting C:\Program Files\Couchbase\*

b) deleting all registry keys with Couchbase Server left over from the failed install

c) rebooting

Next time I see this problem, I'll run it again with the /debuglog

I think the problem might be that a previous install of DP3 or DP4 (nightly build) failed and left some bits in place somewhere.
Comment by Steve Yen [ 05/Apr/12 ]
from Perry...
Comment by Thuan Nguyen [ 05/Apr/12 ]
I can not repo this bug. I test on Windows 7 Professional 64 bit and Windows Server 2008 64 bit.
Here are steps:
- Install couchbase server 2.0.0r-388 (dp3)
- Open web browser and go to initial setup in web console.
- Uninstall couchbase server 2.0.0r-388
- Install couchbase server 2.0.0dp4r-722
- Open web browser and go to initial setup in web console.
Install and uninstall couchbase server go smoothly without any problem.
Comment by Bin Cui [ 25/Apr/12 ]
Maybe we need to get the installer verbose log file to get some clues.

setup.exe /verbose"c:\temp\logfile.txt"
Comment by John Zablocki (Inactive) [ 06/Jul/12 ]
Not sure if this is useful or not, but without fail, every time I encounter this problem, simply shutting down apps (usually Chrome for some reason) causes the hanging to stop. Right after closing Chrome, the C++ redistributable dialog pops open and installation completes.
Comment by Matt Ingenthron [ 10/Jul/12 ]
Workarounds/troubleshooting for this issue:


On installshield's website, there are similar problems reported for installshield. There are several possible reasons behind it:

1. The installation of the Microsoft C++ redistributable is blocked by some other running program, sometimes Chrome.
2. There are some remote network drives that are mapped to local system. Installshield may not have enough network privileges to access them.
3. Couchbase server was installed on the machine before and it was not totally uninstalled and/or removed. Installshield tried to recover from those old images.

To determine where to go next, run setup with debugging mode enabled:
setup.exe /debuglog"C:\temp\setupexe.log"

The contents of the log will tell you where it's getting stuck.
Comment by Bin Cui [ 30/Jul/12 ]
Matt's explanation should be included in document and Q&A website. I reproduced the hanging problem during installation if Chrome browser is running.
Comment by Farshid Ghods (Inactive) [ 30/Jul/12 ]
so does that mean the installer should wait until chrome and other browsers are terminated before proceeding ?

i see this as a very common use case with many installers that they ask the user to stop those applications and if user does not follow the instructions the set up process does not continue until these conditions are met.
Comment by Dipti Borkar [ 31/Jul/12 ]
Is there no way to fix this? At the least we need to provide an error or guidance that chrome needs to be quit before continuing. Is chrome the only one we have seen causing this problem?
Comment by Steve Yen [ 13/Sep/12 ]
http://review.couchbase.org/#/c/20552/
Comment by Steve Yen [ 13/Sep/12 ]
See CBD-593
Comment by Øyvind Størkersen [ 17/Dec/12 ]
Same bug when installing 2.0.0 (build-1976) on Windows 7. Stopping Chrome did not help, but killing the process "Logitech ScrollApp" (KhalScroll.exe) did..
Comment by Joseph Lam [ 13/Sep/13 ]
It's happening to me when installing 2.1.1 on Windows 7. What is this step for and it is really necessary? I see that it happens after the files have been copied to the installation folder. No entirely sure what it's computing space requirements for.
Comment by MikeOliverAZ [ 16/Nov/13 ]
Same problem on 2.2.0x86_64. I have tried everything, closing down chrome and torch from Task Manager to ensure no other apps are competing. Tried removing registry entries but so many, my time please. As is noted above this doesn't seem to be preventing writing the files under Program Files so what's it doing? So I cannot install, it now complains it cannot upgrade and run the installer again.

BS....giving up and going to MongoDB....it installs no sueat.

Comment by Sriram Melkote [ 18/Nov/13 ]
Reopening. Testing on VMs is a problem because they are all clones. We miss many problems like these.
Comment by Sriram Melkote [ 18/Nov/13 ]
Please don't close this bug until we have clear understanding of:

(a) What is the Runtime Library that we're trying to install that conflicts with all these other apps
(b) Why we need it
(c) A prioritized task to someone to remove that dependency on 3.0 release requirements

Until we have these, please do not close the bug.

We should not do any fixes on the lines of checking for known apps that conflict etc, as that is treating the symptom and not fixing the cause.
Comment by Bin Cui [ 18/Nov/13 ]
We install window runtime library because erlang runtime libraries depend on it. Not any runtime library, but the one that comes with erlang distribution package. Without it or with incompatible versions, erl.exe won't run.

In stead of checking any particular applications, the current solution is:
Run a erlang test script. If it runs correctly, no runtime library installed. Otherwise, installer has to install the runtime library.

Please see CBD-593.

Comment by Sriram Melkote [ 18/Nov/13 ]
My suggestion is that let us not attempt to install MSVCRT ourselves.

Let us check the library we need is present or not prior to starting the install (via appropriate registry keys).

If it is absent, let us direct the user to download and install it and exit.
Comment by Bin Cui [ 18/Nov/13 ]
The approach is not totally right. Even if the msvcrt exists, we still need to install it. Here the key is the absolute same msvrt package that comes with erlang distribution. We had problems before that with the same version, but different build of msvcrt installed, erlang won't run.

One possible solution is to ask user to download the msvcrt library from our website and make it a prerequisite for installing couchbase server.
Comment by Sriram Melkote [ 18/Nov/13 ]
OK. It looks like MS distributes some versions of VC runtime with the OS itself. I doubt that Erlang needs anything newer.

So let us rebuild Erlang and have it link to the OS supplied version of MSVCRT (i.e., msvcr70.dll) in Couchbase 3.0 onwards

In the meanwhile, let us point the user to the vcredist we ship in Couchbase 2.x versions and ask them to install it from there.
Comment by Steve Yen [ 23/Dec/13 ]
Saw this in the email inboxes...

From: Tal V
Date: December 22, 2013 at 1:19:36 AM PST
Subject: Installing Couchbase on Windows 7

Hi CouchBase support,
I would like to get your assist on an issue I’m having. I have a windows 7 machine on which I tried to install Couchbase, the installation is stuck on the “Computing space requirements”.
I tried several things without success:

1. 1. I tried to download a new installation package.

2. 2. I deleted all records of the software from the Registry.

3. 3. I deleted the folder that was created under C:\Program Files\Couchbase

4. 4. I restart the computer.

5. 5. Opened only the installation package.

6. 6. Re-install it again.
And again it was stuck on the same step.
What is the solution for it?

Thank you very much,


--
Tal V
Comment by Steve Yen [ 23/Dec/13 ]
Hi Bin,
Not knowing much about installshield here, but one idea - are there ways of forcibly, perhaps optionally, skipping the computing space requirements step? Some environment variable flag, perhaps?
Thanks,
Steve

Comment by Bin Cui [ 23/Dec/13 ]
This "Computing space requirements" is quite misleading. It happens at the post install step while GUI still shows that message. Within the step, we run the erlang test script and fails and the installer runs "vcredist.exe" for microsoft runtime library which gets stuck.

For the time being, the most reliable way is not to run this vcredist.exe from installer. Instead, we should provide a link in our download web site.

1. During installation, if we fails to run the erlang test script, we can pop up a warning dialog and ask customers to download and run it after installation.
 
Comment by Bin Cui [ 23/Dec/13 ]
To work around the problem, we can instruct the customer to download the vcredist.exe and run it manually before set up couchbase server. If running environment is set up correctly, installer will bypass that step.
Comment by Bin Cui [ 30/Dec/13 ]
Use windows registry key to install/skip the vcredist.exe step:

On 32bit windows, Installer will check HKEY_LOCAL_MACHINE\SOFTWARE\Couchbase\SkipVcRuntime
On 64bit windows, Installer will check HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Couchbase\SkipVcRuntime,
where SkipVcRuntime is a DWORD (32-bit) value.

When SkipVcRuntime is set to 1, installer will skip the step to install vcredist.exe. Otherwise, installer will follow the same logic as before.
vcredist_x86.exe can be found in the root directory of couchbase server. It can be run as:
c:\<couchbase_root>\vcredist_x86.exe

http://review.couchbase.org/#/c/31501/
Comment by Bin Cui [ 02/Jan/14 ]
Check into branch 2.5 http://review.couchbase.org/#/c/31558/
Comment by Iryna Mironava [ 22/Jan/14 ]
tested with Win 7 and Win Server 2008
I am unable to reproduce this issue(build 2.0.0-1976, dp3 is no longer available)
Installed/uninstalled couchbase several times
Comment by Sriram Melkote [ 22/Jan/14 ]
Unfortunately, for this problem, if it did not reproduce, we can't say it is fixed. We have to find a machine where it reproduces and then verify a fix.

Anyway, no change made actually addresses the underlying problem (the registry key just gives a way to workaround it when it happens), so reopening the bug and targeting for 3.0
Comment by Sriram Melkote [ 23/Jan/14 ]
Bin - I just noticed that the Erlang installer itself (when downloaded from their website) installs VC redistributable in non-silent mode. The Microsoft runtime installer dialog pop us up, indicates it will install VC redistributable and then complete. Why do we run it in silent mode (and hence assume liability of it running properly)? Why do we not run the MSI in interactive mode like ESL Erlang installer itself does?
Comment by Wayne Siu [ 05/Feb/14 ]
If we could get the information on the exact software version, it could be helpful.
From registry, Computer\HKLM\Software\Microsoft\WindowsNT\CurrentVersion
Comment by Wayne Siu [ 12/Feb/14 ]
Bin, looks like the erl.ini was locked when this issue happened.
Comment by Pavel Paulau [ 19/Feb/14 ]
Just happened to me in 2.2.0-837.
Comment by Anil Kumar [ 18/Mar/14 ]
Triaged by Don and Anil as per Windows Developer plan.
Comment by Bin Cui [ 08/Apr/14 ]
http://review.couchbase.org/#/c/35463/
Comment by Chris Hillery [ 13/May/14 ]
I'm new here, but it seems to me that vcredist_x64.exe does exactly the same thing as the corresponding MS-provided merge module for MSVC2013. If that's true, we should be able to just include that merge module in our project, and not need to fork out to install things. In fact, as of a few weeks ago, the 3.0 server installers are doing just that.

http://msdn.microsoft.com/en-us/library/dn501987.aspx

Is my understanding incomplete in some way?
Comment by Chris Hillery [ 14/May/14 ]
I can confirm that the most recent installers do install msvcr120.dll and msvcp120.dll in apparently the correct places, and the server can start with them. I *believe* this means that we no longer need to fork out vcredist_x64.exe, or have any of the InstallShield tricks to detect whether it is needed and/or skip installing it, etc. I'm leaving this bug open to both verify that the current merge module-based solution works, and to track removal of the unwanted code.
Comment by Sriram Melkote [ 16/May/14 ]
I've also verified that 3.0 build installed VCRT (msvcp100) is sufficient for Erlang R16.




[MB-12063] KV+XDCR System test : Between expiration and purging, getMeta() retrieves revID as 1 for deleted key from Source, same deleted key from Destination returns 2. Created: 25/Aug/14  Updated: 29/Aug/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: rc2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 6.x , build 3.0.0-1174-rel

Issue Links:
Dependency
depends on MB-12100 Rebalance exited with reason bad_repl... Open
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
--> Before expiration and after purging, all metadata match between source and destination clusters.
--> However, after expiration, there's some code that causes a deleted key at the source(!!!) to have seqno as 1. The seqno at destination is however 2, as expected.
--> The data below for uni-xdcr (C1 -> C2).
--> Not a recent regression, seen it once before in 3.0.0-9xx. Catching this bug totally depends on when I run the validation script after system test is completed. Expiration is usually set to 1 day and tombstone purge interval is 3 days on both source and destination. Once tombstones are purged, I don't see this mismatch. So I don't have a live cluster.

{'C1_location:': u'172.23.105.44', 'vb': 90, 'C2_node': u'172.23.105.54', 'C1_key_count': 19919, 'C2_key_count': 19919, 'missing_keys': 0}
RevID or CAS mismatch -
  172.23.105.44(C1): key:65ABEE18-153_100061 metadata:{'deleted': 1, 'seqno': 1, 'cas': 1902841111553483, 'flags': 0, 'expiration': 1408646731}
  172.23.105.54(C2): key:65ABEE18-153_100061 metadata:{'deleted': 1, 'seqno': 2, 'cas': 1902841111553484, 'flags': 0, 'expiration': 1408646731}
 RevID or CAS mismatch -
  172.23.105.44(C1): key:65ABEE18-153_100683 metadata:{'deleted': 1, 'seqno': 1, 'cas': 1902841111336520, 'flags': 0, 'expiration': 1408646731}
  172.23.105.54(C2): key:65ABEE18-153_100683 metadata:{'deleted': 1, 'seqno': 2, 'cas': 1902841111336521, 'flags': 0, 'expiration': 1408646731}
RevID or CAS mismatch -
  172.23.105.44(C1): key:65ABEE18-153_100713 metadata:{'deleted': 1, 'seqno': 1, 'cas': 1902841111837669, 'flags': 0, 'expiration': 1408646731}
  172.23.105.54(C2): key:65ABEE18-153_100713 metadata:{'deleted': 1, 'seqno': 2, 'cas': 1902841111837670, 'flags': 0, 'expiration': 1408646731}
 RevID or CAS mismatch -
  172.23.105.44(C1): key:65ABEE18-153_103240 metadata:{'deleted': 1, 'seqno': 1, 'cas': 1902843752129235, 'flags': 0, 'expiration': 1408646733}
  172.23.105.54(C2): key:65ABEE18-153_103240 metadata:{'deleted': 1, 'seqno': 2, 'cas': 1902843752129236, 'flags': 0, 'expiration': 1408646733}
 RevID or CAS mismatch -
  172.23.105.44(C1): key:65ABEE18-153_105170 metadata:{'deleted': 1, 'seqno': 1, 'cas': 1902847773405994, 'flags': 0, 'expiration': 1408646737}
  172.23.105.54(C2): key:65ABEE18-153_105170 metadata:{'deleted': 1, 'seqno': 2, 'cas': 1902847773405995, 'flags': 0, 'expiration': 1408646737}

Please let me know what/if you need in particular to diagnose this issue. Thanks!




 Comments   
Comment by Aruna Piravi [ 25/Aug/14 ]
Worth mentioning that if the same key gets recreated at Source with seq <= seqno(same key at destination), the create will/may not get propagated.
Comment by Aruna Piravi [ 25/Aug/14 ]
We have many xdcr functional tests with expiration, after which we compare revIDs, even of deleted items. We did not hit this particular bug. Venu did some unit tests and could not catch it either.

I'd like to perform the same system test on 2.5.1 to determine if this is a regression. Again system test runs itself runs for 12-15 hrs, we keep loading throughout the test, items keep expiring until the next 24 hrs, expiry pager runs 3 days from the start of the test and the validation script runs for half a day so it's all about timing.

Will get you the data files and start the system test.

Comment by Aruna Piravi [ 25/Aug/14 ]
http://172.23.105.44:8091/index.html
http://172.23.105.54:8091/index.html
Comment by Aruna Piravi [ 25/Aug/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-12063/stdbucket.rtf --> keys that have this mismatch for bucket 'standardbucket'.

For a quick look -
https://s3.amazonaws.com/bugdb/jira/MB-12063/44_couch.tar (source)
https://s3.amazonaws.com/bugdb/jira/MB-12063/54_couch.tar (dest)
Comment by Aruna Piravi [ 25/Aug/14 ]
Also seeing some keys where revID is greater at source than dest. Pls look at vbuckets 15 and 28.

RevID or CAS mismatch -
  172.23.105.44(C1): key:6B67A321-142_4666321 metadata:{'deleted': 1, 'seqno': 4, 'cas': 13175771830561215, 'flags': 0, 'expiration': 1408649286}
  172.23.105.54(C2): key:6B67A321-142_4666321 metadata:{'deleted': 1, 'seqno': 3, 'cas': 13175771830561214, 'flags': 0, 'expiration': 1408649286}
RevID or CAS mismatch -
  172.23.105.44(C1): key:6B67A321-142_4666453 metadata:{'deleted': 1, 'seqno': 4, 'cas': 13175771778347790, 'flags': 0, 'expiration': 1408649286}
  172.23.105.54(C2): key:6B67A321-142_4666453 metadata:{'deleted': 1, 'seqno': 3, 'cas': 13175771778347789, 'flags': 0, 'expiration': 1408649286}
 RevID or CAS mismatch -
  172.23.105.44(C1): key:6B67A321-142_4674099 metadata:{'deleted': 1, 'seqno': 4, 'cas': 13175775983769867, 'flags': 0, 'expiration': 1408649290}
  172.23.105.54(C2): key:6B67A321-142_4674099 metadata:{'deleted': 1, 'seqno': 3, 'cas': 13175775983769866, 'flags': 0, 'expiration': 1408649290}
 RevID or CAS mismatch -
  172.23.105.44(C1): key:6B67A321-142_4674109 metadata:{'deleted': 1, 'seqno': 4, 'cas': 13175775977754745, 'flags': 0, 'expiration': 1408649290}
  172.23.105.54(C2): key:6B67A321-142_4674109 metadata:{'deleted': 1, 'seqno': 3, 'cas': 13175775977754744, 'flags': 0, 'expiration': 1408649290}
RevID or CAS mismatch -
  172.23.105.44(C1): key:6B67A321-142_4677328 metadata:{'deleted': 1, 'seqno': 4, 'cas': 13175778211679600, 'flags': 0, 'expiration': 1408649293}
  172.23.105.54(C2): key:6B67A321-142_4677328 metadata:{'deleted': 1, 'seqno': 3, 'cas': 13175778211679599, 'flags': 0, 'expiration': 1408649293}
Comment by Aruna Piravi [ 25/Aug/14 ]
cbcollect :

https://s3.amazonaws.com/bugdb/jira/MB-12063/source.tar
https://s3.amazonaws.com/bugdb/jira/MB-12063/dest.tar

Note: .44 is down now. Was up until this morning. .44 ran out of diskspace. The above script ran 2 days back.

I can start system test on 2.5.1 tomorrow morning if required. Pls let me know, thanks.
Comment by Wayne Siu [ 26/Aug/14 ]
Reviewed with Cihan. Potentially a RC2 candidate.
If the fix is contained and is ready by this week.
Need Dev's risk assessment.
Comment by Mike Wiederhold [ 26/Aug/14 ]
The logs have rolled over so I can't see exactly what happened. I think the reason that the cluster got into this state is because the expiry pager was run on one side of the cluster and not the other. This can easily explain the why destination would have a different sequence number and the source wouldn't. This happens because expiring a key will increment the rev sequence number and unfortunately I don't think there is anything we can do about this.

I am assuming that the same issue is happening when the source clusters seqno is incremented, but the destination seqno is not. In this case we should see the delete replicated to the other side and I cannot determine whether or not this happened because the logs are rolled over. I'm also not sure if you just checked those keys while keys were being propagated to the destination node. All of the traffic could have stopped, but the expiry pager might have kicked in during the verification phase.

I'm seeing a log message that I need to investigate further so I will leave this assigned to me.
Comment by Aruna Piravi [ 26/Aug/14 ]
 > I think the reason that the cluster got into this state is because the expiry pager was run on one side of the cluster and not the other.

That's possible but what can explain seeing 'deleted': 1 with 'seqno': 1? My understanding is - any key with deleted flag = 1 needs to have revid as atleast 2. RevID for any doc can be 1 only at the time of creation. If expired or deleted, revID should be incremented, right?
Comment by Aruna Piravi [ 27/Aug/14 ]
Also, this is C1 -> C2 (uni-xdcr). So seeing something like

172.23.105.44(C1): key:65ABEE18-153_100061 metadata:{'deleted': 1, 'seqno': 1, 'cas': 1902841111553483, 'flags': 0, 'expiration': 1408646731}
172.23.105.54(C2): key:65ABEE18-153_100061 metadata:{'deleted': 1, 'seqno': 2, 'cas': 1902841111553484, 'flags': 0, 'expiration': 1408646731}

would mean - expiry pager ran at C1('deleted':1) but did not increment revid which is clearly a bug.
Comment by Cihan Biyikoglu [ 27/Aug/14 ]
is there an ETA on the resolution? if this won't resolve in the next day, we need to push this one out.
thanks
Comment by Abhinav Dangeti [ 27/Aug/14 ]
Mike's already submitted the fixes:
http://review.couchbase.org/#/c/40996/
http://review.couchbase.org/#/c/40997/
Comment by Mike Wiederhold [ 28/Aug/14 ]
The above changes do not resolve this issue. When looking through the logs I saw two separate issues which needed to be fixed.
Comment by Cihan Biyikoglu [ 28/Aug/14 ]
need a fix by EOD today so we can consider this for RC2 . otherwise we'll need to delay the RC
Comment by Sundar Sridharan [ 28/Aug/14 ]
Aruna, if possible, could you please list the operations that the test is doing since we need clues to reproduce and verify. thanks
Comment by Aruna Piravi [ 28/Aug/14 ]
Hi Sundar, as we discussed, here are the steps

Clusters
-----------
C1 : http://172.23.105.44:8091/
C2 : http://172.23.105.54:8091/

Steps
--------
1. Setup uni-xdcr on "standardbucket1" , load till active_resident_ratio = ~70 on standardbucket1
2. Access phase with 50% gets, 50%deletes for 3 hrs
3. Rebalance-out 1 node at cluster1
4. Rebalance-in 1 node at cluster1
5. Failover and remove node at cluster1
6. Failover and add-back node at cluster1
7. Rebalance-out 1 node at cluster2
8. Rebalance-in 1 node at cluster2
9. Failover and remove node at cluster2
10. Failover and add-back node at cluster2
11. Soft restart all nodes in cluster1 one by one

Run verification script after 15 hrs(after all items are expired).
Comment by Sundar Sridharan [ 28/Aug/14 ]
Aruna the toy build is couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0-710-toy.rpm
Comment by Aruna Piravi [ 28/Aug/14 ]
System test started on the above toy build.

You can check
C1 : http://172.23.105.44:8091/
C2 : http://172.23.105.54:8091/
if you would like to watch the test.

It should be finished by tomorrow afternoon. If there are crashes, it would stop sooner. I will check back late night and tomorrow morning.
Comment by Aruna Piravi [ 28/Aug/14 ]
Sundar, there seems to a problem with the toy build. The buckets got created but did not get successfully loaded on all nodes in the cluster. Pls take a look at any of the clusters. All nodes in both clusters are in pending state.
Comment by Raju Suravarjjala [ 29/Aug/14 ]
Thanks Sundar, we need to make a call on this quickly
Comment by Sundar Sridharan [ 29/Aug/14 ]
Raju, I had a discussion with Aruna and Chiyoung and will try to get around to this issue as soon as possible.
Aruna, the new toy build is ready at..
couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0-711-toy.rpm
thanks
Comment by Chiyoung Seo [ 29/Aug/14 ]
Aruna,

I looked at getMeta implementation and confirmed that we return "deleted" flag to the caller if an item is expired. Therefore, this mismatch issue can happen if the expiry pager was executed only in source or destination cluster, but not both. As we discussed, please run the expiry pager by force in both clusters before running the verification step.
Comment by Aruna Piravi [ 29/Aug/14 ]
Started a small scale system test with latest which I expect to complete in 2 hrs. Set expiration is 10 mins. Will run expiry pager on both sides before verification.

If we do get "deleted" flag for "expired but undeleted" items, it may not be a bug. Will confirm.
Comment by Wayne Siu [ 29/Aug/14 ]
Reviewed by Cihan/Chiyoung, raised the priority to BLOCKER.
Comment by Aruna Piravi [ 29/Aug/14 ]
Unable to do further verification on RC2, blocked by MB-12100 which causes mismatch in keys at the end of the test. Will have to wait till MB-12100 is fixed. Thanks.




[MB-6972] distribute couchbase-server through yum and ubuntu package repositories Created: 19/Oct/12  Updated: 29/Aug/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Farshid Ghods (Inactive) Assignee: Phil Labee
Resolution: Unresolved Votes: 3
Labels: devX
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks MB-8693 [Doc when ready] distribute couchbase... Reopened
blocks MB-7821 yum install couchbase-server from cou... Resolved
Duplicate
duplicates MB-2299 Create signed RPM's Resolved
is duplicated by MB-9409 repository for deb packages (debian&u... Resolved
Flagged:
Release Note

 Description   
this helps us in handling dependencies that are needed for couchbase server
sdk team has already implemented this for various sdk packages.

we might have to make some changes to our packaging metadata to work with this schema

 Comments   
Comment by Steve Yen [ 26/Nov/12 ]
to 2.0.2 per bug-scrub

first step is do the repositories?
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Farshid Ghods (Inactive) [ 19/Dec/12 ]
Phil,
please sync up with Farshid and get instructions that Sergey and Pavel sent
Comment by Farshid Ghods (Inactive) [ 28/Jan/13 ]
we should resolve this task once 2.0.1 is released .
Comment by Dipti Borkar [ 29/Jan/13 ]
Have we figured out the upgrade process moving forward. for example from 2.0.1 to 2.0.2 or 2.0.1 to 2.1 ?
Comment by Jin Lim [ 04/Feb/13 ]
Please ensure that we also confirm/validate the upgrade process moving from 2.0.1 to 2.0.2. Thanks.
Comment by Phil Labee [ 06/Feb/13 ]
Now have DEB repo working, but another issue has come up: We need to distribute the public key so that users can install the key before running apt-get.

wiki page has been updated.
Comment by kzeller [ 14/Feb/13 ]
Added to 2.0.1 RN as:

Fix:

We now provide Couchbase Server as a yum and Debian package
repositories.
Comment by Matt Ingenthron [ 09/Apr/13 ]
What are the public URLs for these repositories? This was mentioned in the release notes here:
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-rn_2-0-0l.html
Comment by Matt Ingenthron [ 09/Apr/13 ]
Reopening, since this isn't documented that I can find. Apologies if I'm just missing it.
Comment by Dipti Borkar [ 23/Apr/13 ]
Anil, can you work with Phil to see what are the next steps here?
Comment by Anil Kumar [ 24/Apr/13 ]
Yes I'll be having discussion with Phil and will update here with details.
Comment by Tim Ray [ 28/Apr/13 ]
could we either remove the note about yum/deb repo's in the release notes or get those repo locations / sample files / keys added to public pages? The only links that seem that they 'might' contain the info point to internal pages I don't have access to.
Comment by Anil Kumar [ 14/May/13 ]
thanks Tim, we have removed it from release notes. we will add instructions about yum/deb repo's locations/files/keys to documentation once its available. thanks!
Comment by kzeller [ 14/May/13 ]
Removing duplicate ticket:

http://www.couchbase.com/issues/browse/MB-7860
Comment by h0nIg [ 24/Oct/13 ]
any update? maybe i created a duplicate issue: http://www.couchbase.com/issues/browse/MB-9409 but it seems that the repositories are outdated on http://hub.internal.couchbase.com/confluence/display/CR/How+to+Use+a+Linux+Repo+--+debian
Comment by Sriram Melkote [ 22/Apr/14 ]
I tried to install on Debian today. It failed badly. One .deb package didn't match the libc version of stable. The other didn't match the openssl version. Changing libc or openssl is simply not an option for someone using Debian stable because it messes with the base OS too deeply. So as of 4/23/14, we don't have support for Debian.
Comment by Sriram Melkote [ 22/Apr/14 ]
Anil, we have accumulated a lot of input in this bug. I don't think this will realistically go anywhere for 3.0 unless we define specific goals and some considered platform support matrix expansion. Can you please create a goal for 3.0 more precisely?
Comment by Matt Ingenthron [ 22/Apr/14 ]
+1 on Siri's comments. Conversations I had with both Ubuntu (who recommend their PPAs) and Red Hat experts (who recommend setting up a repo or getting into EPEL or the like) indicated that's the best way to ensure coverage of all OSs. Binary packages built on one OS and deployed on another are risky, run into dependency issues.
Comment by Anil Kumar [ 28/Apr/14 ]
This ticket specially for distributing DEB and RPM repositories through YUM and APT repo. We have another ticket for supporting Debian platform MB-10960.
Comment by Anil Kumar [ 23/Jun/14 ]
Assigning ticket to Tony for verification.
Comment by Phil Labee [ 21/Jul/14 ]
Need to do before closing:

[ ] capture keys and process used for build that is currently posted (3.0.0-628), update tools and keys of record in build repo and wiki page
[ ] distribute 2.5.1 and 3.0.0-beta1 builds using same process, testing update capability
[ ] test update from 2.0.0 to 2.5.1 to 3.0.0
Comment by Phil Labee [ 21/Jul/14 ]
re-opening to assign to sprint to prepare the distribution repos for testing
Comment by Wayne Siu [ 30/Jul/14 ]
Phil,
has build 3.0.0-973 be updated in the repos for beta testing?
Comment by Wayne Siu [ 29/Aug/14 ]
Phil,
Please refresh it with build 3.0.0-1205. Thanks.




[MB-12104] Carrier Config missing after R/W Concurrency change Created: 01/Sep/14  Updated: 01/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Michael Nitschinger Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks JCBC-537 Setting Reader/Writer Worker value on... Open
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Hi,

while investigating JCBC-537 I found what I think is a pretty severe issue when changing the R/W concurrency.

When it is changed on the UI, the GET_CFG command from CCCP is returning successful, but with an empty response. Once the server(s) are restarted, the config is back there.

This manifests in JCBC-537 when tested with 2.5.1 (and the threads are changed), but persists even with 3.0 when just set to high. Again, the same issue persists on single and multinode. I tried to restart the single node (3.0) when set to high, and after it came back up the command worked.

I guess some processes need to be restarted in order to take effect and they are not picking up a binary config afterwards?

I set it to blocker because this is already harming production boxes, feel free to lower it. As far as I can see, workaround is to restart the cluster after the setting is made.




[MB-12100] Rebalance exited with reason bad_replicas Created: 29/Aug/14  Updated: 02/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: rebalance-failed
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 6.x, 3.0.0-1174-rel

Issue Links:
Dependency
blocks MB-12063 KV+XDCR System test : Between expirat... Open
Triage: Untriaged
Is this a Regression?: No

 Description   
--> Only 2 nodes present in the cluster, .47 went down and could not be auto-failovered because only one other node was present in the cluster.
-->.47 was brought up few secs later
-->subsequent rebalance failed with the reason below.

ns_orchestrator002 ns_1@172.23.105.44 13:17:54 - Fri Aug 29, 2014
Bad replicators after rebalance:
Missing = [{'ns_1@172.23.105.44','ns_1@172.23.105.47',33},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',66},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',90},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',97},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',222},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',314},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',416},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',420},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',424},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',428},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',432},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',436},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',440},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',444},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',448},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',452},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',456},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',460},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',464},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',468},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',472},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',476},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',479},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',480},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',483},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',484},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',487},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',488},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',491},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',492},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',495},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',496},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',499},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',500},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',503},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',504},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',507},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',508},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',511},
{'ns_1@172.23.105.44','ns_1@172.23.105.47',512}]
Extras = []

Attaching cbcollect

 Comments   
Comment by Aruna Piravi [ 29/Aug/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-12100/172.23.105.47-8292014-1347-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12100/172.23.105.44-8292014-1341-diag.zip
Comment by Aleksey Kondratenko [ 29/Aug/14 ]
I suspect that a bit more info what you did prior to all that might end up helpful.
Comment by Aruna Piravi [ 29/Aug/14 ]
Ok, I did the following

1. Created buckets and loaded data upto active_resident_ratio =~50, setup xdcr to another cluster.
2. rebalanced-out .45
3. rebalanced-in .45
4. failed over .45 and rebalanced it out
5. Stopped server on .47 for >30secs(auto_failover enabled) but failover understandably did not happen given we had just 2 nodes. Tried to rebalance again after .47 came up when it failed.
Comment by Aleksey Kondratenko [ 29/Aug/14 ]
Seeing lots of issues like this:

[ns_server:debug,2014-08-29T13:18:00.651,ns_1@172.23.105.47:<0.28627.3>:dcp_proxy:handle_packet:120]Proxy packet: RESPONSE: 0x53 (dcp_stream_req) vbucket = 0 opaque = 0x2C020000 status = 0x22 (erange)
81 53 00 00
00 00 00 22
00 00 00 0D
2C 02 00 00
00 00 00 00
00 00 00 00
4F 75 74 73
69 64 65 20
72 61 6E 67
65

I've bumped bug to blocker and put it into 3.0.0 because from what I see (erange errors continuing well after rebalance) it's possible that dcp replication is significantly malfunctioning. Which looks like severe enough for me.
Comment by Aleksey Kondratenko [ 29/Aug/14 ]
CC-ed some stakeholders.

Aruna, you should strongly consider verifying replication up-to-dateness as part of tests that may affect it. I.e. maybe via replica read command.
Comment by Wayne Siu [ 29/Aug/14 ]
Reviewed with PM/Cihan. This is potentially a release blocker.
Comment by Mike Wiederhold [ 29/Aug/14 ]
Should be fixed in build 1201. Please re-test.

https://github.com/membase/ep-engine/commit/7580656bc172ba02cd37e50a36440fffa94937f3

https://github.com/membase/ep-engine/commit/669dfe80e80444ee0208c8c33bcda736a4577ee0

https://github.com/membase/ep-engine/commit/a20ff04d9006bf5a526e7cf50b15f3a0b92b55d4
Comment by Aruna Piravi [ 29/Aug/14 ]
All,

I was wrong when I said I found this bug in 3.0.0-1174-rel. I actually found it on a toy-build built this morning by Sundar with latest 3.0 code for 12063 verification. However

1. Sundar did not add any new code except for a couple of asserts that would cause memcached to crash. We did not find any evidence of the assert being hit nor were there any memcached crashes.
2. the toy-build also contains these commits from Mike. So this toy-build except for Sundar's asserts is what we would have as RC2.

Sundar's toybuild manifest - http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0-711-toy.rpm.manifest.xml
Comment by Cihan Biyikoglu [ 29/Aug/14 ]
Aruna, we should retest on the regular build before we assign to Mike. we cannot trust toy builds at this point. If you can retest on the build that is just kicked off and repro the issue, we will reset RC2 and take the fix.
thanks
Comment by Aruna Piravi [ 29/Aug/14 ]
Reproduced again on RC2 - 3.0.0-1205-rel. Rebalance failed with same reason. Cluster - http://172.23.105.44:8091/index.html#sec=log

Assigning to Mike.
Comment by Aruna Piravi [ 29/Aug/14 ]
Logs if reqd - https://www.couchbase.com/issues/browse/MB-12102?focusedCommentId=99639&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-99639
Comment by Aruna Piravi [ 29/Aug/14 ]
Indeed seeing bad replicas. active and replica keys do not match for almost all vbuckets.

ran cbvdiff - some vb results:

VBucket 857: active count 2710 != 2714 replica count

VBucket 858: active count 2738 != 2739 replica count

VBucket 860: active count 2733 != 2736 replica count

VBucket 862: active count 2691 != 2695 replica count

VBucket 864: active count 2712 != 2715 replica count

VBucket 865: active count 2684 != 2687 replica count

VBucket 868: active count 2684 != 2688 replica count

VBucket 870: active count 2705 != 2708 replica count

VBucket 871: active count 2657 != 2656 replica count

VBucket 872: active count 2731 != 2733 replica count

VBucket 873: active count 2714 != 2715 replica count

VBucket 874: active count 2756 != 2759 replica count

VBucket 877: active count 2812 != 2814 replica count

VBucket 878: active count 2742 != 2744 replica count

VBucket 880: active count 2737 != 2738 replica count

VBucket 882: active count 2687 != 2691 replica count

VBucket 883: active count 2699 != 2700 replica count

VBucket 884: active count 2652 != 2657 replica count

VBucket 885: active count 2668 != 2669 replica count

VBucket 886: active count 2718 != 2722 replica count

VBucket 888: active count 2775 != 2776 replica count

VBucket 889: active count 2778 != 2780 replica count

VBucket 890: active count 2719 != 2724 replica count

VBucket 892: active count 2727 != 2728 replica count

VBucket 893: active count 2754 != 2755 replica count

VBucket 894: active count 2778 != 2779 replica count

VBucket 895: active count 2804 != 2805 replica count

VBucket 896: active count 2731 != 2732 replica count

VBucket 897: active count 2693 != 2694 replica count

VBucket 901: active count 2736 != 2737 replica count

VBucket 902: active count 2716 != 2720 replica count

VBucket 903: active count 2693 != 2691 replica count

VBucket 904: active count 2724 != 2726 replica count

VBucket 905: active count 2707 != 2710 replica count

VBucket 906: active count 2753 != 2756 replica count

VBucket 908: active count 2788 != 2791 replica count

VBucket 909: active count 2729 != 2732 replica count

VBucket 910: active count 2733 != 2734 replica count

VBucket 913: active count 2692 != 2696 replica count

VBucket 914: active count 2699 != 2701 replica count

VBucket 916: active count 2690 != 2695 replica count

VBucket 917: active count 2736 != 2738 replica count

VBucket 918: active count 2753 != 2754 replica count

VBucket 919: active count 2714 != 2715 replica count

VBucket 920: active count 2779 != 2781 replica count

VBucket 921: active count 2765 != 2769 replica count

VBucket 922: active count 2696 != 2698 replica count

VBucket 925: active count 2738 != 2740 replica count

VBucket 926: active count 2783 != 2787 replica count

VBucket 928: active count 2768 != 2771 replica count

VBucket 929: active count 2775 != 2778 replica count

VBucket 930: active count 2724 != 2725 replica count

VBucket 933: active count 2720 != 2721 replica count

VBucket 934: active count 2793 != 2797 replica count

VBucket 937: active count 2742 != 2744 replica count

VBucket 938: active count 2717 != 2719 replica count

VBucket 940: active count 2688 != 2690 replica count

VBucket 941: active count 2652 != 2653 replica count

VBucket 945: active count 2736 != 2737 replica count

VBucket 946: active count 2791 != 2792 replica count

VBucket 948: active count 2807 != 2810 replica count

VBucket 949: active count 2778 != 2781 replica count

VBucket 956: active count 2738 != 2739 replica count

VBucket 957: active count 2684 != 2686 replica count

VBucket 958: active count 2669 != 2674 replica count

VBucket 960: active count 2682 != 2683 replica count

VBucket 965: active count 2673 != 2674 replica count

VBucket 966: active count 2681 != 2685 replica count

VBucket 968: active count 2710 != 2711 replica count

VBucket 969: active count 2748 != 2749 replica count

VBucket 970: active count 2815 != 2818 replica count

VBucket 972: active count 2795 != 2798 replica count

VBucket 973: active count 2763 != 2766 replica count

VBucket 974: active count 2753 != 2754 replica count

VBucket 976: active count 2727 != 2728 replica count

VBucket 977: active count 2740 != 2743 replica count

VBucket 978: active count 2710 != 2713 replica count

VBucket 980: active count 2714 != 2717 replica count

VBucket 984: active count 2787 != 2790 replica count

VBucket 985: active count 2772 != 2774 replica count

VBucket 986: active count 2739 != 2740 replica count

VBucket 987: active count 2701 != 2700 replica count

VBucket 988: active count 2737 != 2736 replica count

VBucket 989: active count 2740 != 2741 replica count

VBucket 990: active count 2768 != 2770 replica count

VBucket 992: active count 2783 != 2784 replica count

VBucket 993: active count 2743 != 2745 replica count

VBucket 994: active count 2689 != 2691 replica count

VBucket 997: active count 2725 != 2727 replica count

VBucket 998: active count 2809 != 2813 replica count

VBucket 1001: active count 2726 != 2728 replica count

VBucket 1002: active count 2676 != 2679 replica count

VBucket 1004: active count 2687 != 2692 replica count

VBucket 1005: active count 2727 != 2728 replica count

VBucket 1008: active count 2729 != 2731 replica count

VBucket 1010: active count 2761 != 2764 replica count

VBucket 1011: active count 2791 != 2790 replica count

VBucket 1012: active count 2758 != 2761 replica count

VBucket 1013: active count 2771 != 2774 replica count

VBucket 1014: active count 2699 != 2700 replica count

VBucket 1016: active count 2711 != 2713 replica count

VBucket 1017: active count 2695 != 2694 replica count

VBucket 1018: active count 2739 != 2740 replica count

VBucket 1020: active count 2719 != 2720 replica count

VBucket 1021: active count 2758 != 2759 replica count

VBucket 1022: active count 2710 != 2713 replica count

Active item count = 2795282
Comment by Chiyoung Seo [ 29/Aug/14 ]
Mike,

Please take a look at this issue. Seems like a regression from the changes that were recently merged.
Comment by Mike Wiederhold [ 01/Sep/14 ]
I need the data files in order to debug this.
Comment by Aruna Piravi [ 02/Sep/14 ]
Data files are usually not attached for system tests, they are huge. The cluster is available for investigation though like I've mentioned above - http://172.23.105.44:8091/index.html. If you need any specific vbucket files, do let me know.




[MB-12087] Create "third-party license information" web page Created: 28/Aug/14  Updated: 02/Sep/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Sriram Melkote Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: rc2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We need to have proper license information for all modules, and third party components.

After discussion between Anil and Ruth, it was decided to place this information on the website. There is a LICENSE.txt file shipped with Couchbase that contains the following:


-----
Couchbase Server release 3.0.0

You can find the Third Party License for all the third party products included
with Couchbase Server at:

  http://www.couchbase.com/3rdpartylicenses-couchbaseserver-3.0.0
-----


I'm co-opting this bug to track gathering all of those license texts and creating the above webpage. I'm marking it as a "Blocker", but to be clear, it is a *release* blocker, not something that needs to be done before a release candidate build.

 Comments   
Comment by Sriram Melkote [ 28/Aug/14 ]
This is a placeholder bug to track approval to make change for 3.0 and capture all reviews that get merged
Comment by Volker Mische [ 28/Aug/14 ]
Our sample datasets don't have a proper license: http://review.couchbase.org/40983
Comment by Cihan Biyikoglu [ 28/Aug/14 ]
Great catch - approved for RC2
Comment by Cihan Biyikoglu [ 28/Aug/14 ]
sending over to Wayne for ensuring this will make it through the build into the final set of bits. pls add this to both community and enterprise editions.
Comment by Chris Hillery [ 29/Aug/14 ]
This isn't resolved since we're still shipping the data but not the license file.
Comment by Chris Hillery [ 29/Aug/14 ]
I will update the rel-3.0.0.xml manifest so that at least this change is included in the build. However there is nothing anywhere which causes this LICENSE file to be shipped. From Cihan's comment it sounds like it needs to be shipped; the question is "how"? Where should it be put?
Comment by Chris Hillery [ 29/Aug/14 ]
Can this information be included in the same place as the third-party software licenses? (that is, on the website, at http://www.couchbase.com/3rdpartylicenses-couchbaseserver-3.0.0 )
Comment by Chris Hillery [ 29/Aug/14 ]
http://review.couchbase.org/#/c/41095/ to add Volker's change to 3.0.0 manifest; as mentioned this does not resolve the issue by itself.
Comment by Cihan Biyikoglu [ 29/Aug/14 ]
yes we should include this in the 3rd party license URL.
thanks
Comment by Cihan Biyikoglu [ 29/Aug/14 ]
Sending over to Anil. Need the text to be done and available on the page here:
 http://www.couchbase.com/agreements/3rdpartylicenses-couchbaseserver-3.0.0

the redirect in third_party.txt will be done through the following ticket http://www.couchbase.com/issues/browse/CBIT-1584




[MB-11048] Range queries result in thousands of GET operations/sec Created: 05/May/14  Updated: 18/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Benchmark for range queries demonstrated very high latency. At the same time I noticed extremely high rate of GET operations.

Even single query such as "SELECT name.f.f.f AS _name FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000 LIMIT 20" led to hundreds of memcached reads.

Explain:

https://gist.github.com/pavel-paulau/5e90939d6ab28034e3ed

Engine output:

https://gist.github.com/pavel-paulau/b222716934dfa3cb598e

I don't like to use JIRA as forum but why does it happen? Do you fetch entire range before returning limited output?

 Comments   
Comment by Gerald Sangudi [ 05/May/14 ]
Pavel,

Yes, the scan and fetch are performed before we do any LIMIT. This will be fixed in DP4, but it may not be easily fixable in DP3.

Can you please post the results of the following query:

SELECT COUNT(*) FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000

Thanks.
Comment by Pavel Paulau [ 05/May/14 ]
cbq> SELECT COUNT(*) FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000
{
    "resultset": [
        {
            "$1": 2134
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "1"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "547.545767ms"
        }
    ]
}
Comment by Pavel Paulau [ 05/May/14 ]
Also it looks like we are leaking memory in this scenario.

Resident memory of cbq-engine grows very fast (several megabytes per second) and never goes down...




[MB-11007] Request for Get Multi Meta Call for bulk meta data reads Created: 30/Apr/14  Updated: 30/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Parag Agarwal Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: All


 Description   
Currently we support per key call for getMetaData. As a result our verification requires per key fetch during verification phase. This request is to support for get bulk meta data call which can get us meta data per vbucket for all keys or in batches. This would help enhance our verification ability for meta data per documents over time or after operations like rebalance, as it will be faster. If there is a better alternative, please recommend.

Current Behavior

https://github.com/couchbase/ep-engine/blob/master/src/ep.cc

ENGINE_ERROR_CODE EventuallyPersistentStore::getMetaData(
                                                        const std::string &key,
                                                        uint16_t vbucket,
                                                        const void *cookie,
                                                        ItemMetaData &metadata,
                                                        uint32_t &deleted,
                                                        bool trackReferenced)
{
    (void) cookie;
    RCPtr<VBucket> vb = getVBucket(vbucket);
    if (!vb || vb->getState() == vbucket_state_dead ||
        vb->getState() == vbucket_state_replica) {
        ++stats.numNotMyVBuckets;
        return ENGINE_NOT_MY_VBUCKET;
    }

    int bucket_num(0);
    deleted = 0;
    LockHolder lh = vb->ht.getLockedBucket(key, &bucket_num);
    StoredValue *v = vb->ht.unlocked_find(key, bucket_num, true,
                                          trackReferenced);

    if (v) {
        stats.numOpsGetMeta++;

        if (v->isTempInitialItem()) { // Need bg meta fetch.
            bgFetch(key, vbucket, -1, cookie, true);
            return ENGINE_EWOULDBLOCK;
        } else if (v->isTempNonExistentItem()) {
            metadata.cas = v->getCas();
            return ENGINE_KEY_ENOENT;
        } else {
            if (v->isTempDeletedItem() || v->isDeleted() ||
                v->isExpired(ep_real_time())) {
                deleted |= GET_META_ITEM_DELETED_FLAG;
            }
            metadata.cas = v->getCas();
            metadata.flags = v->getFlags();
            metadata.exptime = v->getExptime();
            metadata.revSeqno = v->getRevSeqno();
            return ENGINE_SUCCESS;
        }
    } else {
        // The key wasn't found. However, this may be because it was previously
        // deleted or evicted with the full eviction strategy.
        // So, add a temporary item corresponding to the key to the hash table
        // and schedule a background fetch for its metadata from the persistent
        // store. The item's state will be updated after the fetch completes.
        return addTempItemForBgFetch(lh, bucket_num, key, vb, cookie, true);
    }
}



 Comments   
Comment by Venu Uppalapati [ 30/Apr/14 ]
Server has support for quiet CMD_GETQ_META call which can be used on the client side to create a multi-getMeta call similar to multiGet call implementation.
Comment by Parag Agarwal [ 30/Apr/14 ]
Please point to a working example for this call
Comment by Venu Uppalapati [ 30/Apr/14 ]
Parag, you can find some relevant information on using queuing requests using quiet call at https://code.google.com/p/memcached/wiki/BinaryProtocolRevamped#Get,_Get_Quietly,_Get_Key,_Get_Key_Quietly
Comment by Chiyoung Seo [ 30/Apr/14 ]
Changing the fix version to the feature backlog given that 3.0 feature complete date was already passed and it is requested for the QE testing framework.




[MB-10993] Cluster Overview - Usable Free Space documentation misleading Created: 29/Apr/14  Updated: 29/Apr/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Jim Walker Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: documentation
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Issue relates to:
 http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#viewing-cluster-summary

I was working through a support case and trying to explain the cluster overview free space and usable free space.

The following statement is from out documentation. After code review of ns_server I concluded that this is incorrect.

Usable Free Space:
The amount of usable space for storing information on disk. This figure shows the amount of space available on the configured path after non-Couchbase files have been taken into account.

The correct statement should be

Usable Free Space:
The amount of usable space for storing information on disk. This figure is calculated from the node with least amount of available storage in the cluster. The final value is calculated by multiplying by the number of nodes in the cluster.


This change is important as it is important for users to understand why Usable Free Space can be less than Free Space. The cluster considers all nodes to be equal. If you actually have a "weak" node in the cluster, e.g. one with a small disk, then the cluster nodes all have to ensure they keep storage under the weaker nodes limits, else for example we can never failover to the weak node as it cannot take on the job of a stronger node. When Usable Free Space is less than Free space, the user may actually want to see why a node has less storage available.




[MB-10920] unable to start tuq if there are no buckets Created: 22/Apr/14  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
node is initialized but has no buckets
[root@kiwi-r116 tuqtng]# ./tuqtng -couchbase http://localhost:8091
10:26:56.520415 Info line disabled false
10:26:56.522641 FATAL: Unable to run server, err: Unable to access site http://localhost:8091, err: HTTP error 401 Unauthorized getting "http://localhost:8091/pools": -- main.main() at main.go:76




[MB-10834] update the license.txt for enterprise edition for 2.5.1 Created: 10/Apr/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Microsoft Word 2014-04-07 EE Free Clickthru Breif License.docx    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
document attached.

 Comments   
Comment by Phil Labee [ 10/Apr/14 ]
2.5.1 has already been shipped, so this file can't be included.

Is this for 3.0.0 release?
Comment by Phil Labee [ 10/Apr/14 ]
voltron commit: 8044c51ad7c5bc046f32095921f712234e74740b

uses the contents of the attached file to update LICENSE-enterprise.txt on the master branch.




[MB-10821] optimize storage of larger binary object in couchbase Created: 10/Apr/14  Updated: 10/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10084] Sub-Task: Changes required for Data Encryption in Client SDK's Created: 30/Jan/14  Updated: 28/May/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Andrei Baranouski
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
depends on JCBC-441 add SSL support in support of Couchba... Open
depends on CCBC-344 add support for SSL to libcouchbase i... Resolved
depends on NCBC-424 Add SSL support in support of Couchba... Resolved

 Description   
Changes required for Data Encryption in Client SDK's

 Comments   
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
wanted to make sure we agree this will be in 3.0. Matt any concerns?
thanks
Comment by Matt Ingenthron [ 20/Mar/14 ]
This should be closed in favor of the specific project issues. That said, the description is a bit fuzzy. Is this SSL support for memcached && views && any cluster management?

Please clarify and then we can open specific issues. It'd be good to have a link to functional requirements.
Comment by Matt Ingenthron [ 20/Mar/14 ]
And Cihan: it can't be "in 3.0", unless you mean concurrent release or release prior to 3.0 GA. Is that what you mean? I'd actually aim to have this feature support in SDKs prior to 3.0's release and we are working on it right now, though it has some other dependencies. See CCBC-344, for example.
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
thanks Matt. I meant 3.0 paired client SDK release so prior or shortly after is all good for me.
context - we are doing a pass to clean up JIRA. Like to button up what's in and out for 3.0.
Comment by Cihan Biyikoglu [ 24/Mar/14 ]
Matt, is there a client side ref implementation you guys did for this one? would be good to pass that onto test folks for initial validation until you guys completely integrate so no regressions creep up while we march to GA.
thanks
Comment by Matt Ingenthron [ 24/Mar/14 ]
We did verification with a non-mainline client since that was the quickest way to do so and have provided that to QE. Also, Brett filed a bug around HTTPS with ns-server and streaming configuration replies. See MB-10519.

We'll do a mainline client with libcouchbase and the python client as soon as it's dependency for handling packet IO is done. This is under CCBC-298 and CCBC-301, among others.




[MB-10003] [Port-configurability] Non-root instances and multiple sudo instances in a box cannot be 'offline' upgraded Created: 24/Jan/14  Updated: 27/Mar/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Unix/Linux


 Description   
Scenario
------------
As of today, we do not support offline 'upgrade' per se for packages installed in non-root/sudo users. Upgrades are usually handled by package managers. Since these are absent in non-root users and rpm cannot handle more than a a single package upgrade(if there are many instances running), offline upgrades are not supported (confirmed with Bin).

ALL non-root installations will be affected by this limitation. Although a single instance running on a box under sudo user can be offline upgraded, it cannot be extended to more than one such instance.

This is important

Workaround
-----------------
- Online upgrade (swap with nodes running latest build, take old nodes down and do clean install)
- Backup data and restore after fresh install (cbbackup and cbrestore)

Note : At this point, these are mere suggestions and both these workarounds haven't been tested yet.




[MB-10146] Document editor overwrites precision of long numbers Created: 06/Feb/14  Updated: 09/May/14

Status: Reopened
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Perry Krug Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged

 Description   
Just tested this out, not sure what diagnostics to capture so please let me know.

Simple test case:
-Create new document via document editor in UI
-Document contents are:
{"id": 18446744072866779556}
-As soon as you save, the above number is rewritten to:
{
  "id": 18446744072866780000
}
-The same effect is had if you edit a document that was inserted with the above "long" number

 Comments   
Comment by Aaron Miller (Inactive) [ 06/Feb/14 ]
It's worth noting views will always suffer from this, as it is a limitation of Javascript in general. Many JSON libraries have this behavior as well (even though they don't *have* to).
Comment by Aleksey Kondratenko [ 11/Apr/14 ]
cannot fix it. Just closing. If you want to reopen, please pass it to somebody responsible for overall design.
Comment by Perry Krug [ 11/Apr/14 ]
Reopening and assigning to docs, we need this to be release noted IMO.
Comment by Ruth Harris [ 14/Apr/14 ]
Reassigning to Anil. He makes the call on what we put in the release notes for known and fixed issues.
Comment by Anil Kumar [ 09/May/14 ]
Ruth - Lets release note this for 3.0.




[MB-11314] Enhaced Authentication model for Couchbase Server for Administrators, Users and Applications Created: 04/Jun/14  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Couchbase Server will add support for authentication using various techniques example: Kerberos, LDAP etc…







[MB-11282] Separate stats for internal memory allocation (application vs. data) Created: 02/Jun/14  Updated: 02/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Story Priority: Critical
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
AFAIK currently we track allocation for data and application together.

But sometimes application (memcached / ep-engine) overhead is huge and cannot be ignored.




[MB-11250] Go-Coucbase: Provide DML APIs using CAS Created: 29/May/14  Updated: 18/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-11247] Go-Couchbase: Use password to connect to SASL buckets Created: 29/May/14  Updated: 19/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Gerald Sangudi [ 19/Jun/14 ]
https://github.com/couchbaselabs/query/blob/master/docs/n1ql-authentication.md




[MB-11208] stats.org should be installed Created: 27/May/14  Updated: 27/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: techdebt-backlog
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
stats.org contains a description of the stats we're sending from ep-engine. It could be useful for people

 Comments   
Comment by Matt Ingenthron [ 27/May/14 ]
If it's "useful" shouldn't this be part of official documentation? I've often thought it should be. There's probably a duplicate here somewhere.

I also think the stats need stability labels applied as people may rely on stats when building their own integration/monitoring tools. COMMITTED, UNCOMMITTED, VOLATILE, etc. would be useful for the stats.

Relatedly, someone should document deprecation of TAP stats for 3.0.




[MB-11195] Support binary collation for views Created: 23/May/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
N1QL would benefit significantly if we could allow memcmp() collation for views it creates. So much so that we should consider this for a minor release after 3.0 so it can be available for N1QL beta.




[MB-11102] extended documentation about stats flowing out of CBSTATS and the correlation between them Created: 12/May/14  Updated: 12/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Update documentation about stats flowing out of CBSTATS and the correlation between them - Need this to be able to accurately predict capacity/other bottlenecks as well as detect trends.




[MB-11100] Ability to shutoff disk persistence for Couchbase bucket and still have replication, failover and other Couchbase bucket features Created: 12/May/14  Updated: 13/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by MB-8714 introduce vbucket based cache bucket ... Resolved

 Description   
Ability to shutoff disk persistence for Couchbase bucket and still have replication, failover and other Couchbase bucket features.




[MB-11101] supported go SDK for couchbase server Created: 12/May/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Matt Ingenthron
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
go client




[MB-11098] Ability to set block size written to storage for better alignment with SSD¹s and/or HDD¹s for better throughput performance Created: 12/May/14  Updated: 12/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Ability to set block size written to storage for better alignment with SSD¹s and/or HDD¹s for better throughput performance




[MB-10767] DOC: Misc - DITA conversion Created: 04/Apr/14  Updated: 04/Apr/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Ruth Harris Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10716] SSD IO throughput optimizations Created: 01/Apr/14  Updated: 01/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
forestdb work




[MB-10662] _all_docs is no longer supported in 3.0 Created: 27/Mar/14  Updated: 01/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-10649 _all_docs view queries fails with err... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
As of 3.0, view engine will no longer support the special predefined view, _all_docs.

It was not a published feature, but as it has been around for a long time, it is possible it was actually utilized in some setups.

We should document that _all_docs queries will not work in 3.0

 Comments   
Comment by Cihan Biyikoglu [ 27/Mar/14 ]
Thanks. are there internal tools depending on this? Do you know if we have deprecated this in the past? I realize it isn't a supported API but want to make sure we keep the door open for feedback during beta from large customers etc.
Comment by Perry Krug [ 28/Mar/14 ]
We have a few (very few) customers who have used this. They've known it is unsupported...but that doesn't ever really stop anyone if it works for them.

Do we have a doc describing what the proposed replacement will look like and will that be available for 3.0?
Comment by Ruth Harris [ 01/May/14 ]
_all_docs is not mentioned anywhere in the 2.2+ documentation. Not sure how to handle this. It's not deprecated because it was never intended for use.
Comment by Perry Krug [ 01/May/14 ]
I think at the very least a prominant release not is appropriate.




[MB-10651] The guide for install user defined port doesn't work for Rest port change Created: 26/Mar/14  Updated: 17/Jun/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Larry Liu Assignee: Aruna Piravi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
http://docs.couchbase.com/couchbase-manual-2.5/cb-install/#install-user-defined-ports

I followed the instruction to change admin port (Rest port):
append to the /opt/couchbase/etc/couchbase/static_config file.
{rest_port, 9000}.

[root@localhost bin]# netstat -an| grep 9000
[root@localhost bin]# netstat -an| grep :8091
tcp 0 0 0.0.0.0:8091 0.0.0.0:* LISTEN

logs:
https://s3.amazonaws.com/customers.couchbase.com/larry/output.zip

Larry



 Comments   
Comment by Larry Liu [ 26/Mar/14 ]
The log files shows that the change was taken by server:

[ns_server:info,2014-03-26T19:13:24.063,nonode@nohost:<0.58.0>:ns_server:log_pending:30]Static config terms:
[{error_logger_mf_dir,"/opt/couchbase/var/lib/couchbase/logs"},
 {error_logger_mf_maxbytes,10485760},
 {error_logger_mf_maxfiles,20},
 {path_config_bindir,"/opt/couchbase/bin"},
 {path_config_etcdir,"/opt/couchbase/etc/couchbase"},
 {path_config_libdir,"/opt/couchbase/lib"},
 {path_config_datadir,"/opt/couchbase/var/lib/couchbase"},
 {path_config_tmpdir,"/opt/couchbase/var/lib/couchbase/tmp"},
 {nodefile,"/opt/couchbase/var/lib/couchbase/couchbase-server.node"},
 {loglevel_default,debug},
 {loglevel_couchdb,info},
 {loglevel_ns_server,debug},
 {loglevel_error_logger,debug},
 {loglevel_user,debug},
 {loglevel_menelaus,debug},
 {loglevel_ns_doctor,debug},
 {loglevel_stats,debug},
 {loglevel_rebalance,debug},
 {loglevel_cluster,debug},
 {loglevel_views,debug},
 {loglevel_mapreduce_errors,debug},
 {loglevel_xdcr,debug},
 {rest_port,9000}]
Comment by Aleksey Kondratenko [ 17/Apr/14 ]
This is because rest_port entry in static_config is only taken into account for fresh install.

There's some way to install our package without starting server first. And that has to be documented. I don't know who owns working with docs people.
Comment by Anil Kumar [ 09/May/14 ]
Alk - Before it gets to documentation we need to test it and verify the instructions. Can you provide those instructions and assign this ticket to Aruna to test it.
Comment by Anil Kumar [ 03/Jun/14 ]
Alk - can you provide those instructions and assign this ticket to Aruna to test it.
Comment by Aleksey Kondratenko [ 04/Jun/14 ]
Instructions fail to mention the fact that rest_port must be changed before config.dat is written. And config.dat is initialized on first server start.

There's some way to install server without starting it.

But here's what I managed to do:

# dpkg -i ~/Desktop/forReview/couchbase-server-enterprise_ubuntu_1204_x86_2.5.1-1086-rel.deb

# /etc/init.d/couchbase-server stop

# rm /opt/couchbase/var/lib/couchbase/config/config.dat

# emacs /opt/couchbase/etc/couchbase/static_config

# /etc/init.d/couchbase-server start

I.e. I stoped service, removed config.dat, edited static_config, then started it back and found rest port to be updated.
Comment by Anil Kumar [ 04/Jun/14 ]
Thanks Alk. Assigning this to Aruna for verification and later please assign this ticket to Documentation (Ruth).




[MB-10531] No longer necessary to wait for persistence to issue stale=false query Created: 21/Mar/14  Updated: 25/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Sriram Melkote Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Matt pointed out that in the past, we had to wait for an item to persist to disk before issuing stale=false query for correct results. In 3.0, this is not necessary. One can issue a stale=false view query anytime and results will fetch all changes that have been made when the query was issued. This task is a placeholder to update docs to remove the unnecessary step of waiting for persistence in 3.0 docs.

 Comments   
Comment by Matt Ingenthron [ 21/Mar/14 ]
Correct. Thanks for making sure this is raised Siri. While I'm thinking of it, two points need to be in there:
1) if you have older code, you will need to change it to take advantage of the semantic change to the query
2) application developers still need to be a bit careful to ensure any modifications being done aren't async operations-- they'll have to wait for the responses before doing the stale=false query
Comment by Anil Kumar [ 25/Mar/14 ]
This is for 3.0 documentation.
Comment by Sriram Melkote [ 25/Mar/14 ]
Not an improvement. This is a task.




[MB-10511] Feature request for supporting rolling downgrades Created: 19/Mar/14  Updated: 11/Apr/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Abhishek Singh Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to

 Description   
Some customers are interested in Couchbase supporting rolling downgrades. Currently we can't add 2.2 nodes inside a cluster that has all nodes on 2.5.




[MB-10512] Update documentation to convey we don't support rolling downgrades Created: 19/Mar/14  Updated: 27/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Abhishek Singh Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Update documentation to convey we don't support rolling downgrades to 2.2 once all nodes are running on 2.5




[MB-10469] Support Couchbase Server on SuSE linux platform Created: 14/Mar/14  Updated: 17/Apr/14

Status: Open
Project: Couchbase Server
Component/s: build, installer
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: SuSE linux platform

Issue Links:
Duplicate

 Description   
Add support for SuSE Linux platform




[MB-10431] Removed ep_expiry_window stat/engine_parameter Created: 11/Mar/14  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Mike Wiederhold Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This parameter is no longer needed since we require everything to be persisted. In the past it was used to skip persistence on items that would be expiring very soon.




[MB-10432] Removed ep_max_txn_size stat/engine_parameter Created: 11/Mar/14  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Mike Wiederhold Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This value is no longer used in the server. Please not that you need to update the documentation for cbepctl since this stat could be set with that script.




[MB-10430] Add AWS AMI documentation to Installation and Upgrade Guide Created: 11/Mar/14  Updated: 25/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Brian Shumate Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
It would be useful to have some basic installation instructions for those who want to use the Couchbase Server Amazon Machine Instances in a direct manner without RightScale.

This is particularly with regards to the special case of the Administrator user and password, which can become a stumbling point for some users.


 Comments   
Comment by Anil Kumar [ 25/Mar/14 ]
Ruth - Please add reference to Couchbase on AWS Whitepaper - http://aws.typepad.com/aws/2013/08/running-couchbase-on-aws-new-white-paper.html that has all the information.




[MB-10379] index is not used for simple query Created: 06/Mar/14  Updated: 28/May/14  Due: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: 2.5.0
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Iryna Mironava
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 64-bit

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
I created index for name field of bucket b0 and then my_skill index for b0
cbq> select * from :system.indexes
{
    "resultset": [
        {
            "bucket_id": "b0",
            "id": "#alldocs",
            "index_key": [
                "META().id"
            ],
            "index_type": "view",
            "name": "#alldocs",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
        {
            "bucket_id": "b0",
            "id": "my_name",
            "index_key": [
                "name"
            ],
            "index_type": "view",
            "name": "my_name",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
       {
            "bucket_id": "b0",
            "id": "my_skill",
            "index_key": [
                "skills"
            ],
            "index_type": "view",
            "name": "my_skill",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
        {
            "bucket_id": "b1",
            "id": "#alldocs",
            "index_key": [
                "META().id"
            ],
            "index_type": "view",
            "name": "#alldocs",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
        {
            "bucket_id": "default",
            "id": "#alldocs",
            "index_key": [
                "META().id"
            ],
            "index_type": "view",
            "name": "#alldocs",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "4"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "1.185438ms"
        }
    ]
}

I see my view on UI, I can query it.
but explain says i am still using #alldocs

cbq> explain select name from b0
{
    "resultset": [
        {
            "input": {
                "as": "b0",
                "bucket": "b0",
                "ids": null,
                "input": {
                    "as": "",
                    "bucket": "b0",
                    "cover": false,
                    "index": "#alldocs",
                    "pool": "default",
                    "ranges": null,
                    "type": "scan"
                },
                "pool": "default",
                "projection": null,
                "type": "fetch"
            },
            "result": [
                {
                    "as": "name",
                    "expr": {
                        "left": {
                            "path": "b0",
                            "type": "property"
                        },
                        "right": {
                            "path": "name",
                            "type": "property"
                        },
                        "type": "dot_member"
                    },
                    "star": false
                }
            ],
            "type": "projector"
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "1"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "1.236104ms"
        }
    ]
}
same result i see for skills


 Comments   
Comment by Sriram Melkote [ 07/Mar/14 ]
I think the current implementation considers secondary indexes only for filtering operations. When you do SELECT <anything> FROM <bucket>, it is a full bucket scan, and that is implemented by #alldocs and by #primary index only.

So the current behavior looks to be correct. Try running "CREATE PRIMARY INDEX USING VIEW" and please see if the query will then switch from #alldocs to #primary. Please also try adding a filter, like WHERE name > 'Mary' and see if the my_name index gets used for the filtering.

As a side note, what you're running is a covered query, where all the data necessary is held in a secondary index completely. However, this is not implemented. A secondary index is only used as an access path, and not as a source of data.
Comment by Gerald Sangudi [ 11/Mar/14 ]
This particular query will always use #primary or #alldocs. Even for documents without a "name" field, we return a result object that is missing the "name" field.

@Iryna, please test WHERE name iS NOT MISSING to see if it uses the index. If not, we'll fix that for DP4. Thanks.




[MB-10370] ep-engine deadlock in write-heavy DGM cases Created: 05/Mar/14  Updated: 06/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.2.0, 2.5.0, 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630
Memory = 64 GB
Disk = 2 x SSD

Attachments: JPEG File deadlock.jpeg    
Is this a Regression?: Yes

 Description   
This is not a new issue, we discussed it many times.

In extremely write-heavy cases we overload servers, memory usage reaches 95% of bucket quota, we eject all replica items... eventually system becomes unusable.

I'm creating this ticket because of XDCR. In 3.0 we can achieve very high throughput of XDCR operations, throttling it doesn't make sense. According to PM team some "users" deploy XDCR within the same data center so this is quite realistic scenario.

Feel free to close this ticket as duplicate of existing bugs. Though I didn't manage to find anything well-defined.

 Comments   
Comment by Cihan Biyikoglu [ 13/Mar/14 ]
I get that we should be able to recover from this and will open another issues to ensure XDCR also behaves as a good citizen under conditions where destination is under stress.
Comment by Pavel Paulau [ 18/Mar/14 ]
It's really hard to achieve <1% resident ratio because of this issue.
Comment by Pavel Paulau [ 03/Apr/14 ]
Just spotted the same issue in Sync Gateway performance test.
Comment by Maria McDuff (Inactive) [ 08/Apr/14 ]
bumping up to Test Blocker per bug scrub.
Comment by Cihan Biyikoglu [ 08/Apr/14 ]
Hi Pavel, are you blocked? lets mark this a test blocker if so.
Comment by Li Yang [ 08/Apr/14 ]
This issue is blocking sync-gateway performance test. Even with small workload of 5K users on one sync-gateway, connecting to a two-node couchbase database, the test eventually failed with the memory deadlock.
Comment by Chiyoung Seo [ 09/Apr/14 ]
Li,

I just discussed this issue with Pavel. This is not a new issue in the current value-only cache ejection policy. Let's discuss this issue more tomorrow with Pavel.

Thanks,
Comment by Chiyoung Seo [ 11/Apr/14 ]
The major reason of this issue was that all the SET operations are new insertions, which eventually inserted too many items and caused the memory usage to reach to the bucket memory quota because we still maintain keys and metadata in cache for non-resident items. To address this architectural limitation of the value-only ejection policy, we added the full ejection feature in 3.0 release, which ejects key, metadata, and value together from cache.

I recommend you to use the full ejection in 3.0, but otherwise increase the cluster capacity by considering the total number of items inserted and desired resident ratio if you want to use the value ejection policy in 2.5.1.
Comment by Chiyoung Seo [ 11/Apr/14 ]
As you said, we don't have a good way of recovering from this case once it happens. The workaround will be increasing the bucket memory quota if there is any available memory on the existing nodes and then add new nodes to the cluster in order to increase the cluster capacity.
Comment by Chiyoung Seo [ 06/Jun/14 ]
Moving it to post 3.0 as this issue is related to the current architectural limitation.




[MB-9883] High CPU utilization of SSL proxy (both source and dest) Created: 09/Jan/14  Updated: 17/Apr/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 2.5.0-1032

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630
Memory = 64 GB
Disk = 2 x SSD

Attachments: JPEG File cpu_agg_nossl.jpeg     JPEG File cpu_agg_ssl.jpeg    
Triage: Triaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/perf-dev/17/artifact/

 Description   
-- 4 -> 4, unidir, xmem, 1 bucket, moderate DGM
-- Initial replication

Based on manual observation, confirmed by your internal counters.

 Comments   
Comment by Wayne Siu [ 15/Jan/14 ]
Deferring from 2.5. Potential candidate for 2.5.1. ns_server team will make changes on top of 2.5.1.
Comment by Pavel Paulau [ 15/Jan/14 ]
Some benchmarks from our environment:

# openssl speed aes
Doing aes-128 cbc for 3s on 16 size blocks: 15847532 aes-128 cbc's in 2.99s
Doing aes-128 cbc for 3s on 64 size blocks: 4282124 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 256 size blocks: 1078408 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 274532 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 34227 aes-128 cbc's in 2.99s
Doing aes-192 cbc for 3s on 16 size blocks: 13432096 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 64 size blocks: 3576384 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 256 size blocks: 906793 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 1024 size blocks: 227850 aes-192 cbc's in 2.99s
Doing aes-192 cbc for 3s on 8192 size blocks: 28528 aes-192 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16 size blocks: 11684190 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 64 size blocks: 3014948 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 256 size blocks: 771234 aes-256 cbc's in 2.99s
Doing aes-256 cbc for 3s on 1024 size blocks: 190996 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 24076 aes-256 cbc's in 3.00s
OpenSSL 1.0.1e-fips 11 Feb 2013
built on: Tue Dec 3 20:18:14 UTC 2013
options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128 cbc 84802.85k 91351.98k 92024.15k 93706.92k 93775.11k
aes-192 cbc 71637.85k 76296.19k 77379.67k 78032.91k 77900.46k
aes-256 cbc 62315.68k 64318.89k 66032.07k 65193.30k 65743.53k

This is what one core demonstrates. Notice that during the test single node was able to serve as max as 8K documents/sec (~2KB), utilizing on average 3-4 cores.
Comment by Aleksey Kondratenko [ 15/Jan/14 ]
BTW one core can get much higher than that. When AES NI hardware support is enabled. But those benchmarks are not doing it by default. You need to pass -evp as noted for example here: http://stackoverflow.com/questions/19307909/how-do-i-enable-aes-ni-hardware-acceleration-for-node-js-crypto-on-linux

With aes-ni enabled I've seen my box to show more than one _billion_ bytes per second in aes 128 bit! So there's definitely large potential here. And I bet erlang is unable to utilize it. I.e. perf showed us that erlang did not use aes-ni enabled versions of aes in openssl at least on my box.
Comment by Pavel Paulau [ 20/Jan/14 ]
Just for quantification: in bidir scenario with encryption we should easily expect 3x higher CPU utilization on both source and destination sides. Apparently absolute numbers depend on rate of replication.
Comment by Wayne Siu [ 21/Jan/14 ]
Alk will provide a build to Pavel to test. Will review the results in the next meeting.
Wanted to check
a. how much CPU has improved.
b. if there is any added latency.
Comment by Aleksey Kondratenko [ 21/Jan/14 ]
>> b. if there is any added latency.

and c. if there's any throughput change.

Also if possible I'd like to see results with/without rc4 and GSO.
Comment by Pavel Paulau [ 22/Jan/14 ]
a. 1.5-2x lower CPU utilization than in build 2.5.0-1054.
b. No extra latency.
c. No change.

GSO/TSO results will be reported in MB-9896.
Comment by Wayne Siu [ 23/Jan/14 ]
Lowering the priority to Critical as the fix has helped/improved the CPU utilization in build 1054.
Keep this ticket open for further optimization in 3.0.
Comment by Cihan Biyikoglu [ 11/Mar/14 ]
I think we need to be below %5-10 with SSL overhead. we should look for ways to ensure this is a feature we can recommend in general.
Comment by Aleksey Kondratenko [ 17/Apr/14 ]
Moved out of 3.0




[MB-9446] there's chance of starting janitor while not having latest version of config (was: On reboot entire cluster , see many conflicting bucket config changes frequently.) Created: 30/Oct/13  Updated: 04/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.0, 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ketaki Gangal Assignee: Aliaksey Artamonau
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: build 0.0.0-7040toy

Triage: Triaged
Is this a Regression?: Yes

 Description   

Load items on a cluster , build toy-000-704
Reboot cluster

Post reboot, see a lot of messages on conflicting bucket config on the web logs.

Cluster logs here: https://s3.amazonaws.com/bugdb/bug_9445/9435.tar

Sample

{fastForwardMap,undefined}]}]}]}, choosing the former, which looks newer.
ns_config003 ns_1@soursop-s11207.sc.couchbase.com 18:59:30 - Wed Oct 30, 2013
Conflicting configuration changes to field buckets:
{[{'ns_1@172.23.105.45',{5088,63550403967}},
{'ns_1@soursop-s11203.sc.couchbase.com',{1,63550403967}},
{'ns_1@soursop-s11204.sc.couchbase.com',{1764,63550403283}}],
[{'_vclock',[{'ns_1@172.23.105.45',{5088,63550403967}},
{'ns_1@soursop-s11203.sc.couchbase.com',{1,63550403967}},
{'ns_1@soursop-s11204.sc.couchbase.com',{1764,63550403283}}]},
{configs,[{"saslbucket",
[{uuid,<<"b51edfdad356db7e301d9b32c6ef47a3">>},
{num_replicas,1},
{replica_index,false},
{ram_quota,3355443200},
{auth_type,sasl},
{sasl_password,"password"},
{autocompaction,false},
{purge_interval,undefined},
{flush_enabled,false},
{num_threads,3},
{type,membase},
{num_vbuckets,1024},
{servers,['ns_1@soursop-s11203.sc.couchbase.com',
'ns_1@soursop-s11204.sc.couchbase.com',
'ns_1@soursop-s11205.sc.couchbase.com',
'ns_1@soursop-s11207.sc.couchbase.com']},
{map,[['ns_1@soursop-s11207.sc.couchbase.com',
'ns_1@soursop-s11205.sc.couchbase.com'],
['ns_1@soursop-s11207.sc.couchbase.com',
'ns_1@soursop-s11203.sc.couchbase.com'],
['ns_1@soursop-s11207.sc.couchbase.com',
'ns_1@soursop-s11204.sc.couchbase.com'],

 Comments   
Comment by Aleksey Kondratenko [ 30/Oct/13 ]
Very weird. But if indeed issue, there's likely exactly same issue on 2.5.0. And if it's the case looks pretty scary.
Comment by Aliaksey Artamonau [ 01/Nov/13 ]
I set affect version to 2.5 because I really know that it affects 2.5. And actually many preceding releases.
Comment by Maria McDuff (Inactive) [ 31/Jan/14 ]
Alk,

is this already merged in 2.5? pls confirm and mark as resolved if that's the case, assign back to QE.
Thanks.
Comment by Aliaksey Artamonau [ 31/Jan/14 ]
No, it's not fixed in 2.5.
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - 06/04/2014 Alk, Wayne, Parag, Anil




[MB-9356] tuq crash during query + rebalance having 1M items Created: 16/Oct/13  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 64 bit

Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Initial setup:
2 buckets, 1M items each of them, 1 node

Steps:
1) start a query using curl
2) add a node and start rebalance
3) start same query using tuq_client console.


[root@localhost tuqtng]# ./tuqtng -couchbase http://localhost:8091
07:19:57.406786 tuqtng started...
07:19:57.406880 version: 0.0.0
07:19:57.406887 site: http://localhost:8091
panic: runtime error: index out of range

goroutine 283323 [running]:
github.com/couchbaselabs/go-couchbase.func·001(0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:151 +0x4f1
github.com/couchbaselabs/go-couchbase.(*Bucket).doBulkGet(0xc2000e8480, 0xc200c101fc, 0xc20043af70, 0x1, 0x1, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:188 +0x150
github.com/couchbaselabs/go-couchbase.func·002()
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:212 +0x115
created by github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:218 +0x1ef

goroutine 1 [chan receive]:
github.com/couchbaselabs/tuqtng/server.Server(0x8464a0, 0x5, 0x7fff3931ab76, 0x15, 0x8554a0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:66 +0x4f4
main.main()
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/main.go:71 +0x28a

goroutine 2 [syscall]:

goroutine 4 [syscall]:
os/signal.loop()
/usr/local/go/src/pkg/os/signal/signal_unix.go:21 +0x1c
created by os/signal.init·1
/usr/local/go/src/pkg/os/signal/signal_unix.go:27 +0x2f

goroutine 13 [chan send]:
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).SendResult(0xc2001cb930, 0x70d420, 0xc200a27880)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:47 +0x46
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).processItem(0xc2001aa080, 0xc2001c9a80, 0xc2001c99c0, 0xc200a27080, 0x2b5e52485d01, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:119 +0x119
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).executeInternal(0xc2001aa080, 0xc2001d7e10, 0xc2001c9a80, 0xc2001c99c0, 0xc2001e2720, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:90 +0x2a7
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).Execute(0xc2001aa080, 0xc2001d7e10, 0xc2001c9a80, 0xc2001c99c0, 0xc2000004f8, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:42 +0x100
github.com/couchbaselabs/tuqtng/server.Dispatch(0xc2001c9a80, 0xc2001c99c0, 0xc2001c1b10, 0xc2001ab000, 0xc2001c1b40, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:85 +0x191
created by github.com/couchbaselabs/tuqtng/server.Server
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:67 +0x59c

goroutine 6 [chan receive]:
main.dumpOnSignal(0x2b5e52484fa0, 0x1, 0x1)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/main.go:80 +0x7f
main.dumpOnSignalForPlatform()
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/dump.go:19 +0x80
created by main.main
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/main.go:62 +0x1d7

goroutine 7 [IO wait]:
net.runtime_pollWait(0x2aaaaabacf00, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001242c0, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).accept(0xc200124240, 0x90ae00, 0x0, 0xc200198660, 0xb, ...)
/usr/local/go/src/pkg/net/fd_unix.go:385 +0x2c1
net.(*TCPListener).AcceptTCP(0xc2000005f8, 0x4443f6, 0x2b5e52483e28, 0x4443f6)
/usr/local/go/src/pkg/net/tcpsock_posix.go:229 +0x45
net.(*TCPListener).Accept(0xc2000005f8, 0xc200125420, 0xc2001ac2b0, 0xc2001f46c0, 0x0, ...)
/usr/local/go/src/pkg/net/tcpsock_posix.go:239 +0x25
net/http.(*Server).Serve(0xc200107a50, 0xc2001488c0, 0xc2000005f8, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/server.go:1542 +0x85
net/http.(*Server).ListenAndServe(0xc200107a50, 0xc200107a50, 0xc2001985d0)
/usr/local/go/src/pkg/net/http/server.go:1532 +0x9e
net/http.ListenAndServe(0x846860, 0x5, 0xc2001985d0, 0xc200107960, 0x0, ...)
/usr/local/go/src/pkg/net/http/server.go:1597 +0x65
github.com/couchbaselabs/tuqtng/network/http.func·001()
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http.go:37 +0x6c
created by github.com/couchbaselabs/tuqtng/network/http.NewHttpEndpoint
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http.go:41 +0x2a0

goroutine 31 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).SendItem(0xc2001e29c0, 0xc200eb0a40, 0x85b2c0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:151 +0xbe
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).scanRange(0xc2001e29c0, 0x0, 0x8a7970)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:121 +0x640
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).Run(0xc2001e29c0, 0xc2001e2960)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:61 +0xdc
created by github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:97 +0xe3

goroutine 11 [IO wait]:
net.runtime_pollWait(0x2aaaaabace60, 0x77, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitWrite(0xc200124620, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:80 +0x31
net.(*netFD).Write(0xc2001245a0, 0xc2001af000, 0x44, 0x1000, 0x4, ...)
/usr/local/go/src/pkg/net/fd_unix.go:294 +0x3e6
net.(*conn).Write(0xc2000008f0, 0xc2001af000, 0x44, 0x1000, 0x452dd2, ...)
/usr/local/go/src/pkg/net/net.go:131 +0xc3
net/http.(*switchWriter).Write(0xc2001ad040, 0xc2001af000, 0x44, 0x1000, 0x4d5989, ...)
/usr/local/go/src/pkg/net/http/chunked.go:0 +0x62
bufio.(*Writer).Flush(0xc200148f40, 0xc20092e6b4, 0x34)
/usr/local/go/src/pkg/bufio/bufio.go:465 +0xb9
net/http.(*chunkWriter).flush(0xc2001cb8e0)
/usr/local/go/src/pkg/net/http/server.go:270 +0x59
net/http.(*response).Flush(0xc2001cb8c0)
/usr/local/go/src/pkg/net/http/server.go:953 +0x5d
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).ProcessResults(0xc2001cb930, 0x2, 0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:109 +0x16b
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).Process(0xc2001cb930, 0x40519c, 0x71b260)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:61 +0x52
github.com/couchbaselabs/tuqtng/network/http.(*HttpQuery).Process(0xc2001c99c0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_query.go:72 +0x29
github.com/couchbaselabs/tuqtng/network/http.(*HttpEndpoint).ServeHTTP(0xc200000508, 0xc2001b1140, 0xc2001cb8c0, 0xc2001cd000)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http.go:55 +0xcd
github.com/gorilla/mux.(*Router).ServeHTTP(0xc200107960, 0xc2001b1140, 0xc2001cb8c0, 0xc2001cd000)
/tmp/gocode/src/github.com/gorilla/mux/mux.go:90 +0x1e1
net/http.serverHandler.ServeHTTP(0xc200107a50, 0xc2001b1140, 0xc2001cb8c0, 0xc2001cd000)
/usr/local/go/src/pkg/net/http/server.go:1517 +0x16c
net/http.(*conn).serve(0xc200124630)
/usr/local/go/src/pkg/net/http/server.go:1096 +0x765
created by net/http.(*Server).Serve
/usr/local/go/src/pkg/net/http/server.go:1564 +0x266

goroutine 21 [chan receive]:
github.com/couchbaselabs/tuqtng/catalog/couchbase.keepPoolFresh(0xc2001fa200)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/couchbase.go:157 +0x4b
created by github.com/couchbaselabs/tuqtng/catalog/couchbase.newPool
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/couchbase.go:149 +0x34c

goroutine 30 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).SendItem(0xc2001c1660, 0xc200a27940, 0xc200a27940)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:49 +0xbf
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).flushBatch(0xc2001e1c40, 0xc2001e2a00)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:141 +0x7cd
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).processItem(0xc2001e1c40, 0xc200390c00, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:78 +0xd9
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator(0xc2001c1660, 0xc2002015f0, 0xc2001e1c40, 0xc2001e2840)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:107 +0x1b0
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).Run(0xc2001e1c40, 0xc2001e2840)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:57 +0xa8
created by github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:97 +0xe3

goroutine 32 [chan send]:
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanRange(0xc200201500, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:179 +0x386
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanEntries(0xc200201500, 0x0, 0xc2001e2b40, 0xc2001e2ba0, 0xc2001250c0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:112 +0x78
created by github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).scanRange
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:82 +0x18b

goroutine 29 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).SendItem(0xc2001c1600, 0xc200a27140, 0x87ea70)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:49 +0xbf
github.com/couchbaselabs/tuqtng/xpipeline.(*Project).processItem(0xc2001c1630, 0xc200a27140, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/project.go:95 +0x33b
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator(0xc2001c1600, 0xc2002015a0, 0xc2001c1630, 0xc2001e2ae0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:107 +0x1b0
github.com/couchbaselabs/tuqtng/xpipeline.(*Project).Run(0xc2001c1630, 0xc2001e2ae0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/project.go:46 +0x91
created by github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).executeInternal
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:79 +0x1c7

goroutine 33 [chan send]:
github.com/couchbaselabs/tuqtng/catalog/couchbase.WalkViewInBatches(0xc2001251e0, 0xc2001252a0, 0xc2000e8480, 0x845260, 0x0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_util.go:90 +0x424
created by github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanRange
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:159 +0x209

goroutine 165124 [select]:
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).executeInternal(0xc2001aa080, 0xc2005d6340, 0xc2001c9a80, 0xc200282580, 0xc2005bba80, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:87 +0x667
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).Execute(0xc2001aa080, 0xc2005d6340, 0xc2001c9a80, 0xc200282580, 0xc2000004f8, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:42 +0x100
github.com/couchbaselabs/tuqtng/server.Dispatch(0xc2001c9a80, 0xc200282580, 0xc2001c1b10, 0xc2001ab000, 0xc2001c1b40, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:85 +0x191
created by github.com/couchbaselabs/tuqtng/server.Server
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:67 +0x59c

goroutine 283325 [chan receive]:
github.com/dustin/gomemcached/client.(*Client).GetBulk(0xc200d55990, 0xc200d50092, 0xc2001d7d60, 0x1, 0x1, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:228 +0x3c3
github.com/couchbaselabs/go-couchbase.func·001(0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:158 +0x1dc
github.com/couchbaselabs/go-couchbase.(*Bucket).doBulkGet(0xc2000e8480, 0xc200c10092, 0xc2001d7d60, 0x1, 0x1, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:188 +0x150
github.com/couchbaselabs/go-couchbase.func·002()
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:212 +0x115
created by github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:218 +0x1ef

goroutine 283623 [runnable]:
net.runtime_pollWait(0x2aaaaabac820, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001f46b0, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2001f4630, 0xc20063fda0, 0x18, 0x18, 0x0, ...)
/usr/local/go/src/pkg/net/fd_unix.go:195 +0x2b3
net.(*conn).Read(0xc2002f2658, 0xc20063fda0, 0x18, 0x18, 0x1, ...)
/usr/local/go/src/pkg/net/net.go:123 +0xc3
io.ReadAtLeast(0xc200198840, 0xc2002f2658, 0xc20063fda0, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:284 +0xf7
io.ReadFull(0xc200198840, 0xc2002f2658, 0xc20063fda0, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:302 +0x6f
github.com/dustin/gomemcached.(*MCResponse).Receive(0xc2002b4d80, 0xc200198840, 0xc2002f2658, 0xc20063fda0, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/mc_res.go:155 +0xc7
github.com/dustin/gomemcached/client.getResponse(0xc200198840, 0xc2002f2658, 0xc20063fda0, 0x18, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/transport.go:30 +0xc6
github.com/dustin/gomemcached/client.(*Client).Receive(0xc200b13f30, 0xc2002b4c00, 0x0, 0x0)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:81 +0x67
github.com/dustin/gomemcached/client.func·003()
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:193 +0xaf
created by github.com/dustin/gomemcached/client.(*Client).GetBulk
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:207 +0x1e6

goroutine 165127 [chan receive]:
github.com/couchbaselabs/go-couchbase.(*Bucket).GetBulk(0xc2000e8480, 0xc200977000, 0x3e8, 0x3e8, 0xc200000001, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:278 +0x341
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*bucket).BulkFetch(0xc2001c11e0, 0xc200977000, 0x3e8, 0x3e8, 0x15, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/couchbase.go:249 +0x83
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).flushBatch(0xc2006e8230, 0xc2005bbd00)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:113 +0x35d
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).processItem(0xc2006e8230, 0xc200c19840, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:78 +0xd9
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator(0xc200257300, 0xc2002015f0, 0xc2006e8230, 0xc2005bbba0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:107 +0x1b0
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).Run(0xc2006e8230, 0xc2005bbba0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:57 +0xa8
created by github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:97 +0xe3

goroutine 165126 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator(0xc2002572a0, 0xc2002015a0, 0xc2002572d0, 0xc2005bbe40)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:104 +0x32c
github.com/couchbaselabs/tuqtng/xpipeline.(*Project).Run(0xc2002572d0, 0xc2005bbe40)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/project.go:46 +0x91
created by github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).executeInternal
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:79 +0x1c7

goroutine 165123 [chan receive]:
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).ProcessResults(0xc2006e80e0, 0x2, 0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:88 +0x3c
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).Process(0xc2006e80e0, 0x40519c, 0x71b260)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:61 +0x52
github.com/couchbaselabs/tuqtng/network/http.(*HttpQuery).Process(0xc200282580)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_query.go:72 +0x29
github.com/couchbaselabs/tuqtng/network/http.(*HttpEndpoint).ServeHTTP(0xc200000508, 0xc2001b1140, 0xc2006e8070, 0xc2001cdea0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http.go:55 +0xcd
github.com/gorilla/mux.(*Router).ServeHTTP(0xc200107960, 0xc2001b1140, 0xc2006e8070, 0xc2001cdea0)
/tmp/gocode/src/github.com/gorilla/mux/mux.go:90 +0x1e1
net/http.serverHandler.ServeHTTP(0xc200107a50, 0xc2001b1140, 0xc2006e8070, 0xc2001cdea0)
/usr/local/go/src/pkg/net/http/server.go:1517 +0x16c
net/http.(*conn).serve(0xc2001f46c0)
/usr/local/go/src/pkg/net/http/server.go:1096 +0x765
created by net/http.(*Server).Serve
/usr/local/go/src/pkg/net/http/server.go:1564 +0x266

goroutine 165129 [select]:
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanRange(0xc200201500, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:166 +0x6b9
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanEntries(0xc200201500, 0x0, 0xc2005bbea0, 0xc2005bbf00, 0xc2005bbf60, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:112 +0x78
created by github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).scanRange
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:82 +0x18b

goroutine 165128 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).scanRange(0xc2005bbd20, 0x0, 0x8a7970)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:99 +0x7a2
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).Run(0xc2005bbd20, 0xc2005bbcc0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:61 +0xdc
created by github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:97 +0xe3

goroutine 165130 [select]:
net/http.(*persistConn).roundTrip(0xc200372f00, 0xc200765660, 0xc200372f00, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/transport.go:857 +0x6c7
net/http.(*Transport).RoundTrip(0xc20012e080, 0xc2004dac30, 0xc2005a5808, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/transport.go:186 +0x396
net/http.send(0xc2004dac30, 0xc2000e7e70, 0xc20012e080, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/client.go:166 +0x3a1
net/http.(*Client).send(0xb7fcc0, 0xc2004dac30, 0x7c, 0x2b5e52429020, 0xc200fca2c0, ...)
/usr/local/go/src/pkg/net/http/client.go:100 +0xcd
net/http.(*Client).doFollowingRedirects(0xb7fcc0, 0xc2004dac30, 0x90ae80, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/client.go:282 +0x5ff
net/http.(*Client).Do(0xb7fcc0, 0xc2004dac30, 0xc20052cae0, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/client.go:129 +0x8d
github.com/couchbaselabs/go-couchbase.(*Bucket).ViewCustom(0xc2000e8480, 0x845260, 0x0, 0x873ab0, 0x9, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/views.go:115 +0x210
github.com/couchbaselabs/go-couchbase.(*Bucket).View(0xc2000e8480, 0x845260, 0x0, 0x873ab0, 0x9, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/views.go:155 +0xcc
github.com/couchbaselabs/tuqtng/catalog/couchbase.WalkViewInBatches(0xc20045b000, 0xc20045b060, 0xc2000e8480, 0x845260, 0x0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_util.go:80 +0x2ce
created by github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanRange
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:159 +0x209

goroutine 283324 [chan receive]:
github.com/dustin/gomemcached/client.(*Client).GetBulk(0xc200b13f30, 0xc200b1001b, 0xc200fce200, 0x2, 0x2, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:228 +0x3c3
github.com/couchbaselabs/go-couchbase.func·001(0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:158 +0x1dc
github.com/couchbaselabs/go-couchbase.(*Bucket).doBulkGet(0xc2000e8480, 0xc200c1001b, 0xc200fce200, 0x2, 0x2, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:188 +0x150
github.com/couchbaselabs/go-couchbase.func·002()
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:212 +0x115
created by github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:218 +0x1ef

goroutine 283622 [runnable]:
net.runtime_pollWait(0x2aaaaabacb40, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001f4e00, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2001f4d80, 0xc200eda4e0, 0x18, 0x18, 0x0, ...)
/usr/local/go/src/pkg/net/fd_unix.go:195 +0x2b3
net.(*conn).Read(0xc20080b948, 0xc200eda4e0, 0x18, 0x18, 0x1, ...)
/usr/local/go/src/pkg/net/net.go:123 +0xc3
io.ReadAtLeast(0xc200198840, 0xc20080b948, 0xc200eda4e0, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:284 +0xf7
io.ReadFull(0xc200198840, 0xc20080b948, 0xc200eda4e0, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:302 +0x6f
github.com/dustin/gomemcached.(*MCResponse).Receive(0xc2002b4ba0, 0xc200198840, 0xc20080b948, 0xc200eda4e0, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/mc_res.go:155 +0xc7
github.com/dustin/gomemcached/client.getResponse(0xc200198840, 0xc20080b948, 0xc200eda4e0, 0x18, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/transport.go:30 +0xc6
github.com/dustin/gomemcached/client.(*Client).Receive(0xc2004bfa20, 0xc2002b4a20, 0x0, 0x0)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:81 +0x67
github.com/dustin/gomemcached/client.func·003()
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:193 +0xaf
created by github.com/dustin/gomemcached/client.(*Client).GetBulk
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:207 +0x1e6

goroutine 283331 [select]:
net/http.(*persistConn).writeLoop(0xc200372f00)
/usr/local/go/src/pkg/net/http/transport.go:774 +0x26f
created by net/http.(*Transport).dialConn
/usr/local/go/src/pkg/net/http/transport.go:512 +0x58b

goroutine 283321 [chan receive]:
github.com/couchbaselabs/go-couchbase.errorCollector(0xc20101a900, 0xc200372c80)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:246 +0x9f
created by github.com/couchbaselabs/go-couchbase.(*Bucket).GetBulk
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:275 +0x2f2

goroutine 283544 [select]:
net/http.(*persistConn).writeLoop(0xc2006cbd80)
/usr/local/go/src/pkg/net/http/transport.go:774 +0x26f
created by net/http.(*Transport).dialConn
/usr/local/go/src/pkg/net/http/transport.go:512 +0x58b

goroutine 283322 [chan receive]:
github.com/dustin/gomemcached/client.(*Client).GetBulk(0xc2004bfa20, 0xc2004b01ec, 0xc200e8cd60, 0x2, 0x2, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:228 +0x3c3
github.com/couchbaselabs/go-couchbase.func·001(0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:158 +0x1dc
github.com/couchbaselabs/go-couchbase.(*Bucket).doBulkGet(0xc2000e8480, 0xc200c101ec, 0xc200e8cd60, 0x2, 0x2, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:188 +0x150
github.com/couchbaselabs/go-couchbase.func·002()
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:212 +0x115
created by github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:218 +0x1ef

goroutine 283621 [runnable]:
net.runtime_pollWait(0x2aaaaabac960, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001f4500, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2001f4480, 0xc200a83940, 0x18, 0x18, 0x0, ...)
/usr/local/go/src/pkg/net/fd_unix.go:195 +0x2b3
net.(*conn).Read(0xc20084a5d8, 0xc200a83940, 0x18, 0x18, 0x1, ...)
/usr/local/go/src/pkg/net/net.go:123 +0xc3
io.ReadAtLeast(0xc200198840, 0xc20084a5d8, 0xc200a83940, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:284 +0xf7
io.ReadFull(0xc200198840, 0xc20084a5d8, 0xc200a83940, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:302 +0x6f
github.com/dustin/gomemcached.(*MCResponse).Receive(0xc2002b49c0, 0xc200198840, 0xc20084a5d8, 0xc200a83940, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/mc_res.go:155 +0xc7
github.com/dustin/gomemcached/client.getResponse(0xc200198840, 0xc20084a5d8, 0xc200a83940, 0x18, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/transport.go:30 +0xc6
github.com/dustin/gomemcached/client.(*Client).Receive(0xc200d55990, 0xc2002b4720, 0x0, 0x0)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:81 +0x67
github.com/dustin/gomemcached/client.func·003()
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:193 +0xaf
created by github.com/dustin/gomemcached/client.(*Client).GetBulk
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:207 +0x1e6

goroutine 283543 [runnable]:
net/http.(*persistConn).readLoop(0xc2006cbd80)
/usr/local/go/src/pkg/net/http/transport.go:761 +0x64b
created by net/http.(*Transport).dialConn
/usr/local/go/src/pkg/net/http/transport.go:511 +0x574

goroutine 283320 [chan send]:
github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet(0xc2000e8480, 0xc200c19980, 0xc20101a8a0, 0xc20101a900)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:222 +0x26b
created by github.com/couchbaselabs/go-couchbase.(*Bucket).GetBulk
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:273 +0x2d1

goroutine 283330 [IO wait]:
net.runtime_pollWait(0x2aaaaabacdc0, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001f47d0, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2001f4750, 0xc20097b000, 0x1000, 0x1000, 0x0, ...)
/usr/local/go/src/pkg/net/fd_unix.go:195 +0x2b3
net.(*conn).Read(0xc20084a498, 0xc20097b000, 0x1000, 0x1000, 0x8, ...)
/usr/local/go/src/pkg/net/net.go:123 +0xc3
bufio.(*Reader).fill(0xc200b83180)
/usr/local/go/src/pkg/bufio/bufio.go:79 +0x10c
bufio.(*Reader).Peek(0xc200b83180, 0x1, 0xc200198840, 0x0, 0xc200eda4e0, ...)
/usr/local/go/src/pkg/bufio/bufio.go:107 +0xc9
net/http.(*persistConn).readLoop(0xc200372f00)
/usr/local/go/src/pkg/net/http/transport.go:670 +0xc4
created by net/http.(*Transport).dialConn
/usr/local/go/src/pkg/net/http/transport.go:511 +0x574
[root@localhost tuqtng]#


 Comments   
Comment by Marty Schoch [ 16/Oct/13 ]
Looks like a bug in go-couchbase. I have filed an issue there:

http://cbugg.hq.couchbase.com/bug/bug-906
Comment by Ketaki Gangal [ 16/Oct/13 ]
Can easily hit this on any rebalance, bumping this upto a critical.




[MB-9321] Get us off erlang's global facility and re-elect failed master quickly and safely Created: 10/Oct/13  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Duplicate
is duplicated by MB-9691 rebalance repeated failed when add no... Closed
Relates to
relates to MB-9691 rebalance repeated failed when add no... Closed
Triage: Triaged
Is this a Regression?: No

 Description   
We have a number of bugs due to erlang global facility or related issue of not being able to spawn new master quickly. I.e.:

* MB-7282 (erlang's global naming facility apparently drops globally registered service with actual service still alive (was: impossible to change settings/autoFailover after rebalance))

* MB-7168 [Doc'd 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)

* MB-8682 start rebalance request is hunging sometimes (looks like another global facility issue)

* MB-5622 Crash of master node may lead to autofailover in 2 minutes instead of configured shorter autofailover period or similarly slow manual failover

By getting us off global, we will fix all this issues.


 Comments   
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
This also includes making sure autofailover takes into account time it takes for master election in case of master crash.

Current thinking is that every node will run autofailover service but it will run only if it's on master node. And we can have special code that speeds up master re-election if we detect that master node is down.
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
Note that currently mb_master is the thing that first suffers when timeout-ful situation starts.

So we should look at making mb_master more robust if necessary
Comment by Aleksey Kondratenko [ 17/Oct/13 ]
I'm _really_ curious who makes decisions to move this into 2.5.0. Why. And why they think we have bandwidth to handle it.
Comment by Aleksey Kondratenko [ 09/Dec/13 ]
Workaround diag/eval snippet:

rpc:call(mb_master:master_node(), erlang, apply ,[fun () -> erlang:exit(erlang:whereis(mb_master), kill) end, []]).

Detection snippet:

F = (fun (Name) -> {Oks, NotOks} = rpc:multicall(ns_node_disco:nodes_actual(), global, whereis_name, [Name], 60000), case {lists:usort(Oks), NotOks} of {_, [_|_]} -> {failed_rpc, NotOks}; {[_], _} -> ok; {L, _} -> {different_answers, L} end end), [(catch {N, ok} = {N, F(N)}) || N <- [ns_orchestrator, ns_tick, auto_failover]].

Detection snipped should return:

 [{ns_orchestrator,ok},{ns_tick,ok},{auto_failover,ok}]

If not, there's decent chance that we're hitting this issue.
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
As part of that we'll likely have to revamp autofailover. And John Liang suggested nice idea for single ejected node to disable memcached traffic on itself to signal smart clients that something notable occurred.

On Fri, Dec 13, 2013 at 11:48 AM, John Liang <john.liang@couchbase.com> wrote:
>> I.e. consider client that only updates vbucket map if it receives "not my vbucket". And consider 3 node cluster where 1 node is partitioned off other nodes but is accessible from client. Lets name this node C. And imagine that remaining two nodes did failover that node. It can be seen that client will happily continue using old vbucket map and reading/writing to/from node C, because it'll never get a single "not my vbucket" reply.

Thanks Alk. In this case, is there a reason why not to change the vbucket state on the singly-paritioned node on auto-failover? There still be a window for "data loss", but this window should be much smaller.

Yes we can do it. Good idea.
Comment by Perry Krug [ 13/Dec/13 ]
But if that node (C) is partitioned off...how will we be able to tell it to set those vbucket states? IMO, wouldn't it be better for the clients to implement a true quorum approach to the map when they detect that something isn't right? Or am I being too naive and missing something?
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
It's entirely possible that I misunderstood original text, but I understand it the following:

* when autofailover is enabled every node observes if it's alone. If node finds itself alone and usual autofailover threshold passes. This node can be somewhat sure that it was automatically failed over by other cluster

* when that happens, node can either turn all vbuckets to replicas or disable traffic (similarly to what we're doing during flush).

There is of course a chance that all other nodes have truly failed and that single node is all that's left. But in can be argued that in this case amount of data loss is big enough anyways. And one node that artificially disables traffic doesn't change things much.

Regarding "quorum on clients". I've seen one proposal for that. And I don't think it's good idea. I.e. being in majority and being right are almost completely independent things. We can do far better than that. Particularly with CCCP we have rev field that gives sufficient ordering between bucket configurations.
Comment by Perry Krug [ 13/Dec/13 ]
My concern is that our recent observations of false-positive autofailovers may lead lots of individual nodes to decide that they have been isolated and disable their traffic...whether they've been automatically failed over or not.

As you know, one of the very nice safety nets of our autofailover is that it will not activate if it sees more than one node down at once which means that we can never do the wrong thing. If we allow one node to disable its traffic when it can't intelligently reason about the state of the rest of the cluster, IMO we go away from this safety net...no?
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
No. Because node can only do that when it's sure that other side of cluster is not accessible. And it can immediately recover it's memcached traffic ASAP after it detects that rest of cluster is back.`
Comment by Perry Krug [ 13/Dec/13 ]
But it can't ever be sure that the other side of the cluster is actually not accessible...clients may still be able to reach it right?

I'm thinking about some extreme corner cases...but what about the situation where two nodes of a >2-node cluster are completely isolated via some weird networking situation and yet are still reachable to the clients. Both of them would decide that they were isolated from the whole cluster, both of them would disable all their vbuckets and yet neither would be auto-failed over because the rest of the cluster would see two nodes down and not trigger the autofailover. I realize it's rare...but I bet there are less convoluted scenarios that would lead the software to do something undesirable.

I think this is a good discussion...but not directly relevant to the purpose of this bug which I believe is still an important fix that needs to be made. Do you want to take this discussion offline from this bug?
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
There's definitely ways how this can backfire. But tradeoffs are quite clear. You "buy" ability detect autofailovers (and only autofailovers in my words above; but this can be potentially extended to other cases), at expense of small chance of node false-positively disabling it's traffic, briefly and without data loss.

Thinking about this more I now see that it's less good idea than I thought. I.e. particularly autofailover but not manual failover is not as interesting. But we can return to this discussion when mb_master work is actually in progress.

Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Lowered to critical. It's not blocking anyone




[MB-9234] Failover message should take into account availability of replica vbuckets Created: 08/Oct/13  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
If a node goes down and its vbuckets do not have corresponding replicas available in the cluster, we should warn the user that pressing failover will result in perceived dataloss. At the moment, we have the same failover message whether those replica vbuckets are available or not.




[MB-9143] Allow replica count to be edited Created: 17/Sep/13  Updated: 12/Jun/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 2.5.0, 3.0

Type: Task Priority: Critical
Reporter: Perry Krug Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-2512 Allow replica count to be edited Closed

 Description   
Currently the replication factor cannot be edited after a bucket has been created. It would be nice to have this functionality.

 Comments   
Comment by Ruth Harris [ 06/Nov/13 ]
Currently, it's added to the 3.0 Eng branch by Alk. See MB-2512. This would be a 3.0 doc enhancement.
Comment by Perry Krug [ 25/Mar/14 ]
FYI, this is already in as-of 2.5 and probably needs to be documented there as well...if possible before 3.0
Comment by Amy Kurtzman [ 16/May/14 ]
Anil, Can you verify whether this was added in 2.5 or 3.0?
Comment by Anil Kumar [ 28/May/14 ]
Verified - As perry mentioned this was added in 2.5 release. We need to document this soon for 2.5 docs.




[MB-8686] CBHealthChecker - Fix fetching number of CPU processors Created: 23/Jul/13  Updated: 05/Jun/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.1.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Anil Kumar Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: customer
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-8817 REST API support to report number of ... Technical task Open Bin Cui  
Triage: Untriaged

 Description   
Issue reported by customer - cbhealthchecker report showing incorrect information for 'Minimum CPU core number required'.


 Comments   
Comment by Bin Cui [ 07/Aug/13 ]
it will depend on ns_server to provide number of cpu processors in the collected stats. Suggest to push to next release.
Comment by Maria McDuff (Inactive) [ 01/Nov/13 ]
per Bin:
Suggest to push the following two bugs to next release:
1. MB-8686: it depends on ns_server to provide capability to retrieve number of cpu cores
2. MB-8502: caused by async communication between main installer thread and api to get status. Change will be dramatic for installer.
 
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Bin,

Raising to Critical.
If this is still dependent on ns_server, pls assign to Alk.
This needs to be fixed for 3.0.
Comment by Anil Kumar [ 05/Jun/14 ]
We need this information to be provided from ns_server. Created ticket MB-11334.

Traige - June 05 2014 Bin, Anil, Tony, Ashvinder
Comment by Aleksey Kondratenko [ 05/Jun/14 ]
Ehm. I don't think it's good idea to treat ns_server as "provider of random system-level stats". I believe you'll need to find other way of getting it.




[MB-9045] [windows] cbworkloadgen hungs Created: 03/Sep/13  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.2.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: scrubbed
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.2.0-817
<manifest><remote name="couchbase" fetch="git://10.1.1.210/"/><remote name="membase" fetch="git://10.1.1.210/"/><remote name="apache" fetch="git://github.com/apache/"/><remote name="erlang" fetch="git://github.com/erlang/"/><default remote="couchbase" revision="master"/><project name="tlm" path="tlm" revision="862733cea3805cf8eba957a120a67986cd57e4e3"><copyfile dest="Makefile" src="Makefile.top"/></project><project name="bucket_engine" path="bucket_engine" revision="2a797a8d97f421587cce728f2e6aa2cd42c8fa26"/><project name="ep-engine" path="ep-engine" revision="864296f0b4068f9d8e3943fbea6e34c29cf0e903"/><project name="libconflate" path="libconflate" revision="c0d3e26a51f25a2b020713559cb344d43ce0b06c"/><project name="libmemcached" path="libmemcached" revision="ea579a523ca3af872c292b1e33d800e3649a8892" remote="membase"/><project name="libvbucket" path="libvbucket" revision="408057ec55da3862ab8d75b1ed25d2848afd640f"/><project name="couchbase-cli" path="couchbase-cli" revision="94b37190ece87b4386a93b64e62487370d268654" remote="couchbase"/><project name="memcached" path="memcached" revision="414d788f476a019cc5d2b05e0ce72504fe469c79" remote="membase"/><project name="moxi" path="moxi" revision="01bd2a5c0aff2ca35611ba3fb857198945cc84eb"/><project name="ns_server" path="ns_server" revision="8e533a59413ba98dd8a0bc31b409668ca886c560"/><project name="portsigar" path="portsigar" revision="2204847c85a3ccaecb2bb300306baf64824b2597"/><project name="sigar" path="sigar" revision="a402af5b6a30ea8e5e7220818208e2601cb6caba"/><project name="couchbase-examples" path="couchbase-examples" revision="cd9c8600589a1996c1ba6dbea9ac171b937d3379"/><project name="couchbase-python-client" path="couchbase-python-client" revision="f14c0f53b633b5313eca1ef64b0f241330cf02c4"/><project name="couchdb" path="couchdb" revision="386be73085c0b2a8e11cd771fc2ce367b62b7354"/><project name="couchdbx-app" path="couchdbx-app" revision="300031ab2e7e2fc20c59854cb065a7641e8654be"/><project name="couchstore" path="couchstore" revision="30f8f0872ef28f95765a7cad4b2e45e32b95dff8"/><project name="geocouch" path="geocouch" revision="000096996e57b2193ea8dde87e078e653a7d7b80"/><project name="healthchecker" path="healthchecker" revision="fd4658a69eec1dbe8a6122e71d2624c5ef56919c"/><project name="testrunner" path="testrunner" revision="8371aa1cc3a21650b3a9f81ba422ec9ac3151cfc"/><project name="cbsasl" path="cbsasl" revision="6ba4c36480e78569524fc38f6befeefb614951e6"/><project name="otp" path="otp" revision="b6dc1a844eab061d0a7153d46e7e68296f15a504" remote="erlang"/><project name="icu4c" path="icu4c" revision="26359393672c378f41f2103a8699c4357c894be7" remote="couchbase"/><project name="snappy" path="snappy" revision="5681dde156e9d07adbeeab79666c9a9d7a10ec95" remote="couchbase"/><project name="v8" path="v8" revision="447decb75060a106131ab4de934bcc374648e7f2" remote="couchbase"/><project name="gperftools" path="gperftools" revision="44a584d1de8c89addfb4f1d0522bdbbbed83ba48" remote="couchbase"/><project name="pysqlite" path="pysqlite" revision="0ff6e32ea05037fddef1eb41a648f2a2141009ea" remote="couchbase"/></manifest>

Attachments: Zip Archive cbcollect.zip    
Triage: Untriaged
Operating System: Windows 64-bit

 Description   
/cygdrive/c/Program\ Files/Couchbase/Server/bin/cbworkloadgen.exe -n localhost:8091 -r 0.9 -i 1000 -b default -s 256 -j -t 2 -u Administrator -p password
loaded only 369 items and then just hungs

 Comments   
Comment by Bin Cui [ 03/Sep/13 ]
Looks like the parameter -s 256 causes the trouble, which is to create any doc with at least 256 byte.

When tested with s less than 50, it always works fine. But we will have trouble when it runs beyond this value.

BTW, the default one is 10 for s.
Comment by Thuan Nguyen [ 21/Jan/14 ]
Test on build 2.5.0-1054, cbworkloadgen.exe still hang with item size only 35 bytes

cbworkloadgen.exe -n 10.1.2.31:8091 -r 0.9 -i 1000000 -b default -s 35 -j -t 2 -u Administrator -p password

Comment by Thuan Nguyen [ 21/Jan/14 ]
check UI, it loads only 639 items and stopped at default bucket




[MB-8915] Tombstone purger need to find a better home for lifetime of deletion Created: 21/Aug/13  Updated: 15/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication, storage-engine
Affects Version/s: 2.2.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Junyi Xie (Inactive) Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-8916 migration tool for offline upgrade Technical task Open Anil Kumar  
Triage: Untriaged

 Description   
=== copy and paste my email to a group of people, it should explain clearly why we need this ticket ===

Thanks for your comments. Probably it is easier to read in email than code review.

Let me explain a bit to see if we can be on the same page. First of all the current resolution algorithm (comparing all fields) is still right, yes there is small chance we would touch fields after CAS, but for correctness we should have them there.

The cause of MB-8825 is that tombstone purger uses expiration time field to put the purger specific "lifetime of deletion". This is just a "temporary solution" because IMHO the expiration time of a key is not the right place for "lifetime of deletion" (this is purely a storage specific metadata, IMHO should not be in eo_engine), but unfortunately today we cannot find a better place to put such info unless we change the storage format, which has too much overhead at this time. In future, I think we need to figure out the best place for "lifetime of deletion" and move it out of key expiration time field.

In practice, today this temporary solution in tombstone purger is OK in most cases because rarely you have collision in CAS for two deletions on the same key. But MB-8825 just hit the small dark area, when destination tries to replicate a deletion from source back to source in bi-dir XDCR, both share the same (SeqNo, CAS) but different expiration time field (which is not exp time of key, but lifetime of deletion created by tombstone purger), exp time at destination is some times bigger than that at source, causing incorrect resolution results at source. The problem exists for both CAPI and XMEM.

For backward compatibility,
1) If both sides are 2.2, we uses new resolution algorithm for deletion and we are safe.
2) if both sides are pre-2.2, since they do not have tombstone purger, the current algorithm (comparing all fields) should be safe.
3) If a bi-dir XDCR between pre-2.2 and 2.2 cluster on CAPI. deletion born at 2.2 replicating to pre-2.2 should be safe because there is no tombstone purger at pre-2.2. For deletions born at pre-2.2, we may see them bounced back from 2.2. But there should be no dataloss since you just re-delete something already deleted.

This fix may not be perfect, but it is still much better than issues in MB-8825. I hope in near future we can find a right place for "lifetime of deletion" in tombstone purger.


Thanks,

Junyi

 Comments   
Comment by Junyi Xie (Inactive) [ 21/Aug/13 ]
Anil and Dipti,

Please determine the priority of this task, and comment if I miss anything. Thanks.


Comment by Anil Kumar [ 21/Aug/13 ]
Upgrade - We need migration tool (which we talked about) in case of Offline upgrade to move the data. Created a subtask for that.
Comment by Aaron Miller (Inactive) [ 17/Oct/13 ]
Considering that fixing this has lots of implications w.r.t. upgrade and all components that touch the file format, and that not fixing it is not causing any problems, I believe that this is not appropriate for 2.5.0
Comment by Junyi Xie (Inactive) [ 22/Oct/13 ]
I agree with Aaron that this may not be a small task and may have lots of implications to different components.

Anil, please reconsider if this is appropriate for 2.5. Thanks.
Comment by Anil Kumar [ 22/Oct/13 ]
Moved it to 3.0.
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
As "temporary head of xdcr for 3.0" I don't need this fixed in 3.0

And my guess is that after 3.0 when "the plan" for xdcr will be ready, we'll just close it as won't fix, but lets wait and see.
Comment by Cihan Biyikoglu [ 15/Jun/14 ]
Aaron is no longer here. assigning to Chiyoung for consideration.




[MB-8845] spend 5 days prototyping master-less cluster orchestration Created: 15/Aug/13  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to

 Description   
We have a number of issues caused by master election and stuff like that.

We have some ideas about doing better than that but it needs to be protyped.

 Comments   
Comment by Aleksey Kondratenko [ 26/Aug/13 ]
See MB-7282
Comment by Aleksey Kondratenko [ 16/Sep/13 ]
I've spent 1 day on that on Fri Sep 13
Comment by Andrei Baranouski [ 10/Feb/14 ]
Alk, could you provide a cases to reproduce it when finished the task. because this problem occurs very rarely in our tests
Comment by Aleksey Kondratenko [ 10/Feb/14 ]
No. That task will not lead to any commits into mainline code. It's just prototype.

After prototype is done we'll have more specific plan for mainline codebase
Comment by Maria McDuff (Inactive) [ 14/Feb/14 ]
Removed from 3.0 Release.




[MB-8832] Allow for some back-end setting to override hard limit on server quota being 80% of RAM capacity Created: 14/Aug/13  Updated: 28/May/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.1.0, 2.2.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Perry Krug Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Relates to
relates to MB-10180 Server Quota: Inconsistency between d... Open
Triage: Untriaged
Is this a Regression?: Yes

 Description   
At the moment, there is no way to override the 80% of RAM limit for the server quota. At very large node sizes, this can end up leaving a lot of RAM unused.

 Comments   
Comment by Aleksey Kondratenko [ 14/Aug/13 ]
Passing this to Dipti.

We've seen memory fragmentation to easily be 50% of memory usage. So even with 80% you can get into swap and badness.

I'd recommend _against_ this until we solve fragmentation issues we have today.

Also keep in mind that today you _can_ raise this above all limites with simple /diag/eval snippet
Comment by Perry Krug [ 14/Aug/13 ]
We have seen this I agree, but it's been fairly uncommon in production environments and is something that can be monitored and resolved when it does occur. In larger RAM systems, I think we would be better served for most use cases by allowing more RAM to be used.

For example, 80% of 60GB is 48GB...leaving 12GB unused. Even worse for 256GB (leaving 50+GB unused)
Comment by Aleksey Kondratenko [ 14/Aug/13 ]
And on 256 gigs machine fragmentation can be as big as 128 gigs (!) IMHO this is not about absolute numbers but about percentages. Anyways, Dipti will tell us what to do, but your numbers above are just saying how bad our _expected_ fragmentation is.
Comment by Perry Krug [ 14/Aug/13 ]
But that's where I disagree...I think it _is_ about absolute numbers. If we leave fragmentation out of it (since it's something we will fix eventually, something that is specific to certain workloads and something that can be worked around via rebalancing), the point of this overhead was specifically to leave space available for the operating system and any other processes running outside of Couchbase. I'm sure you'd agree that Linux doesn't need anywhere near 50GB of RAM to run properly :) Even if we could decrease that by half it would provide huge savings in terms of hardware and costs to our users.

Is fragmentation the only concern of yours? If we were able to analyze a running production cluster to quantify the RAM fragmentation that exists and determine that it is within a certain bounds...would it be okay to raise the quota about 80%?
Comment by Aleksey Kondratenko [ 14/Aug/13 ]
My point was that fragmentation is also % not absolute. So with larger ram, waste from fragmentation looks scarier.

Now that you're asking if that's my only concern I see that there's more.

Without sufficient space for page cache disk performance will suffer. How much we need to be at least on par with sqlite I cannot say. Nobody can, apparently. Things depend on whether you're going to do bgfetches or not.

Because if you do care about quick bgfetches (or, say views and xdcr) then you may want to set lowest possible quota and give us much ram as possible for page cache, hoping that at least all metadata is in page cache.

If you do not care about residency of metadata, that means you don't care about btree leafs being page-cache-resident. But in order to remain io-efficient you do need to keep non-leaf nodes in page cache. The issue is that with our append-only design nobody knows how well it works in practice and exactly how much of page cache you need to give to keep few perhaps hundreds of megs of metadata-of-metadata page-cache resident. And quite possibly that "correct" recommendation is something like "you need XX percents of your data size for page cache to keep disk subsystem efficient".
Comment by Perry Krug [ 14/Aug/13 ]
Okay, that does make a very good point.

But it also highlights the need for a flexible configuration on our end depending on the use case and customer's needs. i.e., certain customers want to enforce that they are 100% resident and to me that would mean giving Couchbase more than the default quota (while still keeping the potential for fragmentation in mind).
Comment by Patrick Varley [ 11/Feb/14 ]
MB-10180 is strongly related to this issue.
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Anil, pls see my comment on MB-10180.




[MB-8054] Couchstore's mergesort module, currently used for db compaction, can buffer too much data in memory Created: 10/Apr/13  Updated: 19/May/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 2.0.1, 2.1.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Filipe Manana Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
The size of the buffer used by the mergesort module is exclusively bounded by number of elements. This
is dangerous, because elements can have variable sizes, and
a small number of elements does not necessarily means that
the buffer size (byte size) is small.
    
Namely the treewriter module, used by the database compactor
to sort the temporary file containing records for the id btree,
was specifying a buffer element count of 100 * 1024 * 1024.
If for example there are 100 * 1024 * 1024 id records and each
has an average size of 512 bytes, the merge sort module buffers
50Gb of data in memory!
    
Although the id btree records are currently very small (under
a hundred bytes or so), use of other types of records may easily
cause too much memory consumption - this will be the case for
view records. Issue MB-8029 adds a module that uses the mergesort
module to sort files containing view records.


 Comments   
Comment by Filipe Manana [ 10/Apr/13 ]
http://review.couchbase.org/#/c/25588
Comment by Filipe Manana [ 11/Apr/13 ]
It turns out this is not a simple change.

Simply adding a buffer byte size limit breaks the merge algorithm for some cases, particularly when the file to sort is larger than the specified buffer sizes. The mergesort.c merge phase relies on the fact that each sorted batch written to the tmp files always has the same number of elements - a thing that doesn't hold true when records have a variable size such as with views (MB-8029).

For now it's not too bad because for the current use of mergesort.c by the views, the files to sort are small (up to 30Mb max). Later this will have to change, as the files to sort can have any size, from a few kb to hundreds of mbs or gbs. I'll look for alternative external mergesort implementation, which allows to control max buffer size, merge only a group of already sorted files (like Erlang's file_sorter allows) and ideally more optimized as well (allow for N-way merge, instead of fixed 2-way merge, etc).
Comment by Filipe Manana [ 16/May/13 ]
There's a new and improved (flexibility, error handling, some performance optimizations) on-disk file sorter now in master branch.
It's being used for views already.

Introduced in:

https://github.com/couchbase/couchstore/commit/fdb0da52a1e3c059fef3fa7e74ec54b03e62d5db

Advantages:

1) Allow in-memory buffer sizes to be bounded by number of
   bytes, unlike mergesort.c which bounds buffers by number of
   records regardless of their sizes.

2) Allow for a N-way merges, allowing for better performance,
   due to a significant reduction of moving records between
   temporary files;

3) Some optimizations to avoid unncessary moving of records
   between temporary files (specially when total number of
   records is smaller than buffer size);

4) Allow specifying which directory is used to store temporary
   files. The mergesort.c uses the C stdlib function tmpfile()
   to create temporary files - the standard doesn't specify in
   which directory such files are created, but on GNU/Linux it
   seems to be in /tmp (see http://linux.die.net/man/3/tmpfile).
   For database compaction and index compaction, it's important
   to use a directory from within the configured database and
   index directories (settings database_dir and view_index_dir),
   because those directories are what the administrator configured
   and may be part of a disk drive that offers better performance
   or just has more available space for example.
   Further, in some system /tmp might map to a tmpfs mount, which
   is an in-memory filesystem (http://en.wikipedia.org/wiki/Tmpfs);

5) Better and more fine grained error handling. Confront with MB-8055 -
   the mergesort.c module ignored completely read errors when
   reading from the temporary files - which could lead to silent data loss.
Comment by Filipe Manana [ 16/May/13 ]
See above.
Since this is core database, I believe it belongs to you.
Comment by Maria McDuff (Inactive) [ 10/Feb/14 ]
Aaron,

is this going to be in 3.0?
Comment by Aaron Miller (Inactive) [ 18/Feb/14 ]
I wouldn't count on it. This sort of thing affects views a lot more than the storage files, and the view code has already been modified to use the newer disk sort.

This is unlikely to cause any problems with storage file compaction, as the sizes of the records in storage files can't grow arbitrarily.

Using the newer sort will probably perform better, but *not* using it shouldn't cause any problems, making this issue more of a performance enhancement than a bug, and as such will probably lose to other issues I'm working on for 3.0 and 2.5.X




[MB-8022] Fsync optimizations (remove double fsyncs) Created: 05/Feb/13  Updated: 01/Apr/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 2.0, 2.0.1, 2.1.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dipti Borkar Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: PM-PRIORITIZED
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Aaron Miller (Inactive) [ 28/Mar/13 ]
There is a toy build that Ronnie is testing to see the potential perfomance impacts of this. (toy-aaron #1022)
Comment by Maria McDuff (Inactive) [ 10/Apr/13 ]
Jin will update use case scenario that QE will run.
Comment by Jin Lim [ 11/Apr/13 ]
This feature is to optimize disk write from ep engine/couchstore.

Any existing test that measures disk drain rate should determine any tangible improvement from the feature.
Baseline:
* Heavy dgm
* Write heavy (read:20% write:80%)
* Write I/O should be mix of set/delete/update
* Measure disk drain rate and cbstats's kvtimings (writeTime, commit, save_documents)
Comment by Aaron Miller (Inactive) [ 11/Apr/13 ]
The most complicated part of this change is the addition of a corruption check that must be run the first time a file is opened after the server comes up, since we're buying these perf gains by playing a bit more fast and loose with the disk.

To check that this is behaving correctly we'll want to make sure that corrupting the most-recent transaction in a storage file rolls that transaction back.

This could be accomplished by updating an item that will land in a known vbucket, shutting down the server, and flipping some bits around end of the file. The update should be rolled back when the server comes back up, and nothing should freak out :)

A position guaranteed to affect an item body from the recentmost transaction is 4095 bytes behind the last position in the file that is a multiple of 4096, or: floor(file_length / 4096) * 4096 - 4095
Comment by Maria McDuff (Inactive) [ 16/Apr/13 ]
abhinav,
will you be able to craft a test that involves this update to an item and manipulating the bits on eof? this seems tricky. let's discuss with Jin/Aaron.
Comment by Dipti Borkar [ 19/Apr/13 ]
I don't think this is user visible and so doesn't make sense to include in the release notes.
Comment by Maria McDuff (Inactive) [ 19/Apr/13 ]
aaron, pls assign back to QE (Abhinav) once you've merged the fix.
Comment by kzeller [ 22/Apr/13 ]
Updated 4/22 - No docs needed
Comment by Maria McDuff (Inactive) [ 22/Apr/13 ]
Aaron, can you also include the code changes for review here as soon as you have checked-in the fix?
thanks.
Comment by Maria McDuff (Inactive) [ 23/Apr/13 ]
deferred.
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
Hi Aaron, are you working on this for 3.0? if yes, could you push this to fexversion=3.0
Comment by Cihan Biyikoglu [ 01/Apr/14 ]
Chiyoung, pls close if this isn't relevant anymore, given this is a year old.




[MB-7177] lack of fsyncs in view engine may lead to silent index corruption Created: 13/Nov/12  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Rahim Yaseen (Inactive)
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
SUBJ. Found out about this in discussion with Filipe about how views work.

If I understood correctly it doesn't fsync at all silently assuming that if there's valid header, then preceding data is valid as well. Which is clearly not true.

IMHO that's a massive blocker that needs to be fixed sooner rather than later.

 Comments   
Comment by Steve Yen [ 14/Nov/12 ]
bug-scrub -- assigned to yaseen
Comment by Aleksey Kondratenko [ 14/Nov/12 ]
Comment was made that this cannot be silent index corruption due to CRC-ing of all btree nodes. But my point still holds, we if there's data corruption we'll know at query time and people we'll have to experience down time to manually rebuild index.
Comment by Steve Yen [ 15/Nov/12 ]
per bug scrub
Comment by Farshid Ghods (Inactive) [ 26/Nov/12 ]
Deep and Iryna have tried a scenario where they rebooted the system and did not hit this issue.
Comment by Steve Yen [ 26/Nov/12 ]
to .next per bug-scrub.

QE reports that deep & iryna tried to reproduce this and couldn't yet.
Comment by Aleksey Kondratenko [ 26/Nov/12 ]
It appears that move to .next was based on same old "we cannot reproduce" logic. It appears that we continue to under-prioritize IMHO important bugs merely because it's hard to reproduce them.

Because with that logic we'll I'm sure will forever move it to next release. If we think we don't need to that, IMHO it would be better to just close it.
Comment by Filipe Manana [ 04/Jan/13 ]
Due to crc checks for every object written to a file (btree nodes), it won't certainly be silent.
Comment by Aleksey Kondratenko [ 04/Jan/13 ]
I agree. My earlier comment above (based on your's or Damien's verbal comment) has same information.

But not being silent doesn't mean we can simply close it (or IMHO downgrade or forget it). Do we know what exactly will happen if querying or updating view will suddenly detect corrupted index file ?
Comment by Andrew DePue [ 21/May/13 ]
We just ran into this, or something like it. We have a development cluster and lost power to the entire cluster at once (it was a dev cluster so we didn't have backup power). The Couchbase cluster _seemed_ to start OK, but accessing certain views would result in strange behavior... mostly timeouts without any error or any indication as to what the problem could be.
Comment by Filipe Manana [ 21/May/13 ]
If there's a corruption issue with a file (either view or database), view queries will return an explicit file_corruption error if the index file is corrupted. If the corruption is in a database file, the error is only returned in a query response if the query is of type stale=false. For all cases, the error (and a stack trace) are logged.

Did you saw such error in your case? Example:
http://www.couchbase.com/forums/thread/filecorruption-error-executing-view




[MB-6746] separate disk path for replica index (and individual design doc) disk path Created: 26/Sep/12  Updated: 20/Jun/13

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-beta
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dipti Borkar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
On the UI and REST API to create a separate disk path for replica indexes (right under the replica index check box in the setup wizard)

This will allow users to have better disk solution is replica index is used.

In addition, add new rest APIs to enable a separate disk path for each design document (Not in the UI, only in REST)

 Comments   
Comment by Aleksey Kondratenko [ 28/Sep/12 ]
Dipti, this checkbox in setup wizard is for default bucket. Not cluster-wide setting.

Also are you really sure we need this ? I mean raid0 for views looks even better from performance perspective.
Comment by Aleksey Kondratenko [ 04/Oct/12 ]
We discussed already that I can't do that without more instructions.
Comment by Peter Wansch (Inactive) [ 08/Oct/12 ]
Change too invasive for 2.0
Comment by Steve Yen [ 25/Oct/12 ]
alk would be a better assignee for this than peter
Comment by Aleksey Kondratenko [ 20/Jun/13 ]
Given this is per-bucket/per-node we don't have place for this in current UI design.

And I'm not sure we really need this. I seriously doubt that, honestly speaking.




[MB-6527] Tools to Index and compact database/indexes when the server is offline Created: 05/Sep/12  Updated: 01/Apr/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Karan Kumar (Inactive) Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
This is from the supportability point of view.
If for whatever reason customer bring their nodes down, eg. maintenance etc.

When they bring the node back up, hopefully we have all the compaction/indexing finished for that particular node.

We need a way to index and compact data (database and index) if possible when the nodes are offline.

 Comments   
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
Anil, could you pull this into 3.0 if this is happening in 3.0 timeline?




[MB-6450] Finalize doc editing API and implementation Created: 27/Aug/12  Updated: 19/May/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-beta
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
We need to:

* avoid loading and displaying blobs on UI

* avoid seeing deleted docs

* handle warmup, node being down in REST API implementation and UI


 Comments   
Comment by Tug Grall (Inactive) [ 09/Se