[MB-11203] SSL-enabled memcached will hang when given a large buffer containing many pipelined requests Created: 24/May/14  Updated: 09/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Test Blocker
Reporter: Mark Nunberg Assignee: Trond Norbye
Resolution: Unresolved Votes: 0
Labels: memcached
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Sample code which shows filling in a large number of pipelined requests being flushed over a single buffer.

#include <libcouchbase/couchbase.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static int remaining = 0;

static void
get_callback(lcb_t instance, const void *cookie, lcb_error_t err,
    const lcb_get_resp_t *resp)
{
    printf("Remaining: %d \r", remaining);
    fflush(stdout);
    if (err != LCB_SUCCESS && err != LCB_KEY_ENOENT) {
    }
    remaining--;
}

static void
stats_callback(lcb_t instance, const void *cookie, lcb_error_t err,
    const lcb_server_stat_resp_t *resp)
{
    printf("Remaining: %d \r", remaining);
    fflush(stdout);
    if (err != LCB_SUCCESS && err != LCB_KEY_ENOENT) {
    }

    if (resp->v.v0.server_endpoint == NULL) {
        fflush(stdout);
        --remaining;
    }
}

#define ITERCOUNT 5000
static int use_stats = 1;

static void
do_stat(lcb_t instance)
{
    lcb_CMDSTATS cmd;
    memset(&cmd, 0, sizeof(cmd));
    lcb_error_t err = lcb_stats3(instance, NULL, &cmd);
    assert(err==LCB_SUCCESS);
}

static void
do_get(lcb_t instance)
{
    lcb_error_t err;
    lcb_CMDGET cmd;
    memset(&cmd, 0, sizeof cmd);
    LCB_KREQ_SIMPLE(&cmd.key, "foo", 3);
    err = lcb_get3(instance, NULL, &cmd);
    assert(err==LCB_SUCCESS);
}

int main(void)
{
    lcb_t instance;
    lcb_error_t err;
    struct lcb_create_st cropt = { 0 };
    cropt.version = 2;
    char *mode = getenv("LCB_SSL_MODE");
    if (mode && *mode == '3') {
        cropt.v.v2.mchosts = "localhost:11996";
    } else {
        cropt.v.v2.mchosts = "localhost:12000";
    }
    mode = getenv("USE_STATS");
    if (mode && *mode != '\0') {
        use_stats = 1;
    } else {
        use_stats = 0;
    }
    err = lcb_create(&instance, &cropt);
    assert(err == LCB_SUCCESS);


    err = lcb_connect(instance);
    assert(err == LCB_SUCCESS);
    lcb_wait(instance);
    assert(err == LCB_SUCCESS);
    lcb_set_get_callback(instance, get_callback);
    lcb_set_stat_callback(instance, stats_callback);
    lcb_cntl_setu32(instance, LCB_CNTL_OP_TIMEOUT, 20000000);
    int nloops = 0;

    while (1) {
        unsigned ii;
        lcb_sched_enter(instance);
        for (ii = 0; ii < ITERCOUNT; ++ii) {
            if (use_stats) {
                do_stat(instance);
            } else {
                do_get(instance);
            }
            remaining++;
        }
        printf("Done Scheduling.. L=%d\n", nloops++);
        lcb_sched_leave(instance);
        lcb_wait(instance);
        assert(!remaining);
    }
    return 0;
}


 Comments   
Comment by Mark Nunberg [ 24/May/14 ]
http://review.couchbase.org/#/c/37537/
Comment by Mark Nunberg [ 07/Jul/14 ]
Trond, I'm assigning it to you because you might be able to delegate this to another person. I can't see anything obvious in the diff since the original fix which would break it - of course my fix might have not fixed it completely but just have made it work accidentally; or it may be flush-related.
Comment by Mark Nunberg [ 07/Jul/14 ]
Oh, and I found this on an older build of master; 837, and the latest checkout (currently 055b077f4d4135e39369d4c85a4f1b47ab644e22) -- I don't think anyone broke memcached - but rather the original fix was incomplete :(




[MB-10180] Server Quota: Inconsistency between documentation and CB behaviour Created: 11/Feb/14  Updated: 28/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Dave Rigby Assignee: Ruth Harris
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File MB-10180_max_quota.png    
Issue Links:
Relates to
relates to MB-2762 Default node quota is still too high Resolved
relates to MB-8832 Allow for some back-end setting to ov... Open
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
In the documentation for the product (and general sizing advice) we tell people to allocate no more than 80% of their memory for the Server Quota, to leave headroom for the views, disk write queues and general OS usage.

However on larger[1] nodes we don't appear to enforce this, and instead allow people to allocate up to 1GB less than the total RAM.

This is inconsistent, as we document and tell people one thing and let them do another.

This appears to be something inherited from MB-2762, which the intent of which appeared to only allow the relaxing of this when joining a cluster, however this doesn't appear to be how it works - I can successfully change the existing cluster quota from the CLI to a "large" value:

    $ /opt/couchbase/bin/couchbase-cli cluster-edit -c localhost:8091 -u Administrator -p dynam1te --cluster-ramsize=127872
    ERROR: unable to init localhost (400) Bad Request
    [u'The RAM Quota value is too large. Quota must be between 256 MB and 127871 MB (memory size minus 1024 MB).']

While I can see some logic to relax the 80% constraint on big machines, with the advent of 2.X features 1024MB seems far too small an amount of headroom.

Suggestions to resolve:

A) Revert to a straightforward 80% max, with a --force option or similar to allow specific customers to go higher if they know what they are doing
B) Leave current behaviour, but document it.
B) Increase minimum headroom to something more reasonable for 2.X, *and* document the beaviour.

([1] On a machine with 128,895MB of RAM I get the "total-1024" behaviour, on a 1GB VM I get 80%. I didn't check in the code what the cutoff for 80% / total-1024 is).


 Comments   
Comment by Dave Rigby [ 11/Feb/14 ]
Screenshot of initial cluster config: maximum quota is total_RAM-1024
Comment by Aleksey Kondratenko [ 11/Feb/14 ]
Do not agree with that logic.

There's IMHO quite a bit of difference between default settings, recommended settings limit and allowed settings limit. The later can be wider for folks who really know what they're doing.
Comment by Aleksey Kondratenko [ 11/Feb/14 ]
Passed to Anil, because that's not my decision to change limits
Comment by Dave Rigby [ 11/Feb/14 ]
@Aleksey: I'm happy to resolve as something other than my (A,B,C), but the problem here is that many people haven't even been aware of this "extended" limit in the system - and moreover on a large system we actually advertise it in the GUI when specifying the allowed limit (see attached screenshot).

Furthermore, I *suspect* that this was originally only intended for upgrades for 1.6.X (see http://review.membase.org/#/c/4051/), but somehow is now being permitted for new clusters.

Ultimately I don't mind what our actual max quota value is, but the app behaviour should be consistent with the documentation (and the sizing advice we give people).
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
raising to product blocker.
this inconsistency has to be resolved - PM to re-align.
Comment by Anil Kumar [ 28/May/14 ]
Going with option B - Leave current behaviour, but document it.




[MB-10156] "XDCR - Cluster Compare" support tool Created: 07/Feb/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Cihan Biyikoglu Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: 2.5.1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
for the recent issues we have seen we need a tool that cam compare metadata (specifically revids) for a given replication definition in XDCR. To scale to large data sizes, being able to do this per vbucket or per doc range would be great but we can do without these. for clarity, here is a high level desc.

Ideal case:
xdcr_compare cluster1_connectioninfo cluster1_bucketname cluster2connectioninfo cluster2_bucketname [vbucketid] [keyrange]
should return a line per docid for each row where cluster1 metadata and clustermetadata for the given key differ.
docID - cluster1_metadata cluster2_metadata

simplification: the tool is expected to return false positives in a moving system but we will tackle that by rerunning the tool multiple times.

 Comments   
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Aaron, do you have a timeline for this?
thanks
-cihan
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan,

For test automation/verification, can you list out the stats/metadata that we should be testing specifically?
we want to create/implement the tests accordingly.


Also -- is this tool de-coupled from the server package? or is this part of rpm/deb/.exe/osx build package?

Thanks,
Maria
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
This depends on the requirements; A tool that requires the manual collection of all data from all nodes in both clusters onto one machine (like we've done recently) could be done pretty quickly, but I imagine that may be difficult or unfeasible entirely for some users.

Better would be to be able to operate remotely on clusters and only look at metadata. Unfortunately there is no *currently exposed* interface to only extract metadata from the system without also retrieving values. I may be able to work around this, but the workaround is unlikely to be simple.

Also for some users, even the amount of *metadata* may be prohibitively large to transfer all to one place, this also can be avoided, but again, adds difficulty.

Q: Can the tool be JVM-based?
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
I think it would be more feasible for this to ship separately from the server package.
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan, Aaron,

If it's de-coupled, what older versions of Couchbase would this tool support? as far back as 1.8.x? pls confirm as this would expand our backward compatibility testing for this tool.
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
Well, 1.8.x didn't have XDCR or the rev field; It can't be compatible with anything older than 2.0 since it operates mostly to check things added since 2.0.

I don't know how far back it needs to go but it *definitely* needs to be able to run against 2.2
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Agree with Aaron, lets keep this lightweight. can we depend on Aaron for testing if this will initially be just a support tool? for 3.0, we may graduate the tool to the server shipped category.
thanks
Comment by Sangharsh Agarwal [ 27/Feb/14 ]
Cihan, Is the Spec finalized for this tool in version 2.5.1?
Comment by Cihan Biyikoglu [ 27/Feb/14 ]
Sangharsh, for 2.5.1, we wanted to make this a "Aaron tested" tool. I believe Aaron already has the tool. Aaron?
Comment by Aaron Miller (Inactive) [ 27/Feb/14 ]
Working on it; wanted to get my actually-in-the-package 2.5.1 stuff into review first.

What I do already have is a diff tool for *files*, but is highly inconvenient to use; this should be a tool that doesn't require collecting all data files into one place in order to use, and instead can work against a running cluster.
Comment by Maria McDuff (Inactive) [ 05/Mar/14 ]
Aaron,

Is the tool merged yet into the build? can you update pls?
Comment by Cihan Biyikoglu [ 06/Mar/14 ]
2.5.1 shiproom note: Phil raised a build concern on getting this packaged with 2.5.1. The initial bar we set was not to ship this as part of the server - it was intended to be a downloadable support tool. Aaron/Cihan will re-eval and get back to shiproom.
Comment by Cihan Biyikoglu [ 15/Jun/14 ]
Aaron no longer here. assigning to Xiaomei for consideration.




[MB-10086] Cluster-wide diagnostics gathering - collect_info from UI across cluster Created: 30/Jan/14  Updated: 09/Jun/14

Status: In Progress
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Anil Kumar Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
depends on MB-11202 [auto-collectinfo] we need to package... Resolved
depends on MB-11303 [auto-collectinfo] Rebuild Windows cu... Resolved
blocks MB-9564 Supportability: Replace "get diagnost... Resolved
Duplicate
duplicates MB-3140 generate diagnostic report should ret... Closed
duplicates MB-8519 collect info usage for large clusters Closed

 Description   
Cluster-wide diagnostics gathering - collect_info from UI across cluster.

Work required
- REST API
- UI

 Comments   
Comment by Aleksey Kondratenko [ 10/Feb/14 ]
Not happening for 3.0
Comment by Aleksey Kondratenko [ 10/Feb/14 ]
Moved out of sprint. Couldn't get this in 5 allocated days we had for logging changes
Comment by Dave Rigby [ 14/Mar/14 ]
Functional spec at: https://docs.google.com/document/d/1cPHNNIonFT33IfS5ae4_jsknjazP-gCEK69-qr9E4k8/edit#
Comment by Dave Rigby [ 30/Apr/14 ]
The relevant patches are:

ns_server:

MB-10086: Add python-requests 3rd party library - http://review.couchbase.org/36267
MB-10086: cbcollect_info upload support - http://review.couchbase.org/35896
MB-10086 [auto_collect]: Add REST endpoints & param validation - http://review.couchbase.org/34490
MB-10086 [auto_collect]: basic manager and per-node processes - http://review.couchbase.org/35456
MB-10086 [auto_collect]: Increment tasks version on start/complete - http://review.couchbase.org/36262
MB-10086 [auto_collect]: Include timestamp in zip filename; save to tmpdir - http://review.couchbase.org/36292
MB-10086 [auto_collect]: Add recommendedRefreshPeriod - http://review.couchbase.org/36293
MB-10086: auto collect-info UI - http://review.couchbase.org/34474

couchbase-cli:

MB-10086: Auto-collect logs CLI support - http://review.couchbase.org/36416
Comment by Maria McDuff (Inactive) [ 15/May/14 ]
Raising to Test Blocker as QE need to start implementing the tests.
Comment by Aleksey Kondratenko [ 03/Jun/14 ]
New implementation is uploaded into gerrit for review. Chain ends here: http://review.couchbase.org/37826

Give us a bit more time to review the code.

Meanwhile QE can start writing any tests based on REST API (which is different from original spec) document here: http://review.couchbase.org/37826

cli support is missing yet and will be added soon.
Comment by Aleksey Kondratenko [ 04/Jun/14 ]
Stuff uploaded yesterday for review is now in. It's ready for testing. But note that https support across all our platforms might be not fully functional yet. See MB-11202 and MB-11303 linked above.

Also cli support (both couchbase-cli support and cbcollect_info support for "human initiated" collect+upload) is still missing. And will be delivered soon.
Comment by Wayne Siu [ 09/Jun/14 ]
Removing the ticket from test blocker as QE has started testing on this feature.




[MB-10722] ep-engine gerrit jobs don't check out the latest change Created: 01/Apr/14  Updated: 04/Apr/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Mike Wiederhold Assignee: Tommie McAfee
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
I've had my changes marked verified and then merged them a few times now and noticed that make simple-test doesn't pass when I run it. I appears that the actual change we want to test is not getting pulled into the test.

 Comments   
Comment by Thuan Nguyen [ 01/Apr/14 ]
It happens in testrunner too
Comment by Phil Labee [ 01/Apr/14 ]
Need more info. Please provide an example of a code review that passed the commit validation test, but failed in your testing after the change was submitted.
Comment by Mike Wiederhold [ 01/Apr/14 ]
http://factory.couchbase.com/job/ep-engine-gerrit-300/415/

http://review.couchbase.org/#/c/35035/
Comment by Maria McDuff (Inactive) [ 04/Apr/14 ]
Tommie,

can you take a look? looks like we may need to adjust the testrunner logic.
pls advise.




[MB-10719] Missing autoCompactionSettings during create bucket through REST API Created: 01/Apr/14  Updated: 19/Jun/14

Status: Reopened
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: michayu Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File bucket-from-API-attempt1.txt     Text File bucket-from-API-attempt2.txt     Text File bucket-from-API-attempt3.txt     PNG File bucket-from-UI.png     Text File bucket-from-UI.txt    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Unless I'm not using the API correctly, there seems to be some holes in the Couchbase API – particularly with autoCompaction.

The autoCompaction parameter can be set via the UI (as long as the bucketType is couch base).

See the following attachments:
1) bucket-from-UI.png
2) bucket-from-UI.txt

And compare with creating the bucket (with autoCompaction) through the REST API:
1) bucket-from-API-attempt1.txt
    - Reference: http://docs.couchbase.com/couchbase-manual-2.5/cb-rest-api/#creating-and-editing-buckets
2) bucket-from-API-attempt2.txt
    - Reference: http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-admin-rest-auto-compaction
3) bucket-from-API-attempt3.txt
    - Setting autoCompaction globally
    - Reference: http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-admin-rest-auto-compaction

In all cases, autoCompactionSettings is still false.


 Comments   
Comment by Anil Kumar [ 19/Jun/14 ]
Triage - June 19 2014 Alk, parag, Anil
Comment by Aleksey Kondratenko [ 19/Jun/14 ]
It works, just apparently not properly documented:

# curl -u Administrator:asdasd -d name=other -d bucketType=couchbase -d ramQuotaMB=100 -d authType=sasl -d replicaNumber=1 -d replicaIndex=0 -d parallelDBAndViewCompaction=true -d purgeInterval=1 -d 'viewFragmentationThreshold[percentage]'=30 -d autoCompactionDefined=1 http://lh:9000/pools/default/buckets

And general hint is that you can see what browser is POSTing when it creates bucket or does anything else to figure out working (but not necessarily publicly supported) way of doing things.
Comment by Anil Kumar [ 19/Jun/14 ]
Ruth - Above documentation references needs to be fixed with correct REST API.




[MB-10440] something isn't right with tcmalloc in build 1074 on at least rhel6 causing memcached to crash Created: 11/Mar/14  Updated: 02/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aleksey Kondratenko Assignee: Phil Labee
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Relates to
relates to MB-10371 tcmalloc must be compiled with -DTCMA... Reopened
relates to MB-10439 Upgrade:: 2.5.0-1059 to 2.5.1-1074 =>... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
SUBJ.

Just installing latest 2.5.1 build on rhel6 and creating bucket caused segmentation fault (see also MB-10439).

When replacing tcmalloc with a copy I've built it works.

Cannot be 100% sure it's tcmalloc but crash looks too easily reproducible to be something else.


 Comments   
Comment by Wayne Siu [ 12/Mar/14 ]
Phil,
Can you review if this change has been (copied from MB-10371) applied properly?

voltron (2.5.1) commit: 73125ad66996d34e94f0f1e5892391a633c34d3f

    http://review.couchbase.org/#/c/34344/

passes "CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW" to each gprertools configure command
Comment by Andrei Baranouski [ 12/Mar/14 ]
see the same issue on centos 64
Comment by Phil Labee [ 12/Mar/14 ]
need more info:

1. What package did you install?

2. How did you build the tcmalloc which fixes the problem?
 
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
build 1740. Rhel6 package.

You can see yourself. It's easily reproducible as Andrei also confirmed too.

I've got 2.1 tar.gz from googlecode. And then did ./configure --prefix=/opt/couchbase --enable-minimal CPPFLAGS='-DTCMALLOC_SMALL_BUT_SLOW' and then make and make install. After that it works. Have no idea why.

Do you know exact CFLAGS and CXXFLAGS that are used to build our tcmalloc ? Those variables are likely set in voltron (or even from outside of voltron) and might affect optimization and therefore expose some bugs.

Comment by Aleksey Kondratenko [ 12/Mar/14 ]
And 64 bit.
Comment by Phil Labee [ 12/Mar/14 ]
We build out of:

    https://github.com/couchbase/gperftools

and for 2.5.1 use commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

compile using:

(cd /home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools \
&& ./autogen.sh \
        && ./configure --prefix=/opt/couchbase CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW --enable-minimal \
        && make \
        && make install-exec-am install-data-am)
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
That part I know. What I don't know is what cflags are being used.
Comment by Phil Labee [ 13/Mar/14 ]
from the 2.5.1 centos-6-x86 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x86-251-builder/builds/18/steps/couchbase-server%20make%20enterprise%20/logs/stdio

make[1]: Entering directory `/home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools'

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Phil Labee [ 13/Mar/14 ]
from a 2.5.1 centos-6-x64 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x64-251-builder/builds/16/steps/couchbase-server%20make%20enterprise%20/logs/stdio

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Ok. I'll try to exclude -O3 as possible reason of failure later today (in which case it might be upstream bug). In the meantime I suggest you to try lowering optimization to -O2. Unless you have other ideas of course.
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Building tcmalloc with exact same cflags -O3 doesn't cause any crashes. At this time my guess is either compiler bug or cosmic radiation hitting just this specific build.

Can we simply force rebuild ?
Comment by Phil Labee [ 13/Mar/14 ]
test with newer build 2.5.1-1075:

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_2.5.1-1075-rel.rpm

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_64_2.5.1-1075-rel.rpm
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Didn't help unfortunately. Is that still with -O3 ?
Comment by Phil Labee [ 14/Mar/14 ]
still using -O3. There are extensive comments in the voltron Makefile warning against changing to -O2
Comment by Phil Labee [ 14/Mar/14 ]
Did you try to build gperftools out of our repo?
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
The following is not true:

Got myself centos 6.4. And with it's gcc and -O3 I'm finally able to reproduce issue.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
So I've got myself centos 6.4 and _exact same compiler version_. And when I build tcmalloc myself with all right flags and replace tcmalloc from package it works. Without replacing it crashes.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Phil, please, clean ccache, reboot builder host (to clean page cache) and _then_ do another rebuild. Looking at build logs it looks like ccache is being used. So my suspicion about ram corruption is not fully excluded yet. And I have not much other ideas.
Comment by Phil Labee [ 14/Mar/14 ]
cleared ccache and restarted centos-6-x86-builder, centos-6-x64-builder

started build 2.5.1-1076
Comment by Pavel Paulau [ 14/Mar/14 ]
2.5.1-1076 seems to be working, it warns about "SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER" as well.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Maybe I'm doing something wrong but it fails in exact same way on my VM
Comment by Pavel Paulau [ 14/Mar/14 ]
Sorry, it crashed eventually.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Confirmed again. Everything is exactly same as before. Build 1076 centos 6.4 amd64 crashes very easily. Both enterprise edition and community. And doesn't crash if I replace tcmalloc with stuff that I've built, that's exact same source and exact same flags and exact same compiler version.

Build 1071 doesn't crash. All of the 100% consistently.
Comment by Phil Labee [ 17/Mar/14 ]
possibly a difference in build environment

reference env is described in voltron README.md file

for centos-6 X64 (6.4 final) we use the defaults for these tools:


gcc-4.4.7-3.el6 ( 4.4.7-4 available)
gcc-c++-4.4.7-3 ( 4.4.7-4 available)
kernel-devel-2.6.32-358 ( 2.6.32-431.5.1 available)
openssl-devel-1.0.0-27.el6_4.2 ( 1.0.1e-16.el6_5.4 available)
rpm-build-4.8.0-32 ( 4.8.0-37 available)

these tools do not have an update:

scons-2.0.1-1
libtool-2.2.6-15.5

For all centos these specific versions are installed:

gcc, g++ 4.4, currently 4.4.7-3, 4.4.7-4 available
autoconf 2.65, currently 2.63-5 (no update available)
automake 1.11.1
libtool 2.4.2
Comment by Phil Labee [ 17/Mar/14 ]
downloaded gperftools-2.1.tar.gz from

    http://gperftools.googlecode.com/files/gperftools-2.1.tar.gz

and expanded into directory: gperftools-2.1

cloned https://github.com/couchbase/gperftools.git at commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

into directory gperftools, and compared:

=> diff -r gperftools-2.1 gperftools
Only in gperftools: .git
Only in gperftools: autogen.sh
Only in gperftools/doc: pprof.see_also
Only in gperftools/src/windows: TODO
Only in gperftools/src/windows: google

Only in gperftools-2.1: Makefile.in
Only in gperftools-2.1: aclocal.m4
Only in gperftools-2.1: compile
Only in gperftools-2.1: config.guess
Only in gperftools-2.1: config.sub
Only in gperftools-2.1: configure
Only in gperftools-2.1: depcomp
Only in gperftools-2.1: install-sh
Only in gperftools-2.1: libtool
Only in gperftools-2.1: ltmain.sh
Only in gperftools-2.1/m4: libtool.m4
Only in gperftools-2.1/m4: ltoptions.m4
Only in gperftools-2.1/m4: ltsugar.m4
Only in gperftools-2.1/m4: ltversion.m4
Only in gperftools-2.1/m4: lt~obsolete.m4
Only in gperftools-2.1: missing
Only in gperftools-2.1/src: config.h.in
Only in gperftools-2.1: test-driver
Comment by Phil Labee [ 17/Mar/14 ]
Since the build files in your source are different than in the production build, we can't really say we're using the same source.

Please build from our repo and re-try your test.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The difference is in autotools products. I _cannot_ build using same autotools that's present on build machine unless I'm given access to that box.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The _source_ is exact same
Comment by Phil Labee [ 17/Mar/14 ]
I've given the versions of autotools to use, so you could make your build environment in line with the production builds.

As a shortcut, I've submitted a request for a clone of the builder VM that you can experiment with.

See CBIT-1053
Comment by Wayne Siu [ 17/Mar/14 ]
The cloned builder is available. Info in CBIT-1053.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Built tcmalloc from exact copy in builder directory.

Installed package from inside builder directory (build 1077). Verified that problem exists. Stopped service. Replaced tcmalloc. Observer that everything is fine.

Something in environment is causing this. Like maybe unusual ldflags or something else. But _not_ source.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Build full rpm package under buildbot user. With exact same make invocation as I see in buildbot logs. And resultant package works. Weird indeed.
Comment by Phil Labee [ 18/Mar/14 ]
some differences between test build and production build:


1) In gperftools, production calls "make install-exec-am install-data-am" while test calls "make install" which executes extra step "all-am"

2) In ep-engine, produciton uses "make install" while test uses "make"

3) Test build as user "root" while production build as user "buildbot", so PATH and other env.vars may be different.

In general it's hard to tell what steps were performed for the test build, as no output logfiles have been captured.
Comment by Wayne Siu [ 21/Mar/14 ]
Updated from Phil:
comment:
________________________________________

2.5.1-1082 was done without the tcmalloc flag: CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW

    http://review.couchbase.org/#/c/34755/


2.5.1-1083 was done with build step timeout increased from 60 minutes to 90

2.5.1-1084 was done with the tcmalloc flag restored:

    http://review.couchbase.org/#/c/34792/
Comment by Andrei Baranouski [ 23/Mar/14 ]
 2.5.1-1082 MB-10545 Vbucket map is not ready after 60 seconds
Comment by Meenakshi Goel [ 24/Mar/14 ]
Memcached crashes with segmentation fault is observed with build 2.5.1-1084-rel on ubuntu 12.04 during Auto Compaction tests.

Jenkins Link:
http://qa.sc.couchbase.com/view/2.5.1%20centos/job/centos_x64--00_02--compaction_tests-P0/56/consoleFull

root@jackfruit-s12206:/tmp# gdb /opt/couchbase/bin/memcached core.memcached.8276
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /opt/couchbase/bin/memcached...done.
[New LWP 8301]
[New LWP 8302]
[New LWP 8599]
[New LWP 8303]
[New LWP 8604]
[New LWP 8299]
[New LWP 8601]
[New LWP 8600]
[New LWP 8602]
[New LWP 8287]
[New LWP 8285]
[New LWP 8300]
[New LWP 8276]
[New LWP 8516]
[New LWP 8603]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler'.
Program terminated with signal 11, Segmentation fault.
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
298 src/central_freelist.cc: No such file or directory.
(gdb) t a a bt

Thread 15 (Thread 0x7f3568039700 (LWP 8603)):
#0 0x00007f356f01b9fa in __lll_unlock_wake () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f018104 in _L_unlock_644 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f018063 in pthread_mutex_unlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c663d6 in Mutex::release (this=0x5f68250) at src/mutex.cc:94
#4 0x00007f3569c9691f in unlock (this=<optimized out>) at src/locks.hh:58
#5 ~LockHolder (this=<optimized out>, __in_chrg=<optimized out>) at src/locks.hh:41
#6 fireStateChange (to=<optimized out>, from=<optimized out>, this=<optimized out>) at src/warmup.cc:707
#7 transition (force=<optimized out>, to=<optimized out>, this=<optimized out>) at src/warmup.cc:685
#8 Warmup::initialize (this=<optimized out>) at src/warmup.cc:413
#9 0x00007f3569c97f75 in Warmup::step (this=0x5f68258, d=..., t=...) at src/warmup.cc:651
#10 0x00007f3569c2644a in Dispatcher::run (this=0x5e7f180) at src/dispatcher.cc:184
#11 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5f68258) at src/dispatcher.cc:28
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 14 (Thread 0x7f356a705700 (LWP 8516)):
#0 0x00007f356ed0d83d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ed3b774 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f3569c65445 in updateStatsThread (arg=<optimized out>) at src/memory_tracker.cc:31
#3 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 13 (Thread 0x7f35703e8740 (LWP 8276)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e000, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e000, flags=<optimized out>) at event.c:1558
#3 0x000000000040c9e6 in main (argc=<optimized out>, argv=<optimized out>) at daemon/memcached.c:7996

Thread 12 (Thread 0x7f356c709700 (LWP 8300)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e280, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e280, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16814f8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 11 (Thread 0x7f356e534700 (LWP 8285)):
#0 0x00007f356ed348bd in read () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ecc8ff8 in _IO_file_underflow () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f356ecca03e in _IO_default_uflow () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f356ecbe18a in _IO_getline_info () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f356ecbd06b in fgets () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f356e535b19 in fgets (__stream=<optimized out>, __n=<optimized out>, __s=<optimized out>) at /usr/include/bits/stdio2.h:255
#6 check_stdin_thread (arg=<optimized out>) at extensions/daemon/stdin_check.c:37
#7 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 10 (Thread 0x7f356d918700 (LWP 8287)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
---Type <return> to continue, or q <return> to quit---

#1 0x00007f356db32176 in logger_thead_main (arg=<optimized out>) at extensions/loggers/file_logger.c:368
#2 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f3567037700 (LWP 8602)):
#0 SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:32
#1 0x00007f3569c6351c in lock (this=<optimized out>) at src/atomic.hh:282
#2 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#3 gimme (this=<optimized out>) at src/atomic.hh:396
#4 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#5 KVShard::getBucket (this=0x7a6e7c0, id=256) at src/kvshard.cc:58
#6 0x00007f3569c9231d in VBucketMap::getBucket (this=0x614a448, id=256) at src/vbucketmap.cc:40
#7 0x00007f3569c314ef in EventuallyPersistentStore::getVBucket (this=<optimized out>, vbid=256, wanted_state=<optimized out>) at src/ep.cc:475
#8 0x00007f3569c315f6 in EventuallyPersistentStore::firePendingVBucketOps (this=0x614a400) at src/ep.cc:488
#9 0x00007f3569c41bb1 in EventuallyPersistentEngine::notifyPendingConnections (this=0x5eb8a00) at src/ep_engine.cc:3474
#10 0x00007f3569c41d63 in EvpNotifyPendingConns (arg=0x5eb8a00) at src/ep_engine.cc:1182
#11 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f3565834700 (LWP 8600)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7e1c0) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7e204) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f3566035700 (LWP 8601)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7fa40) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7fa84) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f356cf0a700 (LWP 8299)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e500, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e500, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x1681400) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f3567838700 (LWP 8604)):
#0 0x00007f356f01b89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f017065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f016eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c6635a in Mutex::acquire (this=0x5e7f890) at src/mutex.cc:79
#4 0x00007f3569c261f8 in lock (this=<optimized out>) at src/locks.hh:48
#5 LockHolder (m=..., this=<optimized out>) at src/locks.hh:26
---Type <return> to continue, or q <return> to quit---
#6 Dispatcher::run (this=0x5e7f880) at src/dispatcher.cc:138
#7 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5e7f898) at src/dispatcher.cc:28
#8 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f356af06700 (LWP 8303)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e780, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e780, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16817e0) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f3565033700 (LWP 8599)):
#0 0x00007f356ed18267 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f3569c13997 in SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:35
#2 0x00007f3569c63e57 in lock (this=<optimized out>) at src/atomic.hh:282
#3 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#4 gimme (this=<optimized out>) at src/atomic.hh:396
#5 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#6 KVShard::getVBucketsSortedByState (this=0x7a6e7c0) at src/kvshard.cc:75
#7 0x00007f3569c5d494 in Flusher::getNextVb (this=0x168d040) at src/flusher.cc:232
#8 0x00007f3569c5da0d in doFlush (this=<optimized out>) at src/flusher.cc:211
#9 Flusher::step (this=0x5ff7010, tid=21) at src/flusher.cc:152
#10 0x00007f3569c69034 in ExecutorThread::run (this=0x5e7e8c0) at src/scheduler.cc:159
#11 0x00007f3569c6963d in launch_executor_thread (arg=0x5ff7010) at src/scheduler.cc:36
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f356b707700 (LWP 8302)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8ea00, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8ea00, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16816e8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f356bf08700 (LWP 8301)):
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
#1 0x00007f356f23ef19 in tcmalloc::CentralFreeList::FetchFromSpansSafe (this=0x7f356f45d780) at src/central_freelist.cc:283
#2 0x00007f356f23efb7 in tcmalloc::CentralFreeList::RemoveRange (this=0x7f356f45d780, start=0x7f356bf07268, end=0x7f356bf07260, N=4) at src/central_freelist.cc:263
#3 0x00007f356f2430b5 in tcmalloc::ThreadCache::FetchFromCentralCache (this=0xf5d298, cl=9, byte_size=128) at src/thread_cache.cc:160
#4 0x00007f356f239fa3 in Allocate (this=<optimized out>, cl=<optimized out>, size=<optimized out>) at src/thread_cache.h:364
#5 do_malloc_small (size=128, heap=<optimized out>) at src/tcmalloc.cc:1088
#6 do_malloc_no_errno (size=<optimized out>) at src/tcmalloc.cc:1095
#7 (anonymous namespace)::cpp_alloc (size=128, nothrow=<optimized out>) at src/tcmalloc.cc:1423
#8 0x00007f356f249538 in tc_new (size=139867476842368) at src/tcmalloc.cc:1601
#9 0x00007f3569c2523e in Dispatcher::schedule (this=0x5e7f880,
    callback=<error reading variable: DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>, outtid=0x6127930, priority=...,
    sleeptime=<optimized out>, isDaemon=true, mustComplete=false) at src/dispatcher.cc:243
#10 0x00007f3569c84c1a in TapConnNotifier::start (this=0x6127920) at src/tapconnmap.cc:66
---Type <return> to continue, or q <return> to quit---
#11 0x00007f3569c42362 in EventuallyPersistentEngine::initialize (this=0x5eb8a00, config=<optimized out>) at src/ep_engine.cc:1415
#12 0x00007f3569c42616 in EvpInitialize (handle=0x5eb8a00,
    config_str=0x7f356bf07993 "ht_size=3079;ht_locks=5;tap_noop_interval=20;max_txn_size=10000;max_size=1491075072;tap_keepalive=300;dbname=/opt/couchbase/var/lib/couchbase/data/default;allow_data_loss_during_shutdown=true;backend="...) at src/ep_engine.cc:126
#13 0x00007f356cf0f86a in create_bucket_UNLOCKED (e=<optimized out>, bucket_name=0x7f356bf07b80 "default", path=0x7f356bf07970 "/opt/couchbase/lib/memcached/ep.so", config=<optimized out>,
    e_out=<optimized out>, msg=0x7f356bf07560 "", msglen=1024) at bucket_engine.c:711
#14 0x00007f356cf0faac in handle_create_bucket (handle=<optimized out>, cookie=0x5e4bc80, request=<optimized out>, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2168
#15 0x00007f356cf10229 in bucket_unknown_command (handle=0x7f356d1171c0, cookie=0x5e4bc80, request=0x5e44000, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2478
#16 0x0000000000412c35 in process_bin_unknown_packet (c=<optimized out>) at daemon/memcached.c:2911
#17 process_bin_packet (c=<optimized out>) at daemon/memcached.c:3238
#18 complete_nread_binary (c=<optimized out>) at daemon/memcached.c:3805
#19 complete_nread (c=<optimized out>) at daemon/memcached.c:3887
#20 conn_nread (c=0x5e4bc80) at daemon/memcached.c:5744
#21 0x0000000000406e45 in event_handler (fd=<optimized out>, which=<optimized out>, arg=0x5e4bc80) at daemon/memcached.c:6012
#22 0x00007f356fd9948c in event_process_active_single_queue (activeq=<optimized out>, base=<optimized out>) at event.c:1308
#23 event_process_active (base=<optimized out>) at event.c:1375
#24 event_base_loop (base=0x5e8ec80, flags=<optimized out>) at event.c:1572
#25 0x0000000000415584 in worker_libevent (arg=0x16815f0) at daemon/thread.c:301
#26 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#27 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#28 0x0000000000000000 in ?? ()
(gdb)
Comment by Aleksey Kondratenko [ 25/Mar/14 ]
Yesterday I took that consistently failing ubuntu build and played with it on my box.

It is exactly same situation. Replacing libtcmalloc.so makes it work.

So I've spent afternoon on running what's in our actual package under debugger.

I found several evidences that some object files linked into libtcmalloc.so that we ship were built with -DTCMALLOC_SMALL_BUT_SLOW and some _were_ not.

That explains weird crashes.

I'm unable to explain how it's possible that our builders produced such .so files. Yet.

Gut feeling is that it might be:

* something caused by ccache

* perhaps not full cleanup between builds

In order to verify that I'm asking the following:

* do a build with ccache completely disabled but with define

* do git clean -xfd inside gperftools checkout before doing build





[MB-9917] DOC - memcached should dynamically adjust the number of worker threads Created: 14/Jan/14  Updated: 25/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Trond Norbye Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
4 threads is probably not ideal for a 24 core system ;)

 Comments   
Comment by Anil Kumar [ 25/Mar/14 ]
Trond - Can you explain is this new feature in 3.0 or fixing documentation on older docs?




[MB-9752] Our Redhat6 packages require /usr/bin/pkg-config dependency during installation Created: 16/Dec/13  Updated: 16/May/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aleksey Kondratenko Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
# ssh 10.6.2.115
root@10.6.2.115's password:
Last login: Mon Dec 16 12:03:06 2013 from 10.17.17.191
[root@centos-64-x64 ~]# cat /etc/redhat-release
CentOS release 6.4 (Final)
[root@centos-64-x64 ~]# wget http://builder.hq.couchbase.com/get/couchbase-server-enterprise_centos6_x86_64_2.5.0-1011-rel.rpm
--2013-12-16 12:15:08-- http://builder.hq.couchbase.com/get/couchbase-server-enterprise_centos6_x86_64_2.5.0-1011-rel.rpm
Resolving builder.hq.couchbase.com... 10.1.0.118
Connecting to builder.hq.couchbase.com|10.1.0.118|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://cbfs-ext.hq.couchbase.com/builds/couchbase-server-enterprise_centos6_x86_64_2.5.0-1011-rel.rpm [following]
--2013-12-16 12:15:08-- http://cbfs-ext.hq.couchbase.com/builds/couchbase-server-enterprise_centos6_x86_64_2.5.0-1011-rel.rpm
Resolving cbfs-ext.hq.couchbase.com... 10.1.0.118
Reusing existing connection to builder.hq.couchbase.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: “couchbase-server-enterprise_centos6_x86_64_2.5.0-1011-rel.rpm”

    [ <=> ] 86,702,380 40.4M/s in 2.0s

2013-12-16 12:15:10 (40.4 MB/s) - “couchbase-server-enterprise_centos6_x86_64_2.5.0-1011-rel.rpm” saved [86702380]

[root@centos-64-x64 ~]# ls -l /opt
total 0
[root@centos-64-x64 ~]# rpm -i couchbase-server-enterprise_centos6_x86_64_2.5.0-1011-rel.rpm
error: Failed dependencies:
/usr/bin/pkg-config is needed by couchbase-server-2.5.0-1011.x86_64
[root@centos-64-x64 ~]#


 Comments   
Comment by Aleksey Kondratenko [ 16/Dec/13 ]
cc-ed some key build people
Comment by Bin Cui [ 16/Dec/13 ]
I think this dependency was added in MB-8852, where we wanted to check if right openssl library installed on redhat6 or not.
Comment by Phil Labee [ 16/Dec/13 ]
Can I just modify the server-rpm.spec.tmpl file to include:

    Requires: shadow-utils, pkg-config, openssl >= @@LIB_OPENSSL@@
Comment by Bin Cui [ 16/Dec/13 ]
when rpm is installed, pkg-config is needed to interpret Require statement. So pkg-config is a prerequisite to run rpm correctly.
Comment by Aleksey Kondratenko [ 16/Dec/13 ]
>> when rpm is installed, pkg-config is needed to interpret Require statement. So pkg-config is a prerequisite to run rpm correctly.

Can you elaborate please ? Because:

a) our philosophy has _always_ been to have minimal dependencies and certainly nothing beyond default install

b) requiring developer's tool just to install production package sounds very weird

Comment by Phil Labee [ 16/Dec/13 ]
pkg-config has been a pre-requisite since 2.2.0. See:

    https://www.couchbase.com/issues/browse/MB-8925

Comment by Aleksey Kondratenko [ 16/Dec/13 ]
Perhaps I'm missing something but I see no technical reason to use pkg-config to check openssl version.
Comment by Aleksey Kondratenko [ 16/Dec/13 ]
And I actually failed to find any place where this is used. Plus rpm already requires openssl >= 1.0.0 and there's no point "checking" anything. IMO.
Comment by Wayne Siu [ 17/Dec/13 ]
Anil,
Please review the ticket to see if we want to include it to fix in the 2.5 release.
Comment by Aleksey Kondratenko [ 17/Dec/13 ]
Sorry, folks, but I just cannot agree with major here.

I've seen no evidence at all that pkgconfig is actually used. Let alone needed.
Comment by Bin Cui [ 17/Dec/13 ]
It is not that we use or refer pkg-config somewhere in our codebase, but the rpm package manager does. As soon as we specify the dependency on openssl in our rpm package specification, rpm package manager will need pkg-config to check if the specified library installed or not.
Comment by Aleksey Kondratenko [ 17/Dec/13 ]
Then we're doing something wrong. Tons of software that's installed out of the box does have dependencies. Some may even have dependency on openssl. Yet default installation does not have pkgconfig.
Comment by Aleksey Kondratenko [ 17/Dec/13 ]
Here's what I'm seeing in dependencies of libcurl on freshly installed centos 6 x86_64 box:

/sbin/ldconfig
/sbin/ldconfig
libc.so.6()(64bit)
libc.so.6(GLIBC_2.11)(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.3)(64bit)
libc.so.6(GLIBC_2.3.4)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libc.so.6(GLIBC_2.7)(64bit)
libcom_err.so.2()(64bit)
libcurl.so.4()(64bit)
libdl.so.2()(64bit)
libgssapi_krb5.so.2()(64bit)
libgssapi_krb5.so.2(gssapi_krb5_2_MIT)(64bit)
libidn.so.11()(64bit)
libidn.so.11(LIBIDN_1.0)(64bit)
libk5crypto.so.3()(64bit)
libkrb5.so.3()(64bit)
libldap-2.4.so.2()(64bit)
libnspr4.so()(64bit)
libnss3.so()(64bit)
libnss3.so(NSS_3.10)(64bit)
libnss3.so(NSS_3.12)(64bit)
libnss3.so(NSS_3.12.1)(64bit)
libnss3.so(NSS_3.12.5)(64bit)
libnss3.so(NSS_3.2)(64bit)
libnss3.so(NSS_3.3)(64bit)
libnss3.so(NSS_3.4)(64bit)
libnss3.so(NSS_3.5)(64bit)
libnss3.so(NSS_3.9.2)(64bit)
libnss3.so(NSS_3.9.3)(64bit)
libnssutil3.so()(64bit)
libplc4.so()(64bit)
libplds4.so()(64bit)
libpthread.so.0()(64bit)
libpthread.so.0(GLIBC_2.2.5)(64bit)
librt.so.1()(64bit)
librt.so.1(GLIBC_2.2.5)(64bit)
libsmime3.so()(64bit)
libssh2(x86-64) >= 1.4.2
libssh2.so.1()(64bit)
libssl3.so()(64bit)
libssl3.so(NSS_3.11.4)(64bit)
libssl3.so(NSS_3.2)(64bit)
libssl3.so(NSS_3.4)(64bit)
libz.so.1()(64bit)
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rtld(GNU_HASH)
rpmlib(PayloadIsXz) <= 5.2-1

You can see how it depends on openssl libraries but doesn't require pkgconfig
Comment by Aleksey Kondratenko [ 17/Dec/13 ]
Actually libcurl is built against libnss. But wget is linked to libssl. And here's what it's asking for:

bin/sh
/bin/sh
/sbin/install-info
/sbin/install-info
config(wget) = 1.12-1.8.el6
libc.so.6()(64bit)
libc.so.6(GLIBC_2.11)(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.3)(64bit)
libc.so.6(GLIBC_2.3.4)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libc.so.6(GLIBC_2.8)(64bit)
libcrypto.so.10()(64bit)
libdl.so.2()(64bit)
librt.so.1()(64bit)
librt.so.1(GLIBC_2.2.5)(64bit)
libssl.so.10()(64bit)
libz.so.1()(64bit)
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rtld(GNU_HASH)
rpmlib(PayloadIsXz) <= 5.2-1

[root@centos65-64 ~]# ldd `which wget`
linux-vdso.so.1 => (0x00007fff60aba000)
libssl.so.10 => /usr/lib64/libssl.so.10 (0x00007fdc46def000)
libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00007fdc46a10000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fdc4680b000)
libz.so.1 => /lib64/libz.so.1 (0x00007fdc465f5000)
librt.so.1 => /lib64/librt.so.1 (0x00007fdc463ed000)
libc.so.6 => /lib64/libc.so.6 (0x00007fdc46058000)
libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007fdc45e14000)
libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007fdc45b2e000)
libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007fdc45929000)
libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007fdc456fd000)
/lib64/ld-linux-x86-64.so.2 (0x00007fdc4705f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdc454e0000)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007fdc452d4000)
libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007fdc450d1000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fdc44eb7000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fdc44c97000)
[root@centos65-64 ~]# rpm -qf /usr/lib64/libssl.so.10
openssl-1.0.1e-15.el6.x86_64
[root@centos65-64 ~]#




[MB-9632] diag / master events captured in log file Created: 22/Nov/13  Updated: 17/Feb/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Steve Yen Assignee: Ravi Mayuram
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
The information available in the diag / master events REST stream should be captured in a log (ALE?) file and hence available to cbcollect-info's and later analysis tools.

 Comments   
Comment by Aleksey Kondratenko [ 22/Nov/13 ]
It is already available in collectinfo
Comment by Dustin Sallings (Inactive) [ 26/Nov/13 ]
If it's only available in collectinfo, then it's not available at all. We lose most of the useful information if we don't run an http client to capture it continually throughout the entire course of a test.
Comment by Aleksey Kondratenko [ 26/Nov/13 ]
Feel free to submit a patch with exact behavior you need




[MB-9358] while running concurrent queries(3-5 queries) getting 'Bucket X not found.' error from time to time Created: 16/Oct/13  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 64 bit

Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
one thread gives correct result:
[root@localhost tuqtng]# curl 'http://10.3.121.120:8093/query?q=SELECT+META%28%29.cas+as+cas+FROM+bucket2&#39;
{
    "resultset": [
        {
            "cas": 4.956322522514292e+15
        },
        {
            "cas": 4.956322525999292e+15
        },
        {
            "cas": 4.956322554862292e+15
        },
        {
            "cas": 4.956322832498292e+15
        },
        {
            "cas": 4.956322835757292e+15
        },
        {
            "cas": 4.956322838836292e+15
...

    ],
    "info": [
        {
            "caller": "http_response:152",
            "code": 100,
            "key": "total_rows",
            "message": "0"
        },
        {
            "caller": "http_response:154",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "405.41885ms"
        }
    ]
}

but in another I see
{
    "error":
        {
            "caller": "view_index:195",
            "code": 5000,
            "key": "Internal Error",
            "message": "Bucket bucket2 not found."
        }
}

cbcollect will be attached

 Comments   
Comment by Marty Schoch [ 16/Oct/13 ]
This is a duplicate, though I can't yet find the original.

We believe under higher load the view queries timeout, which we report as bucket not found (may not be possible to distinguish).
Comment by Iryna Mironava [ 16/Oct/13 ]
https://s3.amazonaws.com/bugdb/jira/MB-9358/447a45ae/10.3.121.120-10162013-858-diag.zip
Comment by Ketaki Gangal [ 17/Oct/13 ]
Seeing these errors and frequent tuq-server crashes on concurrent queries during typical server operations like
- w/ Failovers
- w/ Backups
- w/ Indexing.

Similar server ops for single queries however seem to run okay.

Note: This is a very small number of concurrent queries ( 3-5), typically users may have higher level of concurrency if used at an Application level.




[MB-9145] Add option to download the manual in pdf format (as before) Created: 17/Sep/13  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 2.0, 2.1.0, 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
On the documentation site there is no option to download the manual in pdf format as before. We need to add this option back.

 Comments   
Comment by Maria McDuff (Inactive) [ 18/Sep/13 ]
need for 2.2.1 bug fix release.




[MB-8838] Security Improvement - Connectors to implement security improvements Created: 14/Aug/13  Updated: 19/May/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Security Improvement - Connectors to implement security improvements

Spec ToDo.




[MB-7250] Mac OS X App should be signed by a valid developer key Created: 22/Nov/12  Updated: 16/May/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0-beta-2, 2.1.0, 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: J Chris Anderson Assignee: Wayne Siu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Build_2.5.0-950.png     PNG File Screen Shot 2013-02-17 at 9.17.16 PM.png     PNG File Screen Shot 2013-04-04 at 3.57.41 PM.png     PNG File Screen Shot 2013-08-22 at 6.12.00 PM.png     PNG File ss_2013-04-03_at_1.06.39 PM.png    
Issue Links:
Dependency
depends on MB-9437 macosx installer package fails during... Closed
Relates to
relates to CBLT-104 Enable Mac developer signing on Mac b... Open

 Description   
Currently launching the Mac OS X version tells you it's from an unidentified developer. You have to right click to launch the app. We can fix this.

 Comments   
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
Chris,

do you know what needs to change on the build machine to embed our developer key ?
Comment by J Chris Anderson [ 22/Nov/12 ]
I have no idea. I could start researching how to get a key from Apple but maybe after the weekend. :)
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
we can discuss this next week : ) . Thanks for reporting the issue Chris.
Comment by Steve Yen [ 26/Nov/12 ]
we'll want separate, related bugs (tasks) for other platforms, too (windows, linux)
Comment by Jens Alfke [ 30/Nov/12 ]
We need to get a developer ID from Apple; this will give us some kind of cert, and a local private key for signing.
Then we need to figure out how to get that key and cert onto the build machine, in the Keychain of the account that runs the buildbot.
Comment by Farshid Ghods (Inactive) [ 02/Jan/13 ]
the instructions to build is available here :
https://github.com/couchbase/couchdbx-app
we need to add codesign as a build step there
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
Phil,

do you have any update on this ticket. ?
Comment by Phil Labee [ 22/Jan/13 ]
I have signing cert installed on 10.17.21.150 (MacBuild).

Change to Makefile: http://review.couchbase.org/#/c/24149/
Comment by Phil Labee [ 23/Jan/13 ]
need to change master.cfg and pass env.var. to package-mac
Comment by Phil Labee [ 29/Jan/13 ]
disregard previous. Have added signing to Xcode projects.

see http://review.couchbase.org/#/c/24273/
Comment by Phil Labee [ 31/Jan/13 ]
To test this go to System Preferences / Security & Privacy, and on the General tab set "Allow applications downloaded from" to "Mac App Store and Identified Developers". Set this before running Couchbase Server.app the first time. Once an app has been allowed to run this setting is no longer checked for that app, and there doesn't seem to be a way to reset that.

What is odd is that on my system, I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked (and would all be allowed to run). Either there is a flaw in my testing methodology, or a serious weakness in this security setting: Just because one app called Couchbase Server was allowed to run should confer this privilege to other apps with the same name. A common malware tactic is to modify a trusted app and distribute it as update, and if the security setting keys off the app name it will do nothing to prevent that.

I'm approving this change without having satisfactorily tested it.
Comment by Jens Alfke [ 31/Jan/13 ]
Strictly speaking it's not the app name but its bundle ID, i.e. "com.couchbase.CouchbaseServer" or whatever we use.

> I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked

By OK'ing an unsigned app you're basically agreeing to toss security out the window, at least for that app. This feature is really just a workaround for older apps. By OK'ing the app you're not really saying "yes, I trust this build of this app" so much as "yes, I agree to run this app even though I don't trust it".

> A common malware tactic is to modify a trusted app and distribute it as update

If it's a trusted app it's hopefully been signed, so the user wouldn't have had to waive signature checking for it.
Comment by Jens Alfke [ 31/Jan/13 ]
Further thought: It might be a good idea to change the bundle ID in the new signed version of the app, because users of 2.0 with strict security settings have presumably already bypassed security on the unsigned version.
Comment by Jin Lim [ 04/Feb/13 ]
Per bug scrubs, keep this a blocker since customers ran into this issues (and originally reported it).
Comment by Phil Labee [ 06/Feb/13 ]
revert the change so that builds can complete. App is currently not being signed.
Comment by Farshid Ghods (Inactive) [ 11/Feb/13 ]
i suggest for 2.0.1 release we do this build manually.
Comment by Jin Lim [ 11/Feb/13 ]
As one-off fix, add the signature manually and automate the required steps later in 2.0.2 or beyond.
Comment by Jin Lim [ 13/Feb/13 ]
Please move this bug to 2.0.2 after populating the required signature manually. I am lowing the severity to critical for it isn't no longer a blocking issue.
Comment by Farshid Ghods (Inactive) [ 15/Feb/13 ]
Phil to upload the binary to latestbuilds , ( 2.0.1-101-rel.zip )
Comment by Phil Labee [ 15/Feb/13 ]
Please verify:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee [ 15/Feb/13 ]
uploaded:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip

I can rename it when uploading for release.
Comment by Farshid Ghods (Inactive) [ 17/Feb/13 ]
i still do get the error that it is from an identified developer.

Comment by Phil Labee [ 18/Feb/13 ]
operator error.

I rebuilt the app, this time verifying that the codesign step occurred.

Uploaded now file to same location:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee [ 26/Feb/13 ]
still need to perform manual workaround
Comment by Phil Labee [ 04/Mar/13 ]
release candidate has been uploaded to:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip
Comment by Wayne Siu [ 03/Apr/13 ]
Phil, looks like version 172/185 is still getting the error. My Mac version is 10.8.2
Comment by Thuan Nguyen [ 03/Apr/13 ]
Install couchbase server (build 2.0.1-172 community version) in my mac osx 10.7.4 , I only see the warning message
Comment by Wayne Siu [ 03/Apr/13 ]
Latest version (04.03.13) : http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.zip
Comment by Maria McDuff (Inactive) [ 03/Apr/13 ]
works in 10.7 but not in 10.8.
if we can get the fix for 10.8 by tomorrow, end of day, QE is willing to test for release on tuesday, april 9.
Comment by Phil Labee [ 04/Apr/13 ]
The mac builds are not being automatically signed, so build 185 is not signed. The original 172 is also not signed.

Did you try

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip

to see if that was signed correctly?

Comment by Wayne Siu [ 04/Apr/13 ]
Phil,
Yes, we did try the 172-signed version. It works on 10.7 but not 10.8. Can you take a look?
Comment by Phil Labee [ 04/Apr/13 ]
I rebuilt 2.0.1-185 and uploaded a signed app to:

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.SIGNED.zip

Test on a machine that has never had Couchbase Server installed, and has the security setting to only allow Appstore or signed apps.

If you get the "Couchbase Server.app was downloaded from the internet" warning and you can click OK and install it, then this bug is fixed. The quarantining of files downloaded by a browser is part of the operating system and is not controlled by signing.
Comment by Wayne Siu [ 04/Apr/13 ]
Tried the 185-signed version (see attached screen shot). Same error message.
Comment by Phil Labee [ 04/Apr/13 ]
This is not an error message related to this bug.

Comment by Maria McDuff (Inactive) [ 14/May/13 ]
per bug triage, we need to have mac 10.8 osx working since it is a supported platform (published in the website).
Comment by Wayne Siu [ 29/May/13 ]
Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Anil Kumar [ 31/May/13 ]
we need to address signed key for both Windows and Mac deferring this to next release.
Comment by Dipti Borkar [ 08/Aug/13 ]
Please let's make sure this is fixed in 2.2.
Comment by Phil Labee [ 16/Aug/13 ]
New keys will be created using new account.
Comment by Phil Labee [ 20/Aug/13 ]
iOS Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=iOS Distribution expires Aug 12, 2014

    ~buildbot/Desktop/appledeveloper.couchbase.com/certs/ios/ios_distribution_appledeveloper.couchbase.com.cer

Identifiers:
  App IDS:
    "Couchbase Server" id=com.couchbase.*

Provisining Profiles:
  Distribution:
    "appledeveloper.couchbase.com" type=Distribution

  ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/ios/appledevelopercouchbasecom.mobileprovision
Comment by Phil Labee [ 20/Aug/13 ]
Mac Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)
    "Couchbase, Inc." type=Developer ID installer (Aug,16,2014)
    "Couchbase, Inc." type=Developer ID Application (Aug,16,2014)
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)

     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developerID_installer.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developererID_application.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution-2.cer

Identifiers:
  App IDs:
    "Couchbase Server" id=couchbase.com.* Prefix=N2Q372V7W2
    "Coucbase Server adhoc" id=couchbase.com.* Prefix=N2Q372V7W2
    .

Provisioning Profiles:
  Distribution:
    "appstore.couchbase.com" type=Distribution
    "Couchbase Server adhoc" type=Distribution

     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/appstorecouchbasecom.privisioningprofile
     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/Couchbase_Server_adhoc.privisioningprofile

Comment by Phil Labee [ 21/Aug/13 ]

As of build 2.2.0-806 the app is signed by a new provisioning profile
Comment by Phil Labee [ 22/Aug/13 ]
 Install version 2.2.0-806 on a macosx 10.8 machine that has never had Couchbase Server installed, which has the security setting to require applications to be signed with a developer ID.
Comment by Phil Labee [ 22/Aug/13 ]
please assign to tester
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
just tried this against newest build 809:
still getting restriction message. see attached.
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
restriction still exists.
Comment by Maria McDuff (Inactive) [ 28/Aug/13 ]
verified in rc1 (build 817). still not fixed. getting same msg:
“Couchbase Server” can’t be opened because it is from an unidentified developer.
Your security preferences allow installation of only apps from the Mac App Store and identified developers.

Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Phil Labee [ 03/Sep/13 ]
Need to create new certificates to replace these that were revoked:

Certificate: Mac Development
Team Name: Couchbase, Inc.

Certificate: Mac Installer Distribution
Team Name: Couchbase, Inc.

Certificate: iOS Development
Team Name: Couchbase, Inc.

Certificate: iOS Distribution
Team Name: Couchbase, Inc.
Comment by Maria McDuff (Inactive) [ 18/Sep/13 ]
candidate for 2.2.1 bug fix release.
Comment by Dipti Borkar [ 28/Oct/13 ]
This is going to make it into 2.5? We seemed to keep differing it?
Comment by Phil Labee [ 29/Oct/13 ]
cannot test changes with installer that fails
Comment by Phil Labee [ 11/Nov/13 ]
Installed certs as buildbot and signed app with "(recommended) 3rd Party Mac Developer Application", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-001.zip

Signed with "(Oct 30) 3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-002.zip

These zip files were made on the command line, not a result of the make command. They are 2.5G in size, so they obviously include mote than the zip files produced by the make command.

Both versions of the app appear to be signed correctly!

Note: cannot run make command from ssh session. Must Remote Desktop in and use terminal shell natively.
Comment by Phil Labee [ 11/Nov/13 ]
Finally, some progress: If the zip file is made using the --symlinks argument it appears to be un-signed. If the symlinked files are included, the app appears to be signed correctly.

The zip file with symlinks is 60M, while the zip file with copies of the files is 2.5G, more than 40X the size.
Comment by Phil Labee [ 25/Nov/13 ]
Fixed in 2.5.0-950
Comment by Dipti Borkar [ 25/Nov/13 ]
Maria, can QE please verify this?
Comment by Wayne Siu [ 28/Nov/13 ]
Tested with build 2.5.0-950. Still see the warning box (attached).
Comment by Wayne Siu [ 19/Dec/13 ]
Phil,
Can you give an update on this?
Comment by Ashvinder Singh [ 14/Jan/14 ]
I tested the code signature with apple utility "spctl -a -v /Applications/Couchbase\ Server.app/" and got the output :
>>> /Applications/Couchbase Server.app/: a sealed resource is missing or invalid

also tried running the command:
 
bash: codesign -dvvvv /Applications/Couchbase\ Server.app
>>>
Executable=/Applications/Couchbase Server.app/Contents/MacOS/Couchbase Server
Identifier=com.couchbase.couchbase-server
Format=bundle with Mach-O thin (x86_64)
CodeDirectory v=20100 size=639 flags=0x0(none) hashes=23+5 location=embedded
Hash type=sha1 size=20
CDHash=868e4659f4511facdf175b44a950b487fa790dc4
Signature size=4355
Authority=3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)
Authority=Apple Worldwide Developer Relations Certification Authority
Authority=Apple Root CA
Signed Time=Jan 8, 2014, 10:59:16 AM
Info.plist entries=31
Sealed Resources version=1 rules=4 files=5723
Internal requirements count=1 size=216

It looks like the code signature is present but got invalid as the new file were added/modified to the project. I suggest for the build team to rebuild and add the code signature again.
Comment by Phil Labee [ 17/Apr/14 ]
need VM to clone for developer experimentation
Comment by Phil Labee [ 16/May/14 ]
now have the macosx build running on a VM, which has now been cloned as:

    macosx-x64-server-builder-01-clone (10.6.2.159)

    login as buildbot/buildbot, either through screen sharing or ssh.

There's a snapshot of the initial state so we can roll back any changes. If you want intermediate snapshots let me know.




[MB-9415] auto-failover in seconds - (reduced from minimum 30 seconds) Created: 21/May/12  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0, 1.8.1, 2.0, 2.0.1, 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Dipti Borkar Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 2
Labels: customer, ns_server-story
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-9416 Make auto-failover near immediate whe... Technical task Open Aleksey Kondratenko  

 Description   
including no false positives

http://www.pivotaltracker.com/story/show/25006101

 Comments   
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
At the very least it requires getting our timeout-ful cases under control. So at least splitting couchdb into separate VM is a requirement for this. But not necessarily enough.
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
Still seeing misunderstanding on this one.

So we have _different_ problem that even manual failover (let alone automatic) cannot succeed quickly if master node fails. It can easily take up to 2 minutes because of our use of erlang "global" facility than requires us to detect that node is dead and erlang is tuned to detect that within 2 minutes.

Now _this_ problem is lowering autofailover detection to 10 seconds. We can blindly make it happen today. But it will not be usable because of all sorts of timeouts happening in cluster management layer. We have some significant proportion of CBSEs _today_ about false positive autofailovers even with 30 seconds threshold. Clearly lowering it to 10 will only make it worse. Therefore my point above. We have to get those timeouts under control so that heartbeats are sent/received timely. Or whatever else we use to detect node being unresponsive.

I would like to note however that especially in some virtualized environments (arguably, oversubscribed) we saw as high as low tens of seconds delays from virtualization _alone_. Given relatively high cost of failover in our software I'd like to point out that people could too easily abuse that feature.

High cost of failover is refered to above is this:

* you almost certainly and irrecoverably lose some recent mutations. _At least_ recent mutations. I.e. if replication is really working well. In node that's on the edge of autofailover you can imagine replication not being "diamond-hard quick". That's cost 1.

* in order to return node back to cluster (say node crashed and needed some time to recover, whatever it might mean) you need rebalance. That type of rebalance is relatively quick by design; i.e. it only moves data back to this node and nothing else. But it's still rebalance. with upr we can possibly make it better. I.e. because its failover log is capable of rewinding just conflicting mutations.

What I'm trying to say in "our approach appears to have relatively high price for failover" is that it appears inherent issue for strongly consistent system. I'm trying to say that in many cases it might be actually better to wait up to few minutes for node to recover and restore it's availability than failing it over and paying price of restoring cluster capacility (with rebalancing this node back or it's replacement, which is irrelevant). If somebody wants stronger availability then some other approaches which particularly can "reconcile" changes from both failed over node and it's replacement node look like fundamentally better choice _for this requirements_.




[MB-4030] enable traffic for for ready nodes even if not all nodes are up/healthy/ready (aka partial janitor) (was: After two nodes crashed, curr_items remained 0 after warmup for extended period of time) Created: 06/Jul/11  Updated: 20/May/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.1, 2.0, 2.0.1, 2.2.0, 2.1.1, 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
we had two nodes crash at a customer, possibly related to a disk space issue, but I don't think so.

After they crashed, the nodes warmed up relatively quickly, but immediately "discarded" their items. I say that because I see that they warmed up ~10m items, but the current item counts were both 0.

I tried shutting down the service and had to kill memcached manually (kill -9). Restarting it went through the same process of warming up and then nothing.

While I was looking around, I left it sit for a little while and magically all of the items came back. I seem to recall this bug previously where a node wouldn't be told to be active until all the nodes in the cluster were active...and it got into trouble when not all of the nodes restarted.

Diags for all nodes will be attached

 Comments   
Comment by Perry Krug [ 06/Jul/11 ]
Full set of logs at \\corp-fs1\export_support_cases\bug_4030
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
It _is_ ns_server issue caused by janitor needing all nodes to be up for vbuckets activation. We planned fix for 1.8.1 (now 1.8.2)
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
Fix would land as part of fast warmup integration
Comment by Perry Krug [ 18/Jul/12 ]
Peter, can we get a second look at this one? We've seen this before, and the problem is that the janitor did not run until all nodes had joined the cluster and warmed up. I'm not sure we've fixed that already...
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Latest 2.0 will mark nodes as green and enable memcached traffic when all of them are up. So easy part is done.

Partial janitor (i.e. enabling traffic for some nodes when others are still down/warming up) is something that will unlikely be done soon
Comment by Perry Krug [ 18/Jul/12 ]
Thanks Alk...what's the difference in behavior (in this area) between 1.x and 2.0? It "sounds" like they're the same, no?

And this bug should still remain open until we fix the primary issue which is the partial janitor...correct?
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
1.8.1 will show node as green when ep-engine thinks it's warmed up. But confusingly it'll not be really ready. All vbuckets will be in state dead and curr_items will be 0.

2.0 fixes this confusion. Node is marked green when it's actually warmed up from user's perspective. I.e. right vbucket states are set and it'll serve clients traffic.

2.0 is still very conservative about only making vbucket state changes when all nodes are up and warmed up. Thats "impartial" janitor. Whether it's a bug or "lack of feature" is debatable. But I think main concern that users are confused by green-ness of nodes is resolved.
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Closing as fixed. We'll get to partial janitor some day in future which is feature we lack today, not bug we have IMHO
Comment by Perry Krug [ 12/Nov/12 ]
Reopening this for the need for partial janitor. Recent customer had multiple nodes need to be hard-booted and none returned to service until all were warmed up
Comment by Steve Yen [ 12/Nov/12 ]
bug-scrub: moving out of 2.0, as this looks like a feature req.
Comment by Farshid Ghods (Inactive) [ 13/Nov/12 ]
in system testing we have noticed many times that if multiple nodes crash until all nodes are warmed up node status for those that are already warmed up appears as yellow.


user won't be able to understand which node has successfully warmed up from the console and if one node is actually not recovering or not warm up in a reasonable time they have to figure it out some other way ( cbstats ... )

another issue with this is that user won't be able to perform a fail over for 1 node even though N-1 nodes has warmed up already.

i am not sure if fixing this bug will impact cluster-restore functionality but something important to fix or suggest a workaround to the user ( by workaround i mean a documented , tested and supported set of commands )
Comment by Mike Wiederhold [ 17/Mar/13 ]
Comments say this is an ns_server issue so I am removing couchbase-bucket from affected components. Please re-add if there is a couchbase-bucket task for this issue.
Comment by Aleksey Kondratenko [ 23/Feb/14 ]
Not going to happen for 3.0.




[MB-6972] distribute couchbase-server through yum and ubuntu package repositories Created: 19/Oct/12  Updated: 23/Jun/14

Status: In Progress
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Farshid Ghods (Inactive) Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 3
Labels: devX
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks MB-8693 [Doc when ready] distribute couchbase... Open
blocks MB-7821 yum install couchbase-server from cou... Resolved
Duplicate
duplicates MB-2299 Create signed RPM's Resolved
is duplicated by MB-9409 repository for deb packages (debian&u... Resolved
Flagged:
Release Note

 Description   
this helps us in handling dependencies that are needed for couchbase server
sdk team has already implemented this for various sdk packages.

we might have to make some changes to our packaging metadata to work with this schema

 Comments   
Comment by Steve Yen [ 26/Nov/12 ]
to 2.0.2 per bug-scrub

first step is do the repositories?
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Farshid Ghods (Inactive) [ 19/Dec/12 ]
Phil,
please sync up with Farshid and get instructions that Sergey and Pavel sent
Comment by Farshid Ghods (Inactive) [ 28/Jan/13 ]
we should resolve this task once 2.0.1 is released .
Comment by Dipti Borkar [ 29/Jan/13 ]
Have we figured out the upgrade process moving forward. for example from 2.0.1 to 2.0.2 or 2.0.1 to 2.1 ?
Comment by Jin Lim [ 04/Feb/13 ]
Please ensure that we also confirm/validate the upgrade process moving from 2.0.1 to 2.0.2. Thanks.
Comment by Phil Labee [ 06/Feb/13 ]
Now have DEB repo working, but another issue has come up: We need to distribute the public key so that users can install the key before running apt-get.

wiki page has been updated.
Comment by kzeller [ 14/Feb/13 ]
Added to 2.0.1 RN as:

Fix:

We now provide Couchbase Server as a yum and Debian package
repositories.
Comment by Matt Ingenthron [ 09/Apr/13 ]
What are the public URLs for these repositories? This was mentioned in the release notes here:
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-rn_2-0-0l.html
Comment by Matt Ingenthron [ 09/Apr/13 ]
Reopening, since this isn't documented that I can find. Apologies if I'm just missing it.
Comment by Dipti Borkar [ 23/Apr/13 ]
Anil, can you work with Phil to see what are the next steps here?
Comment by Anil Kumar [ 24/Apr/13 ]
Yes I'll be having discussion with Phil and will update here with details.
Comment by Tim Ray [ 28/Apr/13 ]
could we either remove the note about yum/deb repo's in the release notes or get those repo locations / sample files / keys added to public pages? The only links that seem that they 'might' contain the info point to internal pages I don't have access to.
Comment by Anil Kumar [ 14/May/13 ]
thanks Tim, we have removed it from release notes. we will add instructions about yum/deb repo's locations/files/keys to documentation once its available. thanks!
Comment by kzeller [ 14/May/13 ]
Removing duplicate ticket:

http://www.couchbase.com/issues/browse/MB-7860
Comment by h0nIg [ 24/Oct/13 ]
any update? maybe i created a duplicate issue: http://www.couchbase.com/issues/browse/MB-9409 but it seems that the repositories are outdated on http://hub.internal.couchbase.com/confluence/display/CR/How+to+Use+a+Linux+Repo+--+debian
Comment by Sriram Melkote [ 22/Apr/14 ]
I tried to install on Debian today. It failed badly. One .deb package didn't match the libc version of stable. The other didn't match the openssl version. Changing libc or openssl is simply not an option for someone using Debian stable because it messes with the base OS too deeply. So as of 4/23/14, we don't have support for Debian.
Comment by Sriram Melkote [ 22/Apr/14 ]
Anil, we have accumulated a lot of input in this bug. I don't think this will realistically go anywhere for 3.0 unless we define specific goals and some considered platform support matrix expansion. Can you please create a goal for 3.0 more precisely?
Comment by Matt Ingenthron [ 22/Apr/14 ]
+1 on Siri's comments. Conversations I had with both Ubuntu (who recommend their PPAs) and Red Hat experts (who recommend setting up a repo or getting into EPEL or the like) indicated that's the best way to ensure coverage of all OSs. Binary packages built on one OS and deployed on another are risky, run into dependency issues.
Comment by Anil Kumar [ 28/Apr/14 ]
This ticket specially for distributing DEB and RPM repositories through YUM and APT repo. We have another ticket for supporting Debian platform MB-10960.
Comment by Anil Kumar [ 23/Jun/14 ]
Assigning ticket to Tony for verification.




[MB-11382] XDCR: default replicators per bucket is now 16 (earlier: 32) Created: 10/Jun/14  Updated: 25/Jun/14

Status: In Progress
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File xdcrMaxConcurrentReps.png    
Issue Links:
Gantt: start-finish
is triggered by MB-11058 Failover during data load with enable... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Per Alk, this change was intentional and made to fix a performance problem reported by Pavel (MB-11058) with just 2 replications causing "heavy CPU contention between memcached/ep-engine and XDCR replicators."

Creating this issue to vet performance of 32 vs 16 replicators per bucket and document the same if 16 is found to be ideal.

 Comments   
Comment by Aleksey Kondratenko [ 10/Jun/14 ]
Specifically, we'll need to do some testing on EC2 across continents to see how effect of 32 versus 16 versus (maybe) 64.
Comment by Cihan Biyikoglu [ 17/Jun/14 ]
ideally this would be auto configured based on the machine resources. this is very tough to get right given there are many machine types in use by customers on EC2 or otherwise. However most customers using XDCR tend to get larger #cores per node (>16) - our sizing guides that as well so higher value may be better in this case.
I would also test this; Does changing the default regress latency of XDCR? if it does, we should change it back.

Comment by Pavel Paulau [ 17/Jun/14 ]
We finally unblocked WAN testing and hopefully will get some results soon.

Also there are many other aspects that impact on XDCR latency so I'd not care about number of replicators for now.
Comment by Pavel Paulau [ 20/Jun/14 ]
As expected WAN setups require greater xdcrMaxConcurrentReps.

Attached chart represent a set of quick experiments:
-- 5 -> 5 (~200 vbuckets/node)
-- 500 mutations per source node
-- non-optimistic unidirectional replication
-- 80±4 ms RTT

As you can see replication latency asymptotically approaches double RTT. Small xdcrMaxConcurrentReps causes way higher latency.

Obviously larger setups will demonstrate different characteristics but deployments where 16 is sufficient are barely common.
Comment by Cihan Biyikoglu [ 25/Jun/14 ]
what was the original reasoning for lowering this to 16?
Comment by Pavel Paulau [ 25/Jun/14 ]
Ticket description should answer your question.
Comment by Cihan Biyikoglu [ 25/Jun/14 ]
missed that. thanks -
if heavy cpu gives us lower latency, I think we can slip this in but if we are not getting any consistent latency benefit, we'll get escalations to support.
I suggest we move this into blocker territory.
Comment by Aruna Piravi [ 25/Jun/14 ]
marking this as "bug" to track the change better and blocker to ensure we revert to 32.




[MB-11060] Build and test 3.0 for 32-bit Windows Created: 06/May/14  Updated: 27/Jun/14  Due: 09/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build, ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Blocker
Reporter: Chris Hillery Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7/8 32-bit

Issue Links:
Dependency
Duplicate

 Description   
For the "Developer Edition" of Couchbase Server 3.0 on Windows 32-bit, we need to first ensure that we can build 32-bit-compatible binaries. It is not possible to build 3.0 on a 32-bit machine due to the MSVC 2013 requirement. Hence we need to configure MSVC as well as Erlang on a 64-bit machine to produce 32-bit compatible binaries.

 Comments   
Comment by Chris Hillery [ 06/May/14 ]
This is assigned to Trond who is already experimenting with this. He should:

 * test being able to start the server on a 32-bit Windows 7/8 VM

 * make whatever changes are necessary to the CMake configuration or other build scripts to produce this build on a 64-bit VM

 * thoroughly document the requirements for the build team to reproduce this build

Then he can assign this bug to Chris to carry out configuring our build jobs accordingly.
Comment by Trond Norbye [ 16/Jun/14 ]
Can you give me a 32 bit windows installation I can test on. My MSDN license have expired and I don't have Windows media available (and the internal wiki page just have a limited set of licenses and no download links)

Then assign it back to me and I'll try it
Comment by Chris Hillery [ 16/Jun/14 ]
I think you can use 172.23.106.184 - it's a 32-bit Windows 2008 VM that we can't use for 3.0 builds anyway.
Comment by Trond Norbye [ 24/Jun/14 ]
I copied the full result of a build where I set target_platform=x86 on my 64 bit windows server (the "install" directory) over to a 32 bit windows machine and was able to start memcached and it worked as expected.

Our installers make other magic like install the service etc needed in order to start the full server. Once we have such an installer I can do further testing
Comment by Chris Hillery [ 24/Jun/14 ]
Bin - could you take a look at this (figuring out how to make InstallShield on a 64-bit machine create a 32-bit compatible installer)? I won't likely be able to get to it for at least a month, and I think you're the only person here who still has access to an InstallShield 2010 designer anyway.




[MB-10838] cbq-engine must work without all_docs Created: 11/Apr/14  Updated: 29/Jun/14  Due: 07/Jul/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: tried builds 3.0.0-555 and 3.0.0-554

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
WORKAROUND: Run "CREATE PRIMARY INDEX ON <bucket>" once per bucket, when using 3.0 server

SYMPTOM: tuq returns Bucket default not found.', u'caller': u'view_index:200 for all queries

single node cluster, 2 buckets(default and standard)
run simple query
q=FROM+default+SELECT+name%2C+email+ORDER+BY+name%2Cemail+ASC

got {u'code': 5000, u'message': u'Bucket default not found.', u'caller': u'view_index:200', u'key': u'Internal Error'}
tuq displays
[root@grape-001 tuqtng]# ./tuqtng -couchbase http://localhost:8091
22:36:07.549322 Info line disabled false
22:36:07.554713 tuqtng started...
22:36:07.554856 version: 0.0.0
22:36:07.554942 site: http://localhost:8091
22:47:06.915183 ERROR: Unable to access view - cause: error executing view req at http://127.0.0.1:8092/default/_all_docs?limit=1001: 500 Internal Server Error - {"error":"noproc","reason":"{gen_server,call,[undefined,bytes,infinity]}"}
 -- couchbase.(*viewIndex).ScanRange() at view_index.go:186


 Comments   
Comment by Sriram Melkote [ 11/Apr/14 ]
Iryna, can you please add cbcollectinfo or at least the couchdb logs?

Also, all CBQ DP4 testing must be done against 2.5.x server, please confirm it is the case in this bug.
Comment by Iryna Mironava [ 22/Apr/14 ]
cbcollect
https://s3.amazonaws.com/bugdb/jira/MB-10838/9c1cf39c/172.27.33.17-4222014-111-diag.zip

bug is valid only for 3.0. 2.5.x versions are working fine
Comment by Sriram Melkote [ 22/Apr/14 ]
Gerald, we need to update query code to not use _all_docs for 3.0

Iryna, workaround is to run "CREATE PRIMARY INDEX ON <bucket>" first before running any queries when using 3.0 server
Comment by Sriram Melkote [ 22/Apr/14 ]
Reducing severity with workaround. Please ping me if that doesn't work
Comment by Iryna Mironava [ 22/Apr/14 ]
works with workaround
Comment by Gerald Sangudi [ 22/Apr/14 ]
Manik,

Please modify the tuqtng / DP3 Couchbase catalog to return an error telling the user to CREATE PRIMARY INDEX. This should only happen with 3.0 server. For 2.5.1 or below, #all_docs should still work.

Thanks.




[MB-11405] ~2400% CPU consumption by memcached during ongoing workload with 5 buckets Created: 11/Jun/14  Updated: 08/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-805

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File cpu.png     PNG File max_threads_cpu.png     Text File perf_b829.log     Text File perf_b854_8threads.log     Text File perf.log    
Issue Links:
Relates to
relates to MB-11434 600-800% CPU utilization by memcached... Open
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/perf-dev/424/artifact/172.23.100.17.zip
http://ci.sc.couchbase.com/job/perf-dev/424/artifact/172.23.100.18.zip
Is this a Regression?: Yes

 Description   
2 nodes, 5 buckets
1M docs (clusterwise), equally distributed, non-DGM
10K mixed ops/sec (85% reads, 1% creates, 1% deletes, 13% updates; clusterwise), equally distributed

CPU utilization in 2.5.1: ~300%
CPU utilization in 3.0.0: ~2400%



 Comments   
Comment by Pavel Paulau [ 12/Jun/14 ]
Interesting chart that shows how CPU utilization depends on #buckets (2-18) and #nodes (2, 4, 8).
Comment by Sundar Sridharan [ 16/Jun/14 ]
More nodes means fewer vbuckets per node, resulting in fewer writer tasks which may explain the lowered cpu per node.
Here is a partial fix based on the attached perf.log http://review.couchbase.org/38337 that I hope will help.
more fixes may follow if needed. thanks
Comment by Sundar Sridharan [ 16/Jun/14 ]
hi Pavel, the fix to reduce the getDescription() noise has been merged.
Could you please help re-run the workload and see if we still have a high CPU usage and if so what does the new profiler output look like? thanks
Comment by Pavel Paulau [ 18/Jun/14 ]
Still high CPU utilization.
Comment by Sundar Sridharan [ 18/Jun/14 ]
Thanks Pavel, Looks like the getDescription() noise has gone away. However this performance result is quite interesting - 85% of the overhead is from the kernel - most likely context switching from the higher number of threads. This will require some more creative solutions to reduce this cpu usage without suffering a performance overhead.
Comment by Sundar Sridharan [ 20/Jun/14 ]
another fix to reduce active system cpu usage by letting only 1 thread snooze while others sleep is located here http://review.couchbase.org/38620 thanks
Comment by Sundar Sridharan [ 20/Jun/14 ]
Pavel, the fix has been merged. Local testing showed marginal improvement. could you please retry the test and let me know if it helps in the larger setup? thanks
Comment by Pavel Paulau [ 20/Jun/14 ]
Ok, will do. Any expected side effects?
Comment by Pavel Paulau [ 21/Jun/14 ]
I have tried build 3.0.0-854 which includes your change. No impact on performance, still very high CPU utilization.

Please notice that CPU consumption drops to ~400% when I decrease number of threads from 30 (auto-tuned) to 8.
Comment by Sundar Sridharan [ 23/Jun/14 ]
Reducing the number of threads should not be the solution. The main new thing in 3.0 is we can have 4 writer threads per bucket essentially so with 5 buckets we may have 20 writing threads. In 2.5 there would only be 5 writing threads for 5 buckets.
This means we should not expect lower than 4 times the CPU use from 2.5, simply because the cost of increased cpu is bringing us lowered disk write latency.
Comment by Pavel Paulau [ 23/Jun/14 ]
Fair enough.

In this case resolution criterion for this ticket should be 600% CPU utilization by memcached.
Comment by Chiyoung Seo [ 26/Jun/14 ]
Another fix was merged:

http://review.couchbase.org/#/c/38756/
Comment by Pavel Paulau [ 26/Jun/14 ]
Sorry,

The same utilization - build 3.0.0-884.
Comment by Sundar Sridharan [ 30/Jun/14 ]
a debugging fix was merged here at http://review.couchbase.org/#/c/38909/. if possible could you please leave the cluster with this change on for sometime for debugging? thanks
Comment by Pavel Paulau [ 30/Jun/14 ]
There might be a delay in getting results due limited h/w resources and upcoming beta release.
Comment by Pavel Paulau [ 01/Jul/14 ]
Assigning back to Sundar because he is working on his own test.
Comment by Pavel Paulau [ 02/Jul/14 ]
Promoting to "Blocker", it currently seems to be one of the most severe performance issues in 3.0.
Comment by Sundar Sridharan [ 02/Jul/14 ]
Pavel, could you try setting max_threads=20 and re-trying the workload to see if this reduces the CPU overhead to unblock other performance testing? thanks
Comment by Pavel Paulau [ 02/Jul/14 ]
Will do, after beta release.

But please notice that performance testing is not blocked.
Comment by Pavel Paulau [ 04/Jul/14 ]
Some interesting observations...

For the same workload I compared number of scheduler wake ups.
3.0-beta with 4 front-end threads and 30 ep-engine threads (auto-tuned):

$ perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '47284':

         7,940,880 sched:sched_wakeup

      30.000548575 seconds time elapsed

2.5.1 with default settings:

$ perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '3677':

           117,003 sched:sched_wakeup

      30.000550702 seconds time elapsed
 
Not surprisingly more write heavy workload (all ops are updates) reduces CPU utilization (down to 600-800%) and scheduling overhead:

$ perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '22699':

         4,014,534 sched:sched_wakeup

      30.000556091 seconds time elapsed

Obviously global IO works nice when IO workload is pretty aggressive and there is always work do.
And it's absolutely crazy when there is a need to constantly put to sleep and wake up threads, which is not uncommon.
Comment by Sundar Sridharan [ 07/Jul/14 ]
Thanks Pavel, as discussed, could you please update the ticket with the results from thread throttling on your 48 core setup?
Comment by Pavel Paulau [ 07/Jul/14 ]
btw, it has only 40 cores/vCPU.
Comment by Sundar Sridharan [ 08/Jul/14 ]
Thanks for the graph Pavel - this confirms our theory that with higher number of threads our scheduling is not able to put threads to sleep in an efficient manner.




[MB-11624] Missing curr_items(1K) after rebalance out on cluster Created: 02/Jul/14  Updated: 08/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ketaki Gangal Assignee: Ketaki Gangal
Resolution: Unresolved Votes: 0
Labels: releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build : 3.0.0-918-rel

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Test to repro:
./testrunner -i b/resources/your.ini -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_with_ops,nodes_out=3,replicas=3,items=100000,doc_ops=create:update:delete,max_verify=100000,value_size=1024,dgm_run=true,eviction_policy=fullEviction,active_resident_threshold=90

Test fails on verification of curr_items.

======================================================================
ERROR: rebalance_out_with_ops (rebalance.rebalanceout.RebalanceOutTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/rebalance/rebalanceout.py", line 87, in rebalance_out_with_ops
    self.verify_cluster_stats(self.servers[:self.num_servers - self.nodes_out])
  File "pytests/basetestcase.py", line 629, in verify_cluster_stats
    self._verify_stats_all_buckets(servers, timeout=(timeout or 120))
  File "pytests/basetestcase.py", line 411, in _verify_stats_all_buckets
    raise Exception("unable to get expected stats during {0} sec".format(timeout))
Exception: unable to get expected stats during 120 sec

----------------------------------------------------------------------

The cbstats displays likewise information

vb_replica_curr_items > curr_items.

root@ubu-12301:/opt/couchbase/bin# ./cbstats localhost:11211 all | grep curr_
 curr_connections: 28
 curr_conns_on_port_11207: 4
 curr_conns_on_port_11209: 51
 curr_conns_on_port_11210: 6
 curr_items: 249999
 curr_items_tot: 500033
 curr_temp_items: 0
 vb_active_curr_items: 249999
 vb_pending_curr_items: 0
 vb_replica_curr_items: 250034

There are no pending items on any intra replication queues.

Attaching logs from the cluster

 Comments   
Comment by Ketaki Gangal [ 02/Jul/14 ]
Logs https://s3.amazonaws.com/bugdb/11624/11624.tar

I see this on 2/2 runs so far.
Comment by Ketaki Gangal [ 03/Jul/14 ]
What this test does :

1. Create 3 buckets on a 3 node cluster.
2. Load items until resident active =90% ~ 250K items.
3. Rebalance Out 2 node
4. Once rebalance is complete -- verifies items on the cluster by
-- item count active v/s item count from initial testrunner kvstore
--- replica item count

The test fails above due to active items missing (1k) as compared to replica items.
Comment by Chiyoung Seo [ 07/Jul/14 ]
Ketaki,

When I ran the same test on my local machine, I had different errors:

2014-07-07 12:00:55 | ERROR | MainProcess | load_gen_task | [rest_client._http_request] http://127.0.0.1:9000/pools/default/buckets/default?basic_stats=true error 404 reason: status: 404, content: Requested resource not found.
 Requested resource not found.
2014-07-07 12:00:55 | ERROR | MainProcess | load_gen_task | [rest_client.get_bucket] try to get http://127.0.0.1:9000/pools/default/buckets/default?basic_stats=true again after 1 sec
2014-07-07 12:00:56 | ERROR | MainProcess | load_gen_task | [rest_client._http_request] http://127.0.0.1:9000/pools/default/buckets/default?basic_stats=true error 404 reason: status: 404, content: Requested resource not found.
 Requested resource not found.
[('/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py', 781, '__bootstrap', 'self.__bootstrap_inner()'), ('/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py', 808, '__bootstrap_inner', 'self.run()'), ('lib/tasks/task.py', 525, 'run', 'self.next()'), ('lib/tasks/task.py', 692, 'next', 'self._unlocked_delete(partition, key)'), ('lib/tasks/task.py', 623, '_unlocked_delete', 'self.set_exception(error)'), ('lib/tasks/future.py', 264, 'set_exception', 'print traceback.extract_stack()')]
Mon Jul 7 12:00:56 2014
2014-07-07 12:00:59 | INFO | MainProcess | test_thread | [bucket_helper.delete_all_buckets_or_assert] deleted bucket : default from 127.0.0.1
2014-07-07 12:00:59 | INFO | MainProcess | test_thread | [bucket_helper.wait_for_bucket_deletion] waiting for bucket deletion to complete....
2014-07-07 12:00:59 | INFO | MainProcess | test_thread | [rest_client.bucket_exists] existing buckets : []
2014-07-07 12:00:59 | INFO | MainProcess | test_thread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [] on 127.0.0.1
2014-07-07 12:00:59 | INFO | MainProcess | test_thread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [] on 127.0.0.1
2014-07-07 12:00:59 | INFO | MainProcess | test_thread | [cluster_helper.cleanup_cluster] rebalancing all nodes in order to remove nodes
2014-07-07 12:00:59 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance params : password=asdasd&ejectedNodes=n_2%40127.0.0.1%2Cn_1%40127.0.0.1&user=Administrator&knownNodes=n_2%40127.0.0.1%2Cn_1%40127.0.0.1%2Cn_0%4010.17.42.241
2014-07-07 12:00:59 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance operation started
2014-07-07 12:00:59 | INFO | MainProcess | test_thread | [rest_client._rebalance_progress] rebalance percentage : 0 %
2014-07-07 12:01:09 | INFO | MainProcess | test_thread | [rest_client.monitorRebalance] rebalance progress took 10.0119791031 seconds
2014-07-07 12:01:09 | INFO | MainProcess | test_thread | [rest_client.monitorRebalance] sleep for 10 seconds after rebalance...
2014-07-07 12:01:19 | INFO | MainProcess | test_thread | [cluster_helper.cleanup_cluster] removed all the nodes from cluster associated with ip:127.0.0.1 port:9000 ssh_username:Administrator ? [(u'n_2@127.0.0.1', 9002), (u'n_1@127.0.0.1', 9001)]
2014-07-07 12:01:19 | INFO | MainProcess | test_thread | [basetestcase.sleep] sleep for 10 secs. ...
2014-07-07 12:01:29 | INFO | MainProcess | test_thread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 127.0.0.1:9000
2014-07-07 12:01:29 | INFO | MainProcess | test_thread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 127.0.0.1:9000 is running
2014-07-07 12:01:29 | INFO | MainProcess | test_thread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 127.0.0.1:9001
2014-07-07 12:01:29 | INFO | MainProcess | test_thread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 127.0.0.1:9001 is running
2014-07-07 12:01:29 | INFO | MainProcess | test_thread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 127.0.0.1:9002
2014-07-07 12:01:29 | INFO | MainProcess | test_thread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 127.0.0.1:9002 is running
2014-07-07 12:01:29 | INFO | MainProcess | test_thread | [basetestcase.tearDown] ============== basetestcase cleanup was finished for test #1 rebalance_out_with_ops ==============
Cluster instance shutdown with force

======================================================================
ERROR: rebalance_out_with_ops (rebalance.rebalanceout.RebalanceOutTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/rebalance/rebalanceout.py", line 86, in rebalance_out_with_ops
    task.result()
  File "lib/tasks/future.py", line 160, in result
    return self.__get_result()
  File "lib/tasks/future.py", line 112, in __get_result
    raise self._exception
InvalidArgumentException: controller/rebalance failed when invoked with parameters: password=asdasd&ejectedNodes=n_2%40127.0.0.1%2Cn_1%40127.0.0.1%2Cn_0%4010.17.42.241&user=Administrator&knownNodes=n_2%40127.0.0.1%2Cn_1%40127.0.0.1%2Cn_0%4010.17.42.241

----------------------------------------------------------------------
Ran 1 test in 258.641s


Can you please check it again?
Comment by Anil Kumar [ 08/Jul/14 ]
[Triage] Raising this to Blocker since this could be data loss.




[MB-11661] mem_used in increasing and dropping in basic setup with 5 buckets Created: 07/Jul/14  Updated: 08/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-928

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File bucket-1-mem_used-cluster-wide.png     PNG File bucket-2-mem_used.png     PNG File bucket-4-mem_used.png     PNG File memcached_rss-172.23.100.17.png     PNG File memcached_rss-172.23.100.18.png     PNG File mem_used_2.5.1_vs_3.0.0.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/perf-dev/479/artifact/
Is this a Regression?: Yes

 Description   
2 nodes, 5 buckets, 200K x 1KB docs per bucket, 2K updates per bucket.

You can see that mem_used for bucket-1 increased from ~600M to ~1250MB after 5 hours.

It doesn't look like a fragmentation issue, at least allocator stats don't indicate that:

MALLOC: 1575414200 ( 1502.4 MiB) Bytes in use by application
MALLOC: + 24248320 ( 23.1 MiB) Bytes in page heap freelist
MALLOC: + 77763952 ( 74.2 MiB) Bytes in central cache freelist
MALLOC: + 3931648 ( 3.7 MiB) Bytes in transfer cache freelist
MALLOC: + 27337432 ( 26.1 MiB) Bytes in thread cache freelists
MALLOC: + 7663776 ( 7.3 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 1716359328 ( 1636.8 MiB) Actual memory used (physical + swap)
MALLOC: + 1581056 ( 1.5 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 1717940384 ( 1638.4 MiB) Virtual address space used
MALLOC:
MALLOC: 94773 Spans in use
MALLOC: 36 Thread heaps in use
MALLOC: 8192 Tcmalloc page size

Please notice that actual RAM usage (RSS) is pretty stable.

Another issue is dropping mem_used for bucket-2 and bucket-4 and these errors:

Mon Jul 7 10:24:59.559952 PDT 3: (bucket-2) Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker...
Mon Jul 7 10:54:58.109779 PDT 3: (bucket-4) Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker...
Mon Jul 7 10:54:58.109779 PDT 3: (bucket-4) Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker...


 Comments   
Comment by Matt Ingenthron [ 07/Jul/14 ]
Any time you see GIGANTOR, that indicates a stats underflow. That was added back in the 1.7 days to try to catch these kinds of underflow allocation problems early.
Comment by Pavel Paulau [ 07/Jul/14 ]
Just a comparison with 2.5.1.




[MB-9494] Support Windows Server 2012 R2 in Production Created: 07/Nov/13  Updated: 02/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Anil Kumar Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Triaged

 Description   
We need to support Windows Server 2012 R2 in production for 3.0.

 Comments   
Comment by Maria McDuff (Inactive) [ 14/Mar/14 ]
fyi, Still waiting for Windows build....
Comment by Wayne Siu [ 02/Jul/14 ]
Windows builds have been available for testing.
Comment by Anil Kumar [ 02/Jul/14 ]
Awesome.
Comment by Wayne Siu [ 02/Jul/14 ]
Let's wait for Tony to confirm before we close the ticket.
Comment by Anil Kumar [ 02/Jul/14 ]
Okay i saw it was resolved so went ahead closed it.




[MB-10273] View compaction should be triggered predictably, especially with parallel compaction enabled Created: 20/Feb/14  Updated: 02/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Artem Stemkovski
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-378.

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630
Memory = 64 GB
Disk = 2 x SSD

Attachments: PNG File 3.0.0-558_couch_views_fragmentation_dgm.png     PNG File 3.0.0-558_couch_views_fragmentation_non_dgm.png     PNG File couch_total_disk_size_b731.png     PNG File couch_views_fragmentation_b731.png     PNG File couch_views_fragmentation.png     PNG File disk_data_size_2.5.1-1083.png     PNG File disk_data_size_3.0.0-859.png     PNG File fragmentation_2.5.1-1083.png     PNG File fragmentation_3.0.0-819.png     PNG File fragmentation_3.0.0-859.png     PNG File fragmentation_gaps.png     PNG File latency_query_histo.png     PNG File latency_query_lt90.png     PNG File manual_compaction.png     PDF File view_queries_comp_2.5.0_vs_3.0.0.pdf    
Issue Links:
Dependency
depends on MB-11486 Erlang memory usage increases to 50GB... Closed
Duplicate
is duplicated by MB-11523 [System Test] Very uneven view compac... Resolved
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/apollo-64/996/artifact/
Is this a Regression?: Yes

 Description   
ORIGINAL TITLE: View compaction doesn't catch up in basic non-DGM tests with view queries

4 nodes, 1 bucket x 20M x 2KB, non-DGM, 4 ddocs, 1 view per ddoc.
10K op/sec (read-heavy), 400 queries/sec.

Data and index are on different SSD drives, parallel compaction is enabled.

See attached screenshot that compares index fragmentation between 2.5 and 3.0.

I observe similar issue with data compaction where way higher drain rate causes IO saturation. However in this case:
1. Isolated index compaction is known to be faster.
2. There is enough CPU and disk bandwidth.

 Comments   
Comment by Sarath Lakshman [ 20/Feb/14 ]
Let me see if I understood the statement correct.
view compaction doesn't catch up in the sense fragmentation grows up much more or while updater operation is too fast, compactor unable to keep up with delta application and switch to new database ?
Comment by Sarath Lakshman [ 20/Feb/14 ]
Given that we have made the view compaction much fast, compaction process should be able to run quickly. Is there a co-relation wrt above fragmentation graph, when view compactor start and ended during this fragmentation sample interval ? It might help us to figure out whether it is because compaction process is not being trigger enough frequently or something else related to updater writer causing more fragmentation now.
Comment by Sarath Lakshman [ 20/Feb/14 ]
In showfast, I see that the QPS have come down in 3.0 somehow. May be this explains - higher fragmentation causing longer disk seeks.
Comment by Pavel Paulau [ 20/Feb/14 ]
My summary means "view compactor was not able to keep fragmentation below specified threshold (30%) causing larger disk utilization". Most importantly this is regression from 2.5.

Results from 4 different tests are posted on showfast dashboard in "Query" section. In the last one we measure maximum throughput that you are referencing. In other cases we limit query throughput to fixed value - 400 queries/sec. Attached graph and PDF report are from the test case with capped throughput and stale=update_after queries.

In next runs I will try to capture/report/correlate index compaction progress. Though it should be visible from logs.
Comment by Sarath Lakshman [ 22/Feb/14 ]
What kind of workload we have for incremental update ? Is it updating the same keys or adding the additional data ? From compaction logs it looks like you are updating same keys.
Comment by Pavel Paulau [ 22/Feb/14 ]
10K ops/sec. 80% reads, 12% updates, 4% creates, 4% deletes.
Comment by Pavel Paulau [ 26/Feb/14 ]
Also notice that in DGM cases I'm observing regression in higher percentiles of query latency.

Like <80th percentile is the same but 90th or 95th percentiles are way greater in 3.0.

I don't create a separate ticket for now since it can be explained by larger index files and and more frequent disk reads.
Comment by Sarath Lakshman [ 26/Feb/14 ]
After discussing with Filipe, we have the preliminary thought that since updater became fast, during compactor run the updater creates small sort record files and the delta apply of many small sort record files causing compacted file to be fragmented again. We might need to introduce a way to merge many small delta files before applying changes to the new compacted index file.
Comment by Sarath Lakshman [ 14/Mar/14 ]
We have a minor workaround already for slowing down updater when compactor runs. But not sure if that helps. The real fix would be to have compactor apply merged delta files. That fix will take a while to implement.

commit 3615009f54a6e24d81ceb24c9bbd11c29d5b84f3
Author: Filipe David Borba Manana <fdmanana@gmail.com>
Date: Wed Feb 26 15:53:26 2014 +0000

    Add throttle option for incremental index updates

    Currently the updater seems to proceed significantly faster than the
    compactor, which has a retry phase not very efficient when there's many
    small updates during the initial compaction phase, which makes the
    compactor retry phase apply too many small log files to the new index,
    causing fragmentation on the new index to be high by the time compaction
    finishes. This in turn causes the compaction scheduler to schedule yet
    another compaction shortly after a compaction finishes.

    So for now make the updater sleep for a short period before applying each
    batch if compaction is currently running and in the retry phase. Not ideal
    and this is a temporary workaround which can ne disabled or adjusted at
    any time by setting the sleep interval (couch_config setting throttle_period).

    A better solution is to make the retry phase more efficient, faster and
    causing much less fragmentation. Amongst other possibilities, merging log
    files into a single batch and applying it all at once makes a big difference
    in regards to efficiency and speed - basically restoring what we had until
    revision 1969a700dd9dfe072608c7b5fcf324eac8ab008d but in a fully correct
    way.
Comment by Filipe Manana [ 14/Mar/14 ]
You don't know if that will be a "real fix". It's certainly an improvement, but might not be enough, it might be something outside the control of view engine (like too many IO threads in ep-engine).
Comment by Pavel Paulau [ 14/Mar/14 ]
Problem is still there.

Also notice that ep-engine drain rate is ~ 2.x faster in 3.0. Thus there will be more persisted mutations and more work for indexer to do.
Comment by Pavel Paulau [ 09/Apr/14 ]
Please notice that it's getting worse in recent UPR-based builds.
Comment by Sarath Lakshman [ 23/May/14 ]
Changes for merging compactor delta files are merged in master branch.
http://review.couchbase.org/#/c/34981/
http://review.couchbase.org/#/c/36893/
Comment by Pavel Paulau [ 25/May/14 ]
The problem still exists.
Comment by Sarath Lakshman [ 27/May/14 ]
I think we need to do some tuning related to upr snapshot size (or finite size buffering) to decrease fragmentation. Volker was planning to work on that.
Comment by Pavel Paulau [ 10/Jun/14 ]
Promoting to product blocker:
1. This is 3.0 regression;
2. Disk usage is not acceptable for production cases;
3. Negative impact on latency of view queries;
4. It happens with any kind of enterprise-class servers.
Comment by Sarath Lakshman [ 11/Jun/14 ]
Assigning to Volker since this is depended on incremental update batching that he is planning to implement.
Comment by Sarath Lakshman [ 11/Jun/14 ]
Pavel, it would be great if you could provide a link to the test logs for latest 3.0 build that you have tried.
Comment by Pavel Paulau [ 11/Jun/14 ]
E.g., http://ci.sc.couchbase.com/job/leto/21/artifact/.
Comment by Volker Mische [ 16/Jun/14 ]
Pavel, we merged change 38308 [1]. Please rerun the test.

[1]: http://review.couchbase.org/38308
Comment by Pavel Paulau [ 16/Jun/14 ]
I would love to retest it but buildbot is stuck since Saturday and looks like build team doesn't care much about the issue.

http://builds.hq.northscale.net:8010/builders/mac-x64-300-builder/builds/826
Comment by Wayne Siu [ 16/Jun/14 ]
Build process has been updated that an issue on one platform (for example Mac in this case) will not stop other platforms from building.
Comment by Pavel Paulau [ 16/Jun/14 ]
I'm afraid the issue still exists.

Disk usage is a little bit better than before but still way worse than 2.5.x:

http://bit.ly/1lv9uKD

Logs:
http://ci.sc.couchbase.com/job/leto/101/artifact/
Comment by Sarath Lakshman [ 17/Jun/14 ]
Pavel, Do you have logs for a 2.5 test ? It might help us compare updater characteristics.
Comment by Pavel Paulau [ 17/Jun/14 ]
Sure, logs are available for all runs. This is 2.5.1-1083:
http://ci.sc.couchbase.com/job/leto/22/artifact/
Comment by Volker Mische [ 17/Jun/14 ]
I propose that we push the de-duplication even one layer lower into the tmp files on disk. This means, that all changes are just pushed into the tmp files and once we do the actual insertion into the btree, the items get de-duplicated during the sorting phase. I'll assign Sarath to the ticket as he knows that code best.

If anyone has a better idea (other than ep-engine just returning single snaphots for every request), please feel to post it here.
Comment by Pavel Paulau [ 17/Jun/14 ]
For Sarath:

Cluster spec:
https://raw.githubusercontent.com/couchbaselabs/perfrunner/master/clusters/leto_ssd.spec

Interrupt all active and pending jobs (you need to create an acount):
http://ci.sc.couchbase.com/job/leto/

From root@172.23.100.33:/root/workspace/leto:

killall -9 python celery (usually not required, just in case).

Modify tests/query_lat_20M.test, focus on:
* cluster / initial_nodes
* load / items
* access / items
* access / throughput
* access / query_throughput
* access / time

To re-install Couchbase Server:
/tmp/env/bin/python -m perfrunner.utils.install -c clusters/leto_ssd.spec -v 3.0.0-817

To configure a new cluster:
/tmp/env/bin/python -m perfrunner.utils.cluster -c clusters/leto_ssd.spec -t tests/query_lat_20M.test

// To run the workload
/tmp/env/bin/python -m perfrunner -c clusters/leto_ssd.spec -t tests/query_lat_20M.test --local --nodebug stats.enabled.0

It's also possible to override test options via CLI arguments, e.g.:

/tmp/env/bin/python -m perfrunner.utils.cluster -c clusters/leto_ssd.spec -t tests/query_lat_20M.test cluster.initial_nodes.2
Comment by Sarath Lakshman [ 19/Jun/14 ]
In the last test logs, I saw that MB-11472 (Part 2) problem was happening frequently. I have merged a fix for it now.
Fixes made for MB-11387 is also relevant for this issue.

It would be great if you could rerun this test and see if there is any progress and provide logs.
Comment by Sarath Lakshman [ 20/Jun/14 ]
I tried a couple of tests to compare 2.5 and 3.0. In general there few aspects that view engine changes in 3.0 has different performance characteristics.
In 3.0, we have made index update and compaction way faster than 3.0. Earlier in 2.5 we used to consume mutations from disk. KV is always fast and persistence is slower (In 3.0 KV persistence rate became faster as well). Earlier we had a slow reader (because of disk and erlang) and we consumed mutations slowly from disk at a much slower rate than KV mutation rate.

When we started consuming mutations through UPR, we are essentially trying to operate at KV mutation rate, which is harder for view engine to keep up with. The meaning of stale=false is different in 3.0, stale=false means operate upto speed with KV. View engine cannot keep up with KV speed. We are currently bottlenecked by our storage layer for writes (Earlier it was erlang and diskread). The current compaction technique duplicates write operations and hence we can operate as fast as the maximum speed of couchstore btree writes. I will be working on tuning compaction wrt delta file apply codepath improvement. But, it should not be a blocker.

So I think we should try running the same test with a lower KV ops rate (May be 8k ops/sec)
Comment by Pavel Paulau [ 20/Jun/14 ]
Demoting to Critical for know. Will check less aggressive workloads (< 500 mutations/sec/node).

However we still need to confirm PM expectations.
Comment by Sarath Lakshman [ 22/Jun/14 ]
ns_server compaction daemon sometimes doesn't obey the rule of check fragmentation every 30 seconds.
[ns_server:debug,2014-06-22T7:22:39.813,n_0@127.0.0.1:<0.6712.1>:compaction_daemon:view_needs_compaction:1154]`default/_design/test/main` data_size is 220616906, disk_size is 216703094
[ns_server:debug,2014-06-22T7:22:39.817,n_0@127.0.0.1:<0.6713.1>:compaction_daemon:view_needs_compaction:1154]`default/_design/test/replica` data_size is 0, disk_size is 4152
[ns_server:debug,2014-06-22T7:29:18.606,n_0@127.0.0.1:<0.12267.1>:compaction_daemon:view_needs_compaction:1154]`default/_design/dev_test/main` data_size is 249994787, disk_size is 1116844208


Almost it took 7 mins to do this check and compaction is delayed.
Comment by Sarath Lakshman [ 23/Jun/14 ]
I have some changes that might reduce the rate of fragmentation. Before we reduce the kv ops rate, please try a performance run with 10k ops.
Please pick any build with the following change.

1962b54 MB-11462 Fix flow control ack handling for remove stream message
Comment by Pavel Paulau [ 23/Jun/14 ]
The most recent build with all your changes demonstrates the same characteristics.

Indeed compaction is not triggered at all from time to time.
I just monitored live system and I'm 99% confident that server has capacity to compact whatever view-engine produces.
There is no need to tune workload. We just need to make sure that compaction works as expected.
Comment by Pavel Paulau [ 23/Jun/14 ]
...

debug.log:[ns_server:debug,2014-06-23T17:32:54.555,ns_1@172.23.100.29:<0.9854.69>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/A/main` data_size is 16
5770372, disk_size is 356258882
debug.log:[ns_server:debug,2014-06-23T17:33:20.189,ns_1@172.23.100.29:<0.28258.71>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/C/main` data_size is 247847546, disk_size is 550040661
debug.log:[ns_server:debug,2014-06-23T17:33:41.505,ns_1@172.23.100.29:<0.31926.73>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/B/main` data_size is 313009412, disk_size is 600708265
debug.log:[ns_server:debug,2014-06-23T17:33:59.959,ns_1@172.23.100.29:<0.26319.75>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/D/main` data_size is 325138091, disk_size is 686384207

<----------- gap ----------->

debug.log:[ns_server:debug,2014-06-23T17:44:00.089,ns_1@172.23.100.29:<0.5863.135>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/A/main` data_size is 165911209, disk_size is 1827173458
debug.log:[ns_server:debug,2014-06-23T17:45:00.191,ns_1@172.23.100.29:<0.6030.141>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/C/main` data_size is 248072061, disk_size is 3438285900
debug.log:[ns_server:debug,2014-06-23T17:45:18.832,ns_1@172.23.100.29:<0.32724.142>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/B/main` data_size is 313234474, disk_size is 3489248322
debug.log:[ns_server:debug,2014-06-23T17:45:38.496,ns_1@172.23.100.29:<0.28889.144>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/D/main` data_size is 325385141, disk_size is 3641594954

...

debug.log:[ns_server:debug,2014-06-23T17:50:59.745,ns_1@172.23.100.29:<0.20088.176>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/A/main` data_size is
165869250, disk_size is 337347730
debug.log:[ns_server:debug,2014-06-23T17:51:21.636,ns_1@172.23.100.29:<0.26689.178>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/C/main` data_size is
247974971, disk_size is 577549412
debug.log:[ns_server:debug,2014-06-23T17:51:38.692,ns_1@172.23.100.29:<0.13950.180>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/B/main` data_size is
313108140, disk_size is 642978883
debug.log:[ns_server:debug,2014-06-23T17:51:57.425,ns_1@172.23.100.29:<0.5877.182>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/D/main` data_size is 3
25235082, disk_size is 656163905

<----------- gap ----------->

debug.log:[ns_server:debug,2014-06-23T18:02:14.815,ns_1@172.23.100.29:<0.13499.243>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/A/main` data_size is
166047213, disk_size is 2737833051
debug.log:[ns_server:debug,2014-06-23T18:02:28.241,ns_1@172.23.100.29:<0.25196.244>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/C/main` data_size is
248145891, disk_size is 3067339852
debug.log:[ns_server:debug,2014-06-23T18:02:45.298,ns_1@172.23.100.29:<0.15161.246>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/B/main` data_size is 313261775, disk_size is 3200218194
debug.log:[ns_server:debug,2014-06-23T18:03:06.189,ns_1@172.23.100.29:<0.19036.248>:compaction_daemon:view_needs_compaction:1154]`bucket-1/_design/D/main` data_size is 325403731, disk_size is 2464662669

...

http://ci.sc.couchbase.com/job/leto/147/artifact/
Comment by Sarath Lakshman [ 23/Jun/14 ]
Thanks Pavel. So as I suspected, we need to look more into ns_server compaction daemon further.
Comment by Sarath Lakshman [ 24/Jun/14 ]
Assigning ticket to Alk to take a quick look at ns_server's compaction daemon logs.

Alk, we observed that even when fragmentation is high, sometimes compaction is not scheduled. Few log entries above shows the delay between the view fragmentation check.

Ketaki has reported in ticket MB-11523 that she is observing similar problem with couchbase kv data file as well.

Could you take a look and let us know if there is something wrong around compaction daemon ?
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
The gap is due to bucket data files compaction. You can see it yourself in logs files.

Plus you're probably already aware that by default we don't allow concurrent kv and views compaction (there's autocompaction settings flag for that).

And it shouldn't be any different from 2.x. Whether it's indeed any different from 2.x or not I cannot say.

Pavel, do you do anything to measure speed or "quality" of compactions? My guess is that if 3.0's compaction is any worse, then it should be visible at least on average fragmentation graphs.
Comment by Pavel Paulau [ 24/Jun/14 ]
Please notice that in my tests I enable parallel data and index compaction.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
I see. In this case the gap can still be explained if kv compaction takes longer than view compaction.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
So do we have any evidence that in 3.0 kv compaction is slower ?
Comment by Pavel Paulau [ 24/Jun/14 ]
We had many issues with KV compaction but they were addressed. I have no evidence that it worse in 3.0 in the most recent builds.

Why does "slower" KV compaction block view compaction if parallel compaction is enabled in global auto-compaction settings?
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
>> Why does "slower" KV compaction block view compaction if parallel compaction is enabled in global auto-compaction settings?

Because compaction works in passes. And in single pass it'll compact both kv and views. So even if you compact in parallel, and views complete before kv, it'll still have to wait for kv, before it's able to start next compaction pass. And has been like this since forever. Making it more flexible is possible but will complicate the code greatly.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
Or at least ns_server team believes that trying to make it more flexible is going to complicate it greatly. And it's already not simplest of our codes.
Comment by Pavel Paulau [ 24/Jun/14 ]
Attached charts compare per node fragmentation in 2.5.1 and two different 3.0.0 builds.

You can see that individual compactions are longer and less frequent in 3.0.0-859. But it still satisfies target fragmentation threshold (20%), though view compaction is obviously blocked. Previous stable build demonstrates absolutely different characteristics...
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
CC-ed my team. If we have to do something with compaction daemon we still have time.

Pavel, do you have any explanation from ep-engine team about longer compactions ?
Comment by Pavel Paulau [ 24/Jun/14 ]
cc ep-engine folks.

Also adding charts with actual disk usage in addition to relative fragmentation. You can see that compaction interval is 2-3 minutes in 2.5.1 while it takes more than 10 minutes to complete compaction in 3.0.0.

But here is the problem: in 2.5.1 disk IO is not saturated and we have enough capacity to serve KV operations and compaction. In 3.0.0 disk is fully utilized, there are relatively long queues and compaction obviously doesn't catch up. Please notice that separate disks are used for views.

I don't have evidence that IO utilization in super-optimal, most likely there are many things that ep-engine team can improve.

But in my opinion issues with one component should not impact on other components. From end-used perspective it sounds like a design limitation.
Comment by Sarath Lakshman [ 24/Jun/14 ]
Alk, I also enabled parallel compaction in my tests. I see exactly what you have mentioned. Whenever fragmentation is above the threshold for views and compaction is not automatically started, I was seeing KV compaction happening. I thought that KV compaction and View compaction could operate in parallel in independently even though one of them took longer. In 3.0, we also have IO improvements and we are able to write more into disk in short span of time which results in requirement for frequent compactions. It is critical for us to run compactions correctly at right time to keep up with the KV mutations coming through UPR.
Comment by Pavel Paulau [ 26/Jun/14 ]
As discussed I will try the following experiment:

-- data (KV) is managed by ns_server as before
-- automatic index compaction is disabled
-- test harness will periodically trigger index compaction

That will allow to justify need in ns_server code changes.
Comment by Sriram Melkote [ 26/Jun/14 ]
Discussed with Alk, Chiyoung, Pavel, Sarath - there are two issues here:

(a) KV compaction is taking longer in 3.0 than 2.5 -- Pavel will open a new issue to track this
(b) When parallel compaction is enabled, view compaction should trigger more frequently.

This bug will track (b) nd Alk will take first look to see what may need to be changed to ensure parallel compactions work closer to intention.
Comment by Pavel Paulau [ 26/Jun/14 ]
Works as expected.
Comment by Aleksey Kondratenko [ 26/Jun/14 ]
Lets discuss specific implementation today after 16:00 meeting.
Comment by Pavel Paulau [ 02/Jul/14 ]
Raising to "Blocker" because it may mask several issues in view-engine. Please prioritize accordingly.
Comment by Artem Stemkovski [ 02/Jul/14 ]
Alk:

* we want to be able to block/unblock index compactions of specific bucket during rebalance. Ideally without touching kv compactions at all.

* we want to be able to manually start bucket compaction that'll compact both kv and views. Possibly in parallel, possibly in sequence (depending on currently per bucket flag "enable parallel index compaction").

* we want automatic bucket compactions to not compact multiple vbuckets at time (both across multiple buckets and across vbuckets of same bucket)

* we want automatic bucket compactions to not compact multiple indexes at time.

* manual compactions should be started immediately. And ignore "one vbucket/index at a time" limitation of automatic compactions. The only manual compaction we currently refuse is manual compaction of bucket that is already being _manually_ compacted.

* we currently do allow manual and automatic compaction of same bucket to run concurrently. If that simplifies anything, feel free to keep this logic

* concurrent manual compactions of different buckets are allowed currently (if somebody wants to compact "now" we obey)

* Anil is currently checking possibility of global and not per-bucket "parallel compactions" flag. It will depend on whether we documented REST API to change this flag on per bucket level (notably, this is the case where bad REST API documentation makes our job easier and our behavior as whole more practical)

* we can tweak or improve this rules. I.e. it can be noted for example that sometime later we might support per-bucket storage path. So folks placing different buckets on different disks should be able to compact those buckets concurrently. But note that anything requiring incompatible api changes are subject to "we're enterprise" rules.


Maybe we can achieve this in the following way:

* manual compactions can be handled by dedicated compaction daemon just for manual compactions. It doesn't seem to need to interact with automatic compactions at all.

* automatic compactions can be split into two daemons. One for views and one for kv. And parallel/sequential flag can control concurrency between them. Which can be implemented via concurrently_throttle as we discussed.
Comment by Artem Stemkovski [ 02/Jul/14 ]
[Anil] - Considering we didn’t document REST API to change this flag on per bucket level and (also user unlikely using this feature) also it simplifies changes needed for fixing http://www.couchbase.com/issues/browse/MB-10273 – we’re good to make this global only setting and not per-bucket.




[MB-11623] test for performance regressions with JSON detection Created: 02/Jul/14  Updated: 07/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Matt Ingenthron Assignee: Thomas Anderson
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Related to one of the changes in 3.0, we need to test what has been implemented to see if a performance regression or unexpected resource utilization has been introduced.

In 2.x, all JSON detection was handled at the time of persistence. Since persistence was done in batch and in background, with the then current document, it would limit the resource utilization of any JSON detection.

Starting in 3.x, with the datatype/HELLO changes introduced (and currently disabled), the JSON detection has moved to both memcached and ep-engine, depending on the type of mutation.

Just to paint the reason this is a concern, here's a possible scenario.

Imagine a cluster node that is happily accepting 100,000 sets/s for a given small JSON document, and it accounts for about 20mbit of the network (small enough to not notice). That node has a fast SSD at about 8k IOPS. That means that we'd only be doing JSON detection some 5000 times per second with Couchbase Server 2.x

With the changes already integrated, that JSON detection may be tried over 100k times/s. That's a 20x increase. The detection needs to occur somewhere other than on the persistence path, as the contract between DCP and view engine is such that the JSON detection needs to occur before DCP transfer.

This request is to test/assess if there is a performance change and/or any unexpected resource utilization when having fast mutating JSON documents.

I'll leave it to the team to decide what the right test is, but here's what I might suggest.

With a view defined create a test that has a small to moderate load at steady state and one fast-changing item. Test it with a set of sizes and different complexity. For instance, permutations that might be something like this:
non-JSON of 1k, 8k, 32k, 128k
simple JSON of 1k, 8k, 32k, 128k
complex JSON of 1k, 8k, 32k, 128k
metrics to gather:
throughput, CPU utilization by process, RSS by process, memory allocation requests by process (or minor faults or something)

Hopefully we won't see anything to be concerned with, but it is possible.

There are options to move JSON detection to somewhere later in processing (i.e., before DCP transfer) or other optimization thoughts if there is an issue.

 Comments   
Comment by Cihan Biyikoglu [ 07/Jul/14 ]
this is no longer needed for 3.0 is that right? ready to postpone to 3.0.1?
Comment by Pavel Paulau [ 07/Jul/14 ]
HELLO-based negotiation was disabled but detection still happens in ep-engine.
We need to understand impact before 3.0 release. Sooner than later.




[MB-11332] APPEND/PREPEND returning KEY_NOT_FOUND instead of NOT_STORED Created: 05/Jun/14  Updated: 07/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Brett Lawson Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
When performing an APPEND or PREPEND operation when the key does not exist in pre-3.0 servers, NOT_STORED will be the returned error code. However, as of 3.0, it appears the server has begun responding with KEY_NOT_FOUND. This is a more logical error, but has implications to end-users.

 Comments   
Comment by Matt Ingenthron [ 05/Jun/14 ]
marking as a blocker as it's an unintentional interface change and breaks existing apps
Comment by Anil Kumar [ 17/Jun/14 ]
Trond can you please take look since Chiyoung is OOF.
Comment by Trond Norbye [ 17/Jun/14 ]
This was explicitly done as a change from MB-10778

This is one of the situations where we have to decide what we're going to do moving forward. Should we use unique error codes to allow the app to know what happened, or should we stick with backwards compatibility "forever". I would say that moving to a new major release like 3.0 would be a good time to add more specific error codes and properly document it. Previously you just knew that "something failed". It could be that the key didn't exist, or that we encountered an error storing the new key.

I can easily add back those lines, but that would also affect the bug report mentioned. Bigger question is: when should we start fixing our technical debt?

Let me know what you want me to do.
Comment by Trond Norbye [ 17/Jun/14 ]
Btw:

commit 869a66d1d08531af65169c59b640de4546974a34
Author: Sriram Ganesan <sriram@couchbase.com>
Date: Fri Apr 11 13:46:16 2014 -0700

    MB-10778: Return item not found instead of not stored

    When an application tried to append to an item that doesn't exist,
    ep-engine needs to return not found as opposed to not stored

    Change-Id: Ic4e50b069e41028cd879530a183d3ac43a3ebc1c
    Reviewed-on: http://review.couchbase.org/35619
    Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
    Tested-by: Chiyoung Seo <chiyoung@couchbase.com>
Comment by Anil Kumar [ 19/Jun/14 ]
Please check Tronds reponses
Comment by Matt Ingenthron [ 03/Jul/14 ]
I understand Trond's response, but I don't think I can speak to this. It's a reasonable thing to change responses between 2.x and 3.0, but is PM's expectation for users who upgrade that they will need to make (albeit minor) changes to their applications?

I think that's the decision that needs to be made. Then the decision if this is a bug or not can be made.

I'm good with either and there is zero client impact in either case. The impact is on the part of end users. They'll need to potentially change the error handling in applications before upgrade.

If we do want to make this kind of change, the time to do it is when doing a major version change.
Comment by Trond Norbye [ 03/Jul/14 ]
Part of the problem here is that we adapted the "error codes" from the old memcached environment when we implemented this originally. I do feel that we should try to make "better" error codes in order to make the life easier for the end user. The user would have to extend their use to try to use "add" when key_not_found is returned in addition to "not stored" to handle both cluster version. If we think adding the error is the wrong thing to do I'm happy reverting the change.
Comment by Cihan Biyikoglu [ 07/Jul/14 ]
I assume there would be quite a few apps handling this error. I don't think we can take this change.

To be bale to take a change like this we need a backward compat flag. The flag would help admins modify the behavior of the server. For example the flag could be set to be compliant with 2.5.X vs 3.X on a new version of the server and if it is set to 2.5.x, we continue to throw NOTSTORED but if the behavior is set to 3.x we can throw KEYNOTFOUND. Until we get this we cannot take breaking changes gracefully.
Comment by Trond Norbye [ 07/Jul/14 ]
It may "easily" be handled inside the client with such a property (just toggling the error code back to the generic not_stored return value). I'd rather not do it on the server (since it would introduce a new command you would have to send to the server in order to set the "compatibility" level for the connection)
Comment by Cihan Biyikoglu [ 07/Jul/14 ]
We may need both. A future facing facility like this can ensure we get a general purpose setting that can hide backward compat issues for all components like N1QL and Indexes, Views and more. There will be a whole lot more than error codes in future.
Comment by Brett Lawson [ 07/Jul/14 ]
I'd like to add, that currently the Node.js and libcouchbase clients currently map both errors to a single error code, such that testing for KEY_NOT_FOUND or NOT_STORED both work (though this could cause compiler errors with switch statements handling said errors).




[MB-11548] Memcached does not handle going back in time. Created: 25/Jun/14  Updated: 09/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Patrick Varley Assignee: Jim Walker
Resolution: Unresolved Votes: 0
Labels: customer, memcached
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: No

 Description   
When you change the time of server to be a time in the pass when the memcached process is running it will start expiring all documents with TTL.

To recreate set date to a time in the past for example 2 hours ago from now.

sudo date --set="15:56:56"

You will see that time and uptime from cbstats will change to a large amount:

time: 5698679116
uptime: 4294946592

Looking at the code we can see how this happens:
http://src.couchbase.org/source/xref/2.5.1/memcached/daemon/memcached.c#6462

When you change the time to a value in the past "process_started" will be greater than "timer.tv_sec" and current_time is unsigned which means it will wrap around.

What I do not understand from the code is why current_time is the number of seconds since memcached started and not just the epoch time? (There is a comment about avoiding 64bit) .

http://src.couchbase.org/source/xref/2.5.1/memcached/daemon/memcached.c#117

Any case we should check if "process_started" is bigger than "timer.tv_sec" do something smart.

I will let you decide what the smart thing is :)

 Comments   
Comment by Patrick Varley [ 07/Jul/14 ]
It would be good if we can get this fix into 3.0. Maybe a quick patch like this is good enough for now:

static void set_current_time(void) {
    struct timeval timer;

    gettimeofday(&timer, NULL);
    if (process_started < timer.tv_sec) {
        current_time = (rel_time_t) (timer.tv_sec - process_started);
    }
    else {
       settings.extensions.logger->log(EXTENSION_LOG_WARNING, NULL, "Time has gone backward shutting down to protect data.\n");
       shutdown_server();
}


More than happy to submit the code for review.
Comment by Chiyoung Seo [ 07/Jul/14 ]
Trond,

Can you see if we can address this issue in 3.0?
Comment by Jim Walker [ 08/Jul/14 ]
Looks to me like clock_handler (which wakes up every second) should be looking for time going backwards. It is sampling time every second so can easily see big shifts in the clock and make appropriate adjustments

I don't think we should be shutting down though if we can deal with it, but it does open interesting questions about TTLs and gettimeofday going backwards.

Perhaps we need to adjust process_started by the shift?

Happy to pick this up, just doing some other stuff at the moment...
Comment by Patrick Varley [ 08/Jul/14 ]
clock_handler calls set_current_time which is where all the damage is done.

I agree if we can handle it better we should not shutdown. I did think about changing process_started but that seem a bit like hack in my head but I cannot explain why :).
I was also wondering what should we do when time shifts forward?

I think this has some interesting affects on the stats too.
Comment by Patrick Varley [ 08/Jul/14 ]
Silly question but why not set current_time to epoch seconds instead of doing the offset from the process_started?
Comment by Jim Walker [ 09/Jul/14 ]
@patrick, this is shared code used by memcache and couchbase buckets. Note that memcache is storing expiry as "seconds since process" started and couch buckets store expiry as second since epoch, hence why a lot of this number shuffling is occurring.
Comment by Jim Walker [ 09/Jul/14 ]
get_current_time() is used for a number of time based lock checks (see getl) and document expiry itself (both within memcached and couchbase buckets).

process_started is an absolute time stamp and can lead to incorrect expiry if the real clock jumped. Example
 - 11:00am memcached started process_started = 11:00am (ignoring the - 2second thing)
 - 11:05am ntp comes in and aligns the node to the correct data-centre time (let’s say - 1hr) time is now 10:05am
 - 10:10am clients now set documents with absolute expiry of 10:45am
 - documents instantly expire because memcached thinks they’re in the past.. client scratches head.

Ultimately we need to ensure that the functions get_current_time(), realtime() and abstime() all do sensible things if the clock is changed, e.g. don’t return large unsigned values.
 
Given all this I think the requirements are:

R1 Define a memcached time tick interval (which is 1 second)
  - set_current_time() callback executes at this frequency.

R2 get_current_time() the value returned must be shielded from clock changes.
   - If clock goes backwards, the returned value still increases by R1.
   - If clock goes forwards, the returned value still increases by R1.
   - Really this returns process uptime in seconds and the stat “uptime” is just current_time.

R3 monitor the system time for jumps (forward or backward).
   - Reset process_started to be current time if there’s a change which is greater or less than R1 ticks.

R4 Ensure documentation describes the effect of system clock changes and the two ways you can set document expiry.
  

Overall the code changes are simple to address the issue, I will also look at making testrunner tests to ensure the system behaves.
Comment by Patrick Varley [ 09/Jul/14 ]
Sounds good, a small reminder about handling VMs that are suspended.




[MB-11573] replica items count mismatch on source cluster Created: 27/Jun/14  Updated: 09/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Sangharsh Agarwal Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-855

Ubuntu 12.04

Issue Links:
Duplicate
is duplicated by MB-11593 Active and replica items count mismat... Resolved
Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.3.3.144 : https://s3.amazonaws.com/bugdb/jira/MB-11573/357ca392/10.3.3.144-6232014-1234-diag.zip
10.3.3.144 : https://s3.amazonaws.com/bugdb/jira/MB-11573/7f58a87c/10.3.3.144-diag.txt.gz
10.3.3.144 : https://s3.amazonaws.com/bugdb/jira/MB-11573/b7ffa0dc/10.3.3.144-6232014-122-couch.tar.gz
10.3.3.146 : https://s3.amazonaws.com/bugdb/jira/MB-11573/501d443d/10.3.3.146-diag.txt.gz
10.3.3.146 : https://s3.amazonaws.com/bugdb/jira/MB-11573/b43882dd/10.3.3.146-6232014-122-couch.tar.gz
10.3.3.146 : https://s3.amazonaws.com/bugdb/jira/MB-11573/cf026514/10.3.3.146-6232014-1231-diag.zip
10.3.3.147 : https://s3.amazonaws.com/bugdb/jira/MB-11573/07012cdf/10.3.3.147-6232014-122-couch.tar.gz
10.3.3.147 : https://s3.amazonaws.com/bugdb/jira/MB-11573/66b29090/10.3.3.147-6232014-1238-diag.zip
10.3.3.147 : https://s3.amazonaws.com/bugdb/jira/MB-11573/cba12a00/10.3.3.147-diag.txt.gz

[Destination]
10.3.3.142 : https://s3.amazonaws.com/bugdb/jira/MB-11573/68cbcc2c/10.3.3.142-6232014-1230-diag.zip
10.3.3.142 : https://s3.amazonaws.com/bugdb/jira/MB-11573/cd5167da/10.3.3.142-diag.txt.gz
10.3.3.142 : https://s3.amazonaws.com/bugdb/jira/MB-11573/d75b9f56/10.3.3.142-6232014-122-couch.tar.gz
10.3.3.143 : https://s3.amazonaws.com/bugdb/jira/MB-11573/4882073a/10.3.3.143-diag.txt.gz
10.3.3.143 : https://s3.amazonaws.com/bugdb/jira/MB-11573/92b171e4/10.3.3.143-6232014-1226-diag.zip
10.3.3.143 : https://s3.amazonaws.com/bugdb/jira/MB-11573/b04e77a9/10.3.3.143-6232014-122-couch.tar.gz
10.3.3.145 : https://s3.amazonaws.com/bugdb/jira/MB-11573/68cd72b2/10.3.3.145-diag.txt.gz
10.3.3.145 : https://s3.amazonaws.com/bugdb/jira/MB-11573/c4455351/10.3.3.145-6232014-122-couch.tar.gz
10.3.3.145 : https://s3.amazonaws.com/bugdb/jira/MB-11573/d7e4cc1f/10.3.3.145-6232014-1228-diag.zip
10.3.3.148 : https://s3.amazonaws.com/bugdb/jira/MB-11573/297c9ed8/10.3.3.148-6232014-1237-diag.zip
10.3.3.148 : https://s3.amazonaws.com/bugdb/jira/MB-11573/44a7d424/10.3.3.148-6232014-123-couch.tar.gz
10.3.3.148 : https://s3.amazonaws.com/bugdb/jira/MB-11573/d2e4f811/10.3.3.148-diag.txt.gz
Is this a Regression?: Unknown

 Description   
http://qa.hq.northscale.net/job/ubuntu_x64--01_02--rebalanceXDCR-P0/17/consoleFull

[Test]
./testrunner -i ubuntu_x64--01_02--rebalanceXDCR-P0.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True -t xdcr.rebalanceXDCR.Rebalance.async_rebalance_in,items=100000,rdirection=bidirection,ctopology=chain,doc-ops=update-delete,doc-ops-dest=update-delete,expires=60,rebalance=destination,num_rebalance=1,GROUP=P1


[Test Error]
[2014-06-23 11:46:00,302] - [task:440] WARNING - Not Ready: vb_replica_curr_items 80001 == 80000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
[2014-06-23 11:46:05,339] - [task:440] WARNING - Not Ready: vb_replica_curr_items 80001 == 80000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
[2014-06-23 11:46:10,379] - [task:440] WARNING - Not Ready: vb_replica_curr_items 80001 == 80000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
[2014-06-23 11:46:15,433] - [task:440] WARNING - Not Ready: vb_replica_curr_items 80001 == 80000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
[2014-06-23 11:46:20,463] - [task:440] WARNING - Not Ready: vb_replica_curr_items 80001 == 80000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
[2014-06-23 11:46:25,498] - [task:440] WARNING - Not Ready: vb_replica_curr_items 80001 == 80000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
[2014-06-23 11:46:31,528] - [task:440] WARNING - Not Ready: vb_replica_curr_items 80001 == 80000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
[2014-06-23 11:46:36,566] - [task:440] WARNING - Not Ready: vb_replica_curr_items 80001 == 80000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket


[Test Steps]
1. Setup 3-3 node Src and Dest cluster.
2. Setup CAPI mode replication between default bucket.
3. Load 1M items on each cluster.
4. Add one node on Destination cluster.
5. Perform 30-30% update and delete on each cluster. Update items with expiration time of 60 seconds.
6. Wait for expiration time of 60 seconds.
8. Expecting 80000 items on each side.
9. 1 replica item on Source side was extra.

I ran this test twice on the same cluster, but couldn't reproduce solely. Please see if you find anything suspected from the logs.

 Comments   
Comment by Sangharsh Agarwal [ 27/Jun/14 ]
there are 1 more tests failed with this error in the same job.
Comment by Abhinav Dangeti [ 27/Jun/14 ]
I couldn't reproduce it either with the test case you pointed out.
If you were able to reproduce it in one of the jenkins jobs' or by yourself, I'd appreciate it if you could point me to the cluster in that state.
Comment by Sangharsh Agarwal [ 30/Jun/14 ]
Abhinav,
   I am trying to reproduce it. In addition to that just to update this issue occurring on various jobs on latest build 3.0.0-884. Approximate 5 issues are failed. If you need logs for those execution, please let me know I can give you as of now.
Comment by Abhinav Dangeti [ 30/Jun/14 ]
Sangarsh, I need the live cluster for debugging this. As I am not able to reproduce this issue and you aren't as well, please keep re-running the task or somehow monitoring the jenkins job, so that you can basically get a cluster in this state.
Comment by Sangharsh Agarwal [ 01/Jul/14 ]
Abhinav,
   Bug is re-produce on 3.0.0-884 build

[Test Logs]
https://friendpaste.com/5YBBomzEpWMeGiksM8dlHx

[Source]
10.5.2.231
10.5.2.232
10.5.2.233
10.5.2.234 -> Added node during test

[Destination]
10.5.2.228
10.5.2.229
10.5.2.230
10.3.5.68 -> Added node during test.

Cluster is Live for debugging.



[Test Error]
2014-07-01 00:02:05 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 80002 == 80000 expected on '10.5.2.232:8091''10.5.2.231:8091''10.5.2.234:8091''10.5.2.233:8091', default bucket
2014-07-01 00:02:11 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 80002 == 80000 expected on '10.5.2.232:8091''10.5.2.231:8091''10.5.2.234:8091''10.5.2.233:8091', default bucket
2014-07-01 00:02:18 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 80002 == 80000 expected on '10.5.2.232:8091''10.5.2.231:8091''10.5.2.234:8091''10.5.2.233:8091', default bucket
2014-07-01 00:02:23 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 80002 == 80000 expected on '10.5.2.232:8091''10.5.2.231:8091''10.5.2.234:8091''10.5.2.233:8091', default bucket
2014-07-01 00:02:29 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 80002 == 80000 expected on '10.5.2.232:8091''10.5.2.231:8091''10.5.2.234:8091''10.5.2.233:8091', default bucket
2014-07-01 00:02:35 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 80002 == 80000 expected on '10.5.2.232:8091''10.5.2.231:8091''10.5.2.234:8091''10.5.2.233:8091', default bucket
Comment by Abhinav Dangeti [ 01/Jul/14 ]
Thanks sangharsh, I'll let you know once I'm done.
replica vbuckets 770 & 801 have 1 delete less when compared to their actives'.
Comment by Aruna Piravi [ 01/Jul/14 ]
Hit a similar problem in system tests where there is a difference of 1item in bi-xdcr when both clusters are compared. Total items : ~100M Will attach cbcollect info. Chiyoung thinks the root cause could be same. So attaching logs to this MB.

Live clusters available for investigation: http://172.23.105.44:8091/ http://172.23.105.54:8091/
Comment by Aruna Piravi [ 01/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11573/C1.tar
https://s3.amazonaws.com/bugdb/jira/MB-11573/C2.tar

Pls let me know if you need the clusters. Thanks.
Comment by Sundar Sridharan [ 01/Jul/14 ]
In Aruna's cluster we see more of this issue…
$ ./cbvdiff 172.23.105.54:11210,172.23.105.55:11210,172.23.105.57:11210,172.23.105.58:11210,172.23.105.60:11210,172.23.105.61:11210,172.23.105.62:11210,172.23.105.63:11210 -b standardbucket
VBucket 42: active count 96859 != 96860 replica count

VBucket 50: active count 96923 != 96924 replica count

VBucket 94: active count 96918 != 96919 replica count

VBucket 196: active count 96791 != 96792 replica count

VBucket 391: active count 96911 != 96912 replica count

VBucket 418: active count 97009 != 97010 replica count

VBucket 427: active count 96772 != 96773 replica count

VBucket 488: active count 96717 != 96718 replica count

VBucket 787: active count 96729 != 96730 replica count

Active item count = 99136544
---
Comment by Sundar Sridharan [ 01/Jul/14 ]
In cluster 10.5.2.234 we see one item in vb_replica write queue that does not seem to decrement to zero in vbucket 770
Comment by Sundar Sridharan [ 02/Jul/14 ]
Since this issue seems to be associated with deleteWithMeta not being replicated to the replica node, I have created a new toy build that logs deletes on replica.
could you please help reproduce this with the toy build
couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0.rpm?
thanks in advance
Comment by Sangharsh Agarwal [ 03/Jul/14 ]
Can you please merge your changes, I will verify from update RPM.
Comment by Sundar Sridharan [ 03/Jul/14 ]
Sangharsh, I cannot merge these changes, because we do not want to log these per-document messages at any log level otherwise it will easily mask other important messages.
The toy build for this is http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0-701-toy.rpm

If this cannot be done, please let me know? thanks
Comment by Sangharsh Agarwal [ 07/Jul/14 ]
Test cases are running, I will update you once finished.
Comment by Sangharsh Agarwal [ 07/Jul/14 ]
Sundar, XDCR Encryption is disabled in Community version, I can not run tests on this toy build. Can you please provide toy build on Enterprise version or Ubuntu (Debian package).
Comment by Sundar Sridharan [ 07/Jul/14 ]
Sangharsh, it looks like we do not have ubuntu toy builders at this moment due to infrastructure issues. But we believe this problem should be seen on CentOs as well. Could you please help reproduce this issue on centos machines? thanks
Comment by Sundar Sridharan [ 07/Jul/14 ]
Sangharsh, There are a few CentOS machines that I would like to use to see if I can reproduce the issue on my own too. Could you please share the details of the file ubuntu_x64--01_02--rebalanceXDCR-P0.ini as mentioned in the command..
./testrunner -i ubuntu_x64--01_02--rebalanceXDCR-P0.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True -t xdcr.rebalanceXDCR.Rebalance.async_rebalance_in,items=100000,rdirection=bidirection,ctopology=chain,doc-ops=update-delete,doc-ops-dest=update-delete,expires=60,rebalance=destination,num_rebalance=1,GROUP=P1
Please let me know if there are any specific test instructions other than the above too?
thanks in advance
Comment by Sangharsh Agarwal [ 08/Jul/14 ]
Sundar,
    Sorry, If I was not clear in my last comment:

1. This bug occurred on Ubuntu Vms initially, Still occurring with Ubuntu.
2. On build 3.0.0-918, this issue only occurred with Rebalance + XDCR SSL, not with normal XDCR. And SSL is not enable in this community toy build.

You can not verify the bug, if test run without above.

Additionally ubuntu_x64--01_02--rebalanceXDCR-P0.ini contain the ubuntu VMs.

Anyways, CentOS VMs are below, Please go ahead with verification.

[global]
username:root
password:couchbase
port:8091

[cluster1]
1:_1
2:_2
3:_3


[cluster2]
4:_4
5:_5
6:_6

[servers]
1:_1
2:_2
3:_3
4:_4
5:_5
6:_6
7:_7
8:_8

[_1]
ip:10.5.2.228

[_2]
ip:10.5.2.229

[_3]
ip:10.5.2.230

[_4]
ip:10.5.2.231

[_5]
ip:10.5.2.232

[_6]
ip:10.5.2.233

[_7]
ip:10.5.2.234

[_8]
ip:10.3.5.68

[membase]
rest_username:Administrator
rest_password:password
Comment by Sangharsh Agarwal [ 08/Jul/14 ]
Sundar,
    I am able to re-produce the bug with toy build also:

[Test Log]
test.log : https://s3.amazonaws.com/bugdb/jira/MB-11573/6f3ccc42/test.log

[Server Logs]

[Source]
10.5.2.231 : https://s3.amazonaws.com/bugdb/jira/MB-11573/3fa17d4c/10.5.2.231-782014-623-diag.zip
10.5.2.231 : https://s3.amazonaws.com/bugdb/jira/MB-11573/5116e212/10.5.2.231-diag.txt.gz
10.5.2.232 : https://s3.amazonaws.com/bugdb/jira/MB-11573/05bbeb62/10.5.2.232-diag.txt.gz
10.5.2.232 : https://s3.amazonaws.com/bugdb/jira/MB-11573/15b11b01/10.5.2.232-782014-67-couch.tar.gz
10.5.2.232 : https://s3.amazonaws.com/bugdb/jira/MB-11573/d388f019/10.5.2.232-782014-621-diag.zip
10.5.2.233 : https://s3.amazonaws.com/bugdb/jira/MB-11573/8b59760a/10.5.2.233-diag.txt.gz
10.5.2.233 : https://s3.amazonaws.com/bugdb/jira/MB-11573/bf4a6786/10.5.2.233-782014-626-diag.zip

[Destination]
10.5.2.228 : https://s3.amazonaws.com/bugdb/jira/MB-11573/0c3e02f3/10.5.2.228-782014-67-couch.tar.gz
10.5.2.228 : https://s3.amazonaws.com/bugdb/jira/MB-11573/cac82b8b/10.5.2.228-782014-616-diag.zip
10.5.2.228 : https://s3.amazonaws.com/bugdb/jira/MB-11573/e24d1ab4/10.5.2.228-diag.txt.gz
10.5.2.229 : https://s3.amazonaws.com/bugdb/jira/MB-11573/307bc81f/10.5.2.229-782014-67-couch.tar.gz
10.5.2.229 : https://s3.amazonaws.com/bugdb/jira/MB-11573/ac67e5f4/10.5.2.229-782014-619-diag.zip
10.5.2.229 : https://s3.amazonaws.com/bugdb/jira/MB-11573/cbcbada6/10.5.2.229-diag.txt.gz
10.5.2.230 : https://s3.amazonaws.com/bugdb/jira/MB-11573/0ee89b4a/10.5.2.230-782014-618-diag.zip
10.5.2.230 : https://s3.amazonaws.com/bugdb/jira/MB-11573/66d42064/10.5.2.230-782014-67-couch.tar.gz
10.5.2.230 : https://s3.amazonaws.com/bugdb/jira/MB-11573/7cbbe047/10.5.2.230-diag.txt.gz


[Test Steps]
1. Setup 3-3 Node Source and Destination Cluster.
2. Setup capi Mode Bi-XDCR.
3. Load 1M items on each cluster asychronously.
4. Rebalance out 2 nodes from Source cluster during data load.
5. After Rebalance, Verify items on each cluster. Test failed as Destination Cluster has 2 items less than.

[2014-07-08 06:00:44,001] - [task:440] WARNING - Not Ready: vb_replica_curr_items 199998 == 200000 expected on '10.5.2.228:8091''10.5.2.230:8091''10.5.2.229:8091', default bucket
[2014-07-08 06:00:47,033] - [task:440] WARNING - Not Ready: curr_items 199998 == 200000 expected on '10.5.2.228:8091''10.5.2.230:8091''10.5.2.229:8091', default bucket
[2014-07-08 06:00:48,062] - [task:440] WARNING - Not Ready: vb_active_curr_items 199998 == 200000 expected on '10.5.2.228:8091''10.5.2.230:8091''10.5.2.229:8091', default bucket


Comment by Sangharsh Agarwal [ 08/Jul/14 ]
I am re-running the test to leave live cluster for you to investigate.
Comment by Sundar Sridharan [ 08/Jul/14 ]
This is interesting, looks like the symptoms here are quite different from the initial one mentioned in the bug. There is no mismatch between active and replica items on the destination cluster.
./cbvdiff 10.5.2.228:11210,10.5.2.230:11210,10.5.2.229:11210
Active item count = 199999
Comment by Sangharsh Agarwal [ 08/Jul/14 ]
I logged https://www.couchbase.com/issues/browse/MB-11593 for this issue too, but it was marked as duplicate of this issue.
Comment by Sangharsh Agarwal [ 08/Jul/14 ]
Cluster is Live for Investigation. You can use VMs for investgation/run the test.
Comment by Sundar Sridharan [ 08/Jul/14 ]
Thanks Sangharsh, just to confirm - 10.5.2.231 and 10.5.2.233 were the nodes from the source cluster that were rebalanced out right?
I am on the cluster right now.. Please let me know if you need it back.
Comment by Sundar Sridharan [ 08/Jul/14 ]
vbucket 303 has 212 items on source and only 211 items on destination.
id: loadOne5836 is present on 10.5.2.232 (src) and not on destination node 10.5.2.228
Comment by Sangharsh Agarwal [ 08/Jul/14 ]
Yes, 10.5.2.231, 233 were rebalance out.
Comment by Sundar Sridharan [ 08/Jul/14 ]
Sangharsh, so were all keys with prefix loadOne were inserted into cluster 1 comprising of 10.5.2.231, 10.5.2.232 and 10.5.2.233 while all keys with prefix loadTwo were inserted into cluster 2 comprising of 10.5.2.228, 10.5.2.229, 10.5.2.230?
Also could you please tell us if the workload overlaps the key space across the source and destination clusters (which means loadOne keys can be inserted both into cluster 1 as well as cluster 2) ?
Comment by Aruna Piravi [ 08/Jul/14 ]
Sangharsh is not available at this time. So answering Sundar's qn .

> Sangharsh, so were all keys with prefix loadOne were inserted into cluster 1 comprising of 10.5.2.231, 10.5.2.232 and 10.5.2.233 while all keys with prefix loadTwo were inserted into cluster 2 comprising of 10.5.2.228, 10.5.2.229, 10.5.2.230?
Yes, you are correct. loadOne* goes to all servers listed under [cluster1] in .ini, loadTwo* gets loaded to servers listed under [cluster2].

>Also could you please tell us if the workload overlaps the key space across the source and destination clusters (which means loadOne keys can be inserted both into cluster 1 as well as cluster 2) ?
I checked the code, we are not doing updates/deletes on overlapping key space in this test.
Comment by Sundar Sridharan [ 08/Jul/14 ]
Looks like on the Producer the start seqno has skipped one item
memcached.log.14.txt:19313:Tue Jul 8 10:18:29.198224 PDT 3: (default) UPR (Producer) eq_uprq:xdcr:default-6098641df836bdbfff9953ad74a05bbe - (vb 303) stream created with start seqno 0 and end seqno 21
memcached.log.14.txt:21205:Tue Jul 8 10:18:35.825112 PDT 3: (default) UPR (Producer) eq_uprq:xdcr:default-0412be9c4abad09bd0892f2d827d7f5f - (vb 303) stream created with start seqno 22 and end seqno 24 <<<<<<<<<<<<<<<<---------------------!!
memcached.log.14.txt:23783:Tue Jul 8 10:18:49.391231 PDT 3: (default) UPR (Producer) eq_uprq:xdcr:default-21d7493b0123d12d6ac5723df80b154f - (vb 303) stream created with start seqno 24 and end seqno 26
memcached.log.14.txt:43257:Tue Jul 8 10:20:07.665858 PDT 3: (default) UPR (Producer) eq_uprq:xdcr:default-886c34d2cd6b30b88c77ededf205e086 - (vb 303) stream created with start seqno 26 and end seqno 105

And we see that the missing item (loadOne5836) also has the seqno 21
Doc seq: 21
     id: loadOne5836
     rev: 1
     content_meta: 131
     size (on disk): 40
     cas: 731155696354949, expiry: 0, flags: 0
Comment by Sundar Sridharan [ 08/Jul/14 ]
fix uploaded at http://review.couchbase.org/#/c/39224/
Comment by Chiyoung Seo [ 08/Jul/14 ]
The fix was merged. Please retest it when the new build is ready.
Comment by Sangharsh Agarwal [ 09/Jul/14 ]
Issue occurred on latest build i.e. 3.0.0-942 again:

[Jenkin]
http://qa.hq.northscale.net/job/ubuntu_x64--01_02--rebalanceXDCR-P0/23/consoleFull

[Test]
./testrunner -i ubuntu_x64--01_02--rebalanceXDCR-P0.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True -t xdcr.rebalanceXDCR.Rebalance.async_rebalance_in,items=100000,rdirection=unidirection,ctopology=chain,doc-ops=update-delete,expires=60,rebalance=source-destination,num_rebalance=1,GROUP=P1


[Test Logs]
[2014-07-08 23:58:42,637] - [task:456] WARNING - Not Ready: vb_replica_curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket
[2014-07-08 23:58:44,677] - [task:456] WARNING - Not Ready: curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket
[2014-07-08 23:58:46,718] - [task:456] WARNING - Not Ready: vb_active_curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket
[2014-07-08 23:58:47,743] - [task:456] WARNING - Not Ready: vb_replica_curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket
[2014-07-08 23:58:49,784] - [task:456] WARNING - Not Ready: curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket
[2014-07-08 23:58:51,827] - [task:456] WARNING - Not Ready: vb_active_curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket
[2014-07-08 23:58:52,873] - [task:456] WARNING - Not Ready: vb_replica_curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket
[2014-07-08 23:58:54,913] - [task:456] WARNING - Not Ready: curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket
[2014-07-08 23:58:56,952] - [task:456] WARNING - Not Ready: vb_active_curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket
[2014-07-08 23:58:57,994] - [task:456] WARNING - Not Ready: vb_replica_curr_items 54999 == 40000 expected on '10.3.3.143:8091''10.3.3.145:8091''10.3.3.142:8091''10.3.3.149:8091', default bucket

There were metadata mismtach found too, which shows many deletion were not replicated to destination cluster:

[2014-07-09 00:05:17,634] - [xdcrbasetests:1255] INFO - Verifying RevIds for 10.3.3.146 -> 10.3.3.143, bucket: default
[2014-07-09 00:05:17,950] - [data_helper:289] INFO - creating direct client 10.3.3.144:11210 default
[2014-07-09 00:05:18,421] - [data_helper:289] INFO - creating direct client 10.3.3.146:11210 default
[2014-07-09 00:05:18,823] - [data_helper:289] INFO - creating direct client 10.3.3.148:11210 default
[2014-07-09 00:05:19,226] - [data_helper:289] INFO - creating direct client 10.3.3.147:11210 default
[2014-07-09 00:05:19,876] - [data_helper:289] INFO - creating direct client 10.3.3.142:11210 default
[2014-07-09 00:05:20,316] - [data_helper:289] INFO - creating direct client 10.3.3.143:11210 default
[2014-07-09 00:05:20,771] - [data_helper:289] INFO - creating direct client 10.3.3.149:11210 default
[2014-07-09 00:05:21,188] - [data_helper:289] INFO - creating direct client 10.3.3.145:11210 default
[2014-07-09 00:06:30,464] - [task:1161] INFO - RevId Verification : 40000 existing items have been verified
[2014-07-09 00:06:30,478] - [task:1220] ERROR - ===== Verifying rev_ids failed for key: loadOne80026 =====
[2014-07-09 00:06:30,478] - [task:1221] ERROR - deleted mismatch: Source deleted:1, Destination deleted:0, Error Count:1
[2014-07-09 00:06:30,478] - [task:1221] ERROR - seqno mismatch: Source seqno:2, Destination seqno:1, Error Count:2
[2014-07-09 00:06:30,478] - [task:1221] ERROR - cas mismatch: Source cas:10784420840980343, Destination cas:10784420840980342, Error Count:3
[2014-07-09 00:06:30,479] - [task:1222] ERROR - Source meta data: {'deleted': 1, 'seqno': 2, 'cas': 10784420840980343, 'flags': 0, 'expiration': 1404888661}
[2014-07-09 00:06:30,479] - [task:1223] ERROR - Dest meta data: {'deleted': 0, 'seqno': 1, 'cas': 10784420840980342, 'flags': 0, 'expiration': 0}
[2014-07-09 00:06:30,487] - [task:1220] ERROR - ===== Verifying rev_ids failed for key: loadOne6230 =====
[2014-07-09 00:06:30,488] - [task:1221] ERROR - deleted mismatch: Source deleted:1, Destination deleted:0, Error Count:4
[2014-07-09 00:06:30,488] - [task:1221] ERROR - seqno mismatch: Source seqno:3, Destination seqno:1, Error Count:5
[2014-07-09 00:06:30,488] - [task:1221] ERROR - cas mismatch: Source cas:10784931188024077, Destination cas:10784393526456264, Error Count:6
[2014-07-09 00:06:30,489] - [task:1222] ERROR - Source meta data: {'deleted': 1, 'seqno': 3, 'cas': 10784931188024077, 'flags': 0, 'expiration': 1404889014}
[2014-07-09 00:06:30,489] - [task:1223] ERROR - Dest meta data: {'deleted': 0, 'seqno': 1, 'cas': 10784393526456264, 'flags': 0, 'expiration': 0}
[2014-07-09 00:06:30,494] - [task:1220] ERROR - ===== Verifying rev_ids failed for key: loadOne77329 =====
[2014-07-09 00:06:30,494] - [task:1221] ERROR - deleted mismatch: Source deleted:1, Destination deleted:0, Error Count:7
[2014-07-09 00:06:30,495] - [task:1221] ERROR - seqno mismatch: Source seqno:2, Destination seqno:1, Error Count:8
[2014-07-09 00:06:30,495] - [task:1221] ERROR - cas mismatch: Source cas:10784419815115230, Destination cas:10784419815115229, Error Count:9
[2014-07-09 00:06:30,495] - [task:1222] ERROR - Source meta data: {'deleted': 1, 'seqno': 2, 'cas': 10784419815115230, 'flags': 0, 'expiration': 1404888644}
[2014-07-09 00:06:30,495] - [task:1223] ERROR - Dest meta data: {'deleted': 0, 'seqno': 1, 'cas': 10784419815115229, 'flags': 0, 'expiration': 0}
[2014-07-09 00:06:30,508] - [task:1220] ERROR - ===== Verifying rev_ids failed for key: loadOne90011 =====
[2014-07-09 00:06:30,509] - [task:1221] ERROR - deleted mismatch: Source deleted:1, Destination deleted:0, Error Count:10
[2014-07-09 00:06:30,509] - [task:1221] ERROR - seqno mismatch: Source seqno:2, Destination seqno:1, Error Count:11
[2014-07-09 00:06:30,509] - [task:1221] ERROR - cas mismatch: Source cas:10784424284736794, Destination cas:10784424284736793, Error Count:12
[2014-07-09 00:06:30,509] - [task:1222] ERROR - Source meta data: {'deleted': 1, 'seqno': 2, 'cas': 10784424284736794, 'flags': 0, 'expiration': 1404888730}
[2014-07-09 00:06:30,510] - [task:1223] ERROR - Dest meta data: {'deleted': 0, 'seqno': 1, 'cas': 10784424284736793, 'flags': 0, 'expiration': 0}


[Test Step]
1. Setup 3-3 node Src and Destination cluster.
2. Setup unixdcr CAPI mode Source -> Destination.
3. Load 1M items on Source.
4. Add 1 node at Source cluster and 1 node at destination cluster.
5. Update 30% (Expiration time of 60 seconds) and Delete 30% items on Source.
6. Verify items. Lesser items found on Destination cluster.

[Server Logs]

[Source]
10.3.3.144 : https://s3.amazonaws.com/bugdb/jira/MB-11573/2956573a/10.3.3.144-diag.txt.gz
10.3.3.144 : https://s3.amazonaws.com/bugdb/jira/MB-11573/bc5a1183/10.3.3.144-792014-022-diag.zip
10.3.3.144 : https://s3.amazonaws.com/bugdb/jira/MB-11573/e54806e5/10.3.3.144-792014-06-couch.tar.gz
10.3.3.146 : https://s3.amazonaws.com/bugdb/jira/MB-11573/86ea590a/10.3.3.146-792014-019-diag.zip
10.3.3.146 : https://s3.amazonaws.com/bugdb/jira/MB-11573/95f08cd8/10.3.3.146-diag.txt.gz
10.3.3.146 : https://s3.amazonaws.com/bugdb/jira/MB-11573/bd342640/10.3.3.146-792014-06-couch.tar.gz
10.3.3.148 : https://s3.amazonaws.com/bugdb/jira/MB-11573/26c6ec6e/10.3.3.148-792014-06-couch.tar.gz
10.3.3.148 : https://s3.amazonaws.com/bugdb/jira/MB-11573/545c1e7c/10.3.3.148-792014-023-diag.zip
10.3.3.148 : https://s3.amazonaws.com/bugdb/jira/MB-11573/eb81965a/10.3.3.148-diag.txt.gz
10.3.3.147 : https://s3.amazonaws.com/bugdb/jira/MB-11573/53ab8e05/10.3.3.147-792014-024-diag.zip
10.3.3.147 : https://s3.amazonaws.com/bugdb/jira/MB-11573/c0a49871/10.3.3.147-diag.txt.gz
10.3.3.147 : https://s3.amazonaws.com/bugdb/jira/MB-11573/e05fb7f2/10.3.3.147-792014-06-couch.tar.gz

10.3.3.148 -> Added node at Source.

[Destination]
10.3.3.142 : https://s3.amazonaws.com/bugdb/jira/MB-11573/29a36a28/10.3.3.142-792014-06-couch.tar.gz
10.3.3.142 : https://s3.amazonaws.com/bugdb/jira/MB-11573/60d75961/10.3.3.142-diag.txt.gz
10.3.3.142 : https://s3.amazonaws.com/bugdb/jira/MB-11573/85967976/10.3.3.142-792014-018-diag.zip
10.3.3.143 : https://s3.amazonaws.com/bugdb/jira/MB-11573/704327af/10.3.3.143-792014-06-couch.tar.gz
10.3.3.143 : https://s3.amazonaws.com/bugdb/jira/MB-11573/7aeb0fe5/10.3.3.143-diag.txt.gz
10.3.3.143 : https://s3.amazonaws.com/bugdb/jira/MB-11573/81482a1f/10.3.3.143-792014-015-diag.zip
10.3.3.149 : https://s3.amazonaws.com/bugdb/jira/MB-11573/b88f45cf/10.3.3.149-792014-07-couch.tar.gz
10.3.3.149 : https://s3.amazonaws.com/bugdb/jira/MB-11573/d070e995/10.3.3.149-diag.txt.gz
10.3.3.149 : https://s3.amazonaws.com/bugdb/jira/MB-11573/e9d83bfb/10.3.3.149-792014-026-diag.zip
10.3.3.145 : https://s3.amazonaws.com/bugdb/jira/MB-11573/49a3f167/10.3.3.145-diag.txt.gz
10.3.3.145 : https://s3.amazonaws.com/bugdb/jira/MB-11573/666f3c03/10.3.3.145-792014-06-couch.tar.gz
10.3.3.145 : https://s3.amazonaws.com/bugdb/jira/MB-11573/c4ddf1ca/10.3.3.145-792014-017-diag.zip

10.3.3.149 -> Added node at destination
Comment by Sundar Sridharan [ 09/Jul/14 ]
Sangharsh, could you please try to reproduce this issue with toy build couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0-702-toy.rpm which contains the latest ep-engine fixes along with the logging for deleted items on replica.
Also it would be great if you could leave the cluster running when the issue reproduces?
thanks




[MB-11672] Missing items in index after rebalance (Intermittent failure) Created: 08/Jul/14  Updated: 10/Jul/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Sarath Lakshman Assignee: Sarath Lakshman
Resolution: Unresolved Votes: 0
Labels: releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates MB-11641 {UPR}:: Reading from views timing out... Closed
Relates to
relates to MB-11371 Corruption in PartitionVersions struc... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Test:
NODES=4 TEST=rebalance.rebalanceinout.RebalanceInOutTests.measure_time_index_during_rebalance,items=200000,data_perc_add=30,nodes_init=3,nodes_in=1,skip_cleanup=True,nodes_out=1,num_ddocs=2,num_views=2,max_verify=50000,value_size=1024,GROUP=IN_OUT make any-test

Once in three or four times, it is found that view query results have lesser items than expected number of items.

Logs:
https://s3.amazonaws.com/bugdb/jira/MB-11371/f9ad56ee/172.23.107.24-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11371/07e24114/172.23.107.25-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11371/a9c9a36d/172.23.107.26-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11371/2517f70b/172.23.107.27-diag.zip

 Comments   
Comment by Parag Agarwal [ 08/Jul/14 ]
What is the output of the test run?
Comment by Parag Agarwal [ 08/Jul/14 ]
Sarath: Did you hit this issue while verifying https://www.couchbase.com/issues/browse/MB-11641 ?
Comment by Parag Agarwal [ 08/Jul/14 ]
Saw this in 935
Comment by Sriram Melkote [ 10/Jul/14 ]
Sarath mentioned on today's codebase, we're not hitting it - it's not clear it it's just reduced in frequency, or was fixed by recent changes. Will update again.




[MB-11597] KV+XDCR System test: Mutation replication rate for uni-xdcr is almost zero(900K items remaining) while another bi-xdcr to same cluster is ~10k ops/sec Created: 30/Jun/14  Updated: 10/Jul/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication, performance
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 6.x 8*8 clusters. Each node : 15GB RAM, 450Gb HDD

Attachments: PNG File Screen Shot 2014-06-30 at 11.27.42 AM.png     PNG File Screen Shot 2014-06-30 at 11.31.54 AM.png    
Triage: Untriaged
Is this a Regression?: No

 Description   
Build
--------
3.0.0-900(xdcr on upr, internal replication on upr)

Clusters
-----------
Source : http://172.23.105.44:8091/
Destination : http://172.23.105.54:8091/
There's currently a test running on this cluster. You can take a look if required.

Steps
--------
1. Load on both clusters till vb_active_resident_items_ratio < 50.
2. Setup bi-xdcr on "standardbucket", uni-xdcr on "standardbucket1"
3. Access phase with 50% gets, 50%deletes running now for an hour.


Problem
-------------
See screenshot. Mutation replication rate is uneven for uni and bi-xdcr.
Bucket under discussion: standardbucket1 (has uni-xdcr) has ~900K items remaining. Mutation replication is almost 0.
Another bucket standardbucket has bi-xdcr (to same cluster) but it's mutation replication rate is ~10k ops/sec, there's data moving as can be seen from upr queue stats.

Bucket capacity:5GB for both standard buckets.

Bucket priority
-----------------------
Both standardbucket and standardbucket1 have high priority.

Attaching cbcollect.

 Comments   
Comment by Aruna Piravi [ 30/Jun/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11597//C1.tar
https://s3.amazonaws.com/bugdb/jira/MB-11597/C2.tar

Pls see if these logs help. When I was doing cbcollect the mutation replication rate picked up. If the logs don't help, I can point to the cluster as soon as I see it again. Thanks.
Comment by Aruna Piravi [ 30/Jun/14 ]
Lowering to Critical as mutation replication rate got better after few mins.
Comment by Aleksey Kondratenko [ 30/Jun/14 ]
Not seeing anything notable in logs.
Comment by Aruna Piravi [ 07/Jul/14 ]
Reproduced again, Alk looked at the cluster. Pausing and resuming replication helped bring the slower replication up to speed. We also noticed many unacked bytes in upr stream on a node where outbound mutations was 0.

New set of logs -
https://s3.amazonaws.com/bugdb/jira/MB-11597/C1.tar
https://s3.amazonaws.com/bugdb/jira/MB-11597/C2.tar




[MB-11554] cbrecovery mismatch in meta information after rebalance Created: 26/Jun/14  Updated: 10/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ashvinder Singh Assignee: David Liao
Resolution: Unresolved Votes: 0
Labels: releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: All OS

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Test run on build 3.0.0-866
Steps to reproduce:
- Setup two clusters - src-A and dest-B (each with three nodes and one bucket each) with xdcr
- Setup two floating nodes on dest-B
- Create 80 K items, ensure all items are replicated on the dest-B cluster
- Failover two nodes at dest-B
- Add one node at dest-B cluster
- run cbrecovery from src-A to dest-B
- Ensure cbrecovery completes successfully
- do rebalance on dest-B cluster
- ensure rebalance completes
- Do mutations on src-A cluster
- Verify meta data on dest-B cluster matches src-A cluster
Bug: Meta data does not match between src-A and dest-B cluster.

Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 6811709964931922, 'flags': 0, 'expiration': 0}
[2014-06-25 01:10:42,636] - [task:1206] ERROR - Dest meta data: {'deleted': 0, 'seqno': 1, 'cas': 0, 'flags': 0, 'expiration': 0}

Found using jenkins job: http://qa.sc.couchbase.com/view/All/job/ubuntu_x64--38_01--cbrecovery-P1/44/consoleFull


 Comments   
Comment by Bin Cui [ 10/Jul/14 ]
I wonder if ep_engine by any chance will change the cas field?




[MB-11299] Upr replica streams cannot send items from partial snapshots Created: 03/Jun/14  Updated: 10/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
If items are sent from a replica vbucket and those items are from a partial snapshot then we might get holes in our data.

 Comments   
Comment by Aleksey Kondratenko [ 02/Jul/14 ]
Raised to blocker. Data loss in xdcr or views is super critical IMO
Comment by Mike Wiederhold [ 10/Jul/14 ]
I agree with Alk on the severity of this issue, but I do want to note that seeing this problem will be rare. I'm planning to work on it soon, but I need to get another issue resolved first before I address this problem.




[MB-11048] Range queries result in thousands of GET operations/sec Created: 05/May/14  Updated: 18/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Benchmark for range queries demonstrated very high latency. At the same time I noticed extremely high rate of GET operations.

Even single query such as "SELECT name.f.f.f AS _name FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000 LIMIT 20" led to hundreds of memcached reads.

Explain:

https://gist.github.com/pavel-paulau/5e90939d6ab28034e3ed

Engine output:

https://gist.github.com/pavel-paulau/b222716934dfa3cb598e

I don't like to use JIRA as forum but why does it happen? Do you fetch entire range before returning limited output?

 Comments   
Comment by Gerald Sangudi [ 05/May/14 ]
Pavel,

Yes, the scan and fetch are performed before we do any LIMIT. This will be fixed in DP4, but it may not be easily fixable in DP3.

Can you please post the results of the following query:

SELECT COUNT(*) FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000

Thanks.
Comment by Pavel Paulau [ 05/May/14 ]
cbq> SELECT COUNT(*) FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000
{
    "resultset": [
        {
            "$1": 2134
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "1"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "547.545767ms"
        }
    ]
}
Comment by Pavel Paulau [ 05/May/14 ]
Also it looks like we are leaking memory in this scenario.

Resident memory of cbq-engine grows very fast (several megabytes per second) and never goes down...




[MB-11033] was able to add node on src from dest cluster and got local replication Created: 03/May/14  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication, ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
system tests:

2 cluster with 4 bucket, uni xdc replication

source:
172.23.105.158
172.23.105.156
172.23.105.157
172.23.105.22
destination:
172.23.105.160
172.23.105.159
172.23.105.206
172.23.105.207

1) remove node 172.23.105.159 on dest
2) remove node 172.23.105.22 on src
3) add 172.23.105.159 and 172.23.105.22 on src


results:
src cluster started replicate on itself


AbRegNums Version 2 this cluster bucket "AbRegNums" on cluster "172.23.105.159" Replicating on change
Delete
Settings
MsgsCalls Version 2 this cluster bucket "MsgsCalls" on cluster "172.23.105.159" Replicating on change
Delete
Settings
RevAB Version 2 this cluster bucket "RevAB" on cluster "172.23.105.159" Replicating on change
Delete
Settings
UserInfo Version 2 this cluster bucket "UserInfo" on cluster "172.23.105.159" Replicating on change





logs on destination when node 172.23.105.159 ejected:
Starting rebalance, KeepNodes = ['ns_1@172.23.105.160','ns_1@172.23.105.206',
'ns_1@172.23.105.207'], EjectNodes = ['ns_1@172.23.105.159'], Failed over and being ejected nodes = []; no delta recovery nodes
ns_orchestrator004 ns_1@172.23.105.159 08:20:13 - Fri May 2, 2014


logs on src when node 172.23.105.159 & 172.23.105.22 added:
Starting rebalance, KeepNodes = ['ns_1@172.23.105.156','ns_1@172.23.105.157',
'ns_1@172.23.105.158','ns_1@172.23.105.22',
'ns_1@172.23.105.159'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes
ns_orchestrator004 ns_1@172.23.105.156 09:30:06 - Fri May 2, 2014

Starting rebalance, KeepNodes = ['ns_1@172.23.105.156','ns_1@172.23.105.157',
'ns_1@172.23.105.158'], EjectNodes = ['ns_1@172.23.105.22'], Failed over and being ejected nodes = []; no delta recovery nodes
ns_orchestrator004 ns_1@172.23.105.156 08:20:11 - Fri May 2, 2014


also on this cluster I see a lot discomfiture:
1) stop all loaders but still see ~6K set(update) ops for bucket RevAB
2)MB-11032 DISK QUEUES for Active & resident items are not numbers, too big and does not correspond to total
3) rebalance stuck(will fill separate ticket)
4)


will keep the clusters alive:
source: http://172.23.105.156:8091/
destination:
http://172.23.105.160:8091/


https://s3.amazonaws.com/bugdb/jira/MB-11032/371fc18e/172.23.105.156-532014-621-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11032/371fc18e/172.23.105.157-532014-612-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11032/371fc18e/172.23.105.158-532014-62-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11032/371fc18e/172.23.105.22-532014-631-diag.zip




 Comments   
Comment by Aleksey Kondratenko [ 05/May/14 ]
Hm. What makes you think the cluster is replicating into itself ?
Comment by Aleksey Kondratenko [ 07/May/14 ]
Please elaborate at least a bit on why you think we're replicating into itself
Comment by Anil Kumar [ 19/Jun/14 ]
Andrei - can you please add information to ticket
Comment by Andrei Baranouski [ 20/Jun/14 ]
yes, will try to reproduce it




[MB-11007] Request for Get Multi Meta Call for bulk meta data reads Created: 30/Apr/14  Updated: 30/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Parag Agarwal Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: All


 Description   
Currently we support per key call for getMetaData. As a result our verification requires per key fetch during verification phase. This request is to support for get bulk meta data call which can get us meta data per vbucket for all keys or in batches. This would help enhance our verification ability for meta data per documents over time or after operations like rebalance, as it will be faster. If there is a better alternative, please recommend.

Current Behavior

https://github.com/couchbase/ep-engine/blob/master/src/ep.cc

ENGINE_ERROR_CODE EventuallyPersistentStore::getMetaData(
                                                        const std::string &key,
                                                        uint16_t vbucket,
                                                        const void *cookie,
                                                        ItemMetaData &metadata,
                                                        uint32_t &deleted,
                                                        bool trackReferenced)
{
    (void) cookie;
    RCPtr<VBucket> vb = getVBucket(vbucket);
    if (!vb || vb->getState() == vbucket_state_dead ||
        vb->getState() == vbucket_state_replica) {
        ++stats.numNotMyVBuckets;
        return ENGINE_NOT_MY_VBUCKET;
    }

    int bucket_num(0);
    deleted = 0;
    LockHolder lh = vb->ht.getLockedBucket(key, &bucket_num);
    StoredValue *v = vb->ht.unlocked_find(key, bucket_num, true,
                                          trackReferenced);

    if (v) {
        stats.numOpsGetMeta++;

        if (v->isTempInitialItem()) { // Need bg meta fetch.
            bgFetch(key, vbucket, -1, cookie, true);
            return ENGINE_EWOULDBLOCK;
        } else if (v->isTempNonExistentItem()) {
            metadata.cas = v->getCas();
            return ENGINE_KEY_ENOENT;
        } else {
            if (v->isTempDeletedItem() || v->isDeleted() ||
                v->isExpired(ep_real_time())) {
                deleted |= GET_META_ITEM_DELETED_FLAG;
            }
            metadata.cas = v->getCas();
            metadata.flags = v->getFlags();
            metadata.exptime = v->getExptime();
            metadata.revSeqno = v->getRevSeqno();
            return ENGINE_SUCCESS;
        }
    } else {
        // The key wasn't found. However, this may be because it was previously
        // deleted or evicted with the full eviction strategy.
        // So, add a temporary item corresponding to the key to the hash table
        // and schedule a background fetch for its metadata from the persistent
        // store. The item's state will be updated after the fetch completes.
        return addTempItemForBgFetch(lh, bucket_num, key, vb, cookie, true);
    }
}



 Comments   
Comment by Venu Uppalapati [ 30/Apr/14 ]
Server has support for quiet CMD_GETQ_META call which can be used on the client side to create a multi-getMeta call similar to multiGet call implementation.
Comment by Parag Agarwal [ 30/Apr/14 ]
Please point to a working example for this call
Comment by Venu Uppalapati [ 30/Apr/14 ]
Parag, you can find some relevant information on using queuing requests using quiet call at https://code.google.com/p/memcached/wiki/BinaryProtocolRevamped#Get,_Get_Quietly,_Get_Key,_Get_Key_Quietly
Comment by Chiyoung Seo [ 30/Apr/14 ]
Changing the fix version to the feature backlog given that 3.0 feature complete date was already passed and it is requested for the QE testing framework.




[MB-10993] Cluster Overview - Usable Free Space documentation misleading Created: 29/Apr/14  Updated: 29/Apr/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Jim Walker Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: documentation
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Issue relates to:
 http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#viewing-cluster-summary

I was working through a support case and trying to explain the cluster overview free space and usable free space.

The following statement is from out documentation. After code review of ns_server I concluded that this is incorrect.

Usable Free Space:
The amount of usable space for storing information on disk. This figure shows the amount of space available on the configured path after non-Couchbase files have been taken into account.

The correct statement should be

Usable Free Space:
The amount of usable space for storing information on disk. This figure is calculated from the node with least amount of available storage in the cluster. The final value is calculated by multiplying by the number of nodes in the cluster.


This change is important as it is important for users to understand why Usable Free Space can be less than Free Space. The cluster considers all nodes to be equal. If you actually have a "weak" node in the cluster, e.g. one with a small disk, then the cluster nodes all have to ensure they keep storage under the weaker nodes limits, else for example we can never failover to the weak node as it cannot take on the job of a stronger node. When Usable Free Space is less than Free space, the user may actually want to see why a node has less storage available.




[MB-10944] Support of stale=false queries Created: 23/Apr/14  Updated: 18/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3, cbq-DP4
Fix Version/s: cbq-DP3
Security Level: Public

Type: Story Priority: Critical
Reporter: Pavel Paulau Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
stale=false queries in view engine are not really consistent but critical for competitive benchmarking.

 Comments   
Comment by Gerald Sangudi [ 23/Apr/14 ]
Manik,

Please add a -stale parameter to the REST API for cbq-engine. The parameter should accept true, false, and update-after as values.

Please include this fix in the DP3 bugfix release.

Thanks.




[MB-10920] unable to start tuq if there are no buckets Created: 22/Apr/14  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
node is initialized but has no buckets
[root@kiwi-r116 tuqtng]# ./tuqtng -couchbase http://localhost:8091
10:26:56.520415 Info line disabled false
10:26:56.522641 FATAL: Unable to run server, err: Unable to access site http://localhost:8091, err: HTTP error 401 Unauthorized getting "http://localhost:8091/pools": -- main.main() at main.go:76




[MB-10914] {UPR} ::Control connection to memcached on 'ns_1@IP' disconnected with some other crashes before upr_replicator:init/1, upr_proxy:init/1, replication_manager:init/1 Created: 21/Apr/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Meenakshi Goel Assignee: Ketaki Gangal
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-594-rel

Attachments: Text File log.txt     File lsof_memcached.rtf    
Issue Links:
Dependency
depends on MB-11378 Cluster gets stuck in warmup state af... Closed
Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Link:
http://qa.sc.couchbase.com/job/ubuntu_x64--65_01--view_query_negative-P1/17/console

Notes:
Failed Test will not deterministically reproduce the error.
No Core dumps are observed on the machines.
Please refer log file log.txt attached.

Logs:
[user:info,2014-04-21T1:42:50.417,ns_1@172.23.106.196:ns_memcached-default<0.22998.7>:ns_memcached:terminate:821]Control connection to memcached on 'ns_1@172.23.106.196' disconnected: {badmatch,
                                                                        {error,
                                                                         couldnt_connect_to_memcached}}
[error_logger:error,2014-04-21T1:42:50.419,ns_1@172.23.106.196:error_logger<0.6.0>:ale_error_logger_handler:log_msg:119]** Generic server <0.22998.7> terminating
** Last message in was {'EXIT',<0.23030.7>,
                           {badmatch,{error,couldnt_connect_to_memcached}}}
** When Server state == {state,1,0,0,
                               {[],[]},
                               {[],[]},
                               {[],[]},
                               connected,
                               {1398,69765,399403},
                               "default",#Port<0.424036>,
                               {interval,#Ref<0.0.28.134215>},
                               [{<0.23031.7>,#Ref<0.0.28.136461>},
                                {<0.23029.7>,#Ref<0.0.28.134508>},
                                {<0.23032.7>,#Ref<0.0.28.134245>}],
                               []}
** Reason for termination ==
** {badmatch,{error,couldnt_connect_to_memcached}}

[error_logger:error,2014-04-21T1:42:50.420,ns_1@172.23.106.196:error_logger<0.6.0>:ale_error_logger_handler:log_report:115]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_memcached:init/1
    pid: <0.22998.7>
    registered_name: []
    exception exit: {badmatch,{error,couldnt_connect_to_memcached}}
      in function gen_server:init_it/6
    ancestors: ['single_bucket_sup-default',<0.22992.7>]
    messages: []
    links: [<0.23029.7>,<0.23031.7>,<0.23032.7>,<0.307.0>,<0.22993.7>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 196418
    stack_size: 24
    reductions: 22902
  neighbours:
    neighbour: [{pid,<0.23032.7>},
                  {registered_name,[]},
                  {initial_call,{erlang,apply,['Argument__1','Argument__2']}},
                  {current_function,{gen,do_call,4}},
                  {ancestors,['ns_memcached-default',
                              'single_bucket_sup-default',<0.22992.7>]},
                  {messages,[]},
                  {links,[<0.22998.7>,#Port<0.424043>]},
                  {dictionary,[]},
                  {trap_exit,false},
                  {status,waiting},
                  {heap_size,46368},
                  {stack_size,24},
                  {reductions,4663}]
    neighbour: [{pid,<0.23031.7>},
                  {registered_name,[]},
                  {initial_call,{erlang,apply,['Argument__1','Argument__2']}},
                  {current_function,{gen,do_call,4}},
                  {ancestors,['ns_memcached-default',
                              'single_bucket_sup-default',<0.22992.7>]},
                  {messages,[]},
                  {links,[<0.22998.7>,#Port<0.424044>]},
                  {dictionary,[]},
                  {trap_exit,false},
                  {status,waiting},
                  {heap_size,10946},
                  {stack_size,24},
                  {reductions,44938}]
    neighbour: [{pid,<0.23029.7>},
                  {registered_name,[]},
                  {initial_call,{erlang,apply,['Argument__1','Argument__2']}},
                  {current_function,{gen,do_call,4}},
                  {ancestors,['ns_memcached-default',
                              'single_bucket_sup-default',<0.22992.7>]},
                  {messages,[]},
                  {links,[<0.22998.7>,#Port<0.424045>]},
  {dictionary,[]},
                  {trap_exit,false},
                  {status,waiting},
                  {heap_size,10946},
                  {stack_size,24},
                  {reductions,11546}]


Uploading logs.

 Comments   
Comment by Meenakshi Goel [ 21/Apr/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-10914/fd10746b/172.23.106.196-4212014-20-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/e390d261/172.23.106.197-4212014-22-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/29d793bb/172.23.106.198-4212014-24-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/d5568c08/172.23.106.199-4212014-26-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/22245367/172.23.106.200-4212014-29-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/5fabfeeb/172.23.106.201-4212014-211-diag.zip
Comment by Meenakshi Goel [ 24/Apr/14 ]
Observing this issue with latest build 3.0.0-605-rel too with some other errors, not sure if they are related. Uploading latest logs if of any help.
Promoting it to Blocker as after occurrence of this issue remaining tests fails.
http://qa.sc.couchbase.com/job/ubuntu_x64--65_01--view_query_negative-P1/19/consoleFull

https://s3.amazonaws.com/bugdb/jira/MB-10914/dde3511a/172.23.106.196-4242014-155-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/104c02cc/172.23.106.197-4242014-157-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/49a9d481/172.23.106.198-4242014-158-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/05abe0fe/172.23.106.199-4242014-158-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/84299c36/172.23.106.200-4242014-159-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/038a4375/172.23.106.201-4242014-20-diag.zip
Comment by Abhinav Dangeti [ 24/Apr/14 ]
The test that seems to be failing, is failing with this exception:
DesignDocCreationException: Error occured design document _design/test_view-d9c3c3d
Comment by Abhinav Dangeti [ 24/Apr/14 ]
Hey Sarath, can you take a look at this please.
Comment by Sarath Lakshman [ 25/Apr/14 ]
I was trying to reproduce this test locally. But the test was not failing. Meenakshi, Is there a specific test that I can run repeatedly other than full negative_test conf file ?

But looking at the logfiles, following are my observations:
When the view engine tries to start a replica design document index process, it tries to open an UPR connection to the memcached. But it seems that connection gets closed by the server or unable to open connection.


165742 [couchdb:error,2014-04-24T1:40:31.367,ns_1@172.23.106.196:<0.10689.12>:couch_log:error:42]couch_set_view_group error opening set view group `_design/test_view-481e250` (prod), signature `55ed333c5e 923978b87358684c4961b8', from set `default`: {badmatch,¬
165743 {error,¬
165744 closed}}¬
165745 [couchdb:info,2014-04-24T1:40:31.367,ns_1@172.23.106.196:<0.10666.12>:couch_log:info:39]Set view `default`, main (prod) group `_design/test_view-481e250`, signature `55ed333c5e923978b87358684c4961b 8`, terminating with reason: {{badmatch,¬
165746 {error,¬
165747 {badmatch,¬
165748 {error,¬
165749 closed}}}},¬
165750 [{couch_set_view_group,¬
165751 open_replica_group,¬
165752 1,¬
165753 [{file,¬
165754 "/home/buildbot/buildbot_slave/ubuntu-1204-x64-300-bui lder/build/build/couchdb/src/couch_set_view/src/couch_set_view_group.erl"},¬


Around same time I can see the following too many connection errors in memcached logs. So I guess memcached ran out of sockets and open_connection by view engine upr client did not succeed.
195207 Thu Apr 24 01:40:22.804213 PDT 3: Too many open connections¬
195208 Thu Apr 24 01:40:22.914615 PDT 3: Too many open connections¬
195209 Thu Apr 24 01:40:22.927568 PDT 3: Too many open connections¬
195210 Thu Apr 24 01:40:22.931671 PDT 3: Too many open connections¬
195211 Thu Apr 24 01:40:22.935193 PDT 3: Too many open connections¬
195212 Thu Apr 24 01:40:22.936648 PDT 3: Too many open connections¬
195213 Thu Apr 24 01:40:31.367115 PDT 3: Too many open connections¬
195214 Thu Apr 24 01:40:31.374450 PDT 3: Too many open connections¬
195215 Thu Apr 24 01:40:31.379162 PDT 3: Too many open connections¬
Comment by Meenakshi Goel [ 25/Apr/14 ]
No, Failed Test will not deterministically reproduce the error as mentioned above.
Yes it is observed with this conf file run as shared results from latest build 3.0.0-605-rel.
I can try running the conf file again and share the live cluster details if of any help.
Comment by Meenakshi Goel [ 25/Apr/14 ]
Link to latest Jenkins run : http://qa.sc.couchbase.com/job/ubuntu_x64--65_01--view_query_negative-P1/21/console

Cluster Details:
Default credentials for SSH and GUI
ip:172.23.106.196
ip:172.23.106.197
ip:172.23.106.198
ip:172.23.106.199
ip:172.23.106.200
ip:172.23.106.201

Second last test in progress is constantly displaying below errors:
[2014-04-25 03:11:27,788] - [rest_client:712] ERROR - http://172.23.106.196:8092//default/_design/test_view-0a6036f error 404 reason: not_found {"error":"not_found","reason":"missing"}
[2014-04-25 03:11:27,827] - [rest_client:457] INFO - index query url: http://172.23.106.196:8092/default/_design/test_view-0a6036f/_view/test_view-0a6036f?stale=ok
[2014-04-25 03:13:07,980] - [rest_client:712] ERROR - http://172.23.106.196:8092/default/_design/test_view-0a6036f/_view/test_view-0a6036f?stale=ok error 500 reason: error {"error":"error","reason":"inconsistent_state"}
Comment by Sarath Lakshman [ 28/Apr/14 ]
It would be great if you could attach logs for above 6 machines to this ticket (If you have log collection automated).
Comment by Meenakshi Goel [ 28/Apr/14 ]
Please find below logs collected during this Jenkins run.
https://s3.amazonaws.com/bugdb/jira/MB-10914/0153ab9e/172.23.106.196-4252014-35-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/53091f8a/172.23.106.197-4252014-36-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/838b4ee1/172.23.106.198-4252014-37-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/4767f5f8/172.23.106.199-4252014-38-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/e3f385fe/172.23.106.200-4252014-38-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/82f1fd14/172.23.106.201-4252014-39-diag.zip

Let me know if cluster is no longer required. Thanks.
Comment by Sarath Lakshman [ 28/Apr/14 ]
Sure. We do not need that cluster anymore.
Comment by Sarath Lakshman [ 28/Apr/14 ]
It seems to be an issue around memcached out of fds. From view engine side, we do not think there is any fd leak
What is the fd limit used ? ulimit -n on one of the machines showed 1024. If fd number is set a high value, ep-engine team should take a look and figure out if there is some fd leak in memcached/ep-engine.
Comment by Meenakshi Goel [ 28/Apr/14 ]
We haven't updated fd limit so it should be a default value which is 1024 currently, as checked with ulimit -n.
Comment by Sarath Lakshman [ 28/Apr/14 ]
I doubt if the couchbase startup scripts set higher fd limit value before starting couchbase. Please confirm it from QE team. If it doesn't please run tests with higher fd limit value.
Comment by Sarath Lakshman [ 29/Apr/14 ]
If you could increase the fd limit to 10k and rerun the test that would be great.
Comment by Meenakshi Goel [ 02/May/14 ]
Yes you are right that Couchbase automatically handles setting limit –n to 10240 before starting couchbase as confirmed from QE team. Thus no need to set it manually.
Re-ran the test after restarting the VM's with 3.0.0-628-rel and still the issue is reproducible.
http://qa.sc.couchbase.com/job/ubuntu_x64--65_01--view_query_negative-P1/27/consoleFull

Here is the output from all VM's:
root@mulberry-s10708:~# ps -eo pid,args | grep memcached
19157 /opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcached.json
23397 grep --color=auto memcached
root@mulberry-s10708:~# cat /proc/19157/limits | grep "open files"
Max open files 10240 10240 files
root@mulberry-s10708:~# cat /opt/couchbase/etc/couchbase_init.d | grep ulimit
    ulimit -n 10240
    ulimit -c unlimited
    ulimit -l unlimited

root@mulberry-s10709:~# ps -eo pid,args | grep memcached
19148 /opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcached.json
23342 grep --color=auto memcached
root@mulberry-s10709:~# cat /proc/19148/limits | grep "open files"
Max open files 10240 10240 files
root@mulberry-s10709:~# cat /opt/couchbase/etc/couchbase_init.d | grep ulimit
    ulimit -n 10240
    ulimit -c unlimited
    ulimit -l unlimited

root@mulberry-s10710:~# ps -eo pid,args | grep memcached
19184 /opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcached.json
23375 grep --color=auto memcached
root@mulberry-s10710:~# cat /proc/19184/limits | grep "open files"
Max open files 10240 10240 files
root@mulberry-s10710:~# cat /opt/couchbase/etc/couchbase_init.d | grep ulimit
    ulimit -n 10240
    ulimit -c unlimited
    ulimit -l unlimited

root@mulbery-s10711:~# ps -eo pid,args | grep memcached
19133 /opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcached.json
23357 grep --color=auto memcached
root@mulbery-s10711:~# cat /proc/19133/limits | grep "open files"
Max open files 10240 10240 files
root@mulbery-s10711:~# cat /opt/couchbase/etc/couchbase_init.d | grep ulimit
    ulimit -n 10240
    ulimit -c unlimited
    ulimit -l unlimited

root@mulbery-s10712:~# ps -eo pid,args | grep memcached
18672 /opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcached.json
23020 grep --color=auto memcached
root@mulbery-s10712:~# cat /proc/18672/limits | grep "open files"
Max open files 10240 10240 files
root@mulbery-s10712:~# cat /opt/couchbase/etc/couchbase_init.d | grep ulimit
    ulimit -n 10240
    ulimit -c unlimited
    ulimit -l unlimited

root@mulberry-s10713:~# ps -eo pid,args | grep memcached
 3177 /opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcached.json
17916 grep --color=auto memcached
root@mulberry-s10713:~# cat /proc/3177/limits | grep "open files"
Max open files 10240 10240 files
root@mulberry-s10713:~# cat /opt/couchbase/etc/couchbase_init.d | grep ulimit
    ulimit -n 10240
    ulimit -c unlimited
    ulimit -l unlimited
Comment by Sarath Lakshman [ 02/May/14 ]
From view engine side we have taken look at the tcp connections that we are opening and they are getting closed properly. Hence ep-engine team should take a look at why memcached has lot of open connections and it is causing system running out of fds. In this case view engine is unable to open tcp connection since it system is out of fds.
Comment by Meenakshi Goel [ 02/May/14 ]
Logs with build 3.0.0-628-rel if of any help:
https://s3.amazonaws.com/bugdb/jira/MB-10914/5545b032/172.23.106.201-522014-144-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/0e6c299a/172.23.106.197-522014-146-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/c62d4f6b/172.23.106.198-522014-148-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/7f14e0cd/172.23.106.199-522014-150-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/0d09f698/172.23.106.200-522014-152-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10914/8a5cd982/172.23.106.196-522014-154-diag.zip
Comment by Sriram Melkote [ 06/May/14 ]
So issue that we need further investigation is why there are so many connections to memcached that the machine ran out of FDs
Comment by Abhinav Dangeti [ 06/May/14 ]
Hi Siri, I will take a look at this, and will let you know as soon as I can.
Comment by Abhinav Dangeti [ 06/May/14 ]
Meenakshi, are you by any chance running this test right now, so I could look at a live cluster?
Once we hit the issue, we can look into the /proc/[memcacheds_pid]/fd directory, to see what all the open descriptors link to. This could be helpful.

I ask this, as I gather that its difficult to reproduce this issue with any one test.
Comment by Meenakshi Goel [ 07/May/14 ]
Link to latest run with 3.0.0-651-rel
http://qa.sc.couchbase.com/job/ubuntu_x64--65_01--view_query_negative-P1/31/console

Cluster Details:
Default credentials for SSH and GUI
ip:172.23.106.196
ip:172.23.106.197
ip:172.23.106.198
ip:172.23.106.199
ip:172.23.106.200
ip:172.23.106.201
Comment by Meenakshi Goel [ 07/May/14 ]
Please let us know once are your done with the cluster. Thanks !
Comment by Maria McDuff (Inactive) [ 07/May/14 ]
Meenakshi, you can release this cluster. Abhinav doesn't need it.
Comment by Abhinav Dangeti [ 07/May/14 ]
So I wasn't able to catch a node when the test actually hit this error, as the test suite had already complete its course.
Is there any way we can look into the details of the file descriptors when we hit this issue, and just log it all into a file perhaps?
I'm trying to reproduce this locally as well, but haven't been successful yet.
Comment by Abhinav Dangeti [ 07/May/14 ]
Also, since its always the last 4 tests failing, I can do that myself tomorrow I suppose.
Comment by Abhinav Dangeti [ 12/May/14 ]
Hey Meenakshi, can you please run the entire suite with the following toy build:
http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent58-3.0.0-toy-couchstore-x86_64_3.0.0-727-toy.rpm

I understand that you're current tests run on ubuntu, but if you can get some centos vms, run the entire test suite with a base build to make sure you do hit the same error, and then with this toy build, it would be great, because I'm not sure we can create toy builds for ubuntu right now.
Comment by Meenakshi Goel [ 14/May/14 ]
Tested with toy build on CentOS and observed same behaviour:
http://qa.sc.couchbase.com/job/ubuntu_x64--65_01--view_query_negative-P1/41/consoleFull
Comment by Meenakshi Goel [ 14/May/14 ]
I would also like to mention that this issue seems to be occurring in case of 6-node cluster as all test passed with build 3.0.0-674-rel on 4-node CentOS cluster.
http://qa.sc.couchbase.com/job/ubuntu_x64--65_01--view_query_negative-P1/36/consoleFull
Comment by Ketaki Gangal [ 14/May/14 ]
This bug blocks smaller amount of tests ( about 4) of the existing tests and so it has been downgraded from a Blocker to a Critical.
Comment by Abhinav Dangeti [ 15/May/14 ]
I tried running the entire suite on my machine, I saw the last 4 tests fail on another issue (but not a fd leak in memcached)
Tests failed after timing out on this message: "INFO - View result is still not expected "

Also noticed one other strange thing, one of the 6 memcached processes (as there are 6 nodes), never died over a bunch of these tests, cleanup at the end of the each test should've killed this process like for the others and spawned a new one, at the beginning of the following test in the suite. I was able to terminate this memcached process with a SIG_KILL.
Comment by Abhinav Dangeti [ 15/May/14 ]
To make sure that a stagnating memcached process is not causing this issue, Meenakshi, Ketaki can we instrument a kill command for every memcached process (in every node) at the end of every test (considered in this suite).
Comment by Ketaki Gangal [ 16/May/14 ]
root@mulberry-s10708:/opt/couchbase/bin# ./cbstats localhost:11210 all | grep curr_
 curr_connections: 24
 curr_conns_on_port_11207: 2
 curr_conns_on_port_11209: 433
 curr_conns_on_port_11210: 3
 curr_items: 0
 curr_items_tot: 0
 curr_temp_items: 0
 vb_active_curr_items: 0
 vb_pending_curr_items: 0
 vb_replica_curr_items: 0

Between every test, 11209 ( ns_server) keeps have a larger count of open connections.

The lsof does not tally with this however

root@mulberry-s10708:/opt/couchbase/bin# lsof -np 18528 | wc
    128 1203 13653

Running this on the current cluster, will add more details.
Comment by Ketaki Gangal [ 20/May/14 ]
The curr_conn on port 11209 (used by ns_server to collect server info for replication, views upr ..) shows many open connection ~ 1K

root@mulberry-s10708:/opt/couchbase/bin# ./cbstats localhost:11210 all | grep curr_
 curr_connections: 8
 curr_conns_on_port_11207: 2
 curr_conns_on_port_11209: 999
 curr_conns_on_port_11210: 3
 curr_items: 0
 curr_items_tot: 0
 curr_temp_items: 0
 vb_active_curr_items: 0
 vb_pending_curr_items: 0
 vb_replica_curr_items: 0

ps aux | grep mem
1000 14271 19.9 3.1 791320 127420 ? Ssl 11:28 34:13 /opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcached.json

The lsof memcached shows a much smaller number https://www.couchbase.com/issues/secure/attachment/20551/lsof_memcached.rtf
The above stats and actual open connections seems to be off by a large number.

Can someone from the ep-engine team take a look at why is this accounted differently?

While the above indicates some issue the qe- will be changing the test cleanup where with the current implementation the tests always retain the same memcached pid across every test. Change this by either
1. Killing memcached explicitly between tests
2. Using self.input.param("forceEject", True) for the test cleanup, which ought to ensure correct cleanup of the node(master and non-master) at the end of each test


Comment by Chiyoung Seo [ 29/May/14 ]
David,

Abhinav has other test blockers now. Can you please take a look at this issue?
Comment by David Liao [ 02/Jun/14 ]
I no longer see the "Too many open connections" or "disconnection error" when running the test. I think that's because the memcached process is restarted between tests and that resolved this issue.

Regarding the stats of conns. port 11209 is used by ns_server internally to maintain its own accounting which ep-engine is not aware of and it's hard to conclude that there is anything wrong there. Anyway, the original issue is resolved.

Comment by Aleksey Kondratenko [ 02/Jun/14 ]
If the only thing we did is to start restarting memcached between tests that we cannot say "original issue is fixed". Because if it's not, then folks in the production will have this very critical problem.
Comment by David Liao [ 04/Jun/14 ]
I think you have a point. But as of now, issues can't be reproduced per this particular test. QE may construct a dedicated test case to put memcached to its connection limit and make sure client behave properly and we'll have a better idea whether if it's client or server issue.
Comment by Aleksey Kondratenko [ 04/Jun/14 ]
Ok.

I'd suggest you to make this a release blocker.

Plus if you want any more tests I'd suggest you to assign back to QE.

But note that I've seen with my own eyes that this is not a client issue. Cannot be. Because lsof and netstat reported count of connections was not matching memcached stats on number of port 11209 connections.
Comment by David Liao [ 04/Jun/14 ]
QE, please contract a test case to reproduce "too many connection" issue without have to run a few hours to reproduce -- maybe by configuring a smaller max limit.
Comment by Ketaki Gangal [ 05/Jun/14 ]
Hi David,

Can someone from the dev let us know - how can this param be tuned to a smaller number? We can work on a testcase accordingly.
Comment by Abhinav Dangeti [ 05/Jun/14 ]
I think you can set the file descriptor limit to a lesser value using ulimit ?
Comment by Aleksey Kondratenko [ 05/Jun/14 ]
This bug has _nothing_ to do with OS fd limit. It's some "refcount leak" in memcached.

Here's how you can change limit for port 11209. Port the following to /diag/eval.

NewValue = 99, ns_config:update_key({node, node(), memcached_config}, fun (V) -> misc:rewrite(fun ([{host, _}, {port, dedicated_port}, {maxconn, _}] = KV) -> {stop, lists:keyreplace(maxconn, 1, KV, {maxconn, NewValue})}; (Other) -> continue end, V) end).
Comment by Sriram Melkote [ 16/Jun/14 ]
Ketaki, will you retry this soon?




[MB-10907] Missing UPR config :: UUID difference observed in (active vs replica) vbuckets after online upgrade 2.5.1 ==> 3.0.0-593 Created: 19/Apr/14  Updated: 19/Jun/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Parag Agarwal Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: GZip Archive online_upgrade_logs_2.5.1_3.0.0.tar.gz    
Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: 1. Install 2.5.1 on 10.6.2.144, 10.6.2.145
2. Add 10.6.2.145 to 10.6.2.144 and rebalance
3. Add default bucket with 1024 vbuckets to cluster
4. Add ~ 1000 items to the buckets
5. Online upgrade cluster to 3.0.0-593 with 10.6.2.146 as our extra node
6. Finally cluster has node 10.6.2.145, 10.6.2.146

Check V-bucket id for active and replica v-bucket, after all replication is complete, plus disk queue drained.

Expectation: Should be same according to UPR

Actual Result: Different as per observation. Without an upgrade they are same.

Note that: In the build we tested UPR was not turned due to missing COUCHBASE_REPL_TYPE = upr and we operating with TAP. This case will occur during upgrade and we have to fix the config before upgrade.

Example of difference in UUID

On 10.6.2.145 where vb_9 is active
 vb_9:high_seqno: 14

 vb_9:purge_seqno: 0

 vb_9:uuid: 18881518640852

On 10.6.2.146 where vb_9 is replica
 vb_9:high_seqno: 14

 vb_9:purge_seqno: 0

 vb_9:uuid: 120602843033209





Is this a Regression?: No

 Comments   
Comment by Aliaksey Artamonau [ 22/Apr/14 ]
I can't find any evidence that there was even an attempt to upgrade replications to upr after rebalance. I assume that you forgot to set COUCHBASE_REPL_TYPE environment variable accordingly.
Comment by Parag Agarwal [ 22/Apr/14 ]
isn't UPR switched on by default? for version like 3.0.0-593 or do we need to set explicitly?
Comment by Parag Agarwal [ 22/Apr/14 ]
I will re-run the scenario by adding it in the config file and let you know the results. But I think if this is not on by default, we should have it on to avoid this scenario at least.
Comment by Aliaksey Artamonau [ 22/Apr/14 ]
You need to set it explicitly.
Comment by Aliaksey Artamonau [ 22/Apr/14 ]
Please also note that currently it's known that upgrade to UPR is broken: MB-10928.
Comment by Parag Agarwal [ 22/Apr/14 ]
Thanks for the update, I am going to change the scope of this bug due to the issue observed was missing COUCHBASE_REPL_TYPE = upr
Comment by Parag Agarwal [ 22/Apr/14 ]
Changing scope of the bug as per comments from the dev. We need to fix the config of our install package to switch on UPR by default.
Comment by Aleksey Kondratenko [ 08/May/14 ]
Given it's "retest after upr is default", I'm moving it from dev.

There is nothing dev needs to do with this right now
Comment by Parag Agarwal [ 08/May/14 ]
Re-assgined bug to Tony since he handles functional upgrade tests. Thanks, Alk! is there a bug open for this? Can you please add it here
Comment by Anil Kumar [ 19/Jun/14 ]
Tony - Please update the ticket if you have tested with recent builds.




[MB-10898] [Doc] Password encryption between Client and Server for Admin ports credentials Created: 18/Apr/14  Updated: 29/May/14  Due: 23/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Anil Kumar Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: 3.0-Beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Flagged:
Release Note

 Description   
Password encryption between Client and Server for Admin ports credentials

http://www.couchbase.com/issues/browse/MB-10088
http://www.couchbase.com/issues/browse/MB-9198




[MB-10899] [Doc] Support immediate and eventual consistency level for indexes (stale=false) Created: 18/Apr/14  Updated: 29/May/14  Due: 23/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: 3.0-Beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Flagged:
Release Note

 Description   
Support immediate and eventual consistency level for indexes (stale=false)






[MB-10902] [Doc] Progress indicator for Warm-up Operation Created: 18/Apr/14  Updated: 29/May/14  Due: 23/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: 3.0-Beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Flagged:
Release Note

 Description   
Progress indicator for Warm-up Operation -

http://www.couchbase.com/issues/browse/MB-8989




[MB-10893] [Doc] XDCR - pause and resume Created: 18/Apr/14  Updated: 29/May/14  Due: 23/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: 3.0-Beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Flagged:
Release Note

 Description   
XDCR - pause and resume

https://www.couchbase.com/issues/browse/MB-5487




[MB-10834] update the license.txt for enterprise edition for 2.5.1 Created: 10/Apr/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Microsoft Word 2014-04-07 EE Free Clickthru Breif License.docx    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
document attached.

 Comments   
Comment by Phil Labee [ 10/Apr/14 ]
2.5.1 has already been shipped, so this file can't be included.

Is this for 3.0.0 release?
Comment by Phil Labee [ 10/Apr/14 ]
voltron commit: 8044c51ad7c5bc046f32095921f712234e74740b

uses the contents of the attached file to update LICENSE-enterprise.txt on the master branch.




[MB-10823] Log failed/successful login with source IP to detect brute force attacks Created: 10/Apr/14  Updated: 18/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Cihan Biyikoglu [ 18/Jun/14 ]
http://www.couchbase.com/issues/browse/MB-11463 for covering ports 11209 or 11211.




[MB-10821] optimize storage of larger binary object in couchbase Created: 10/Apr/14  Updated: 10/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10262] corrupted key in data file rolls backwards to an earlier version or disappears without detection Created: 19/Feb/14  Updated: 06/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Matt Ingenthron Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: corrupt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Observed on Mac OS X, but presumed to affect all versions

Triage: Triaged
Flagged:
Release Note

 Description   
By shutting down Couchbase Server, intentionally corrupting one recently stored key, then starting up the server and trying to read said key, an older version of that key is seen. The corruption wasn't logged (that I could find).

Note, the actual component here is couchstore.

Steps to reproduce:
1) Add a new document to a given bucket. Call the key something known, like "corruptme"
2) Edit the document once (so you'll have two versions of it)
3) Shut down the server
4) grep for that string in the vbucket data files
5) Edit the vbucket file for the given key. Change "corruptme" to "corruptm3"
6) Start the server
7) Perform a get for the given key (with cbc or the like)

Expected behavior: either the right key is returned (assumes replicated metadata) or an error is returned.

Observed behavior: the old version of the key is returned.


The probability of encountering this goes up dramatically in environments where there are many nodes, disks.

Related reading:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/disk_failures.pdf




[MB-10261] document a set of rules for how to handle various view requests Created: 19/Feb/14  Updated: 04/Apr/14

Status: In Progress
Project: Couchbase Server
Component/s: documentation, ns_server
Affects Version/s: 2.1.0, 2.2.0, 2.1.1, 2.5.0, 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Matt Ingenthron Assignee: Jeff Morris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-9915 capi layer is not sending view reques... Resolved
Triage: Untriaged

 Description   
With the initial 2.0 release, the initial understood contract between the client library and the cluster is that the client library would send requests and the cluster would handle execution of those requests and sending responses. Over the development of the 2.0 series, to handle certain cases relating to new nodes, leaving nodes, and failures that contract has changed.

At this point in time, there are a few situations we may encounter (and presumed rules):
- 200 response (good, just pass results back)
- 301/302 response (follow the "redirect", possibly trigger a configuration update)
- 404 response (possibly retry on another node... see derived rules)
- 5xx response (possibly retry on another node with a backoff... see derived rules)

See the discussion in MB-9915 where a 500 has been encountered and rules which have been derived in Java:
https://github.com/couchbase/couchbase-java-client/blob/master/src/main/java/com/couchbase/client/http/HttpResponseCallback.java#L144

This bug is to document the set of rules for clients, which should become part of this doc:
http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#querying-using-the-rest-api

 Comments   
Comment by Matt Ingenthron [ 19/Feb/14 ]
I've assigned this to Jeff initially since he needs to get to a set of rules for a particular user's needs. He'll draft this up and send it out for review. Once reviewed, then the docs team can incorporate it appropriately.
Comment by Jeff Morris [ 20/Feb/14 ]
Here is my first draft: https://docs.google.com/document/d/1GhRxvPb7xakLL4g00FUi6fhZjiDaP33DTJZW7wfSxrI/edit#

I used the rules provided in the Java HttpResponseCallback.java class as baseline.
Comment by Jeff Morris [ 27/Feb/14 ]
Patch set ticket: https://www.couchbase.com/issues/browse/NCBC-407
Comment by Jeff Morris [ 27/Feb/14 ]
Patchset: http://review.couchbase.org/#/c/34007/




[MB-10084] Sub-Task: Changes required for Data Encryption in Client SDK's Created: 30/Jan/14  Updated: 28/May/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Andrei Baranouski
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
depends on JCBC-441 add SSL support in support of Couchba... Open
depends on CCBC-344 add support for SSL to libcouchbase i... Resolved
depends on NCBC-424 Add SSL support in support of Couchba... Resolved

 Description   
Changes required for Data Encryption in Client SDK's

 Comments   
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
wanted to make sure we agree this will be in 3.0. Matt any concerns?
thanks
Comment by Matt Ingenthron [ 20/Mar/14 ]
This should be closed in favor of the specific project issues. That said, the description is a bit fuzzy. Is this SSL support for memcached && views && any cluster management?

Please clarify and then we can open specific issues. It'd be good to have a link to functional requirements.
Comment by Matt Ingenthron [ 20/Mar/14 ]
And Cihan: it can't be "in 3.0", unless you mean concurrent release or release prior to 3.0 GA. Is that what you mean? I'd actually aim to have this feature support in SDKs prior to 3.0's release and we are working on it right now, though it has some other dependencies. See CCBC-344, for example.
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
thanks Matt. I meant 3.0 paired client SDK release so prior or shortly after is all good for me.
context - we are doing a pass to clean up JIRA. Like to button up what's in and out for 3.0.
Comment by Cihan Biyikoglu [ 24/Mar/14 ]
Matt, is there a client side ref implementation you guys did for this one? would be good to pass that onto test folks for initial validation until you guys completely integrate so no regressions creep up while we march to GA.
thanks
Comment by Matt Ingenthron [ 24/Mar/14 ]
We did verification with a non-mainline client since that was the quickest way to do so and have provided that to QE. Also, Brett filed a bug around HTTPS with ns-server and streaming configuration replies. See MB-10519.

We'll do a mainline client with libcouchbase and the python client as soon as it's dependency for handling packet IO is done. This is under CCBC-298 and CCBC-301, among others.




[MB-10012] cbrecovery hangs in the case of multi-bucket case Created: 24/Jan/14  Updated: 17/Jun/14

Status: Open
Project: Couchbase Server
Component/s: test-execution
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Venu Uppalapati Assignee: Ashvinder Singh
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive cbrecovery1.zip     Zip Archive cbrecovery2.zip     Zip Archive cbrecovery3.zip     Zip Archive cbrecovery4.zip     Zip Archive cbrecovery_source1.zip     Zip Archive cbrecovery_source2.zip     Zip Archive cbrecovery_source3.zip     Zip Archive cbrecovery_source4.zip     PNG File recovery.png    
Issue Links:
Relates to
Triage: Triaged
Operating System: Centos 64-bit

 Description   
2.5.0-1055

during verification MB-9967 I performed the same steps:
source cluster: 3 modes, 4 buckets
destination cluster: 3 nodes, 1 bucket
failover 2 nodes on destination cluster(without rebalance)

cbrecovery hangs on

[root@centos-64-x64 ~]# /opt/couchbase/bin/cbrecovery http://172.23.105.158:8091 http://172.23.105.159:8091 -u Administrator -U Administrator -p password -P password -b RevAB -B RevAB -v
Missing vbuckets to be recovered:[{"node": "ns_1@172.23.105.159", "vbuckets": [513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023]}]
2014-01-22 01:27:59,304: mt cbrecovery...
2014-01-22 01:27:59,304: mt source : http://172.23.105.158:8091
2014-01-22 01:27:59,305: mt sink : http://172.23.105.159:8091
2014-01-22 01:27:59,305: mt opts : {'username': '<xxx>', 'username_destination': 'Administrator', 'verbose': 1, 'dry_run': False, 'extra': {'max_retry': 10.0, 'rehash': 0.0, 'data_only': 1.0, 'nmv_retry': 1.0, 'conflict_resolve': 1.0, 'cbb_max_mb': 100000.0, 'try_xwm': 1.0, 'batch_max_bytes': 400000.0, 'report_full': 2000.0, 'batch_max_size': 1000.0, 'report': 5.0, 'design_doc_only': 0.0, 'recv_min_bytes': 4096.0}, 'bucket_destination': 'RevAB', 'vbucket_list': '{"172.23.105.159": [513]}', 'threads': 4, 'password_destination': 'password', 'key': None, 'password': '<xxx>', 'id': None, 'bucket_source': 'RevAB'}
2014-01-22 01:27:59,491: mt bucket: RevAB
2014-01-22 01:27:59,558: w0 source : http://172.23.105.158:8091(RevAB@172.23.105.156:8091)
2014-01-22 01:27:59,559: w0 sink : http://172.23.105.159:8091(RevAB@172.23.105.156:8091)
2014-01-22 01:27:59,559: w0 : total | last | per sec
2014-01-22 01:27:59,559: w0 batch : 1 | 1 | 15.7
2014-01-22 01:27:59,559: w0 byte : 0 | 0 | 0.0
2014-01-22 01:27:59,559: w0 msg : 0 | 0 | 0.0
2014-01-22 01:27:59,697: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210
2014-01-22 01:27:59,719: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210
2014-01-22 01:27:59,724: w2 source : http://172.23.105.158:8091(RevAB@172.23.105.158:8091)
2014-01-22 01:27:59,724: w2 sink : http://172.23.105.159:8091(RevAB@172.23.105.158:8091)
2014-01-22 01:27:59,727: w2 : total | last | per sec
2014-01-22 01:27:59,728: w2 batch : 1 | 1 | 64.0
2014-01-22 01:27:59,728: w2 byte : 0 | 0 | 0.0
2014-01-22 01:27:59,728: w2 msg : 0 | 0 | 0.0
2014-01-22 01:27:59,738: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210
2014-01-22 01:27:59,757: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210



 Comments   
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - June 04 2014 Bin, Ashivinder, Venu, Tony




[MB-10003] [Port-configurability] Non-root instances and multiple sudo instances in a box cannot be 'offline' upgraded Created: 24/Jan/14  Updated: 27/Mar/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Unix/Linux


 Description   
Scenario
------------
As of today, we do not support offline 'upgrade' per se for packages installed in non-root/sudo users. Upgrades are usually handled by package managers. Since these are absent in non-root users and rpm cannot handle more than a a single package upgrade(if there are many instances running), offline upgrades are not supported (confirmed with Bin).

ALL non-root installations will be affected by this limitation. Although a single instance running on a box under sudo user can be offline upgraded, it cannot be extended to more than one such instance.

This is important

Workaround
-----------------
- Online upgrade (swap with nodes running latest build, take old nodes down and do clean install)
- Backup data and restore after fresh install (cbbackup and cbrestore)

Note : At this point, these are mere suggestions and both these workarounds haven't been tested yet.




[MB-9982] XDCR should be incremental on topology changes Created: 22/Jan/14  Updated: 05/May/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.2.0, 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dipti Borkar Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Currently, XDCR checkpoints are not replicated to the replica nodes. this means that on any topology change, XDCR needs to re-check if each item is needed on the other side. While it may not resend the data, rechecking for a large number of items is quite expensive.

We need to replicate checkpoints so that XDCR is incremental on topology changes just as it work without topology changes.


 Comments   
Comment by Cihan Biyikoglu [ 28/Jan/14 ]
Hi Junyi, does UPR help with being more resume-able in XDCR?
Comment by Junyi Xie (Inactive) [ 28/Jan/14 ]
It should be helpful. But we may not have cycles to do that in 3.0
Comment by Dipti Borkar [ 29/Jan/14 ]
We have to consider this to 3.0. this is a major problem.

Also, backlog is a bottomless bit. let's not use it.




[MB-10146] Document editor overwrites precision of long numbers Created: 06/Feb/14  Updated: 09/May/14

Status: Reopened
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Perry Krug Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged

 Description   
Just tested this out, not sure what diagnostics to capture so please let me know.

Simple test case:
-Create new document via document editor in UI
-Document contents are:
{"id": 18446744072866779556}
-As soon as you save, the above number is rewritten to:
{
  "id": 18446744072866780000
}
-The same effect is had if you edit a document that was inserted with the above "long" number

 Comments   
Comment by Aaron Miller (Inactive) [ 06/Feb/14 ]
It's worth noting views will always suffer from this, as it is a limitation of Javascript in general. Many JSON libraries have this behavior as well (even though they don't *have* to).
Comment by Aleksey Kondratenko [ 11/Apr/14 ]
cannot fix it. Just closing. If you want to reopen, please pass it to somebody responsible for overall design.
Comment by Perry Krug [ 11/Apr/14 ]
Reopening and assigning to docs, we need this to be release noted IMO.
Comment by Ruth Harris [ 14/Apr/14 ]
Reassigning to Anil. He makes the call on what we put in the release notes for known and fixed issues.
Comment by Anil Kumar [ 09/May/14 ]
Ruth - Lets release note this for 3.0.




[MB-11346] Audit logs for User/App actions Created: 06/Jun/14  Updated: 07/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Couchbase Server should be able to get an audit logs for all User/App actions such-as login/logout events, mutations and other bucket and security changes.






[MB-11329] uninstall couchbase server 3.0.0 on windows did not delete files completely Created: 05/Jun/14  Updated: 17/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64-bit

Attachments: PNG File ss_2014-06-05_at_11.18.37 AM.png    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-779 on windows server 3.0.0-779 from
this link http://factory.hq.couchbase.com:8080/job/cs_300_win6408/186/artifact/voltron/couchbase-server-enterprise-3.0.0-779.setup.exe

Then uninstall couchbase server.
When uninstall complete, there are many files left in c:/Program Files/Couchbase/Server/var/lib/couchbase

 Comments   
Comment by Bin Cui [ 17/Jun/14 ]
It essentially means that uninstallation doesn't proceed correctly. And I think it is related to erlang process still running after uninstallation. We need to revisit the window build.




[MB-11328] old erlang processes were still running after uninstall couchbase server 3.0.0 on windows Created: 05/Jun/14  Updated: 17/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64-bit

Attachments: PNG File ss_2014-06-05_at_10.48.16 AM.png    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-779 on windows server 3.0.0-779 from
this link http://factory.hq.couchbase.com:8080/job/cs_300_win6408/186/artifact/voltron/couchbase-server-enterprise-3.0.0-779.setup.exe

Then uninstall couchbase server. In windows task manager, erlang processes were still there.
These left over erlang processes would make UI fail to run in next install couchbase server on windows.

 Comments   
Comment by Bin Cui [ 09/Jun/14 ]
Most likely, the erlang process get hunged and it won't exit request from service control manager when installation happens. Do we have other erlang issues during the run?
Comment by Thuan Nguyen [ 09/Jun/14 ]
Yes, we have issue during the run since there are some extra erlang processes running.
Comment by Sriram Melkote [ 10/Jun/14 ]
Bin, can we have the installer run:

taskkill.exe /im beam.smp /f
taskkill.exe /im epmd.exe /f
taskkill.exe /im memcached.exe /f

After stopping service and before beginning uninstall? The epmd is the important one, others are just to be safe.
Comment by Bin Cui [ 10/Jun/14 ]
This is definitely a bandaged kind of fix and it will cover the more fatal issue, i.e. a corrupted image in erlang process. Installer can double check and kill these unresponsive process. But we still need to dig deeper to find the root cause.

Since we register erlang as service and all these processes are under control of erlang management. Only corrupted processes will not response to parent process.




[MB-11314] Enhaced Authentication model for Couchbase Server for Administrators, Users and Applications Created: 04/Jun/14  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Couchbase Server will add support for authentication using various techniques example: Kerberos, LDAP etc…







[MB-11282] Separate stats for internal memory allocation (application vs. data) Created: 02/Jun/14  Updated: 02/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Story Priority: Critical
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
AFAIK currently we track allocation for data and application together.

But sometimes application (memcached / ep-engine) overhead is huge and cannot be ignored.




[MB-11250] Go-Coucbase: Provide DML APIs using CAS Created: 29/May/14  Updated: 18/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-11247] Go-Couchbase: Use password to connect to SASL buckets Created: 29/May/14  Updated: 19/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Gerald Sangudi [ 19/Jun/14 ]
https://github.com/couchbaselabs/query/blob/master/docs/n1ql-authentication.md




[MB-11214] ORDER BY clause should require LIMIT clause Created: 27/May/14  Updated: 27/May/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-11208] stats.org should be installed Created: 27/May/14  Updated: 27/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: techdebt-backlog
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
stats.org contains a description of the stats we're sending from ep-engine. It could be useful for people

 Comments   
Comment by Matt Ingenthron [ 27/May/14 ]
If it's "useful" shouldn't this be part of official documentation? I've often thought it should be. There's probably a duplicate here somewhere.

I also think the stats need stability labels applied as people may rely on stats when building their own integration/monitoring tools. COMMITTED, UNCOMMITTED, VOLATILE, etc. would be useful for the stats.

Relatedly, someone should document deprecation of TAP stats for 3.0.




[MB-11195] Support binary collation for views Created: 23/May/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
N1QL would benefit significantly if we could allow memcmp() collation for views it creates. So much so that we should consider this for a minor release after 3.0 so it can be available for N1QL beta.




[MB-11192] Snooze for 1 second during the backfill task is causing significant pauses during backup Created: 23/May/14  Updated: 24/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Critical
Reporter: Daniel Owen Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: customer, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: cbbackup --single-node
Data all memory resident.

Attachments: PNG File dropout-screenshot.png     PNG File IOThroughput-magnified.png     PNG File ThroughputGraphfromlocalhostport11210.png    
Issue Links:
Duplicate

 Description   
When performing a backup - the cbbackup process repeatedly stalls waiting on the socket for data. This can be seen in the uploaded graphs. The uploaded TCPdump output also shows the delay.

Setting the backfill/tap queue snooze always to zero - makes the issue go away.
i.e. modifying the sleep to zero in ep-engine/src/ep.cc/ function VBCBAdaptor::VBCBAdaptor

VBCBAdaptor::VBCBAdaptor(EventuallyPersistentStore *s,
                         shared_ptr<VBucketVisitor> v,
                         const char *l, double sleep) :
    store(s), visitor(v), label(l), sleepTime(sleep), currentvb(0)
{
sleepTime = 0.0;
....

Description of the cause is provided by Abhinav:

We back off or snooze for 1 second during the backfill task because the size of the backfill/tap queue crosses this limit (which we set to 5000 as part of initial configuration), we snooze for a second to wait for the items in the queue to drain.
So what's happening here is since all the items are in memory, this queue gets filled up really fast, causing the queue size to hit the limit and there by snoozing.




[MB-11188] RemoteMachineShellConnection.extract_remote_info doesn't work on OSX Mavericks Created: 22/May/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: test-execution
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Artem Stemkovski Assignee: Parag Agarwal
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
2 problems:

1:
executing sw_vers on ssh returns:
/Users/artem/.bashrc: line 2: brew: command not found

2:
workbook:ns_server artem$ hostname -d
hostname: illegal option -- d
usage: hostname [-fs] [name-of-host]




[MB-11171] mem_used stat exceeds the bucket memory quota in extremely heavy DGM and highly overloaded cluster Created: 20/May/14  Updated: 21/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.1.0, 2.2.0, 2.1.1, 2.5.0, 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This issue was reported from one of the customers. Their cluster was extremely heavy DGM (resident ratio near zero in both active and replica vbuckets) and was highly overloaded when this memory bloating issue happened.

From the logs, we saw that the number of memcached connections was spiked from 300 to 3K during the period having the memory issue. However, we were not able to correlate the increased number of connections to the memory bloating issue yet, but plan to keep investigating this issue by running the similar workload tests.





[MB-11154] Document proper way to detect a flush success from the SDK Created: 19/May/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1
Fix Version/s: techdebt-backlog
Security Level: Public

<
Type: Task Priority: Critical
Reporter: Michael Nitschinger Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: