[MB-11203] SSL-enabled memcached will hang when given a large buffer containing many pipelined requests Created: 24/May/14  Updated: 29/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Mark Nunberg Assignee: Jim Walker
Resolution: Unresolved Votes: 0
Labels: memcached
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Sample code which shows filling in a large number of pipelined requests being flushed over a single buffer.

#include <libcouchbase/couchbase.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static int remaining = 0;

static void
get_callback(lcb_t instance, const void *cookie, lcb_error_t err,
    const lcb_get_resp_t *resp)
{
    printf("Remaining: %d \r", remaining);
    fflush(stdout);
    if (err != LCB_SUCCESS && err != LCB_KEY_ENOENT) {
    }
    remaining--;
}

static void
stats_callback(lcb_t instance, const void *cookie, lcb_error_t err,
    const lcb_server_stat_resp_t *resp)
{
    printf("Remaining: %d \r", remaining);
    fflush(stdout);
    if (err != LCB_SUCCESS && err != LCB_KEY_ENOENT) {
    }

    if (resp->v.v0.server_endpoint == NULL) {
        fflush(stdout);
        --remaining;
    }
}

#define ITERCOUNT 5000
static int use_stats = 1;

static void
do_stat(lcb_t instance)
{
    lcb_CMDSTATS cmd;
    memset(&cmd, 0, sizeof(cmd));
    lcb_error_t err = lcb_stats3(instance, NULL, &cmd);
    assert(err==LCB_SUCCESS);
}

static void
do_get(lcb_t instance)
{
    lcb_error_t err;
    lcb_CMDGET cmd;
    memset(&cmd, 0, sizeof cmd);
    LCB_KREQ_SIMPLE(&cmd.key, "foo", 3);
    err = lcb_get3(instance, NULL, &cmd);
    assert(err==LCB_SUCCESS);
}

int main(void)
{
    lcb_t instance;
    lcb_error_t err;
    struct lcb_create_st cropt = { 0 };
    cropt.version = 2;
    char *mode = getenv("LCB_SSL_MODE");
    if (mode && *mode == '3') {
        cropt.v.v2.mchosts = "localhost:11996";
    } else {
        cropt.v.v2.mchosts = "localhost:12000";
    }
    mode = getenv("USE_STATS");
    if (mode && *mode != '\0') {
        use_stats = 1;
    } else {
        use_stats = 0;
    }
    err = lcb_create(&instance, &cropt);
    assert(err == LCB_SUCCESS);


    err = lcb_connect(instance);
    assert(err == LCB_SUCCESS);
    lcb_wait(instance);
    assert(err == LCB_SUCCESS);
    lcb_set_get_callback(instance, get_callback);
    lcb_set_stat_callback(instance, stats_callback);
    lcb_cntl_setu32(instance, LCB_CNTL_OP_TIMEOUT, 20000000);
    int nloops = 0;

    while (1) {
        unsigned ii;
        lcb_sched_enter(instance);
        for (ii = 0; ii < ITERCOUNT; ++ii) {
            if (use_stats) {
                do_stat(instance);
            } else {
                do_get(instance);
            }
            remaining++;
        }
        printf("Done Scheduling.. L=%d\n", nloops++);
        lcb_sched_leave(instance);
        lcb_wait(instance);
        assert(!remaining);
    }
    return 0;
}


 Comments   
Comment by Mark Nunberg [ 24/May/14 ]
http://review.couchbase.org/#/c/37537/
Comment by Mark Nunberg [ 07/Jul/14 ]
Trond, I'm assigning it to you because you might be able to delegate this to another person. I can't see anything obvious in the diff since the original fix which would break it - of course my fix might have not fixed it completely but just have made it work accidentally; or it may be flush-related.
Comment by Mark Nunberg [ 07/Jul/14 ]
Oh, and I found this on an older build of master; 837, and the latest checkout (currently 055b077f4d4135e39369d4c85a4f1b47ab644e22) -- I don't think anyone broke memcached - but rather the original fix was incomplete :(




[MB-11857] [System Test] Indexing stuck on intial load Created: 31/Jul/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Ketaki Gangal Assignee: Volker Mische
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Live cluster available on 10.6.2.163:8091

Build : 3.0.0-1059

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
1. Create a 7 node cluster, 2 buckets, 1 ddoc, 2 Views
2. Load 120M, 180M items on the bcukets.
3. Wait for indexing to complete.

-- Indexing appears to be stuck on the cluster (over 12 hours)
--- Error on couchdb logs shows a couple of errors on couch_set_view_updater,'-load_changes

[couchdb:error,2014-07-30T14:10:37.494,ns_1@10.6.2.167:<0.17752.67>:couch_log:error:44]Set view `default`, main (prod) group `_design/ddoc1`, received error from updater: {error,
                                                                                     vbucket_stream_already_exists}
[couchdb:error,2014-07-30T14:10:37.499,ns_1@10.6.2.167:<0.4435.98>:couch_log:error:44]Set view `default`, main group `_design/ddoc1`, doc loader error
error: function_clause
stacktrace: [{couch_set_view_updater,'-load_changes/8-fun-0-',
                 [vbucket_stream_not_found,{8,149579}],
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_set_view/src/couch_set_view_updater.erl"},
                  {line,461}]},
             {couch_upr_client,receive_events,4,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                  {line,854}]},
             {couch_upr_client,enum_docs_since,8,


[root@centos-64-x64 bin]# ./cbstats localhost:11210 all | grep upr
 ep_upr_conn_buffer_size: 10485760
 ep_upr_enable_flow_control: 1
 ep_upr_enable_noop: 1
 ep_upr_max_unacked_bytes: 524288
 ep_upr_noop_interval: 180

Attaching logs.





 Comments   
Comment by Ketaki Gangal [ 31/Jul/14 ]
Logs https://s3.amazonaws.com/bugdb/11857/bug.tar
Comment by Volker Mische [ 31/Jul/14 ]
Sarath has already a test blocker, hence I take it.
Comment by Volker Mische [ 31/Jul/14 ]
First finding: node 10.6.2.163 does OOM kill things before the view errors occur. beam.smp takes 11gb of RAM.
Comment by Volker Mische [ 31/Jul/14 ]
I was wrong the OOM kill seems to have happened before this test was run.

Would it be possible to also get the information at which time the test was started? Sometimes the logs are trimmed, so it's hard to tell.
Comment by Meenakshi Goel [ 31/Jul/14 ]
Test was started at 2014-07-30 08:48
Comment by Volker Mische [ 31/Jul/14 ]
As the system was still running I could expect it. The DCP client is waiting for a message (probable a close stream one) but never received it. It is probably in and endless loop receiving it.

I'll add a log message when this is happening.

That's the stack trace where it is stuck:

erlang:process_info(list_to_pid("<0.1985.19>"), current_stacktrace).
{current_stacktrace,[{couch_upr_client,get_stream_event_get_reply,
                                       3,
                                       [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                                        {line,201}]},
                     {couch_upr_client,get_stream_event,2,
                                       [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                                        {line,196}]},
                     {couch_upr_client,receive_events,4,
                                       [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                                        {line,846}]},
                     {couch_upr_client,enum_docs_since,8,
                                       [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                                        {line,248}]},
                     {couch_set_view_updater,'-load_changes/8-fun-2-',12,
                                             [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_set_view/src/couch_set_view_updater.erl"},
                                              {line,510}]},
                     {lists,foldl,3,[{file,"lists.erl"},{line,1248}]},
                     {couch_set_view_updater,load_changes,8,
                                             [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_set_view/src/couch_set_view_updater.erl"},
                                              {line,574}]},
                     {couch_set_view_updater,'-update/8-fun-2-',14,
                                             [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_set_view/src/couch_set_view_updater.erl"},
                                              {line,267}]}]}
Comment by Volker Mische [ 31/Jul/14 ]
Here's the additional log message: http://review.couchbase.org/40107

if no one objects (perhaps someone wants to take a look at the cluster), I'll ask Meenakshi to re-run the test once this is merged.




[MB-11846] Compiling breakdancer test case exceeds available memory Created: 29/Jul/14  Updated: 31/Jul/14  Due: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Chris Hillery Assignee: Trond Norbye
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
1. With memcached change 4bb252a2a7d9a369c80f8db71b3b5dc1c9f47eb9, cc1 on ubuntu-1204 quickly uses up 100% of the available memory (4GB RAM, 512MB swap) and crashes with an internal error.

2. Without Trond's change, cc1 compiles fine and never takes up more than 12% memory, running on the same hardware.

 Comments   
Comment by Chris Hillery [ 29/Jul/14 ]
Ok, weird fact - on further investigation, it appears that this is NOT happening on the production build server, which is an identically-configured VM. It only appears to be happening on the commit validation server ci03. I'm going to temporarily disable that machine so the next make-simple-github-tap test runs on a different ci server and see if it is unique to ci03. If it is I will lower the priority of the bug. I'd still appreciate some help in understanding what's going on either way.
Comment by Trond Norbye [ 30/Jul/14 ]
Please verify that the two builders have the same patch level so that we're comparing apples with apples.

It does bring up another interesting topic. should our builders just use the compiler provided with the installation, or should we have a reference compiler we're using to build our code. It does seems like a bad idea having to support a ton of various compiler revision (including the fact that they support different levels of C++11 that we have to work around).
Comment by Chris Hillery [ 31/Jul/14 ]
This is now occurring on other CI build servers in other tests - http://www.couchbase.com/issues/browse/CBD-1423

I am bumping this back to Test Blocker and I will revert the change as a work-around for now.
Comment by Chris Hillery [ 31/Jul/14 ]
Partial revert committed to memcached master: http://review.couchbase.org/#/c/40152/ and 3.0: http://review.couchbase.org/#/c/40153/




[MB-10156] "XDCR - Cluster Compare" support tool Created: 07/Feb/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Cihan Biyikoglu Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: 2.5.1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
for the recent issues we have seen we need a tool that cam compare metadata (specifically revids) for a given replication definition in XDCR. To scale to large data sizes, being able to do this per vbucket or per doc range would be great but we can do without these. for clarity, here is a high level desc.

Ideal case:
xdcr_compare cluster1_connectioninfo cluster1_bucketname cluster2connectioninfo cluster2_bucketname [vbucketid] [keyrange]
should return a line per docid for each row where cluster1 metadata and clustermetadata for the given key differ.
docID - cluster1_metadata cluster2_metadata

simplification: the tool is expected to return false positives in a moving system but we will tackle that by rerunning the tool multiple times.

 Comments   
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Aaron, do you have a timeline for this?
thanks
-cihan
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan,

For test automation/verification, can you list out the stats/metadata that we should be testing specifically?
we want to create/implement the tests accordingly.


Also -- is this tool de-coupled from the server package? or is this part of rpm/deb/.exe/osx build package?

Thanks,
Maria
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
This depends on the requirements; A tool that requires the manual collection of all data from all nodes in both clusters onto one machine (like we've done recently) could be done pretty quickly, but I imagine that may be difficult or unfeasible entirely for some users.

Better would be to be able to operate remotely on clusters and only look at metadata. Unfortunately there is no *currently exposed* interface to only extract metadata from the system without also retrieving values. I may be able to work around this, but the workaround is unlikely to be simple.

Also for some users, even the amount of *metadata* may be prohibitively large to transfer all to one place, this also can be avoided, but again, adds difficulty.

Q: Can the tool be JVM-based?
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
I think it would be more feasible for this to ship separately from the server package.
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan, Aaron,

If it's de-coupled, what older versions of Couchbase would this tool support? as far back as 1.8.x? pls confirm as this would expand our backward compatibility testing for this tool.
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
Well, 1.8.x didn't have XDCR or the rev field; It can't be compatible with anything older than 2.0 since it operates mostly to check things added since 2.0.

I don't know how far back it needs to go but it *definitely* needs to be able to run against 2.2
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Agree with Aaron, lets keep this lightweight. can we depend on Aaron for testing if this will initially be just a support tool? for 3.0, we may graduate the tool to the server shipped category.
thanks
Comment by Sangharsh Agarwal [ 27/Feb/14 ]
Cihan, Is the Spec finalized for this tool in version 2.5.1?
Comment by Cihan Biyikoglu [ 27/Feb/14 ]
Sangharsh, for 2.5.1, we wanted to make this a "Aaron tested" tool. I believe Aaron already has the tool. Aaron?
Comment by Aaron Miller (Inactive) [ 27/Feb/14 ]
Working on it; wanted to get my actually-in-the-package 2.5.1 stuff into review first.

What I do already have is a diff tool for *files*, but is highly inconvenient to use; this should be a tool that doesn't require collecting all data files into one place in order to use, and instead can work against a running cluster.
Comment by Maria McDuff (Inactive) [ 05/Mar/14 ]
Aaron,

Is the tool merged yet into the build? can you update pls?
Comment by Cihan Biyikoglu [ 06/Mar/14 ]
2.5.1 shiproom note: Phil raised a build concern on getting this packaged with 2.5.1. The initial bar we set was not to ship this as part of the server - it was intended to be a downloadable support tool. Aaron/Cihan will re-eval and get back to shiproom.
Comment by Cihan Biyikoglu [ 15/Jun/14 ]
Aaron no longer here. assigning to Xiaomei for consideration.




[MB-10719] Missing autoCompactionSettings during create bucket through REST API Created: 01/Apr/14  Updated: 19/Jun/14

Status: Reopened
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: michayu Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File bucket-from-API-attempt1.txt     Text File bucket-from-API-attempt2.txt     Text File bucket-from-API-attempt3.txt     PNG File bucket-from-UI.png     Text File bucket-from-UI.txt    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Unless I'm not using the API correctly, there seems to be some holes in the Couchbase API – particularly with autoCompaction.

The autoCompaction parameter can be set via the UI (as long as the bucketType is couch base).

See the following attachments:
1) bucket-from-UI.png
2) bucket-from-UI.txt

And compare with creating the bucket (with autoCompaction) through the REST API:
1) bucket-from-API-attempt1.txt
    - Reference: http://docs.couchbase.com/couchbase-manual-2.5/cb-rest-api/#creating-and-editing-buckets
2) bucket-from-API-attempt2.txt
    - Reference: http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-admin-rest-auto-compaction
3) bucket-from-API-attempt3.txt
    - Setting autoCompaction globally
    - Reference: http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-admin-rest-auto-compaction

In all cases, autoCompactionSettings is still false.


 Comments   
Comment by Anil Kumar [ 19/Jun/14 ]
Triage - June 19 2014 Alk, parag, Anil
Comment by Aleksey Kondratenko [ 19/Jun/14 ]
It works, just apparently not properly documented:

# curl -u Administrator:asdasd -d name=other -d bucketType=couchbase -d ramQuotaMB=100 -d authType=sasl -d replicaNumber=1 -d replicaIndex=0 -d parallelDBAndViewCompaction=true -d purgeInterval=1 -d 'viewFragmentationThreshold[percentage]'=30 -d autoCompactionDefined=1 http://lh:9000/pools/default/buckets

And general hint is that you can see what browser is POSTing when it creates bucket or does anything else to figure out working (but not necessarily publicly supported) way of doing things.
Comment by Anil Kumar [ 19/Jun/14 ]
Ruth - Above documentation references needs to be fixed with correct REST API.




[MB-9632] diag / master events captured in log file Created: 22/Nov/13  Updated: 17/Feb/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Steve Yen Assignee: Ravi Mayuram
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
The information available in the diag / master events REST stream should be captured in a log (ALE?) file and hence available to cbcollect-info's and later analysis tools.

 Comments   
Comment by Aleksey Kondratenko [ 22/Nov/13 ]
It is already available in collectinfo
Comment by Dustin Sallings (Inactive) [ 26/Nov/13 ]
If it's only available in collectinfo, then it's not available at all. We lose most of the useful information if we don't run an http client to capture it continually throughout the entire course of a test.
Comment by Aleksey Kondratenko [ 26/Nov/13 ]
Feel free to submit a patch with exact behavior you need




[MB-9358] while running concurrent queries(3-5 queries) getting 'Bucket X not found.' error from time to time Created: 16/Oct/13  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 64 bit

Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
one thread gives correct result:
[root@localhost tuqtng]# curl 'http://10.3.121.120:8093/query?q=SELECT+META%28%29.cas+as+cas+FROM+bucket2&#39;
{
    "resultset": [
        {
            "cas": 4.956322522514292e+15
        },
        {
            "cas": 4.956322525999292e+15
        },
        {
            "cas": 4.956322554862292e+15
        },
        {
            "cas": 4.956322832498292e+15
        },
        {
            "cas": 4.956322835757292e+15
        },
        {
            "cas": 4.956322838836292e+15
...

    ],
    "info": [
        {
            "caller": "http_response:152",
            "code": 100,
            "key": "total_rows",
            "message": "0"
        },
        {
            "caller": "http_response:154",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "405.41885ms"
        }
    ]
}

but in another I see
{
    "error":
        {
            "caller": "view_index:195",
            "code": 5000,
            "key": "Internal Error",
            "message": "Bucket bucket2 not found."
        }
}

cbcollect will be attached

 Comments   
Comment by Marty Schoch [ 16/Oct/13 ]
This is a duplicate, though I can't yet find the original.

We believe under higher load the view queries timeout, which we report as bucket not found (may not be possible to distinguish).
Comment by Iryna Mironava [ 16/Oct/13 ]
https://s3.amazonaws.com/bugdb/jira/MB-9358/447a45ae/10.3.121.120-10162013-858-diag.zip
Comment by Ketaki Gangal [ 17/Oct/13 ]
Seeing these errors and frequent tuq-server crashes on concurrent queries during typical server operations like
- w/ Failovers
- w/ Backups
- w/ Indexing.

Similar server ops for single queries however seem to run okay.

Note: This is a very small number of concurrent queries ( 3-5), typically users may have higher level of concurrency if used at an Application level.




[MB-9145] Add option to download the manual in pdf format (as before) Created: 17/Sep/13  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 2.0, 2.1.0, 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
On the documentation site there is no option to download the manual in pdf format as before. We need to add this option back.

 Comments   
Comment by Maria McDuff (Inactive) [ 18/Sep/13 ]
need for 2.2.1 bug fix release.




[MB-8838] Security Improvement - Connectors to implement security improvements Created: 14/Aug/13  Updated: 19/May/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Security Improvement - Connectors to implement security improvements

Spec ToDo.




[MB-9415] auto-failover in seconds - (reduced from minimum 30 seconds) Created: 21/May/12  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0, 1.8.1, 2.0, 2.0.1, 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Dipti Borkar Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 2
Labels: customer, ns_server-story
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-9416 Make auto-failover near immediate whe... Technical task Open Aleksey Kondratenko  

 Description   
including no false positives

http://www.pivotaltracker.com/story/show/25006101

 Comments   
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
At the very least it requires getting our timeout-ful cases under control. So at least splitting couchdb into separate VM is a requirement for this. But not necessarily enough.
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
Still seeing misunderstanding on this one.

So we have _different_ problem that even manual failover (let alone automatic) cannot succeed quickly if master node fails. It can easily take up to 2 minutes because of our use of erlang "global" facility than requires us to detect that node is dead and erlang is tuned to detect that within 2 minutes.

Now _this_ problem is lowering autofailover detection to 10 seconds. We can blindly make it happen today. But it will not be usable because of all sorts of timeouts happening in cluster management layer. We have some significant proportion of CBSEs _today_ about false positive autofailovers even with 30 seconds threshold. Clearly lowering it to 10 will only make it worse. Therefore my point above. We have to get those timeouts under control so that heartbeats are sent/received timely. Or whatever else we use to detect node being unresponsive.

I would like to note however that especially in some virtualized environments (arguably, oversubscribed) we saw as high as low tens of seconds delays from virtualization _alone_. Given relatively high cost of failover in our software I'd like to point out that people could too easily abuse that feature.

High cost of failover is refered to above is this:

* you almost certainly and irrecoverably lose some recent mutations. _At least_ recent mutations. I.e. if replication is really working well. In node that's on the edge of autofailover you can imagine replication not being "diamond-hard quick". That's cost 1.

* in order to return node back to cluster (say node crashed and needed some time to recover, whatever it might mean) you need rebalance. That type of rebalance is relatively quick by design; i.e. it only moves data back to this node and nothing else. But it's still rebalance. with upr we can possibly make it better. I.e. because its failover log is capable of rewinding just conflicting mutations.

What I'm trying to say in "our approach appears to have relatively high price for failover" is that it appears inherent issue for strongly consistent system. I'm trying to say that in many cases it might be actually better to wait up to few minutes for node to recover and restore it's availability than failing it over and paying price of restoring cluster capacility (with rebalancing this node back or it's replacement, which is irrelevant). If somebody wants stronger availability then some other approaches which particularly can "reconcile" changes from both failed over node and it's replacement node look like fundamentally better choice _for this requirements_.




[MB-4030] enable traffic for for ready nodes even if not all nodes are up/healthy/ready (aka partial janitor) (was: After two nodes crashed, curr_items remained 0 after warmup for extended period of time) Created: 06/Jul/11  Updated: 20/May/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.1, 2.0, 2.0.1, 2.2.0, 2.1.1, 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
we had two nodes crash at a customer, possibly related to a disk space issue, but I don't think so.

After they crashed, the nodes warmed up relatively quickly, but immediately "discarded" their items. I say that because I see that they warmed up ~10m items, but the current item counts were both 0.

I tried shutting down the service and had to kill memcached manually (kill -9). Restarting it went through the same process of warming up and then nothing.

While I was looking around, I left it sit for a little while and magically all of the items came back. I seem to recall this bug previously where a node wouldn't be told to be active until all the nodes in the cluster were active...and it got into trouble when not all of the nodes restarted.

Diags for all nodes will be attached

 Comments   
Comment by Perry Krug [ 06/Jul/11 ]
Full set of logs at \\corp-fs1\export_support_cases\bug_4030
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
It _is_ ns_server issue caused by janitor needing all nodes to be up for vbuckets activation. We planned fix for 1.8.1 (now 1.8.2)
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
Fix would land as part of fast warmup integration
Comment by Perry Krug [ 18/Jul/12 ]
Peter, can we get a second look at this one? We've seen this before, and the problem is that the janitor did not run until all nodes had joined the cluster and warmed up. I'm not sure we've fixed that already...
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Latest 2.0 will mark nodes as green and enable memcached traffic when all of them are up. So easy part is done.

Partial janitor (i.e. enabling traffic for some nodes when others are still down/warming up) is something that will unlikely be done soon
Comment by Perry Krug [ 18/Jul/12 ]
Thanks Alk...what's the difference in behavior (in this area) between 1.x and 2.0? It "sounds" like they're the same, no?

And this bug should still remain open until we fix the primary issue which is the partial janitor...correct?
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
1.8.1 will show node as green when ep-engine thinks it's warmed up. But confusingly it'll not be really ready. All vbuckets will be in state dead and curr_items will be 0.

2.0 fixes this confusion. Node is marked green when it's actually warmed up from user's perspective. I.e. right vbucket states are set and it'll serve clients traffic.

2.0 is still very conservative about only making vbucket state changes when all nodes are up and warmed up. Thats "impartial" janitor. Whether it's a bug or "lack of feature" is debatable. But I think main concern that users are confused by green-ness of nodes is resolved.
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Closing as fixed. We'll get to partial janitor some day in future which is feature we lack today, not bug we have IMHO
Comment by Perry Krug [ 12/Nov/12 ]
Reopening this for the need for partial janitor. Recent customer had multiple nodes need to be hard-booted and none returned to service until all were warmed up
Comment by Steve Yen [ 12/Nov/12 ]
bug-scrub: moving out of 2.0, as this looks like a feature req.
Comment by Farshid Ghods (Inactive) [ 13/Nov/12 ]
in system testing we have noticed many times that if multiple nodes crash until all nodes are warmed up node status for those that are already warmed up appears as yellow.


user won't be able to understand which node has successfully warmed up from the console and if one node is actually not recovering or not warm up in a reasonable time they have to figure it out some other way ( cbstats ... )

another issue with this is that user won't be able to perform a fail over for 1 node even though N-1 nodes has warmed up already.

i am not sure if fixing this bug will impact cluster-restore functionality but something important to fix or suggest a workaround to the user ( by workaround i mean a documented , tested and supported set of commands )
Comment by Mike Wiederhold [ 17/Mar/13 ]
Comments say this is an ns_server issue so I am removing couchbase-bucket from affected components. Please re-add if there is a couchbase-bucket task for this issue.
Comment by Aleksey Kondratenko [ 23/Feb/14 ]
Not going to happen for 3.0.




[MB-11060] Build and test 3.0 for 32-bit Windows Created: 06/May/14  Updated: 27/Jun/14  Due: 09/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build, ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Blocker
Reporter: Chris Hillery Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7/8 32-bit

Issue Links:
Dependency
Duplicate

 Description   
For the "Developer Edition" of Couchbase Server 3.0 on Windows 32-bit, we need to first ensure that we can build 32-bit-compatible binaries. It is not possible to build 3.0 on a 32-bit machine due to the MSVC 2013 requirement. Hence we need to configure MSVC as well as Erlang on a 64-bit machine to produce 32-bit compatible binaries.

 Comments   
Comment by Chris Hillery [ 06/May/14 ]
This is assigned to Trond who is already experimenting with this. He should:

 * test being able to start the server on a 32-bit Windows 7/8 VM

 * make whatever changes are necessary to the CMake configuration or other build scripts to produce this build on a 64-bit VM

 * thoroughly document the requirements for the build team to reproduce this build

Then he can assign this bug to Chris to carry out configuring our build jobs accordingly.
Comment by Trond Norbye [ 16/Jun/14 ]
Can you give me a 32 bit windows installation I can test on. My MSDN license have expired and I don't have Windows media available (and the internal wiki page just have a limited set of licenses and no download links)

Then assign it back to me and I'll try it
Comment by Chris Hillery [ 16/Jun/14 ]
I think you can use 172.23.106.184 - it's a 32-bit Windows 2008 VM that we can't use for 3.0 builds anyway.
Comment by Trond Norbye [ 24/Jun/14 ]
I copied the full result of a build where I set target_platform=x86 on my 64 bit windows server (the "install" directory) over to a 32 bit windows machine and was able to start memcached and it worked as expected.

Our installers make other magic like install the service etc needed in order to start the full server. Once we have such an installer I can do further testing
Comment by Chris Hillery [ 24/Jun/14 ]
Bin - could you take a look at this (figuring out how to make InstallShield on a 64-bit machine create a 32-bit compatible installer)? I won't likely be able to get to it for at least a month, and I think you're the only person here who still has access to an InstallShield 2010 designer anyway.




[MB-10838] cbq-engine must work without all_docs Created: 11/Apr/14  Updated: 29/Jun/14  Due: 07/Jul/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: tried builds 3.0.0-555 and 3.0.0-554

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
WORKAROUND: Run "CREATE PRIMARY INDEX ON <bucket>" once per bucket, when using 3.0 server

SYMPTOM: tuq returns Bucket default not found.', u'caller': u'view_index:200 for all queries

single node cluster, 2 buckets(default and standard)
run simple query
q=FROM+default+SELECT+name%2C+email+ORDER+BY+name%2Cemail+ASC

got {u'code': 5000, u'message': u'Bucket default not found.', u'caller': u'view_index:200', u'key': u'Internal Error'}
tuq displays
[root@grape-001 tuqtng]# ./tuqtng -couchbase http://localhost:8091
22:36:07.549322 Info line disabled false
22:36:07.554713 tuqtng started...
22:36:07.554856 version: 0.0.0
22:36:07.554942 site: http://localhost:8091
22:47:06.915183 ERROR: Unable to access view - cause: error executing view req at http://127.0.0.1:8092/default/_all_docs?limit=1001: 500 Internal Server Error - {"error":"noproc","reason":"{gen_server,call,[undefined,bytes,infinity]}"}
 -- couchbase.(*viewIndex).ScanRange() at view_index.go:186


 Comments   
Comment by Sriram Melkote [ 11/Apr/14 ]
Iryna, can you please add cbcollectinfo or at least the couchdb logs?

Also, all CBQ DP4 testing must be done against 2.5.x server, please confirm it is the case in this bug.
Comment by Iryna Mironava [ 22/Apr/14 ]
cbcollect
https://s3.amazonaws.com/bugdb/jira/MB-10838/9c1cf39c/172.27.33.17-4222014-111-diag.zip

bug is valid only for 3.0. 2.5.x versions are working fine
Comment by Sriram Melkote [ 22/Apr/14 ]
Gerald, we need to update query code to not use _all_docs for 3.0

Iryna, workaround is to run "CREATE PRIMARY INDEX ON <bucket>" first before running any queries when using 3.0 server
Comment by Sriram Melkote [ 22/Apr/14 ]
Reducing severity with workaround. Please ping me if that doesn't work
Comment by Iryna Mironava [ 22/Apr/14 ]
works with workaround
Comment by Gerald Sangudi [ 22/Apr/14 ]
Manik,

Please modify the tuqtng / DP3 Couchbase catalog to return an error telling the user to CREATE PRIMARY INDEX. This should only happen with 3.0 server. For 2.5.1 or below, #all_docs should still work.

Thanks.




[MB-11736] add client SSL to 3.0 beta documentation Created: 15/Jul/14  Updated: 15/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Matt Ingenthron Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
This is mostly a curation exercise. Add to the server 3.0 beta docs the configuration information for each of the following clients:
- Java
- .NET
- PHP
- Node.js
- C/C++

No other SDKs support SSL at the moment.

This is either in work-in-progress documentation or in the blogs from the various DPs. Please check in with the component owner if you can't find what you need.




[MB-11738] Evaluate GIO CPU utilization on systems with 16 vCPU Created: 16/Jul/14  Updated: 16/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, performance
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-11405 Shared thread pool: high CPU overhead... Open
relates to MB-11434 600-800% CPU consumption by memcached... Closed




[MB-11299] Upr replica streams cannot send items from partial snapshots Created: 03/Jun/14  Updated: 17/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
If items are sent from a replica vbucket and those items are from a partial snapshot then we might get holes in our data.

 Comments   
Comment by Aleksey Kondratenko [ 02/Jul/14 ]
Raised to blocker. Data loss in xdcr or views is super critical IMO
Comment by Mike Wiederhold [ 10/Jul/14 ]
I agree with Alk on the severity of this issue, but I do want to note that seeing this problem will be rare. I'm planning to work on it soon, but I need to get another issue resolved first before I address this problem.
Comment by Mike Wiederhold [ 16/Jul/14 ]
There are 3 sub-tasks to complete this issue.

1. Mark in each header whether or not this commit contains a full snapshot
2. Replica to active state transition means failover entry is created at the end of the last full snapshot
3. Replica vbuckets can only stream data from closed checkpoints
Comment by Aleksey Kondratenko [ 17/Jul/14 ]
Another thing (you might be already aware but let me note it here just in case).

If later replica has to rewind, for example as part of replicating from new master, it must rewind to full (not partial) snapshot to maintain correctness of upr. So your header markers should be useful there.

Comment by Mike Wiederhold [ 17/Jul/14 ]
Right, this will be addressed by the 2nd sub-task mentioned above. If you curious how I'm going to do this let me know and I'll explain it.
Comment by Aleksey Kondratenko [ 17/Jul/14 ]
Ok. Just one final thing. From your description of 2nd task it looks like slightly different thing. I.e. 2nd task is about picking seqno for failover history entry. What I'm referring to is a case where replica continues to be replica after failover, but has to do rollback. In that case per upr spec it needs to roll back to latest convenient point that's before seqno it needs to revert to that's also full snapshot. So it means if replica ever does rollback/rewind/whatever-you-call-it, it will need to use that marker as well.
Comment by Mike Wiederhold [ 17/Jul/14 ]
Yes, this should happen once I get all of these changes in. The replica will connect to the new master first without rolling back. It will specify the start and end seqno of the snapshot it was trying to receive and if it cannot receive the rest of the snapshot then the new master will tell it to rollback.




[MB-10180] Server Quota: Inconsistency between documentation and CB behaviour Created: 11/Feb/14  Updated: 21/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Dave Rigby Assignee: Ruth Harris
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File MB-10180_max_quota.png    
Issue Links:
Relates to
relates to MB-2762 Default node quota is still too high Resolved
relates to MB-8832 Allow for some back-end setting to ov... Open
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
In the documentation for the product (and general sizing advice) we tell people to allocate no more than 80% of their memory for the Server Quota, to leave headroom for the views, disk write queues and general OS usage.

However on larger[1] nodes we don't appear to enforce this, and instead allow people to allocate up to 1GB less than the total RAM.

This is inconsistent, as we document and tell people one thing and let them do another.

This appears to be something inherited from MB-2762, which the intent of which appeared to only allow the relaxing of this when joining a cluster, however this doesn't appear to be how it works - I can successfully change the existing cluster quota from the CLI to a "large" value:

    $ /opt/couchbase/bin/couchbase-cli cluster-edit -c localhost:8091 -u Administrator -p dynam1te --cluster-ramsize=127872
    ERROR: unable to init localhost (400) Bad Request
    [u'The RAM Quota value is too large. Quota must be between 256 MB and 127871 MB (memory size minus 1024 MB).']

While I can see some logic to relax the 80% constraint on big machines, with the advent of 2.X features 1024MB seems far too small an amount of headroom.

Suggestions to resolve:

A) Revert to a straightforward 80% max, with a --force option or similar to allow specific customers to go higher if they know what they are doing
B) Leave current behaviour, but document it.
B) Increase minimum headroom to something more reasonable for 2.X, *and* document the beaviour.

([1] On a machine with 128,895MB of RAM I get the "total-1024" behaviour, on a 1GB VM I get 80%. I didn't check in the code what the cutoff for 80% / total-1024 is).


 Comments   
Comment by Dave Rigby [ 11/Feb/14 ]
Screenshot of initial cluster config: maximum quota is total_RAM-1024
Comment by Aleksey Kondratenko [ 11/Feb/14 ]
Do not agree with that logic.

There's IMHO quite a bit of difference between default settings, recommended settings limit and allowed settings limit. The later can be wider for folks who really know what they're doing.
Comment by Aleksey Kondratenko [ 11/Feb/14 ]
Passed to Anil, because that's not my decision to change limits
Comment by Dave Rigby [ 11/Feb/14 ]
@Aleksey: I'm happy to resolve as something other than my (A,B,C), but the problem here is that many people haven't even been aware of this "extended" limit in the system - and moreover on a large system we actually advertise it in the GUI when specifying the allowed limit (see attached screenshot).

Furthermore, I *suspect* that this was originally only intended for upgrades for 1.6.X (see http://review.membase.org/#/c/4051/), but somehow is now being permitted for new clusters.

Ultimately I don't mind what our actual max quota value is, but the app behaviour should be consistent with the documentation (and the sizing advice we give people).
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
raising to product blocker.
this inconsistency has to be resolved - PM to re-align.
Comment by Anil Kumar [ 28/May/14 ]
Going with option B - Leave current behaviour, but document it.
Comment by Ruth Harris [ 17/Jul/14 ]
I only see the 80% number coming up as an example of setting the high water mark (85% suggested). The Server Quota section doesn't mention anything. The working set managment & ejection section(s) and item pager sub-section also mention high water mark.

Can you be more specific about where this information is? Anyway, the best solution is to add a "note" in the applicable section(s).

--ruth

Comment by Dave Rigby [ 21/Jul/14 ]
@Ruth: So the current product behaviour is that the ServerQuota limit depends on the maximum memory available:

* For machines with <= X MB of memory, the maximum server quota is 80% of total physical memory
* For machines with > X MB of memory, the maximum Server Quota is Total Physical Memory - 1024.

The value of 'X' is fixed in the code, but it wasn't obvious what it's actually is (it's derived from a few different things. I suggest you ask Alk who should be able to provide the value of it.




[MB-11770] Re-investigate impact of xdcrMaxConcurrentReps on XDCR latency (LAN/WAN) and choose safer value for 3.0 Created: 21/Jul/14  Updated: 23/Jul/14

Status: Open
Project: Couchbase Server
Component/s: performance
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-9917] DOC - memcached should dynamically adjust the number of worker threads Created: 14/Jan/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Trond Norbye Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
4 threads is probably not ideal for a 24 core system ;)

 Comments   
Comment by Anil Kumar [ 25/Mar/14 ]
Trond - Can you explain is this new feature in 3.0 or fixing documentation on older docs?
Comment by Ruth Harris [ 17/Jul/14 ]
Trond, Could you provide more information here and then reassign to me? --ruth
Comment by Trond Norbye [ 24/Jul/14 ]
New in 3.0 is that memcached no longer defaults to 4 threads for the frontend, but use 75% of the number of cores reported of the system (with a minimum of 4 cores).

There are 3 ways to tune this:

* Export MEMCACHED_NUM_CPUS=number of threads you want before starting couchbase server

* Use the -t <number> command line argument (this will go away in the future)

* specify it in the configuration file read during startup (but when started from the full server this file is regenerated every time so you'll loose the modifications)




[MB-11548] Memcached does not handle going back in time. Created: 25/Jun/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Patrick Varley Assignee: Jim Walker
Resolution: Unresolved Votes: 0
Labels: customer, memcached
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: No

 Description   
When you change the time of server to be a time in the pass when the memcached process is running it will start expiring all documents with TTL.

To recreate set date to a time in the past for example 2 hours ago from now.

sudo date --set="15:56:56"

You will see that time and uptime from cbstats will change to a large amount:

time: 5698679116
uptime: 4294946592

Looking at the code we can see how this happens:
http://src.couchbase.org/source/xref/2.5.1/memcached/daemon/memcached.c#6462

When you change the time to a value in the past "process_started" will be greater than "timer.tv_sec" and current_time is unsigned which means it will wrap around.

What I do not understand from the code is why current_time is the number of seconds since memcached started and not just the epoch time? (There is a comment about avoiding 64bit) .

http://src.couchbase.org/source/xref/2.5.1/memcached/daemon/memcached.c#117

Any case we should check if "process_started" is bigger than "timer.tv_sec" do something smart.

I will let you decide what the smart thing is :)

 Comments   
Comment by Patrick Varley [ 07/Jul/14 ]
It would be good if we can get this fix into 3.0. Maybe a quick patch like this is good enough for now:

static void set_current_time(void) {
    struct timeval timer;

    gettimeofday(&timer, NULL);
    if (process_started < timer.tv_sec) {
        current_time = (rel_time_t) (timer.tv_sec - process_started);
    }
    else {
       settings.extensions.logger->log(EXTENSION_LOG_WARNING, NULL, "Time has gone backward shutting down to protect data.\n");
       shutdown_server();
}


More than happy to submit the code for review.
Comment by Chiyoung Seo [ 07/Jul/14 ]
Trond,

Can you see if we can address this issue in 3.0?
Comment by Jim Walker [ 08/Jul/14 ]
Looks to me like clock_handler (which wakes up every second) should be looking for time going backwards. It is sampling time every second so can easily see big shifts in the clock and make appropriate adjustments

I don't think we should be shutting down though if we can deal with it, but it does open interesting questions about TTLs and gettimeofday going backwards.

Perhaps we need to adjust process_started by the shift?

Happy to pick this up, just doing some other stuff at the moment...
Comment by Patrick Varley [ 08/Jul/14 ]
clock_handler calls set_current_time which is where all the damage is done.

I agree if we can handle it better we should not shutdown. I did think about changing process_started but that seem a bit like hack in my head but I cannot explain why :).
I was also wondering what should we do when time shifts forward?

I think this has some interesting affects on the stats too.
Comment by Patrick Varley [ 08/Jul/14 ]
Silly question but why not set current_time to epoch seconds instead of doing the offset from the process_started?
Comment by Jim Walker [ 09/Jul/14 ]
@patrick, this is shared code used by memcache and couchbase buckets. Note that memcache is storing expiry as "seconds since process" started and couch buckets store expiry as second since epoch, hence why a lot of this number shuffling is occurring.
Comment by Jim Walker [ 09/Jul/14 ]
get_current_time() is used for a number of time based lock checks (see getl) and document expiry itself (both within memcached and couchbase buckets).

process_started is an absolute time stamp and can lead to incorrect expiry if the real clock jumped. Example
 - 11:00am memcached started process_started = 11:00am (ignoring the - 2second thing)
 - 11:05am ntp comes in and aligns the node to the correct data-centre time (let’s say - 1hr) time is now 10:05am
 - 10:10am clients now set documents with absolute expiry of 10:45am
 - documents instantly expire because memcached thinks they’re in the past.. client scratches head.

Ultimately we need to ensure that the functions get_current_time(), realtime() and abstime() all do sensible things if the clock is changed, e.g. don’t return large unsigned values.
 
Given all this I think the requirements are:

R1 Define a memcached time tick interval (which is 1 second)
  - set_current_time() callback executes at this frequency.

R2 get_current_time() the value returned must be shielded from clock changes.
   - If clock goes backwards, the returned value still increases by R1.
   - If clock goes forwards, the returned value still increases by R1.
   - Really this returns process uptime in seconds and the stat “uptime” is just current_time.

R3 monitor the system time for jumps (forward or backward).
   - Reset process_started to be current time if there’s a change which is greater or less than R1 ticks.

R4 Ensure documentation describes the effect of system clock changes and the two ways you can set document expiry.
  

Overall the code changes are simple to address the issue, I will also look at making testrunner tests to ensure the system behaves.
Comment by Patrick Varley [ 09/Jul/14 ]
Sounds good, a small reminder about handling VMs that are suspended.
Comment by Jim Walker [ 24/Jul/14 ]
Patch for platform http://review.couchbase.org/39811
Patch for memcached http://review.couchbase.org/39813




[MB-11812] Need a read-only mode to startup the query server Created: 24/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Don Pinto Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This is required for the tutorial in production as we don't want any user to blow off the data, or add additional data.

All DML queries should be blocked when the server is started in this mode. Only the admin should be start the query server in read-only mode.






[MB-11733] One node is slow during indexing Created: 15/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Volker Mische Assignee: Volker Mische
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I don't know whether this is an environmental problem or not. On Pavels performance run with 4 nodes, one node is slow, one half way slow and two normal. You can find the logs of the slow run here [1].

If you look at the current ShowFast graph [2] of the "Initial index (min), 1 bucket x 50M x 2KB, DGM, 4 x 1 views, no mutations" run ("Linux", "View Indexing" -> "Initial", second graph), it's way slower in the build 956 than in the 928 (46.1s vs. 22.6s). When looking at the logs, it's node *.31 that's way slower. It is either ep-engine not providing the UPR stream messages fast enough, or the view-engine consuming them slowly.

This node has been shown to be slow in several tests, so it might even be a problem in the environment (like a slow disk).

Here's the analysis from the 4 nodes, where you can see that one is clearly way slower. The numbers on the right are the seconds between the "Backfill complete" and "Stream closing" message, the left number is how often it occurred:

cat cbcollect_info_ns_1@172.23.100.31_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}' > /tmp/31
vmx@emil$ cat cbcollect_info_ns_1@172.23.100.29_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}'|sort -n|uniq -c
    301 2
    208 3
      1 4
      1 5
      1 8
vmx@emil$ cat cbcollect_info_ns_1@172.23.100.30_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}'|sort -n|uniq -c
    169 2
     87 3
     16 4
     82 5
    119 6
     28 7
      9 8
      2 9
vmx@emil$ cat cbcollect_info_ns_1@172.23.100.31_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}'|sort -n|uniq -c
      9 5
     41 6
    146 7
    124 8
     76 9
     67 10
     29 11
     15 12
      3 13
      1 14
      1 16
vmx@emil$ cat cbcollect_info_ns_1@172.23.100.32_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}'|sort -n|uniq -c
    317 2
    195 3

[1] http://localhost:3000/job/leto/298/
[2] http://showfast.sc.couchbase.com/#/timeline

 Comments   
Comment by Pavel Paulau [ 15/Jul/14 ]
MB-9822 ?
Comment by Volker Mische [ 15/Jul/14 ]
Forgot a link to the report: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-956_bba_build_init_index
Comment by Volker Mische [ 15/Jul/14 ]
Pavel, yes this sounds a lot like MB-9822. Though I wonder whether it is really an Erlang VM problem. That's what I try to find out. I guess I need to add some additional logging to see whether the view-engine doesn't reveive the items as fast as possible, or doesn't process them as fast as the other servers.
Comment by Volker Mische [ 15/Jul/14 ]
I can't find the UPR flow control spec, where is it (I forgot the details)?
Comment by Sriram Melkote [ 15/Jul/14 ]
https://docs.google.com/document/d/1xm43fPU0pO3EkN5xlePqBLiWy7O7f1MGz8TUHdaEZlU/edit?pli=1

Wish it was in github like other docs.
Comment by Mike Wiederhold [ 15/Jul/14 ]
Once the backfill is complete all of the items form the backfill are in memory. This means that the slowness you are reporting is from items that are already in memory. I would recommend checking the flow control logic and also looking for view engine slowness. If you suspect that the slowness is caused by ep-engine it would be good to get some information showing that messages sent per second are low or that there are large gaps in time between messages being sent.
Comment by Volker Mische [ 16/Jul/14 ]
After looking at the graphs from older builds, it really seems to be not specific to a single physical machine.

Next step is that the view-engine will get some stat which tracks how full the flow control buffer is. We hope this will give us some insights.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - July 17

this issue is form before happening more often now. performance issue. we should continue looking into.
Comment by Sriram Melkote [ 24/Jul/14 ]
Due to the impact of this on initial indexing, I'm raising this to a blocker
Comment by Volker Mische [ 30/Jul/14 ]
I found the issue, it is the CPU governor. On the slow node it the cpufreq kernel module wasn't loaded, on the others it was. If I set the CPU governor on the slow node to "performance" (it currently is still in that state), then the indexing is even faster than on the other nodes. It can e.g. be seen here [1] on the graph named [172.23.100.31] beam.smp_rss. The usage drops earlier than on the other nodes.

The other nodes do some CPU scaling, I saw different values when I did a `cat /sys/devices/system/cpu/cpu10/cpufreq/cpuinfo_cur_freq`.

[1]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1057_202_build_init_index
Comment by Volker Mische [ 30/Jul/14 ]
Pavel, I assign the issue to you. You can either keep using this issue or create a new one.

I propose to set all nodes to CPU governor "performance". I leave the configuration of the machines to you. If you need any help, let me know.

Once there's a run with an updated system and its results on ShowFast I'll close the issue.
Comment by Pavel Paulau [ 30/Jul/14 ]
Hi Volker,

+ your comment from email thread:
"Somehow now 172.23.100.30 is slow. The only way I could imagine that happened is that I started "cpuspeed", perhaps it changed some setting. Though I'm pretty sure the issue will go away, once everything is setup properly in regards to the cpu governors."

172.23.100.31 did demonstrate weird characteristics sometimes. I'm 100% sure that there is/was an environment issue.

But CPU governors don't explain slowness of other servers. They don't explain cases when 172.23.100.31 was using a lot of CPU resources.

I have nothing to do this ticket. Regular indexing tickets will be executed regardless this issues. It's up to you to follow results and close the ticket.
Comment by Volker Mische [ 30/Jul/14 ]
Pavel, can I take the environment again? I'm sure when I set the governor to "performance" on node 172.23.100.30 manually, it will just be fast again.

What I'm asking for is that *all* nodes use the same governor, so that they perform the same way ("performance is obviously preferred to get the best numbers).
Comment by Volker Mische [ 30/Jul/14 ]
I just read your comment again. Does it mean that I should open a new ticket that says that all nodes should use the same governor?
Comment by Volker Mische [ 30/Jul/14 ]
I should concentrate a bit more. In the issue I'm describing here, node 172.23.100.31 was never using a lot of CPU it was always using less CPU than other nodes.
Comment by Pavel Paulau [ 30/Jul/14 ]
I just disabled "cpuspeed' on all machines and started a set of regular tests.
I don't understand the need in CPU scaling on production-like servers.

+example of beam.smp on 172.23.100.31 with ~1700% CPU utilization: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1045_389_access
Comment by Volker Mische [ 30/Jul/14 ]
I fully agree that there's no need for CPU scaling, that's exactly what I was after. Thanks.

+in this example all nodes take about the same amount of CPU.
Comment by Pavel Paulau [ 30/Jul/14 ]
@Volker,

With disabled cpuspeed on all machines:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1059_66d_build_init_index
Comment by Volker Mische [ 30/Jul/14 ]
@Pavel, I just saw it. Please let me know whenever the cluster is free again for me to experiment with.

Now another node is low :(
Comment by Sriram Melkote [ 30/Jul/14 ]
If you haven't noticed - node .32 is running userspace governor via acpi_cpufreq module still
Comment by Volker Mische [ 30/Jul/14 ]
@Siri, yes, I just enabled it.
Comment by Volker Mische [ 30/Jul/14 ]
Results from a new run [1]. This time with the "performace" governor enabled on all nodes. Node .32 is extremely slow. This time my script can't reproduce the issue. Now I am where I started.

[1] http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1059_ff3_build_init_index
Comment by Volker Mische [ 30/Jul/14 ]
And here's a run which is perfectly fine: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1057_49a_build_init_index

We're back to sporadic slowdowns.
Comment by Volker Mische [ 30/Jul/14 ]
It seems to be a bug in the view engine.

I just had a slow node which didn't take a lot of CPU. Again the transfer from UPR was slow (the time between the backfilling and closing the stream). I did run my escript [1] in parallel, which does the same UPR requests as the view engine does. I saw an increase on the memcached CPU usage. The script finishes as fast as when it is run without the test in the background. This means that ep-engine is not the bottle-neck. It also isn't TCP (although it's a transfer on localhost, it still could've been an issue). This leads me to the conclusion that something is wrong within the view engine when processing the data it gets from UPR. Another indication is that during this run [2] the indexing progress suddenly decresed (around 30/Jul/2014 14:53:50).

In case you take a look at the build artifacts this run should upload, please take into account that I was running my escript on two of the nodes at around 30/Jul/2014 14:49:57 several times (just in case you wonder about additional log messages that seem unrelated).

[1]: https://gist.github.com/vmx/ec04d67416276e69a02d
[2]: http://ci.sc.couchbase.com/job/leto/419/console




[MB-7250] Mac OS X App should be signed by a valid developer key Created: 22/Nov/12  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0-beta-2, 2.1.0, 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: J Chris Anderson Assignee: Wayne Siu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Build_2.5.0-950.png     PNG File Screen Shot 2013-02-17 at 9.17.16 PM.png     PNG File Screen Shot 2013-04-04 at 3.57.41 PM.png     PNG File Screen Shot 2013-08-22 at 6.12.00 PM.png     PNG File ss_2013-04-03_at_1.06.39 PM.png    
Issue Links:
Dependency
depends on MB-9437 macosx installer package fails during... Closed
Relates to
relates to CBLT-104 Enable Mac developer signing on Mac b... Open

 Description   
Currently launching the Mac OS X version tells you it's from an unidentified developer. You have to right click to launch the app. We can fix this.

 Comments   
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
Chris,

do you know what needs to change on the build machine to embed our developer key ?
Comment by J Chris Anderson [ 22/Nov/12 ]
I have no idea. I could start researching how to get a key from Apple but maybe after the weekend. :)
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
we can discuss this next week : ) . Thanks for reporting the issue Chris.
Comment by Steve Yen [ 26/Nov/12 ]
we'll want separate, related bugs (tasks) for other platforms, too (windows, linux)
Comment by Jens Alfke [ 30/Nov/12 ]
We need to get a developer ID from Apple; this will give us some kind of cert, and a local private key for signing.
Then we need to figure out how to get that key and cert onto the build machine, in the Keychain of the account that runs the buildbot.
Comment by Farshid Ghods (Inactive) [ 02/Jan/13 ]
the instructions to build is available here :
https://github.com/couchbase/couchdbx-app
we need to add codesign as a build step there
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
Phil,

do you have any update on this ticket. ?
Comment by Phil Labee [ 22/Jan/13 ]
I have signing cert installed on 10.17.21.150 (MacBuild).

Change to Makefile: http://review.couchbase.org/#/c/24149/
Comment by Phil Labee [ 23/Jan/13 ]
need to change master.cfg and pass env.var. to package-mac
Comment by Phil Labee [ 29/Jan/13 ]
disregard previous. Have added signing to Xcode projects.

see http://review.couchbase.org/#/c/24273/
Comment by Phil Labee [ 31/Jan/13 ]
To test this go to System Preferences / Security & Privacy, and on the General tab set "Allow applications downloaded from" to "Mac App Store and Identified Developers". Set this before running Couchbase Server.app the first time. Once an app has been allowed to run this setting is no longer checked for that app, and there doesn't seem to be a way to reset that.

What is odd is that on my system, I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked (and would all be allowed to run). Either there is a flaw in my testing methodology, or a serious weakness in this security setting: Just because one app called Couchbase Server was allowed to run should confer this privilege to other apps with the same name. A common malware tactic is to modify a trusted app and distribute it as update, and if the security setting keys off the app name it will do nothing to prevent that.

I'm approving this change without having satisfactorily tested it.
Comment by Jens Alfke [ 31/Jan/13 ]
Strictly speaking it's not the app name but its bundle ID, i.e. "com.couchbase.CouchbaseServer" or whatever we use.

> I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked

By OK'ing an unsigned app you're basically agreeing to toss security out the window, at least for that app. This feature is really just a workaround for older apps. By OK'ing the app you're not really saying "yes, I trust this build of this app" so much as "yes, I agree to run this app even though I don't trust it".

> A common malware tactic is to modify a trusted app and distribute it as update

If it's a trusted app it's hopefully been signed, so the user wouldn't have had to waive signature checking for it.
Comment by Jens Alfke [ 31/Jan/13 ]
Further thought: It might be a good idea to change the bundle ID in the new signed version of the app, because users of 2.0 with strict security settings have presumably already bypassed security on the unsigned version.
Comment by Jin Lim [ 04/Feb/13 ]
Per bug scrubs, keep this a blocker since customers ran into this issues (and originally reported it).
Comment by Phil Labee [ 06/Feb/13 ]
revert the change so that builds can complete. App is currently not being signed.
Comment by Farshid Ghods (Inactive) [ 11/Feb/13 ]
i suggest for 2.0.1 release we do this build manually.
Comment by Jin Lim [ 11/Feb/13 ]
As one-off fix, add the signature manually and automate the required steps later in 2.0.2 or beyond.
Comment by Jin Lim [ 13/Feb/13 ]
Please move this bug to 2.0.2 after populating the required signature manually. I am lowing the severity to critical for it isn't no longer a blocking issue.
Comment by Farshid Ghods (Inactive) [ 15/Feb/13 ]
Phil to upload the binary to latestbuilds , ( 2.0.1-101-rel.zip )
Comment by Phil Labee [ 15/Feb/13 ]
Please verify:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee [ 15/Feb/13 ]
uploaded:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip

I can rename it when uploading for release.
Comment by Farshid Ghods (Inactive) [ 17/Feb/13 ]
i still do get the error that it is from an identified developer.

Comment by Phil Labee [ 18/Feb/13 ]
operator error.

I rebuilt the app, this time verifying that the codesign step occurred.

Uploaded now file to same location:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee [ 26/Feb/13 ]
still need to perform manual workaround
Comment by Phil Labee [ 04/Mar/13 ]
release candidate has been uploaded to:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip
Comment by Wayne Siu [ 03/Apr/13 ]
Phil, looks like version 172/185 is still getting the error. My Mac version is 10.8.2
Comment by Thuan Nguyen [ 03/Apr/13 ]
Install couchbase server (build 2.0.1-172 community version) in my mac osx 10.7.4 , I only see the warning message
Comment by Wayne Siu [ 03/Apr/13 ]
Latest version (04.03.13) : http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.zip
Comment by Maria McDuff (Inactive) [ 03/Apr/13 ]
works in 10.7 but not in 10.8.
if we can get the fix for 10.8 by tomorrow, end of day, QE is willing to test for release on tuesday, april 9.
Comment by Phil Labee [ 04/Apr/13 ]
The mac builds are not being automatically signed, so build 185 is not signed. The original 172 is also not signed.

Did you try

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip

to see if that was signed correctly?

Comment by Wayne Siu [ 04/Apr/13 ]
Phil,
Yes, we did try the 172-signed version. It works on 10.7 but not 10.8. Can you take a look?
Comment by Phil Labee [ 04/Apr/13 ]
I rebuilt 2.0.1-185 and uploaded a signed app to:

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.SIGNED.zip

Test on a machine that has never had Couchbase Server installed, and has the security setting to only allow Appstore or signed apps.

If you get the "Couchbase Server.app was downloaded from the internet" warning and you can click OK and install it, then this bug is fixed. The quarantining of files downloaded by a browser is part of the operating system and is not controlled by signing.
Comment by Wayne Siu [ 04/Apr/13 ]
Tried the 185-signed version (see attached screen shot). Same error message.
Comment by Phil Labee [ 04/Apr/13 ]
This is not an error message related to this bug.

Comment by Maria McDuff (Inactive) [ 14/May/13 ]
per bug triage, we need to have mac 10.8 osx working since it is a supported platform (published in the website).
Comment by Wayne Siu [ 29/May/13 ]
Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Anil Kumar [ 31/May/13 ]
we need to address signed key for both Windows and Mac deferring this to next release.
Comment by Dipti Borkar [ 08/Aug/13 ]
Please let's make sure this is fixed in 2.2.
Comment by Phil Labee [ 16/Aug/13 ]
New keys will be created using new account.
Comment by Phil Labee [ 20/Aug/13 ]
iOS Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=iOS Distribution expires Aug 12, 2014

    ~buildbot/Desktop/appledeveloper.couchbase.com/certs/ios/ios_distribution_appledeveloper.couchbase.com.cer

Identifiers:
  App IDS:
    "Couchbase Server" id=com.couchbase.*

Provisining Profiles:
  Distribution:
    "appledeveloper.couchbase.com" type=Distribution

  ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/ios/appledevelopercouchbasecom.mobileprovision
Comment by Phil Labee [ 20/Aug/13 ]
Mac Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)
    "Couchbase, Inc." type=Developer ID installer (Aug,16,2014)
    "Couchbase, Inc." type=Developer ID Application (Aug,16,2014)
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)

     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developerID_installer.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developererID_application.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution-2.cer

Identifiers:
  App IDs:
    "Couchbase Server" id=couchbase.com.* Prefix=N2Q372V7W2
    "Coucbase Server adhoc" id=couchbase.com.* Prefix=N2Q372V7W2
    .

Provisioning Profiles:
  Distribution:
    "appstore.couchbase.com" type=Distribution
    "Couchbase Server adhoc" type=Distribution

     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/appstorecouchbasecom.privisioningprofile
     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/Couchbase_Server_adhoc.privisioningprofile

Comment by Phil Labee [ 21/Aug/13 ]

As of build 2.2.0-806 the app is signed by a new provisioning profile
Comment by Phil Labee [ 22/Aug/13 ]
 Install version 2.2.0-806 on a macosx 10.8 machine that has never had Couchbase Server installed, which has the security setting to require applications to be signed with a developer ID.
Comment by Phil Labee [ 22/Aug/13 ]
please assign to tester
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
just tried this against newest build 809:
still getting restriction message. see attached.
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
restriction still exists.
Comment by Maria McDuff (Inactive) [ 28/Aug/13 ]
verified in rc1 (build 817). still not fixed. getting same msg:
“Couchbase Server” can’t be opened because it is from an unidentified developer.
Your security preferences allow installation of only apps from the Mac App Store and identified developers.

Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Phil Labee [ 03/Sep/13 ]
Need to create new certificates to replace these that were revoked:

Certificate: Mac Development
Team Name: Couchbase, Inc.

Certificate: Mac Installer Distribution
Team Name: Couchbase, Inc.

Certificate: iOS Development
Team Name: Couchbase, Inc.

Certificate: iOS Distribution
Team Name: Couchbase, Inc.
Comment by Maria McDuff (Inactive) [ 18/Sep/13 ]
candidate for 2.2.1 bug fix release.
Comment by Dipti Borkar [ 28/Oct/13 ]
This is going to make it into 2.5? We seemed to keep differing it?
Comment by Phil Labee [ 29/Oct/13 ]
cannot test changes with installer that fails
Comment by Phil Labee [ 11/Nov/13 ]
Installed certs as buildbot and signed app with "(recommended) 3rd Party Mac Developer Application", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-001.zip

Signed with "(Oct 30) 3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-002.zip

These zip files were made on the command line, not a result of the make command. They are 2.5G in size, so they obviously include mote than the zip files produced by the make command.

Both versions of the app appear to be signed correctly!

Note: cannot run make command from ssh session. Must Remote Desktop in and use terminal shell natively.
Comment by Phil Labee [ 11/Nov/13 ]
Finally, some progress: If the zip file is made using the --symlinks argument it appears to be un-signed. If the symlinked files are included, the app appears to be signed correctly.

The zip file with symlinks is 60M, while the zip file with copies of the files is 2.5G, more than 40X the size.
Comment by Phil Labee [ 25/Nov/13 ]
Fixed in 2.5.0-950
Comment by Dipti Borkar [ 25/Nov/13 ]
Maria, can QE please verify this?
Comment by Wayne Siu [ 28/Nov/13 ]
Tested with build 2.5.0-950. Still see the warning box (attached).
Comment by Wayne Siu [ 19/Dec/13 ]
Phil,
Can you give an update on this?
Comment by Ashvinder Singh [ 14/Jan/14 ]
I tested the code signature with apple utility "spctl -a -v /Applications/Couchbase\ Server.app/" and got the output :
>>> /Applications/Couchbase Server.app/: a sealed resource is missing or invalid

also tried running the command:
 
bash: codesign -dvvvv /Applications/Couchbase\ Server.app
>>>
Executable=/Applications/Couchbase Server.app/Contents/MacOS/Couchbase Server
Identifier=com.couchbase.couchbase-server
Format=bundle with Mach-O thin (x86_64)
CodeDirectory v=20100 size=639 flags=0x0(none) hashes=23+5 location=embedded
Hash type=sha1 size=20
CDHash=868e4659f4511facdf175b44a950b487fa790dc4
Signature size=4355
Authority=3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)
Authority=Apple Worldwide Developer Relations Certification Authority
Authority=Apple Root CA
Signed Time=Jan 8, 2014, 10:59:16 AM
Info.plist entries=31
Sealed Resources version=1 rules=4 files=5723
Internal requirements count=1 size=216

It looks like the code signature is present but got invalid as the new file were added/modified to the project. I suggest for the build team to rebuild and add the code signature again.
Comment by Phil Labee [ 17/Apr/14 ]
need VM to clone for developer experimentation
Comment by Anil Kumar [ 18/Jul/14 ]
Any update on this? We need this for 3.0.0 GA.

Please update the ticket.

Triage - July 18th




[MB-6972] distribute couchbase-server through yum and ubuntu package repositories Created: 19/Oct/12  Updated: 30/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Farshid Ghods (Inactive) Assignee: Phil Labee
Resolution: Unresolved Votes: 3
Labels: devX
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks MB-8693 [Doc when ready] distribute couchbase... Reopened
blocks MB-7821 yum install couchbase-server from cou... Resolved
Duplicate
duplicates MB-2299 Create signed RPM's Resolved
is duplicated by MB-9409 repository for deb packages (debian&u... Resolved
Flagged:
Release Note

 Description   
this helps us in handling dependencies that are needed for couchbase server
sdk team has already implemented this for various sdk packages.

we might have to make some changes to our packaging metadata to work with this schema

 Comments   
Comment by Steve Yen [ 26/Nov/12 ]
to 2.0.2 per bug-scrub

first step is do the repositories?
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Farshid Ghods (Inactive) [ 19/Dec/12 ]
Phil,
please sync up with Farshid and get instructions that Sergey and Pavel sent
Comment by Farshid Ghods (Inactive) [ 28/Jan/13 ]
we should resolve this task once 2.0.1 is released .
Comment by Dipti Borkar [ 29/Jan/13 ]
Have we figured out the upgrade process moving forward. for example from 2.0.1 to 2.0.2 or 2.0.1 to 2.1 ?
Comment by Jin Lim [ 04/Feb/13 ]
Please ensure that we also confirm/validate the upgrade process moving from 2.0.1 to 2.0.2. Thanks.
Comment by Phil Labee [ 06/Feb/13 ]
Now have DEB repo working, but another issue has come up: We need to distribute the public key so that users can install the key before running apt-get.

wiki page has been updated.
Comment by kzeller [ 14/Feb/13 ]
Added to 2.0.1 RN as:

Fix:

We now provide Couchbase Server as a yum and Debian package
repositories.
Comment by Matt Ingenthron [ 09/Apr/13 ]
What are the public URLs for these repositories? This was mentioned in the release notes here:
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-rn_2-0-0l.html
Comment by Matt Ingenthron [ 09/Apr/13 ]
Reopening, since this isn't documented that I can find. Apologies if I'm just missing it.
Comment by Dipti Borkar [ 23/Apr/13 ]
Anil, can you work with Phil to see what are the next steps here?
Comment by Anil Kumar [ 24/Apr/13 ]
Yes I'll be having discussion with Phil and will update here with details.
Comment by Tim Ray [ 28/Apr/13 ]
could we either remove the note about yum/deb repo's in the release notes or get those repo locations / sample files / keys added to public pages? The only links that seem that they 'might' contain the info point to internal pages I don't have access to.
Comment by Anil Kumar [ 14/May/13 ]
thanks Tim, we have removed it from release notes. we will add instructions about yum/deb repo's locations/files/keys to documentation once its available. thanks!
Comment by kzeller [ 14/May/13 ]
Removing duplicate ticket:

http://www.couchbase.com/issues/browse/MB-7860
Comment by h0nIg [ 24/Oct/13 ]
any update? maybe i created a duplicate issue: http://www.couchbase.com/issues/browse/MB-9409 but it seems that the repositories are outdated on http://hub.internal.couchbase.com/confluence/display/CR/How+to+Use+a+Linux+Repo+--+debian
Comment by Sriram Melkote [ 22/Apr/14 ]
I tried to install on Debian today. It failed badly. One .deb package didn't match the libc version of stable. The other didn't match the openssl version. Changing libc or openssl is simply not an option for someone using Debian stable because it messes with the base OS too deeply. So as of 4/23/14, we don't have support for Debian.
Comment by Sriram Melkote [ 22/Apr/14 ]
Anil, we have accumulated a lot of input in this bug. I don't think this will realistically go anywhere for 3.0 unless we define specific goals and some considered platform support matrix expansion. Can you please create a goal for 3.0 more precisely?
Comment by Matt Ingenthron [ 22/Apr/14 ]
+1 on Siri's comments. Conversations I had with both Ubuntu (who recommend their PPAs) and Red Hat experts (who recommend setting up a repo or getting into EPEL or the like) indicated that's the best way to ensure coverage of all OSs. Binary packages built on one OS and deployed on another are risky, run into dependency issues.
Comment by Anil Kumar [ 28/Apr/14 ]
This ticket specially for distributing DEB and RPM repositories through YUM and APT repo. We have another ticket for supporting Debian platform MB-10960.
Comment by Anil Kumar [ 23/Jun/14 ]
Assigning ticket to Tony for verification.
Comment by Phil Labee [ 21/Jul/14 ]
Need to do before closing:

[ ] capture keys and process used for build that is currently posted (3.0.0-628), update tools and keys of record in build repo and wiki page
[ ] distribute 2.5.1 and 3.0.0-beta1 builds using same process, testing update capability
[ ] test update from 2.0.0 to 2.5.1 to 3.0.0
Comment by Phil Labee [ 21/Jul/14 ]
re-opening to assign to sprint to prepare the distribution repos for testing
Comment by Wayne Siu [ 30/Jul/14 ]
Phil,
has build 3.0.0-973 be updated in the repos for beta testing?




[MB-11675] 40-50% performance degradation on append-heavy workload compared to 2.5.1 Created: 09/Jul/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Dave Rigby Assignee: Dave Rigby
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: OS X Mavericks 10.9.3
CB server 3.0.0-918 (http://packages.northscale.com/latestbuilds/3.0.0/couchbase-server-enterprise_x86_64_3.0.0-918-rel.zip)
Haswell MacBook Pro (16GB RAM)

Attachments: PNG File CB 2.5.1 revAB_sim.png     PNG File CB 3.0.0-918 revAB_sim.png     JPEG File ep.251.jpg     JPEG File ep.300.jpg     JPEG File epso.251.jpg     JPEG File epso.300.jpg     Zip Archive MB-11675.trace.zip     Zip Archive perf_report_result.zip     Zip Archive revAB_sim_v2.zip     Zip Archive revAB_sim.zip    
Issue Links:
Relates to
relates to MB-11642 Intra-replication falling far behind ... Closed
relates to MB-11623 test for performance regressions with... In Progress

 Description   
When running an append-heavy workload (modelling a social network address book, see below) the performance of CB has dropped from ~100K ops down to 50K ops compared to 2.5.1-1083 on OS X.

Edit: I see a similar (but slightly smaller - around 40% degradation on Linux (Ubuntu 14.04) - see comment below for details.

== Workload ==

revAB_sim - generates a model social network, then builds a representation of this in Couchbase. Keys are a set of phone numbers, values are lists of phone books which contain that phone number. (See attachment).

Configured for 8 client threads, 100,000 people (documents).

To run:

* pip install networkx
* Check revAB_sim.py for correct host, port, etc
* time ./revAB_sim.py

== Cluster ==

1 node, default bucket set to 1024MB quota.

== Runtimes for workload to complete ==


## CB-2.5.1-1083:

~107K op/s. Timings for workload (3 samples):

real 2m28.536s
real 2m28.820s
real 2m31.586s


## CB-3.0.0-918

~54K op/s. Timings for workload:

real 5m23.728s
real 5m22.129s
real 5m24.947s


 Comments   
Comment by Pavel Paulau [ 09/Jul/14 ]
I'm just curious, what does consume all CPU resources?
Comment by Dave Rigby [ 09/Jul/14 ]
I haven't had chance to profile it yet; certainly in both instances (fast / slow) the CPU is at 100% between the client workload and server.
Comment by Pavel Paulau [ 09/Jul/14 ]
Is memcached top consumer? or beam.smp? or client?
Comment by Dave Rigby [ 09/Jul/14 ]
memcached highest (as expected). From the 3.0.0 package (which I still have installed):

PID COMMAND %CPU TIME #TH #WQ #PORT #MREG MEM RPRVT PURG CMPRS VPRVT VSIZE PGRP PPID STATE UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH CSW
34046 memcached 476.9 01:34.84 17/7 0 36 419 278M+ 277M+ 0B 0B 348M 2742M 34046 33801 running 501 73397+ 160 67 26 13304643+ 879+ 4070244+
34326 Python 93.4 00:18.57 9/1 0 25 418 293M+ 293M+ 0B 0B 386M 2755M 34326 1366 running 501 77745+ 399 70 28 15441263+ 629 5754198+
0 kernel_task 71.8 00:14.29 95/9 0 2 949 1174M+ 30M 0B 0B 295M 15G 0 0 running 0 42409 0 57335763+ 52435352+ 0 0 278127194+
...
32800 beam.smp 8.5 00:05.61 30/4 0 49 330 155M- 152M- 0B 0B 345M- 2748M- 32800 32793 running 501 255057+ 468 149 30 6824071+ 1862753+ 1623911+


Python is the workload generator.

I shall try to collect an Instruments profile of 3.0 and 2.5.1 to compare...
Comment by Dave Rigby [ 09/Jul/14 ]
Instruments profile of two runs:

Run 1: 3.0.0 (slow)
Run 2: 2.5.1 (fast)

I can look into the differences tomorrow if no-one else gets there first.


Comment by Dave Rigby [ 10/Jul/14 ]
Running on Linux (Ubuntu 14.04), 24 core Xeon, I see a similar effect, but the magnitude is not as bad - 40% performance drop.

100,000 documents with 4 worker threads, same bucket size (1024MB). (Note: worker threads was dropped to 4 as I couldn't get Python SDK to reliably connect with 8 threads at the same time).

## CB-3.0.0 (source build):

    83k op/s
    real 3m26.785s

## CB-2.5.1 (source build):

    133K op/s
    real 2m4.276s


Edit: Attached updated zip file as: revAB_sim_v2.zip
Comment by Dave Rigby [ 10/Jul/14 ]
Attaching the output of `perf report` for both 2.5.1 and 3.0.0 - perf_report_result.zip

There's nothing obvious jumping out at me, looks like quite a bit has changed between the two in ep_engine.
Comment by Dave Rigby [ 11/Jul/14 ]
I'm tempted to bump this to "blocker" considering it also affects Linux - any thoughts?
Comment by Pavel Paulau [ 11/Jul/14 ]
It's a product/release blocker, no doubt.

(though raising priority at this point will not move ticket to the top of the backlog due to other issues)
Comment by Dave Rigby [ 11/Jul/14 ]
@Pavel done :)
Comment by Abhinav Dangeti [ 11/Jul/14 ]
Think I should bring this up to people's notice that JSON detection has been moved to before items are set in memory, in 3.0. This could very well be the cause for this regression (as previously, we did do this JSON check but just before persistence).
This was part of the datatype related change, now required by UPR.
A HELLO protocol was introduced new in 3.0, which clients can invoke, there by letting the server know that clients would be setting the datatype themselves, in which case this JSON check wouldn't take place.
If a client doesn't invoke the HELLO command, then we would do JSON detection to set the datatype correctly.

However, recently, the HELLO was disabled as we weren't ready to handle compressed documents in view engine. This implied that we do a mandatory JSON check for every store operation, before setting the document even in memory.
Comment by Cihan Biyikoglu [ 11/Jul/14 ]
Thanks Abhinav. Can we try out if this simply resolves the issue quickly and if this is proven, revert this change?
Comment by David Liao [ 14/Jul/14 ]
I tried testing using the provided scripts with and without the json checking logic and there is no difference (on Mac and Ubuntu).

The total size of data is less than 200 MB with 100K items, it's about <2K per item which is not very big.
Comment by David Liao [ 15/Jul/14 ]
There might be an issue with general disk operation. I tested the set and it shows the same performance difference as append.
Pavel, have you seen any 'set' performance drop with 3.0? There is no rebalance involved just a single node in this test.
Comment by Pavel Paulau [ 16/Jul/14 ]
3.0 performs worse in CPU bound scenarios.
However Dave observed the same issue on system with 24 vCPU, which is kind of confusing to me.
Comment by Pavel Paulau [ 16/Jul/14 ]
Meanwhile I tried that script in my environment. I see no difference between and 2.5.1 and 3.0.

3.0.0-969: real 3m30.530s
2.5.1-1083: real 3m28.911s

Peak throughput is about 80K in both cases.

h/w configuration:

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

I used a standalone server as test client and regular packages.
Comment by Dave Rigby [ 16/Jul/14 ]
@Pavel: I was essentially maxing out the system, so that probably explains why even with 24 cores I could see the issue.
Comment by Pavel Paulau [ 16/Jul/14 ]
Does it mean that 83/133K ops/sec saturate system with 24 cores?
Comment by Dave Rigby [ 16/Jul/14 ]
@Pavel: yes (including the client workload generator which was running on the same machine). I could possibly push it higher by increasing the client worker threads, but as mentioned I had some python SDK connection issues then.

Comment by Pavel Paulau [ 16/Jul/14 ]
Weird, in my case CPU utilization was less than 500% (IRIX mode).
Comment by David Liao [ 16/Jul/14 ]
I am using a 4-core/4 GB ubuntu VM for the test.

3.0
real 11m16.530s
user 2m33.814s
sys 2m35.779s
<30k ops

2.5.1
real 7m6.843s
user 2m6.693s
sys 2m2.696s
40k ops


During today's test, I found out that the disk queue fill/drain rate of 3.0.0 is much smaller than 3.0.0 (<2k vs 30k). The cpu usage is ~8% higher too but most increase is from system cpu usage (total cpu is almost maxed out on 3.0)

Pavel, can you check the disk queue fill/drain rate of your test and system vs user cpu usage?
Comment by Pavel Paulau [ 16/Jul/14 ]
David,

I will check disk stats tmrw. At the time I would recommend you to run benchmark with disabled persistence.
Comment by Pavel Paulau [ 18/Jul/14 ]
In my case drain rate is higher in 2.5.1 (80K vs. 5K) but size of write queue and rate of actual disk creates/updates is pretty much the same.

CPU utilization is 2x higher in 3.0 (400% vs. 200%).

However I don't understand how this information helps.
Comment by David Liao [ 21/Jul/14 ]
The drain rate may not be accurate on 2.5.1.
 
'iostat' shows about 2x 'tps' and 'KB_wrtn/s' for 3.0.0 vs 2.5.1. So it indicates far more disk activities in 3.0.0.

We need to find out what the extra disk activities are. Since ep-engine issues "set" to couchstore which then write to disk, we should
do a benchmark against the couchstore separately to isolate problem.

Pavel, is there a way to do a couchstore performance test?
Comment by Pavel Paulau [ 22/Jul/14 ]
Due to increased number of flusher threads 3.0.0 persist data faster, that must explain higher disk activity.

Once again, disabling disk persistence at all will eliminate "disk" factor (just as an experiment).

Also I don't think that we made any critical changes in couchstore, I don't expect any regression. Chiyoung may have some benchmarks.
Comment by David Liao [ 22/Jul/14 ]
I have played with different flusher threads but don't see any improvement in my own not-designed-for-serious-performance-testing environment.

Logically, if flusher threads runs faster, it means the total number of transfer to disk should finish in shorter time. My observation is higher TPS lasted during the entire testing period which itself is much longer than 2.5.1 which means the total disk activities TPS and date_writte_disk for the same amount of work load is much higher.

Do you mean using memcached bucket when you say "disabling disk"? That test shows much less performance degradation which means majority of the problem is not from the memcached layer.

I am not familiar with couchstore changes but there are indeed quite a lot and not sure who is responsible for that component. But still it needs to be tested just like any other component.
Comment by Pavel Paulau [ 23/Jul/14 ]
I meant disabling persistence to disk in couchbase bucket. E.g., using cbepctl.
Comment by David Liao [ 23/Jul/14 ]
I disabled persistence with cbepctl and reran the tests and got the same performance degradation:

3.0.0:
real 6m3.988s
user 1m59.670s
sys 2m1.363s
ops: 50k

2.5.1
real 4m18.072s
user 1m45.940s
sys 1m39.775s
ops: 70k

So it's not the disk related operations that caused this.
Comment by David Liao [ 24/Jul/14 ]
Dave, what profiling tool did u use to collect the profiling data you attached?
Comment by Dave Rigby [ 24/Jul/14 ]
I used Linux perf - see for example http://www.brendangregg.com/perf.html
Comment by David Liao [ 25/Jul/14 ]
attach perf report for ep.so 3.0.0
Comment by David Liao [ 25/Jul/14 ]
perf report ep.so 2.51
Comment by David Liao [ 25/Jul/14 ]
I attached memcached and ep.so cpu usage for both 3.00 and 2.5.1.

The 2.5.1 didn't use c++ atomics. I tested 3.0.0 without c++ atomics and see the following improvement: ~20% diff.

Both with persistence disabled.

2,51
real 7m38.581s
user 2m11.771s
sys 2m27.968s
ops: 35k+

3.0.0
real 9m15.638s
user 2m31.642s
sys 2m56.154s
ops: ~30k

There could be multiple things that we still need to look at: the threading change in 3.0.0 and thus figuring out the best number of thread for different work load and also why there are much more data being written to disk in this work load.

I am using my laptop doing the perf testing but this kind of test should be done using dedicated/controlled testing environment.
So the perf team should try test the following areas:
1. c++ atomics change.
2. different threading configuration for different type of workload
3. independent couchstore testing decoupled from ep-engine.

Comment by Pavel Paulau [ 26/Jul/14 ]
As I mentioned before, I don't see difference between 251 and 300 using "dedicated/controlled testing environment.".

Anyways, thanks for your deep investigation. I will try to reproduce the issue on my laptop.

cc Thomas
Comment by Thomas Anderson [ 29/Jul/14 ]
in both cases, append heavy workloads, where Sets are > 50Kops, performance degradation seen in early 3.0.0 builds. collateral symptoms were 1) increase in bytes written to store/disk, approximately 20%, 2) frequency of bucket compression (log shows bucket ranges being compressed overlap) 3) drop off of OPS over time.
starting with build 3.0.0-1037, these performance metrics are generally aligned/equivalent to 2.5.1. 1) frequency of bucket compression reduced 2) expansion of bytes written reduced to almost 1-1 3) OPS contention/slowdown does not occur.

test is 10 concurrent loaders, 1024 byte document (JSON or not-JSON) averaging ~80kOPS.
Comment by Dave Rigby [ 30/Jul/14 ]
TL;DR: 3.0 release debs appear to be built *without* optimisation (!)

On a hunch I thought I'd see how we are building 3.0.0, as it seemed a little surprising we saw symbols for C++ atomics as I would have expected them to be inlined. Looking at the build log [1], I see we are building the .deb package as Debug, without optimisation:

    (cd build && cmake -G "Unix Makefiles" -D CMAKE_INSTALL_PREFIX="/opt/couchbase" -D CMAKE_PREFIX_PATH=";/opt/couchbase" -D PRODUCT_VERSION=3.0.0-1059-rel -D BUILD_ENTERPRISE=TRUE -D CMAKE_BUILD_TYPE=Debug -D CB_DOWNLOAD_DEPS=1 ..)

Note: CMAKE_BUILD_TYPE=****Debug****

From my local Ubuntu build, I see that CXX flags are set to the following for each of Debug / Release / RelWithDebInfo:

    CMAKE_CXX_FLAGS_DEBUG:STRING=-g
    CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG
    CMAKE_CXX_FLAGS_RELEASE:STRING=-O3 -DNDEBUG

For comparision I checked the latest 2.5.1 build [2] (which may not be the same as the last 2.5.1 release) and I see we *did* compile that with -O3 - for example:

    libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -pipe -I./include -DHAVE_VISIBILITY=1 -fvisibility=hidden -I./src -I./include -I/opt/couchbase/include -pipe -O3 -O3 -ggdb3 -MT src/ep_la-ep_engine.lo -MD -MP -MF src/.deps/ep_la-ep_engine.Tpo -c src/ep_engine.cc -fPIC -DPIC -o src/.libs/ep_la-ep_engine.o


If someone from build / infrastructure could confirm that would be great, but all the evidence suggests we are building our release packages with no optimisation (!!)

I believe the solution here is to change the invocation of cmake to set CMAKE_BUILD_TYPE=Release.


[1]: http://builds.hq.northscale.net:8010/builders/ubuntu-1204-x64-300-builder/builds/1100/steps/couchbase-server%20make%20enterprise%20/logs/stdio
[2]: http://builds.hq.northscale.net:8010/builders/ubuntu-1204-x64-251-builder/builds/38/steps/couchbase-server%20make%20enterprise%20/logs/stdio
Comment by Dave Rigby [ 30/Jul/14 ]
Just checked RHEL - I see the same.

3.0:

    (cd build && cmake -G "Unix Makefiles" <cut> -D CMAKE_BUILD_TYPE=Debug <cut>

    Full logs: http://builds.hq.northscale.net:8010/builders/centos-6-x64-300-builder/builds/1095/steps/couchbase-server%20make%20enterprise%20/logs/stdio


2.5.1:

    libtool: compile: g++ <cut> -O3 -c src/ep_engine.cc -o src/.libs/ep_la-ep_engine.o
    
    Full logs: http://builds.hq.northscale.net:8010/builders/centos-6-x64-251-builder/builds/42/steps/couchbase-server%20make%20enterprise%20/logs/stdio


Comment by Dave Rigby [ 30/Jul/14 ]
I've separated the "packages build as debug" problem out into it's own defect (MB-11854)
Comment by Sundar Sridharan [ 30/Jul/14 ]
Dave, I am unable to verify this at this time. could you please let me know if you are still see this issue on builds with optimizations enabled? thanks
Comment by Chiyoung Seo [ 30/Jul/14 ]
Thanks Dave for identifying the build issue.

Enabling "-O3" optimization will make a huge difference in the performance. We should set CMAKE_BUILD_TYPE=Release in 3.0 builds for a fair comparison.
Comment by Chris Hillery [ 30/Jul/14 ]
For those not watching CBD-1422 (nee MB-11854), I have pushed a fix for master and am verifying. I will update this bug when there is a 3.0 Release-mode build, hopefully later tonight.
Comment by Chris Hillery [ 31/Jul/14 ]
3.0.0 build 1068 is being built in Release mode. Should be uploaded in the next half-hour or so.




[MB-10440] something isn't right with tcmalloc in build 1074 on at least rhel6 causing memcached to crash Created: 11/Mar/14  Updated: 31/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aleksey Kondratenko Assignee: Phil Labee
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Relates to
relates to MB-10371 tcmalloc must be compiled with -DTCMA... In Progress
relates to MB-10439 Upgrade:: 2.5.0-1059 to 2.5.1-1074 =>... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
SUBJ.

Just installing latest 2.5.1 build on rhel6 and creating bucket caused segmentation fault (see also MB-10439).

When replacing tcmalloc with a copy I've built it works.

Cannot be 100% sure it's tcmalloc but crash looks too easily reproducible to be something else.


 Comments   
Comment by Wayne Siu [ 12/Mar/14 ]
Phil,
Can you review if this change has been (copied from MB-10371) applied properly?

voltron (2.5.1) commit: 73125ad66996d34e94f0f1e5892391a633c34d3f

    http://review.couchbase.org/#/c/34344/

passes "CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW" to each gprertools configure command
Comment by Andrei Baranouski [ 12/Mar/14 ]
see the same issue on centos 64
Comment by Phil Labee [ 12/Mar/14 ]
need more info:

1. What package did you install?

2. How did you build the tcmalloc which fixes the problem?
 
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
build 1740. Rhel6 package.

You can see yourself. It's easily reproducible as Andrei also confirmed too.

I've got 2.1 tar.gz from googlecode. And then did ./configure --prefix=/opt/couchbase --enable-minimal CPPFLAGS='-DTCMALLOC_SMALL_BUT_SLOW' and then make and make install. After that it works. Have no idea why.

Do you know exact CFLAGS and CXXFLAGS that are used to build our tcmalloc ? Those variables are likely set in voltron (or even from outside of voltron) and might affect optimization and therefore expose some bugs.

Comment by Aleksey Kondratenko [ 12/Mar/14 ]
And 64 bit.
Comment by Phil Labee [ 12/Mar/14 ]
We build out of:

    https://github.com/couchbase/gperftools

and for 2.5.1 use commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

compile using:

(cd /home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools \
&& ./autogen.sh \
        && ./configure --prefix=/opt/couchbase CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW --enable-minimal \
        && make \
        && make install-exec-am install-data-am)
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
That part I know. What I don't know is what cflags are being used.
Comment by Phil Labee [ 13/Mar/14 ]
from the 2.5.1 centos-6-x86 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x86-251-builder/builds/18/steps/couchbase-server%20make%20enterprise%20/logs/stdio

make[1]: Entering directory `/home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools'

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Phil Labee [ 13/Mar/14 ]
from a 2.5.1 centos-6-x64 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x64-251-builder/builds/16/steps/couchbase-server%20make%20enterprise%20/logs/stdio

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Ok. I'll try to exclude -O3 as possible reason of failure later today (in which case it might be upstream bug). In the meantime I suggest you to try lowering optimization to -O2. Unless you have other ideas of course.
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Building tcmalloc with exact same cflags -O3 doesn't cause any crashes. At this time my guess is either compiler bug or cosmic radiation hitting just this specific build.

Can we simply force rebuild ?
Comment by Phil Labee [ 13/Mar/14 ]
test with newer build 2.5.1-1075:

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_2.5.1-1075-rel.rpm

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_64_2.5.1-1075-rel.rpm
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Didn't help unfortunately. Is that still with -O3 ?
Comment by Phil Labee [ 14/Mar/14 ]
still using -O3. There are extensive comments in the voltron Makefile warning against changing to -O2
Comment by Phil Labee [ 14/Mar/14 ]
Did you try to build gperftools out of our repo?
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
The following is not true:

Got myself centos 6.4. And with it's gcc and -O3 I'm finally able to reproduce issue.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
So I've got myself centos 6.4 and _exact same compiler version_. And when I build tcmalloc myself with all right flags and replace tcmalloc from package it works. Without replacing it crashes.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Phil, please, clean ccache, reboot builder host (to clean page cache) and _then_ do another rebuild. Looking at build logs it looks like ccache is being used. So my suspicion about ram corruption is not fully excluded yet. And I have not much other ideas.
Comment by Phil Labee [ 14/Mar/14 ]
cleared ccache and restarted centos-6-x86-builder, centos-6-x64-builder

started build 2.5.1-1076
Comment by Pavel Paulau [ 14/Mar/14 ]
2.5.1-1076 seems to be working, it warns about "SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER" as well.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Maybe I'm doing something wrong but it fails in exact same way on my VM
Comment by Pavel Paulau [ 14/Mar/14 ]
Sorry, it crashed eventually.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Confirmed again. Everything is exactly same as before. Build 1076 centos 6.4 amd64 crashes very easily. Both enterprise edition and community. And doesn't crash if I replace tcmalloc with stuff that I've built, that's exact same source and exact same flags and exact same compiler version.

Build 1071 doesn't crash. All of the 100% consistently.
Comment by Phil Labee [ 17/Mar/14 ]
possibly a difference in build environment

reference env is described in voltron README.md file

for centos-6 X64 (6.4 final) we use the defaults for these tools:


gcc-4.4.7-3.el6 ( 4.4.7-4 available)
gcc-c++-4.4.7-3 ( 4.4.7-4 available)
kernel-devel-2.6.32-358 ( 2.6.32-431.5.1 available)
openssl-devel-1.0.0-27.el6_4.2 ( 1.0.1e-16.el6_5.4 available)
rpm-build-4.8.0-32 ( 4.8.0-37 available)

these tools do not have an update:

scons-2.0.1-1
libtool-2.2.6-15.5

For all centos these specific versions are installed:

gcc, g++ 4.4, currently 4.4.7-3, 4.4.7-4 available
autoconf 2.65, currently 2.63-5 (no update available)
automake 1.11.1
libtool 2.4.2
Comment by Phil Labee [ 17/Mar/14 ]
downloaded gperftools-2.1.tar.gz from

    http://gperftools.googlecode.com/files/gperftools-2.1.tar.gz

and expanded into directory: gperftools-2.1

cloned https://github.com/couchbase/gperftools.git at commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

into directory gperftools, and compared:

=> diff -r gperftools-2.1 gperftools
Only in gperftools: .git
Only in gperftools: autogen.sh
Only in gperftools/doc: pprof.see_also
Only in gperftools/src/windows: TODO
Only in gperftools/src/windows: google

Only in gperftools-2.1: Makefile.in
Only in gperftools-2.1: aclocal.m4
Only in gperftools-2.1: compile
Only in gperftools-2.1: config.guess
Only in gperftools-2.1: config.sub
Only in gperftools-2.1: configure
Only in gperftools-2.1: depcomp
Only in gperftools-2.1: install-sh
Only in gperftools-2.1: libtool
Only in gperftools-2.1: ltmain.sh
Only in gperftools-2.1/m4: libtool.m4
Only in gperftools-2.1/m4: ltoptions.m4
Only in gperftools-2.1/m4: ltsugar.m4
Only in gperftools-2.1/m4: ltversion.m4
Only in gperftools-2.1/m4: lt~obsolete.m4
Only in gperftools-2.1: missing
Only in gperftools-2.1/src: config.h.in
Only in gperftools-2.1: test-driver
Comment by Phil Labee [ 17/Mar/14 ]
Since the build files in your source are different than in the production build, we can't really say we're using the same source.

Please build from our repo and re-try your test.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The difference is in autotools products. I _cannot_ build using same autotools that's present on build machine unless I'm given access to that box.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The _source_ is exact same
Comment by Phil Labee [ 17/Mar/14 ]
I've given the versions of autotools to use, so you could make your build environment in line with the production builds.

As a shortcut, I've submitted a request for a clone of the builder VM that you can experiment with.

See CBIT-1053
Comment by Wayne Siu [ 17/Mar/14 ]
The cloned builder is available. Info in CBIT-1053.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Built tcmalloc from exact copy in builder directory.

Installed package from inside builder directory (build 1077). Verified that problem exists. Stopped service. Replaced tcmalloc. Observer that everything is fine.

Something in environment is causing this. Like maybe unusual ldflags or something else. But _not_ source.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Build full rpm package under buildbot user. With exact same make invocation as I see in buildbot logs. And resultant package works. Weird indeed.
Comment by Phil Labee [ 18/Mar/14 ]
some differences between test build and production build:


1) In gperftools, production calls "make install-exec-am install-data-am" while test calls "make install" which executes extra step "all-am"

2) In ep-engine, produciton uses "make install" while test uses "make"

3) Test build as user "root" while production build as user "buildbot", so PATH and other env.vars may be different.

In general it's hard to tell what steps were performed for the test build, as no output logfiles have been captured.
Comment by Wayne Siu [ 21/Mar/14 ]
Updated from Phil:
comment:
________________________________________

2.5.1-1082 was done without the tcmalloc flag: CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW

    http://review.couchbase.org/#/c/34755/


2.5.1-1083 was done with build step timeout increased from 60 minutes to 90

2.5.1-1084 was done with the tcmalloc flag restored:

    http://review.couchbase.org/#/c/34792/
Comment by Andrei Baranouski [ 23/Mar/14 ]
 2.5.1-1082 MB-10545 Vbucket map is not ready after 60 seconds
Comment by Meenakshi Goel [ 24/Mar/14 ]
Memcached crashes with segmentation fault is observed with build 2.5.1-1084-rel on ubuntu 12.04 during Auto Compaction tests.

Jenkins Link:
http://qa.sc.couchbase.com/view/2.5.1%20centos/job/centos_x64--00_02--compaction_tests-P0/56/consoleFull

root@jackfruit-s12206:/tmp# gdb /opt/couchbase/bin/memcached core.memcached.8276
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /opt/couchbase/bin/memcached...done.
[New LWP 8301]
[New LWP 8302]
[New LWP 8599]
[New LWP 8303]
[New LWP 8604]
[New LWP 8299]
[New LWP 8601]
[New LWP 8600]
[New LWP 8602]
[New LWP 8287]
[New LWP 8285]
[New LWP 8300]
[New LWP 8276]
[New LWP 8516]
[New LWP 8603]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler'.
Program terminated with signal 11, Segmentation fault.
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
298 src/central_freelist.cc: No such file or directory.
(gdb) t a a bt

Thread 15 (Thread 0x7f3568039700 (LWP 8603)):
#0 0x00007f356f01b9fa in __lll_unlock_wake () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f018104 in _L_unlock_644 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f018063 in pthread_mutex_unlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c663d6 in Mutex::release (this=0x5f68250) at src/mutex.cc:94
#4 0x00007f3569c9691f in unlock (this=<optimized out>) at src/locks.hh:58
#5 ~LockHolder (this=<optimized out>, __in_chrg=<optimized out>) at src/locks.hh:41
#6 fireStateChange (to=<optimized out>, from=<optimized out>, this=<optimized out>) at src/warmup.cc:707
#7 transition (force=<optimized out>, to=<optimized out>, this=<optimized out>) at src/warmup.cc:685
#8 Warmup::initialize (this=<optimized out>) at src/warmup.cc:413
#9 0x00007f3569c97f75 in Warmup::step (this=0x5f68258, d=..., t=...) at src/warmup.cc:651
#10 0x00007f3569c2644a in Dispatcher::run (this=0x5e7f180) at src/dispatcher.cc:184
#11 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5f68258) at src/dispatcher.cc:28
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 14 (Thread 0x7f356a705700 (LWP 8516)):
#0 0x00007f356ed0d83d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ed3b774 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f3569c65445 in updateStatsThread (arg=<optimized out>) at src/memory_tracker.cc:31
#3 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 13 (Thread 0x7f35703e8740 (LWP 8276)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e000, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e000, flags=<optimized out>) at event.c:1558
#3 0x000000000040c9e6 in main (argc=<optimized out>, argv=<optimized out>) at daemon/memcached.c:7996

Thread 12 (Thread 0x7f356c709700 (LWP 8300)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e280, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e280, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16814f8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 11 (Thread 0x7f356e534700 (LWP 8285)):
#0 0x00007f356ed348bd in read () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ecc8ff8 in _IO_file_underflow () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f356ecca03e in _IO_default_uflow () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f356ecbe18a in _IO_getline_info () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f356ecbd06b in fgets () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f356e535b19 in fgets (__stream=<optimized out>, __n=<optimized out>, __s=<optimized out>) at /usr/include/bits/stdio2.h:255
#6 check_stdin_thread (arg=<optimized out>) at extensions/daemon/stdin_check.c:37
#7 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 10 (Thread 0x7f356d918700 (LWP 8287)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
---Type <return> to continue, or q <return> to quit---

#1 0x00007f356db32176 in logger_thead_main (arg=<optimized out>) at extensions/loggers/file_logger.c:368
#2 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f3567037700 (LWP 8602)):
#0 SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:32
#1 0x00007f3569c6351c in lock (this=<optimized out>) at src/atomic.hh:282
#2 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#3 gimme (this=<optimized out>) at src/atomic.hh:396
#4 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#5 KVShard::getBucket (this=0x7a6e7c0, id=256) at src/kvshard.cc:58
#6 0x00007f3569c9231d in VBucketMap::getBucket (this=0x614a448, id=256) at src/vbucketmap.cc:40
#7 0x00007f3569c314ef in EventuallyPersistentStore::getVBucket (this=<optimized out>, vbid=256, wanted_state=<optimized out>) at src/ep.cc:475
#8 0x00007f3569c315f6 in EventuallyPersistentStore::firePendingVBucketOps (this=0x614a400) at src/ep.cc:488
#9 0x00007f3569c41bb1 in EventuallyPersistentEngine::notifyPendingConnections (this=0x5eb8a00) at src/ep_engine.cc:3474
#10 0x00007f3569c41d63 in EvpNotifyPendingConns (arg=0x5eb8a00) at src/ep_engine.cc:1182
#11 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f3565834700 (LWP 8600)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7e1c0) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7e204) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f3566035700 (LWP 8601)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7fa40) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7fa84) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f356cf0a700 (LWP 8299)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e500, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e500, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x1681400) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f3567838700 (LWP 8604)):
#0 0x00007f356f01b89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f017065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f016eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c6635a in Mutex::acquire (this=0x5e7f890) at src/mutex.cc:79
#4 0x00007f3569c261f8 in lock (this=<optimized out>) at src/locks.hh:48
#5 LockHolder (m=..., this=<optimized out>) at src/locks.hh:26
---Type <return> to continue, or q <return> to quit---
#6 Dispatcher::run (this=0x5e7f880) at src/dispatcher.cc:138
#7 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5e7f898) at src/dispatcher.cc:28
#8 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f356af06700 (LWP 8303)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e780, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e780, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16817e0) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f3565033700 (LWP 8599)):
#0 0x00007f356ed18267 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f3569c13997 in SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:35
#2 0x00007f3569c63e57 in lock (this=<optimized out>) at src/atomic.hh:282
#3 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#4 gimme (this=<optimized out>) at src/atomic.hh:396
#5 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#6 KVShard::getVBucketsSortedByState (this=0x7a6e7c0) at src/kvshard.cc:75
#7 0x00007f3569c5d494 in Flusher::getNextVb (this=0x168d040) at src/flusher.cc:232
#8 0x00007f3569c5da0d in doFlush (this=<optimized out>) at src/flusher.cc:211
#9 Flusher::step (this=0x5ff7010, tid=21) at src/flusher.cc:152
#10 0x00007f3569c69034 in ExecutorThread::run (this=0x5e7e8c0) at src/scheduler.cc:159
#11 0x00007f3569c6963d in launch_executor_thread (arg=0x5ff7010) at src/scheduler.cc:36
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f356b707700 (LWP 8302)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8ea00, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8ea00, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16816e8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f356bf08700 (LWP 8301)):
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
#1 0x00007f356f23ef19 in tcmalloc::CentralFreeList::FetchFromSpansSafe (this=0x7f356f45d780) at src/central_freelist.cc:283
#2 0x00007f356f23efb7 in tcmalloc::CentralFreeList::RemoveRange (this=0x7f356f45d780, start=0x7f356bf07268, end=0x7f356bf07260, N=4) at src/central_freelist.cc:263
#3 0x00007f356f2430b5 in tcmalloc::ThreadCache::FetchFromCentralCache (this=0xf5d298, cl=9, byte_size=128) at src/thread_cache.cc:160
#4 0x00007f356f239fa3 in Allocate (this=<optimized out>, cl=<optimized out>, size=<optimized out>) at src/thread_cache.h:364
#5 do_malloc_small (size=128, heap=<optimized out>) at src/tcmalloc.cc:1088
#6 do_malloc_no_errno (size=<optimized out>) at src/tcmalloc.cc:1095
#7 (anonymous namespace)::cpp_alloc (size=128, nothrow=<optimized out>) at src/tcmalloc.cc:1423
#8 0x00007f356f249538 in tc_new (size=139867476842368) at src/tcmalloc.cc:1601
#9 0x00007f3569c2523e in Dispatcher::schedule (this=0x5e7f880,
    callback=<error reading variable: DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>, outtid=0x6127930, priority=...,
    sleeptime=<optimized out>, isDaemon=true, mustComplete=false) at src/dispatcher.cc:243
#10 0x00007f3569c84c1a in TapConnNotifier::start (this=0x6127920) at src/tapconnmap.cc:66
---Type <return> to continue, or q <return> to quit---
#11 0x00007f3569c42362 in EventuallyPersistentEngine::initialize (this=0x5eb8a00, config=<optimized out>) at src/ep_engine.cc:1415
#12 0x00007f3569c42616 in EvpInitialize (handle=0x5eb8a00,
    config_str=0x7f356bf07993 "ht_size=3079;ht_locks=5;tap_noop_interval=20;max_txn_size=10000;max_size=1491075072;tap_keepalive=300;dbname=/opt/couchbase/var/lib/couchbase/data/default;allow_data_loss_during_shutdown=true;backend="...) at src/ep_engine.cc:126
#13 0x00007f356cf0f86a in create_bucket_UNLOCKED (e=<optimized out>, bucket_name=0x7f356bf07b80 "default", path=0x7f356bf07970 "/opt/couchbase/lib/memcached/ep.so", config=<optimized out>,
    e_out=<optimized out>, msg=0x7f356bf07560 "", msglen=1024) at bucket_engine.c:711
#14 0x00007f356cf0faac in handle_create_bucket (handle=<optimized out>, cookie=0x5e4bc80, request=<optimized out>, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2168
#15 0x00007f356cf10229 in bucket_unknown_command (handle=0x7f356d1171c0, cookie=0x5e4bc80, request=0x5e44000, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2478
#16 0x0000000000412c35 in process_bin_unknown_packet (c=<optimized out>) at daemon/memcached.c:2911
#17 process_bin_packet (c=<optimized out>) at daemon/memcached.c:3238
#18 complete_nread_binary (c=<optimized out>) at daemon/memcached.c:3805
#19 complete_nread (c=<optimized out>) at daemon/memcached.c:3887
#20 conn_nread (c=0x5e4bc80) at daemon/memcached.c:5744
#21 0x0000000000406e45 in event_handler (fd=<optimized out>, which=<optimized out>, arg=0x5e4bc80) at daemon/memcached.c:6012
#22 0x00007f356fd9948c in event_process_active_single_queue (activeq=<optimized out>, base=<optimized out>) at event.c:1308
#23 event_process_active (base=<optimized out>) at event.c:1375
#24 event_base_loop (base=0x5e8ec80, flags=<optimized out>) at event.c:1572
#25 0x0000000000415584 in worker_libevent (arg=0x16815f0) at daemon/thread.c:301
#26 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#27 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#28 0x0000000000000000 in ?? ()
(gdb)
Comment by Aleksey Kondratenko [ 25/Mar/14 ]
Yesterday I took that consistently failing ubuntu build and played with it on my box.

It is exactly same situation. Replacing libtcmalloc.so makes it work.

So I've spent afternoon on running what's in our actual package under debugger.

I found several evidences that some object files linked into libtcmalloc.so that we ship were built with -DTCMALLOC_SMALL_BUT_SLOW and some _were_ not.

That explains weird crashes.

I'm unable to explain how it's possible that our builders produced such .so files. Yet.

Gut feeling is that it might be:

* something caused by ccache

* perhaps not full cleanup between builds

In order to verify that I'm asking the following:

* do a build with ccache completely disabled but with define

* do git clean -xfd inside gperftools checkout before doing build

Comment by Phil Labee [ 29/Jul/14 ]
The failure was detected by

    http://qa.sc.couchbase.com/job/centos_x64--00_02--compaction_tests-P0/

Can I run this test on a 3.0.0 build to see if this bug still exists?
Comment by Phil Labee [ 29/Jul/14 ]
Can I run this test on a 3.0.0 build to see if bug still exists?
Comment by Meenakshi Goel [ 30/Jul/14 ]
Started a run with latest 3.0.0 build 1057.
http://qa.hq.northscale.net/job/centos_x64--44_01--auto_compaction_tests-P0/37/console

However haven't seen such crashes with compaction tests during 3.0.0 testing.
Comment by Meenakshi Goel [ 30/Jul/14 ]
Tests passed with 3.0.0-1057-rel.
Comment by Wayne Siu [ 31/Jul/14 ]
Pavel also helped verify that this is not an issue in 3.0 (3.0.0-1067).
Comment by Wayne Siu [ 31/Jul/14 ]
Reopening for 2.5.x.




[MB-11623] test for performance regressions with JSON detection Created: 02/Jul/14  Updated: 31/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Matt Ingenthron Assignee: Thomas Anderson
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: 0h
Time Spent: 120h
Original Estimate: Not Specified

Attachments: File JSONDoctPerfTest140728.rtf    
Issue Links:
Relates to
relates to MB-11675 40-50% performance degradation on app... Open

 Description   
Related to one of the changes in 3.0, we need to test what has been implemented to see if a performance regression or unexpected resource utilization has been introduced.

In 2.x, all JSON detection was handled at the time of persistence. Since persistence was done in batch and in background, with the then current document, it would limit the resource utilization of any JSON detection.

Starting in 3.x, with the datatype/HELLO changes introduced (and currently disabled), the JSON detection has moved to both memcached and ep-engine, depending on the type of mutation.

Just to paint the reason this is a concern, here's a possible scenario.

Imagine a cluster node that is happily accepting 100,000 sets/s for a given small JSON document, and it accounts for about 20mbit of the network (small enough to not notice). That node has a fast SSD at about 8k IOPS. That means that we'd only be doing JSON detection some 5000 times per second with Couchbase Server 2.x

With the changes already integrated, that JSON detection may be tried over 100k times/s. That's a 20x increase. The detection needs to occur somewhere other than on the persistence path, as the contract between DCP and view engine is such that the JSON detection needs to occur before DCP transfer.

This request is to test/assess if there is a performance change and/or any unexpected resource utilization when having fast mutating JSON documents.

I'll leave it to the team to decide what the right test is, but here's what I might suggest.

With a view defined create a test that has a small to moderate load at steady state and one fast-changing item. Test it with a set of sizes and different complexity. For instance, permutations that might be something like this:
non-JSON of 1k, 8k, 32k, 128k
simple JSON of 1k, 8k, 32k, 128k
complex JSON of 1k, 8k, 32k, 128k
metrics to gather:
throughput, CPU utilization by process, RSS by process, memory allocation requests by process (or minor faults or something)

Hopefully we won't see anything to be concerned with, but it is possible.

There are options to move JSON detection to somewhere later in processing (i.e., before DCP transfer) or other optimization thoughts if there is an issue.

 Comments   
Comment by Cihan Biyikoglu [ 07/Jul/14 ]
this is no longer needed for 3.0 is that right? ready to postpone to 3.0.1?
Comment by Pavel Paulau [ 07/Jul/14 ]
HELLO-based negotiation was disabled but detection still happens in ep-engine.
We need to understand impact before 3.0 release. Sooner than later.
Comment by Matt Ingenthron [ 23/Jul/14 ]
I'm curious Thomas, when you say "increase in bytes appended", do you mean for the same workload the RSS is larger in the 'increase' case? Great to see you making progress.
Comment by Wayne Siu [ 24/Jul/14 ]
Pasted comment from Thomas:
Subject: Re: Couchbase Issues: (MB-11623) test for performance regressions with JSON detection
Yes, ~20% increase from 2.5.1 to 3.0 for same load generator. as reported by the cb server for same input load. I’m verifying and ‘isolating’ . Will also be looking at if/how this contributes to replication load increase (20% on 20% increase …)
The issues seem related. Same increase for 1K, 8K, 16K and 32K with some variance.
—thomas
Comment by Thomas Anderson [ 29/Jul/14 ]
initial results using JSON document load test.
Comment by Matt Ingenthron [ 29/Jul/14 ]
Tom: saw your notes in the work log, out of curiosity, what was deferred to 3.0.1? Also, from the comment above, 20% increase in what?




[MB-11779] Memory underflow in updates-only scenario with 5 buckets Created: 21/Jul/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-988

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: Text File gdb.log    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/view/lab/job/perf-dev/503/artifact/
Is this a Regression?: Yes

 Description   
Essentially re-opened MB-11661.

2 nodes, 5 buckets, 200K x 1KB docs per bucket (non-DGM), 2K updates per bucket.

Mon Jul 21 13:24:34.955935 PDT 3: (bucket-1) Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker...

 Comments   
Comment by Sriram Ganesan [ 22/Jul/14 ]
Pavel

How often would you say this reproduces in your environment? I tried this locally a few times and didn't hit this.
Comment by Pavel Paulau [ 23/Jul/14 ]
Pretty much every time.

It usually takes >10 hours before test encounters GIGANTOR failure. But slowly decreasing mem_used obviously indicates the issue.
Comment by Pavel Paulau [ 26/Jul/14 ]
Just spotted again in different scenario, build 3.0.0-1024. Proof: https://s3.amazonaws.com/bugdb/jira/MB-11779/172.23.96.11.zip .
Comment by Sriram Ganesan [ 28/Jul/14 ]
Pavel

Thanks for uploading those logs. I see a bunch vbucket deletion messages in the test

Fri Jul 25 07:33:16.745484 PDT 3: (bucket-10) Deletion of vbucket 1023 was completed.
Fri Jul 25 07:33:16.745619 PDT 3: (bucket-10) Deletion of vbucket 1022 was completed.
Fri Jul 25 07:33:16.745739 PDT 3: (bucket-10) Deletion of vbucket 1021 was completed.
Fri Jul 25 07:33:16.745887 PDT 3: (bucket-10) Deletion of vbucket 1020 was completed.
Fri Jul 25 07:33:16.746005 PDT 3: (bucket-10) Deletion of vbucket 1019 was completed.
Fri Jul 25 07:33:16.746177 PDT 3: (bucket-10) Deletion of vbucket 1018 was completed.

This seems to the case for all the buckets. But the GIGANTOR message only shows up for 5 of the buckets. Are these logs from the same test? Are you doing any forced shutdown of any of the buckets in your test? Apparently there is a known issue in ep-engine according to Chiyoung at bucket shutdown time and the GIGANTOR message can manifest only affecting the bucket that is shutdown.
Comment by Sriram Ganesan [ 28/Jul/14 ]
Also, please confirm if any rebalance operations were done in the logs uploaded on the 25th.
Comment by Pavel Paulau [ 28/Jul/14 ]
Sriram,

Logs are from different test/setup (with 10 buckets).

There was only one rebalance event during initial cluster setup:

2014-07-25 07:33:07.970 ns_orchestrator:4:info:message(ns_1@172.23.96.11) - Starting rebalance, KeepNodes = ['ns_1@172.23.96.11','ns_1@172.23.96.12',
                                 'ns_1@172.23.96.13','ns_1@172.23.96.14'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes
2014-07-25 07:33:07.995 ns_orchestrator:1:info:message(ns_1@172.23.96.11) - Rebalance completed successfully.

10 buckets were created after that:

2014-07-25 07:33:13.674 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:13.784 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:14.005 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:14.005 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:14.006 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:14.031 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:14.082 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:14.384 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.14' in 1 seconds.
2014-07-25 07:33:14.384 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.385 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.588 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.588 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.682 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.11' in 1 seconds.
2014-07-25 07:33:15.107 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.12' in 1 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.111 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.111 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.218 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.219 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.219 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.219 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.303 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.303 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.304 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.304 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.304 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.305 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.312 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.610 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.716 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.802 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.811 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.11' in 0 seconds.

Basically bucket shutdown wasn't forced. All those operations are quite normal.

Also from logs I can see underflow issue only in "bucket-10".
Comment by Pavel Paulau [ 30/Jul/14 ]
Hi Sriram,

I can start the test which will reproduce the issue. Will live cluster help?
Comment by Sriram Ganesan [ 30/Jul/14 ]
Pavel

I was planning on providing a toy build today. I need to do more local testing in my environment before I can provide it. The current theory is that the root cause actually happens much earlier itself messing up the accounting and eventually leads to an underflow. I shall try to do so before noon today.
Comment by Sriram Ganesan [ 30/Jul/14 ]
Pavel

Please try the following toy build http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent58-3.0.0-toy-sriram-x86_64_3.0.0-712-toy.rpm. The memcached process will crash once it hits a particular condition which could be the manifestation of the bug.You will likely hit that right after the updates start. Also, if you can run the test in a regular build but just with one node to see if we hit this problem. If you don't hit it, we can rule out other areas and DCP/UPR is more likely to be a potential culprit.
Comment by Pavel Paulau [ 31/Jul/14 ]
It does crash almost immediately, backtraces are attached.

Logs:
http://ci.sc.couchbase.com/job/perf-dev/540/artifact/172.23.100.17.zip
http://ci.sc.couchbase.com/job/perf-dev/540/artifact/172.23.100.18.zip
Comment by Pavel Paulau [ 31/Jul/14 ]
Memory usage seems stable in single node setup. I made only run though.
Comment by Sriram Ganesan [ 31/Jul/14 ]
Thanks Pavel. The crash probably establishes that an allocation for an object is accounted for one bucket (or in no buckets if it is in memcached layer) and deallocation in a different bucket and the fact that the single node setup is quite stable might point to DCP being the more likely culprit here.




[MB-11843] {UPR} :: View Query timesout after a rebalance-in-out operation Created: 29/Jul/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Sarath Lakshman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 1:172.23.107.210
2:172.23.107.211
3:172.23.107.212
4:172.23.107.213
5:172.23.107.214
6:172.23.107.215

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
1046, centos 6x

1. Create 5 node cluster
2. Create default bucket
3. Add 500 K Items
4. Create 3 ddocs with 3 views, start indexing and querying
5. Rebalance-in-out (out 2 node, in 1 node)
6. Query the views

Hit the exception

2014-07-29 05:38:01,573] - [rest_client:484] INFO - index query url: http://172.23.107.210:8092/default/_design/ddoc0/_view/view2?connectionTimeout=60000&full_set=true
ERROR
[('/usr/lib/python2.7/threading.py', 524, '__bootstrap', 'self.__bootstrap_inner()'), ('/usr/lib/python2.7/threading.py', 551, '__bootstrap_inner', 'self.run()'), ('./testrunner.py', 262, 'run', '**self._Thread__kwargs)'), ('/usr/lib/python2.7/unittest/runner.py', 151, 'run', 'test(result)'), ('/usr/lib/python2.7/unittest/case.py', 391, '__call__', 'return self.run(*args, **kwds)'), ('/usr/lib/python2.7/unittest/case.py', 327, 'run', 'testMethod()'), ('pytests/rebalance/rebalanceinout.py', 501, 'measure_time_index_during_rebalance', 'tasks[task].result(self.wait_timeout)'), ('lib/tasks/future.py', 162, 'result', 'self.set_exception(TimeoutError())'), ('lib/tasks/future.py', 264, 'set_exception', 'print traceback.extract_stack()')]

Looking at couchdb logs for 1 node

[couchdb:error,2014-07-29T10:24:25.367,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29759.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:25.572,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29762.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:25.821,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29770.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:27.556,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29821.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:27.685,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29827.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:28.105,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29840.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:28.575,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29852.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:28.805,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29857.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:28.985,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29875.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:29.143,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29878.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:29.393,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29881.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:29.533,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29894.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.040,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29910.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.177,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29913.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.333,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29918.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.524,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29925.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.687,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29937.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.802,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29945.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.994,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29956.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.160,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29960.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.325,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29963.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.455,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29966.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.556,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29969.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.719,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29972.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.831,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29975.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:25:13.438,ns_1@10.3.121.63:<0.19295.1>:couch_log:error:44]Cleanup process <0.30517.1> for set view `default`, main (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:25:46.471,ns_1@10.3.121.63:<0.19325.1>:couch_log:error:44]upr client (default, mapreduce_view: default _design/ddoc0 (prod/replica)): upr receive worker failed due to reason: closed. Restarting upr receive worker...
[couchdb:error,2014-07-29T10:25:46.471,ns_1@10.3.121.63:<0.19307.1>:couch_log:error:44]upr client (default, mapreduce_view: default _design/ddoc0 (prod/main)): upr receive worker failed due to reason: closed. Restarting upr receive worker...
[couchdb:error,2014-07-29T10:25:48.477,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Set view `default`, replica (prod) group `_design/ddoc0`, UPR process <0.19325.1> died with unexpected reason: vbucket_stream_not_found
[couchdb:error,2014-07-29T10:25:48.479,ns_1@10.3.121.63:<0.19295.1>:couch_log:error:44]Set view `default`, main (prod) group `_design/ddoc0`, UPR process <0.19307.1> died with unexpected reason: vbucket_stream_not_found


Test Case:: ./testrunner -i centos_x64_rebalance_in_out.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceinout.RebalanceInOutTests.measure_time_index_during_rebalance,items=500000,data_perc_add=50,nodes_init=5,nodes_in=1,nodes_out=2,num_ddocs=3,num_views=3,max_verify=50000,GROUP=IN_OUT;P1;FROM_2_0;PERFORMANCE


 Comments   
Comment by Parag Agarwal [ 29/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11843/1046_log.tar.gz
Comment by Aleksey Kondratenko [ 29/Jul/14 ]
This can go directly to view engine IMO.
Comment by Meenakshi Goel [ 30/Jul/14 ]
Observing similar issue during Views tests with latest build 3.0.0-1057-rel
http://qa.sc.couchbase.com/job/centos_x64--29_01--create_view_all-P1/129/consoleFull
Comment by Sarath Lakshman [ 30/Jul/14 ]
I have identified the problem and will be posting a fix by tomorrow.
Comment by Meenakshi Goel [ 31/Jul/14 ]
Promoting to blocker as multiple tests are failing due to this issue.
http://qa.sc.couchbase.com/job/centos_x64--29_01--create_view_all-P1/129/consoleFull
Comment by Mike Wiederhold [ 31/Jul/14 ]
I found this in the memcached logs. Looks like a bucket authentication error.

10.3.121.63
Tue Jul 29 10:25:47.693563 PDT 3: 53: Invalid username/password combination

10.3.121.64
Tue Jul 29 10:25:47.710180 PDT 3: 53: Invalid username/password combination
Comment by Sarath Lakshman [ 31/Jul/14 ]
Tried to implement a fix and later found in my tests that the solution has pitfalls and not clean enough.
Our current reconnection implementation in the couch_upr_client needs some rework to handle all cases. Probably, It will take a bit longer to fix this problem.

Mike, Could you tell us about which are the cases in which EP-Engine drops upr connections ?
Currently, we have a reconnection and repair mechanism within the couch_upr_client whenever EP-Engine closes an UPR connection. We have some bugs in handling all the cases. It would be great if you could list the cases, especially rebalance scenarios.




[MB-10371] tcmalloc must be compiled with -DTCMALLOC_SMALL_BUT_SLOW [ 1 ] Created: 05/Mar/14  Updated: 31/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.0, 2.5.1
Fix Version/s: 2.5.1, 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aleksey Kondratenko Assignee: Phil Labee
Resolution: Unresolved Votes: 0
Labels: 2.5.1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-10440 something isn't right with tcmalloc i... Reopened
relates to MB-10439 Upgrade:: 2.5.0-1059 to 2.5.1-1074 =>... Resolved
Triage: Triaged
Is this a Regression?: Yes

 Description   
Possible candidate for 2.5.1, given it's easy fix.

Based on MB-7887. And particularly on runs comparing older 2.2.0 builds and newer 2.2.0 builds (after 2.2.0 release, I guess those were purely for hotfixes or who knows what) we see tcmalloc memory fragmentation regression. I.e. here: https://www.couchbase.com/issues/secure/attachment/19549/10KB-1MB_250Items_10KB_delta.log

We see 2.2.0-821 (GA) is having:

2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] detailed: NOTE: SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER.
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] ------------------------------------------------
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: 645831408 ( 615.9 MiB) Bytes in use by application
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 120201216 ( 114.6 MiB) Bytes in page heap freelist
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 10568336 ( 10.1 MiB) Bytes in central cache freelist
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 0 ( 0.0 MiB) Bytes in transfer cache freelist
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 2810496 ( 2.7 MiB) Bytes in thread cache freelists
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 1831064 ( 1.7 MiB) Bytes in malloc metadata
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: ------------
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: = 781242520 ( 745.1 MiB) Actual memory used (physical + swap)
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 26411008 ( 25.2 MiB) Bytes released to OS (aka unmapped)
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: ------------
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: = 807653528 ( 770.2 MiB) Virtual address space used
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC:
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: 6848 Spans in use
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: 18 Thread heaps in use
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: 8192 Tcmalloc page size
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] ------------------------------------------------
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Bytes released to the OS take up virtual address space but no physical memory.

I.e. about 750 megs of ram is used

and on 2.2.0-840 (that later build I referred to above) eats 820 megs. And there's similar situation with 2.5.0 GA.

Only regression of 2.2.0-840 vs. 2.2.0-821 that's apparent here is lack of -DTCMALLOC_SMALL_BUT_SLOW in build (which is seen by lack of warning like this " NOTE: SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER." in later builds).

Therefore we should restore that define that's known to be passed in earlier builds and which was somehow dropped in later builds. It's possible that it was intentional for some specific reason, but I'm not aware of that. Looks like simple regression instead.

Related bugs are MB-7887 and MB-9930.

 Comments   
Comment by Phil Labee [ 07/Mar/14 ]
abandonded change: http://review.couchbase.org/#/c/34270/

instead add this to the make command:

   libtcmalloc_EXTRA_OPTIONS="-DTCMALLOC_SMALL_BUT_SLOW"

this is made in the membase / buildbot-internal / Buildbot / master.cfg

https://github.com/membase/buildbot-internal/commit/6d7da38047eaf8ba9fb89aa0544c3cdc8697f53b
Comment by Phil Labee [ 11/Mar/14 ]
voltron (2.5.1) commit: 73125ad66996d34e94f0f1e5892391a633c34d3f

    http://review.couchbase.org/#/c/34344/

passes "CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW" to each gprertools configure command
Comment by Phil Labee [ 11/Mar/14 ]
fixed in build: 2.5.1-1074
Comment by Maria McDuff (Inactive) [ 11/Mar/14 ]
Phil,

is this the build ready for QE to test, 2.5.1-1074?
Comment by Maria McDuff (Inactive) [ 11/Mar/14 ]
Andrei,

can you confirm that this msg is no longer appearing on this new build, 1074? Thanks.
all OS:

2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] detailed: NOTE: SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER.

-win32,64 bit
-centos32,64 bit
-ubuntu32,64 bit
-mac64 - i can verify this since you don't have a mac
Comment by Andrei Baranouski [ 11/Mar/14 ]
Hi Maria,
I have mac also to check it out. I have already talked with Alk on what scenarios to test the build. give the results tomorrow
Comment by Maria McDuff (Inactive) [ 14/Mar/14 ]
fyi, andrei, working tcmalloc build is still not available....
Comment by Wayne Siu [ 26/Mar/14 ]
There are some issues uncovered related to the fix.
Comment by Wayne Siu [ 31/Jul/14 ]
Pavel helped verify that this is still an issue in 3.0.
Raising the priority to blocker. Would like to have the fix before Aug 01.2014.
Comment by Phil Labee [ 31/Jul/14 ]
These changes:

    http://review.couchbase.org/#/c/40129
    http://review.couchbase.org/#/c/40130

add tcmalloc configure flag: CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW




[MB-11405] Shared thread pool: high CPU overhead due to OS level context switches / scheduling Created: 11/Jun/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-805

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File cpu.png     PNG File max_threads_cpu.png     PNG File memcached_cpu_b988.png     PNG File memcached_cpu_toy.png     Text File perf_b829.log     Text File perf_b854_8threads.log     Text File perf.log    
Issue Links:
Relates to
relates to MB-11434 600-800% CPU consumption by memcached... Closed
relates to MB-11738 Evaluate GIO CPU utilization on syste... Open
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/perf-dev/424/artifact/172.23.100.17.zip
http://ci.sc.couchbase.com/job/perf-dev/424/artifact/172.23.100.18.zip
Is this a Regression?: Yes

 Description   
Originally reported as "~2400% CPU consumption by memcached during ongoing workload with five (5) buckets ".

The CPU usage for memcached process is more than two times the usage in the previous release. This is due to increased scheduling overhead from the shared thread pool.
Workaround: Reduce the number of threads on systems that have more than 30 cores.

2 nodes, 5 buckets
1M docs (clusterwise), equally distributed, non-DGM
10K mixed ops/sec (85% reads, 1% creates, 1% deletes, 13% updates; clusterwise), equally distributed

CPU utilization in 2.5.1: ~300%
CPU utilization in 3.0.0: ~2400%



 Comments   
Comment by Pavel Paulau [ 12/Jun/14 ]
Interesting chart that shows how CPU utilization depends on #buckets (2-18) and #nodes (2, 4, 8).
Comment by Sundar Sridharan [ 16/Jun/14 ]
More nodes means fewer vbuckets per node, resulting in fewer writer tasks which may explain the lowered cpu per node.
Here is a partial fix based on the attached perf.log http://review.couchbase.org/38337 that I hope will help.
more fixes may follow if needed. thanks
Comment by Sundar Sridharan [ 16/Jun/14 ]
hi Pavel, the fix to reduce the getDescription() noise has been merged.
Could you please help re-run the workload and see if we still have a high CPU usage and if so what does the new profiler output look like? thanks
Comment by Pavel Paulau [ 18/Jun/14 ]
Still high CPU utilization.
Comment by Sundar Sridharan [ 18/Jun/14 ]
Thanks Pavel, Looks like the getDescription() noise has gone away. However this performance result is quite interesting - 85% of the overhead is from the kernel - most likely context switching from the higher number of threads. This will require some more creative solutions to reduce this cpu usage without suffering a performance overhead.
Comment by Sundar Sridharan [ 20/Jun/14 ]
another fix to reduce active system cpu usage by letting only 1 thread snooze while others sleep is located here http://review.couchbase.org/38620 thanks
Comment by Sundar Sridharan [ 20/Jun/14 ]
Pavel, the fix has been merged. Local testing showed marginal improvement. could you please retry the test and let me know if it helps in the larger setup? thanks
Comment by Pavel Paulau [ 20/Jun/14 ]
Ok, will do. Any expected side effects?
Comment by Pavel Paulau [ 21/Jun/14 ]
I have tried build 3.0.0-854 which includes your change. No impact on performance, still very high CPU utilization.

Please notice that CPU consumption drops to ~400% when I decrease number of threads from 30 (auto-tuned) to 8.
Comment by Sundar Sridharan [ 23/Jun/14 ]
Reducing the number of threads should not be the solution. The main new thing in 3.0 is we can have 4 writer threads per bucket essentially so with 5 buckets we may have 20 writing threads. In 2.5 there would only be 5 writing threads for 5 buckets.
This means we should not expect lower than 4 times the CPU use from 2.5, simply because the cost of increased cpu is bringing us lowered disk write latency.
Comment by Pavel Paulau [ 23/Jun/14 ]
Fair enough.

In this case resolution criterion for this ticket should be 600% CPU utilization by memcached.
Comment by Chiyoung Seo [ 26/Jun/14 ]
Another fix was merged:

http://review.couchbase.org/#/c/38756/
Comment by Pavel Paulau [ 26/Jun/14 ]
Sorry,

The same utilization - build 3.0.0-884.
Comment by Sundar Sridharan [ 30/Jun/14 ]
a debugging fix was merged here at http://review.couchbase.org/#/c/38909/. if possible could you please leave the cluster with this change on for sometime for debugging? thanks
Comment by Pavel Paulau [ 30/Jun/14 ]
There might be a delay in getting results due limited h/w resources and upcoming beta release.
Comment by Pavel Paulau [ 01/Jul/14 ]
Assigning back to Sundar because he is working on his own test.
Comment by Pavel Paulau [ 02/Jul/14 ]
Promoting to "Blocker", it currently seems to be one of the most severe performance issues in 3.0.
Comment by Sundar Sridharan [ 02/Jul/14 ]
Pavel, could you try setting max_threads=20 and re-trying the workload to see if this reduces the CPU overhead to unblock other performance testing? thanks
Comment by Pavel Paulau [ 02/Jul/14 ]
Will do, after beta release.

But please notice that performance testing is not blocked.
Comment by Pavel Paulau [ 04/Jul/14 ]
Some interesting observations...

For the same workload I compared number of scheduler wake ups.
3.0-beta with 4 front-end threads and 30 ep-engine threads (auto-tuned):

$ perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '47284':

         7,940,880 sched:sched_wakeup

      30.000548575 seconds time elapsed

2.5.1 with default settings:

$ perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '3677':

           117,003 sched:sched_wakeup

      30.000550702 seconds time elapsed
 
Not surprisingly more write heavy workload (all ops are updates) reduces CPU utilization (down to 600-800%) and scheduling overhead:

$ perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '22699':

         4,014,534 sched:sched_wakeup

      30.000556091 seconds time elapsed

Obviously global IO works nice when IO workload is pretty aggressive and there is always work do.
And it's absolutely crazy when there is a need to constantly put to sleep and wake up threads, which is not uncommon.
Comment by Sundar Sridharan [ 07/Jul/14 ]
Thanks Pavel, as discussed, could you please update the ticket with the results from thread throttling on your 48 core setup?
Comment by Pavel Paulau [ 07/Jul/14 ]
btw, it has only 40 cores/vCPU.
Comment by Sundar Sridharan [ 08/Jul/14 ]
Thanks for the graph Pavel - this confirms our theory that with higher number of threads our scheduling is not able to put threads to sleep in an efficient manner.
Comment by Sundar Sridharan [ 13/Jul/14 ]
Fix for distributed sleep uploaded for review - this is expected to lower the scheduling overhead http://review.couchbase.org/#/c/39210/ thanks
Comment by Sundar Sridharan [ 15/Jul/14 ]
Hi Pavel, could you please let us know if the fix in this toy build shows any cpu improvement?
couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64.rpm
thanks
Comment by Pavel Paulau [ 16/Jul/14 ]
I assume you meant couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0-703-toy.rpm

See my comment in MB-11434.
Comment by Pavel Paulau [ 16/Jul/14 ]
Logs: http://ci.sc.couchbase.com/view/lab/job/perf-dev/498/artifact/
Comment by Sundar Sridharan [ 18/Jul/14 ]
Dynamically configurable thread limits fix uploaded for review http://review.couchbase.org/#/c/39475/
Expected to mitigate heavy cpu usage and allow tunable testing
Comment by Chiyoung Seo [ 18/Jul/14 ]
The change was merged.

Pavel, please test it again when you have time.
Comment by Pavel Paulau [ 20/Jul/14 ]
This is how it looks now.

Logs:
http://ci.sc.couchbase.com/view/lab/job/perf-dev/501/artifact/
Comment by Sundar Sridharan [ 21/Jul/14 ]
From the performance logs uploaded, it looks like with recent changes memcached's CPU usage dropped from
  85.20% memcached [kernel.kallsyms] [k] _spin_lock
……...down to………..
  16.01% memcached [kernel.kallsyms] [k] _spin_lock

That is a 5X improvement which means we are looking at about 500 % usage which is just a marginal increase over the 300% cpu usage 2.5, but with better consolidation.
Could you please then close this bug if you find this satisfactory? thanks
Comment by Sundar Sridharan [ 21/Jul/14 ]
Pavel, another fix to address a CPU hotspot issue in the persistence path has been uploaded for review. Sorry to ask again, but could you please retest with this fix http://review.couchbase.org/#/c/39645
Comment by Pavel Paulau [ 29/Jul/14 ]
I just tried build 3.0.0-1045 and test case with 5 buckets.

CPU utilization is still very high (~2400%) and resources are mostly spent in kernel space:

# sar -u 4
Linux 2.6.32-431.17.1.el6.x86_64 (atlas-s310) 07/28/2014 _x86_64_ (40 CPU)

10:57:44 PM CPU %user %nice %system %iowait %steal %idle
10:57:48 PM all 6.36 0.00 49.99 1.09 0.00 42.56
10:57:52 PM all 6.09 0.00 50.86 0.90 0.00 42.14
10:57:56 PM all 6.28 0.00 46.59 1.13 0.00 46.00
10:58:00 PM all 6.15 0.00 48.49 0.93 0.00 44.43
10:58:04 PM all 6.01 0.00 48.77 1.14 0.00 44.08
10:58:08 PM all 6.22 0.00 48.21 1.14 0.00 44.44

Rate of wakeups is high as well:

# perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '29970':

         8,888,980 sched:sched_wakeup

      30.013133143 seconds time elapsed

From perf profiler:

    82.33% memcached [kernel.kallsyms] [k] _spin_lock

https://s3.amazonaws.com/bugdb/jira/MB-11405/perf_b1045.log
Comment by Sundar Sridharan [ 31/Jul/14 ]
fix: http://review.couchbase.org/#/c/40080/ and
fix: http://review.couchbase.org/#/c/40084/
is expected to reduce cpu context switching overhead and also improve bgfetch latencies back to 2.5.1 levels
Pavel, could you please verify this from your setup?
thanks
Comment by Chiyoung Seo [ 31/Jul/14 ]
Pavel,

The above two changes were just merged. I hope these finally resolve the issue :)




[MB-11799] Bucket compaction causes massive slowness of flusher and UPR consumers Created: 23/Jul/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1005

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File compaction_b1-vs-compaction_b2-vs-ep_upr_replica_items_remaining-vs_xdcr_lag.png    
Issue Links:
Duplicate
is duplicated by MB-11731 Persistence to disk suffers from buck... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/386/artifact/
Is this a Regression?: Yes

 Description   
5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec, LAN

Similar to MB-11731 which is getting worse and worse. But now compaction affects intra-cluster replication and XDCR latency as well:

"ep_upr_replica_items_remaining" reaches 1M during compaction
"xdcr latency" reaches 5 minutes during compaction.

See attached charts for details. Full reports:

http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1005_a66_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1005_6d2_access

One important change that we made recently - http://review.couchbase.org/#/c/39647/.

The last known working builds is 3.0.0-988.

 Comments   
Comment by Pavel Paulau [ 23/Jul/14 ]
Chiyoung,

This is really critical regression. It affects many XDCR tests and also blocks many investigation/tuning efforts.
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix added for review at http://review.couchbase.org/39880 thanks
Comment by Chiyoung Seo [ 25/Jul/14 ]
I made several fixes for this issue:

http://review.couchbase.org/#/c/39906/
http://review.couchbase.org/#/c/39907/
http://review.couchbase.org/#/c/39910/

We will provide the toy build for Pavel.
Comment by Pavel Paulau [ 26/Jul/14 ]
Toy build helps a lot.

It doesn't fix the problem but at least minimize regression:
-- ep_upr_replica_items_remaining is close to zero now
-- write queue is 10x lower
-- max xdcr latency is about 8-9 second

Logs: http://ci.sc.couchbase.com/view/lab/job/perf-dev/530/
Reports:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-785-toy_6ed_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-785-toy_269_access
Comment by Chiyoung Seo [ 26/Jul/14 ]
Thanks Pavel for the updates. We will merge the above changes soon.

Do you mean that both the disk write queue size and XDCR latency are still regression? or XDCR is only your major concern?

As you pointed above, the recent change in parallelizing the compaction (4 by default) is mostly the main root cause of this issue. Do you still see the compaction slowness in your tests? I guess "no" because we can now run 4 concurrent compaction tasks on each node.

I will talk to Aliaksey to understand that change more.
Comment by Chiyoung Seo [ 26/Jul/14 ]
Pavel,

I will continue to look at some more optimizations in the ep-engine side. In the mean time, you may want to test the toy build again by lowering compaction_number_of_kv_workers in the ns-server side from 4 to 1. As mentioned in http://review.couchbase.org/#/c/39647/ , that parameter is configurable in the ns-server side.
Comment by Chiyoung Seo [ 26/Jul/14 ]
Btw, all the changes above were merged. You can use the new build and lower the above compaction parameter.
Comment by Pavel Paulau [ 28/Jul/14 ]
Build 3.0.0-1035 with compaction_number_of_kv_workers = 1:

http://ci.sc.couchbase.com/job/perf-dev/533/artifact/

Source: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1035_276_access
Destination: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1035_624_access

Disk write queue is lower (max ~5-10K) but xdcr latency is still high (several seconds) and affected by compaction.
Comment by Chiyoung Seo [ 30/Jul/14 ]
Pavel,

The following change is merged:

http://review.couchbase.org/#/c/40043/

I plan to make another change for this issue today, but you may want to test it with the new build that includes the above fix
Comment by Chiyoung Seo [ 30/Jul/14 ]
I just pushed another important change in gerrit for review:

http://review.couchbase.org/#/c/40059/
Comment by Chiyoung Seo [ 30/Jul/14 ]
Pavel,

The above two changes were merged. Please retest it to see if they resolve this issue.
Comment by Pavel Paulau [ 31/Jul/14 ]
It doesn't not.

Comparison with previously tested build 3.0.0-1045:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1045_f29_access&snapshot=atlas_c1_300-1061_8b3_access

http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1045_cf4_access&snapshot=atlas_c2_300-1061_3c8_access

Pretty much the same characteristics. Logs:
http://ci.sc.couchbase.com/job/xdcr-5x5/409/artifact/
Comment by Chiyoung Seo [ 31/Jul/14 ]
Thanks Pavel for the updates.

I debugged this issue more and found that a lot of UPR backfill tasks were scheduled unnecessarily even if items can be read from checkpoints in memory. Mike pushed the fix to address this issue:

http://review.couchbase.org/#/c/40145/




[MB-11048] Range queries result in thousands of GET operations/sec Created: 05/May/14  Updated: 18/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Benchmark for range queries demonstrated very high latency. At the same time I noticed extremely high rate of GET operations.

Even single query such as "SELECT name.f.f.f AS _name FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000 LIMIT 20" led to hundreds of memcached reads.

Explain:

https://gist.github.com/pavel-paulau/5e90939d6ab28034e3ed

Engine output:

https://gist.github.com/pavel-paulau/b222716934dfa3cb598e

I don't like to use JIRA as forum but why does it happen? Do you fetch entire range before returning limited output?

 Comments   
Comment by Gerald Sangudi [ 05/May/14 ]
Pavel,

Yes, the scan and fetch are performed before we do any LIMIT. This will be fixed in DP4, but it may not be easily fixable in DP3.

Can you please post the results of the following query:

SELECT COUNT(*) FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000

Thanks.
Comment by Pavel Paulau [ 05/May/14 ]
cbq> SELECT COUNT(*) FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000
{
    "resultset": [
        {
            "$1": 2134
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "1"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "547.545767ms"
        }
    ]
}
Comment by Pavel Paulau [ 05/May/14 ]
Also it looks like we are leaking memory in this scenario.

Resident memory of cbq-engine grows very fast (several megabytes per second) and never goes down...




[MB-11007] Request for Get Multi Meta Call for bulk meta data reads Created: 30/Apr/14  Updated: 30/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Parag Agarwal Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: All


 Description   
Currently we support per key call for getMetaData. As a result our verification requires per key fetch during verification phase. This request is to support for get bulk meta data call which can get us meta data per vbucket for all keys or in batches. This would help enhance our verification ability for meta data per documents over time or after operations like rebalance, as it will be faster. If there is a better alternative, please recommend.

Current Behavior

https://github.com/couchbase/ep-engine/blob/master/src/ep.cc

ENGINE_ERROR_CODE EventuallyPersistentStore::getMetaData(
                                                        const std::string &key,
                                                        uint16_t vbucket,
                                                        const void *cookie,
                                                        ItemMetaData &metadata,
                                                        uint32_t &deleted,
                                                        bool trackReferenced)
{
    (void) cookie;
    RCPtr<VBucket> vb = getVBucket(vbucket);
    if (!vb || vb->getState() == vbucket_state_dead ||
        vb->getState() == vbucket_state_replica) {
        ++stats.numNotMyVBuckets;
        return ENGINE_NOT_MY_VBUCKET;
    }

    int bucket_num(0);
    deleted = 0;
    LockHolder lh = vb->ht.getLockedBucket(key, &bucket_num);
    StoredValue *v = vb->ht.unlocked_find(key, bucket_num, true,
                                          trackReferenced);

    if (v) {
        stats.numOpsGetMeta++;

        if (v->isTempInitialItem()) { // Need bg meta fetch.
            bgFetch(key, vbucket, -1, cookie, true);
            return ENGINE_EWOULDBLOCK;
        } else if (v->isTempNonExistentItem()) {
            metadata.cas = v->getCas();
            return ENGINE_KEY_ENOENT;
        } else {
            if (v->isTempDeletedItem() || v->isDeleted() ||
                v->isExpired(ep_real_time())) {
                deleted |= GET_META_ITEM_DELETED_FLAG;
            }
            metadata.cas = v->getCas();
            metadata.flags = v->getFlags();
            metadata.exptime = v->getExptime();
            metadata.revSeqno = v->getRevSeqno();
            return ENGINE_SUCCESS;
        }
    } else {
        // The key wasn't found. However, this may be because it was previously
        // deleted or evicted with the full eviction strategy.
        // So, add a temporary item corresponding to the key to the hash table
        // and schedule a background fetch for its metadata from the persistent
        // store. The item's state will be updated after the fetch completes.
        return addTempItemForBgFetch(lh, bucket_num, key, vb, cookie, true);
    }
}



 Comments   
Comment by Venu Uppalapati [ 30/Apr/14 ]
Server has support for quiet CMD_GETQ_META call which can be used on the client side to create a multi-getMeta call similar to multiGet call implementation.
Comment by Parag Agarwal [ 30/Apr/14 ]
Please point to a working example for this call
Comment by Venu Uppalapati [ 30/Apr/14 ]
Parag, you can find some relevant information on using queuing requests using quiet call at https://code.google.com/p/memcached/wiki/BinaryProtocolRevamped#Get,_Get_Quietly,_Get_Key,_Get_Key_Quietly
Comment by Chiyoung Seo [ 30/Apr/14 ]
Changing the fix version to the feature backlog given that 3.0 feature complete date was already passed and it is requested for the QE testing framework.




[MB-10993] Cluster Overview - Usable Free Space documentation misleading Created: 29/Apr/14  Updated: 29/Apr/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Jim Walker Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: documentation
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Issue relates to:
 http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#viewing-cluster-summary

I was working through a support case and trying to explain the cluster overview free space and usable free space.

The following statement is from out documentation. After code review of ns_server I concluded that this is incorrect.

Usable Free Space:
The amount of usable space for storing information on disk. This figure shows the amount of space available on the configured path after non-Couchbase files have been taken into account.

The correct statement should be

Usable Free Space:
The amount of usable space for storing information on disk. This figure is calculated from the node with least amount of available storage in the cluster. The final value is calculated by multiplying by the number of nodes in the cluster.


This change is important as it is important for users to understand why Usable Free Space can be less than Free Space. The cluster considers all nodes to be equal. If you actually have a "weak" node in the cluster, e.g. one with a small disk, then the cluster nodes all have to ensure they keep storage under the weaker nodes limits, else for example we can never failover to the weak node as it cannot take on the job of a stronger node. When Usable Free Space is less than Free space, the user may actually want to see why a node has less storage available.




[MB-10944] Support of stale=false queries Created: 23/Apr/14  Updated: 18/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3, cbq-DP4
Fix Version/s: cbq-DP3
Security Level: Public

Type: Story Priority: Critical
Reporter: Pavel Paulau Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
stale=false queries in view engine are not really consistent but critical for competitive benchmarking.

 Comments   
Comment by Gerald Sangudi [ 23/Apr/14 ]
Manik,

Please add a -stale parameter to the REST API for cbq-engine. The parameter should accept true, false, and update-after as values.

Please include this fix in the DP3 bugfix release.

Thanks.




[MB-10920] unable to start tuq if there are no buckets Created: 22/Apr/14  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
node is initialized but has no buckets
[root@kiwi-r116 tuqtng]# ./tuqtng -couchbase http://localhost:8091
10:26:56.520415 Info line disabled false
10:26:56.522641 FATAL: Unable to run server, err: Unable to access site http://localhost:8091, err: HTTP error 401 Unauthorized getting "http://localhost:8091/pools": -- main.main() at main.go:76




[MB-10898] [Doc] Password encryption between Client and Server for Admin ports credentials Created: 18/Apr/14  Updated: 29/May/14  Due: 23/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Anil Kumar Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: 3.0-Beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Flagged:
Release Note

 Description   
Password encryption between Client and Server for Admin ports credentials

http://www.couchbase.com/issues/browse/MB-10088
http://www.couchbase.com/issues/browse/MB-9198




[MB-10899] [Doc] Support immediate and eventual consistency level for indexes (stale=false) Created: 18/Apr/14  Updated: 29/May/14  Due: 23/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: 3.0-Beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Flagged:
Release Note

 Description   
Support immediate and eventual consistency level for indexes (stale=false)






[MB-10902] [Doc] Progress indicator for Warm-up Operation Created: 18/Apr/14  Updated: 29/May/14  Due: 23/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: 3.0-Beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Flagged:
Release Note

 Description   
Progress indicator for Warm-up Operation -

http://www.couchbase.com/issues/browse/MB-8989




[MB-10893] [Doc] XDCR - pause and resume Created: 18/Apr/14  Updated: 29/May/14  Due: 23/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: 3.0-Beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Flagged:
Release Note

 Description   
XDCR - pause and resume

https://www.couchbase.com/issues/browse/MB-5487




[MB-10834] update the license.txt for enterprise edition for 2.5.1 Created: 10/Apr/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Microsoft Word 2014-04-07 EE Free Clickthru Breif License.docx    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
document attached.

 Comments   
Comment by Phil Labee [ 10/Apr/14 ]
2.5.1 has already been shipped, so this file can't be included.

Is this for 3.0.0 release?
Comment by Phil Labee [ 10/Apr/14 ]
voltron commit: 8044c51ad7c5bc046f32095921f712234e74740b

uses the contents of the attached file to update LICENSE-enterprise.txt on the master branch.




[MB-10823] Log failed/successful login with source IP to detect brute force attacks Created: 10/Apr/14  Updated: 18/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Cihan Biyikoglu [ 18/Jun/14 ]
http://www.couchbase.com/issues/browse/MB-11463 for covering ports 11209 or 11211.




[MB-10821] optimize storage of larger binary object in couchbase Created: 10/Apr/14  Updated: 10/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10261] document a set of rules for how to handle various view requests Created: 19/Feb/14  Updated: 04/Apr/14

Status: In Progress
Project: Couchbase Server
Component/s: documentation, ns_server
Affects Version/s: 2.1.0, 2.2.0, 2.1.1, 2.5.0, 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Matt Ingenthron Assignee: Jeff Morris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-9915 capi layer is not sending view reques... Resolved
Triage: Untriaged

 Description   
With the initial 2.0 release, the initial understood contract between the client library and the cluster is that the client library would send requests and the cluster would handle execution of those requests and sending responses. Over the development of the 2.0 series, to handle certain cases relating to new nodes, leaving nodes, and failures that contract has changed.

At this point in time, there are a few situations we may encounter (and presumed rules):
- 200 response (good, just pass results back)
- 301/302 response (follow the "redirect", possibly trigger a configuration update)
- 404 response (possibly retry on another node... see derived rules)
- 5xx response (possibly retry on another node with a backoff... see derived rules)

See the discussion in MB-9915 where a 500 has been encountered and rules which have been derived in Java:
https://github.com/couchbase/couchbase-java-client/blob/master/src/main/java/com/couchbase/client/http/HttpResponseCallback.java#L144

This bug is to document the set of rules for clients, which should become part of this doc:
http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#querying-using-the-rest-api

 Comments   
Comment by Matt Ingenthron [ 19/Feb/14 ]
I've assigned this to Jeff initially since he needs to get to a set of rules for a particular user's needs. He'll draft this up and send it out for review. Once reviewed, then the docs team can incorporate it appropriately.
Comment by Jeff Morris [ 20/Feb/14 ]
Here is my first draft: https://docs.google.com/document/d/1GhRxvPb7xakLL4g00FUi6fhZjiDaP33DTJZW7wfSxrI/edit#

I used the rules provided in the Java HttpResponseCallback.java class as baseline.
Comment by Jeff Morris [ 27/Feb/14 ]
Patch set ticket: https://www.couchbase.com/issues/browse/NCBC-407
Comment by Jeff Morris [ 27/Feb/14 ]
Patchset: http://review.couchbase.org/#/c/34007/




[MB-10084] Sub-Task: Changes required for Data Encryption in Client SDK's Created: 30/Jan/14  Updated: 28/May/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Andrei Baranouski
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
depends on JCBC-441 add SSL support in support of Couchba... Open
depends on CCBC-344 add support for SSL to libcouchbase i... Resolved
depends on NCBC-424 Add SSL support in support of Couchba... Resolved

 Description   
Changes required for Data Encryption in Client SDK's

 Comments   
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
wanted to make sure we agree this will be in 3.0. Matt any concerns?
thanks
Comment by Matt Ingenthron [ 20/Mar/14 ]
This should be closed in favor of the specific project issues. That said, the description is a bit fuzzy. Is this SSL support for memcached && views && any cluster management?

Please clarify and then we can open specific issues. It'd be good to have a link to functional requirements.
Comment by Matt Ingenthron [ 20/Mar/14 ]
And Cihan: it can't be "in 3.0", unless you mean concurrent release or release prior to 3.0 GA. Is that what you mean? I'd actually aim to have this feature support in SDKs prior to 3.0's release and we are working on it right now, though it has some other dependencies. See CCBC-344, for example.
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
thanks Matt. I meant 3.0 paired client SDK release so prior or shortly after is all good for me.
context - we are doing a pass to clean up JIRA. Like to button up what's in and out for 3.0.
Comment by Cihan Biyikoglu [ 24/Mar/14 ]
Matt, is there a client side ref implementation you guys did for this one? would be good to pass that onto test folks for initial validation until you guys completely integrate so no regressions creep up while we march to GA.
thanks
Comment by Matt Ingenthron [ 24/Mar/14 ]
We did verification with a non-mainline client since that was the quickest way to do so and have provided that to QE. Also, Brett filed a bug around HTTPS with ns-server and streaming configuration replies. See MB-10519.

We'll do a mainline client with libcouchbase and the python client as soon as it's dependency for handling packet IO is done. This is under CCBC-298 and CCBC-301, among others.




[MB-10003] [Port-configurability] Non-root instances and multiple sudo instances in a box cannot be 'offline' upgraded Created: 24/Jan/14  Updated: 27/Mar/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Unix/Linux


 Description   
Scenario
------------
As of today, we do not support offline 'upgrade' per se for packages installed in non-root/sudo users. Upgrades are usually handled by package managers. Since these are absent in non-root users and rpm cannot handle more than a a single package upgrade(if there are many instances running), offline upgrades are not supported (confirmed with Bin).

ALL non-root installations will be affected by this limitation. Although a single instance running on a box under sudo user can be offline upgraded, it cannot be extended to more than one such instance.

This is important

Workaround
-----------------
- Online upgrade (swap with nodes running latest build, take old nodes down and do clean install)
- Backup data and restore after fresh install (cbbackup and cbrestore)

Note : At this point, these are mere suggestions and both these workarounds haven't been tested yet.




[MB-9982] XDCR should be incremental on topology changes Created: 22/Jan/14  Updated: 05/May/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.2.0, 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dipti Borkar Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Currently, XDCR checkpoints are not replicated to the replica nodes. this means that on any topology change, XDCR needs to re-check if each item is needed on the other side. While it may not resend the data, rechecking for a large number of items is quite expensive.

We need to replicate checkpoints so that XDCR is incremental on topology changes just as it work without topology changes.


 Comments   
Comment by Cihan Biyikoglu [ 28/Jan/14 ]
Hi Junyi, does UPR help with being more resume-able in XDCR?
Comment by Junyi Xie (Inactive) [ 28/Jan/14 ]
It should be helpful. But we may not have cycles to do that in 3.0
Comment by Dipti Borkar [ 29/Jan/14 ]
We have to consider this to 3.0. this is a major problem.

Also, backlog is a bottomless bit. let's not use it.




[MB-10146] Document editor overwrites precision of long numbers Created: 06/Feb/14  Updated: 09/May/14

Status: Reopened
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Perry Krug Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged

 Description   
Just tested this out, not sure what diagnostics to capture so please let me know.

Simple test case:
-Create new document via document editor in UI
-Document contents are:
{"id": 18446744072866779556}
-As soon as you save, the above number is rewritten to:
{
  "id": 18446744072866780000
}
-The same effect is had if you edit a document that was inserted with the above "long" number

 Comments   
Comment by Aaron Miller (Inactive) [ 06/Feb/14 ]
It's worth noting views will always suffer from this, as it is a limitation of Javascript in general. Many JSON libraries have this behavior as well (even though they don't *have* to).
Comment by Aleksey Kondratenko [ 11/Apr/14 ]
cannot fix it. Just closing. If you want to reopen, please pass it to somebody responsible for overall design.
Comment by Perry Krug [ 11/Apr/14 ]
Reopening and assigning to docs, we need this to be release noted IMO.
Comment by Ruth Harris [ 14/Apr/14 ]
Reassigning to Anil. He makes the call on what we put in the release notes for known and fixed issues.
Comment by Anil Kumar [ 09/May/14 ]
Ruth - Lets release note this for 3.0.




[MB-11346] Audit logs for User/App actions Created: 06/Jun/14  Updated: 07/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Couchbase Server should be able to get an audit logs for all User/App actions such-as login/logout events, mutations and other bucket and security changes.






[MB-11329] uninstall couchbase server 3.0.0 on windows did not delete files completely Created: 05/Jun/14  Updated: 17/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64-bit

Attachments: PNG File ss_2014-06-05_at_11.18.37 AM.png    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-779 on windows server 3.0.0-779 from
this link http://factory.hq.couchbase.com:8080/job/cs_300_win6408/186/artifact/voltron/couchbase-server-enterprise-3.0.0-779.setup.exe

Then uninstall couchbase server.
When uninstall complete, there are many files left in c:/Program Files/Couchbase/Server/var/lib/couchbase

 Comments   
Comment by Bin Cui [ 17/Jun/14 ]
It essentially means that uninstallation doesn't proceed correctly. And I think it is related to erlang process still running after uninstallation. We need to revisit the window build.




[MB-11328] old erlang processes were still running after uninstall couchbase server 3.0.0 on windows Created: 05/Jun/14  Updated: 17/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64-bit

Attachments: PNG File ss_2014-06-05_at_10.48.16 AM.png    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-779 on windows server 3.0.0-779 from
this link http://factory.hq.couchbase.com:8080/job/cs_300_win6408/186/artifact/voltron/couchbase-server-enterprise-3.0.0-779.setup.exe

Then uninstall couchbase server. In windows task manager, erlang processes were still there.
These left over erlang processes would make UI fail to run in next install couchbase server on windows.

 Comments   
Comment by Bin Cui [ 09/Jun/14 ]
Most likely, the erlang process get hunged and it won't exit request from service control manager when installation happens. Do we have other erlang issues during the run?
Comment by Thuan Nguyen [ 09/Jun/14 ]
Yes, we have issue during the run since there are some extra erlang processes running.
Comment by Sriram Melkote [ 10/Jun/14 ]
Bin, can we have the installer run:

taskkill.exe /im beam.smp /f
taskkill.exe /im epmd.exe /f
taskkill.exe /im memcached.exe /f

After stopping service and before beginning uninstall? The epmd is the important one, others are just to be safe.
Comment by Bin Cui [ 10/Jun/14 ]
This is definitely a bandaged kind of fix and it will cover the more fatal issue, i.e. a corrupted image in erlang process. Installer can double check and kill these unresponsive process. But we still need to dig deeper to find the root cause.

Since we register erlang as service and all these processes are under control of erlang management. Only corrupted processes will not response to parent process.




[MB-11314] Enhaced Authentication model for Couchbase Server for Administrators, Users and Applications Created: 04/Jun/14  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Couchbase Server will add support for authentication using various techniques example: Kerberos, LDAP etc…







[MB-11282] Separate stats for internal memory allocation (application vs. data) Created: 02/Jun/14  Updated: 02/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Story Priority: Critical
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
AFAIK currently we track allocation for data and application together.

But sometimes application (memcached / ep-engine) overhead is huge and cannot be ignored.




[MB-11250] Go-Coucbase: Provide DML APIs using CAS Created: 29/May/14  Updated: 18/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-11247] Go-Couchbase: Use password to connect to SASL buckets Created: 29/May/14  Updated: 19/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Gerald Sangudi [ 19/Jun/14 ]
https://github.com/couchbaselabs/query/blob/master/docs/n1ql-authentication.md




[MB-11214] ORDER BY clause should require LIMIT clause Created: 27/May/14  Updated: 27/May/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-11208] stats.org should be installed Created: 27/May/14  Updated: 27/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: techdebt-backlog
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
stats.org contains a description of the stats we're sending from ep-engine. It could be useful for people

 Comments   
Comment by Matt Ingenthron [ 27/May/14 ]
If it's "useful" shouldn't this be part of official documentation? I've often thought it should be. There's probably a duplicate here somewhere.

I also think the stats need stability labels applied as people may rely on stats when building their own integration/monitoring tools. COMMITTED, UNCOMMITTED, VOLATILE, etc. would be useful for the stats.

Relatedly, someone should document deprecation of TAP stats for 3.0.




[MB-11195] Support binary collation for views Created: 23/May/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
N1QL would benefit significantly if we could allow memcmp() collation for views it creates. So much so that we should consider this for a minor release after 3.0 so it can be available for N1QL beta.




[MB-11192] Snooze for 1 second during the backfill task is causing significant pauses during backup Created: 23/May/14  Updated: 24/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Critical
Reporter: Daniel Owen Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: customer, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: cbbackup --single-node
Data all memory resident.

Attachments: PNG File dropout-screenshot.png     PNG File IOThroughput-magnified.png     PNG File ThroughputGraphfromlocalhostport11210.png    
Issue Links:
Duplicate

 Description   
When performing a backup - the cbbackup process repeatedly stalls waiting on the socket for data. This can be seen in the uploaded graphs. The uploaded TCPdump output also shows the delay.

Setting the backfill/tap queue snooze always to zero - makes the issue go away.
i.e. modifying the sleep to zero in ep-engine/src/ep.cc/ function VBCBAdaptor::VBCBAdaptor

VBCBAdaptor::VBCBAdaptor(EventuallyPersistentStore *s,
                         shared_ptr<VBucketVisitor> v,
                         const char *l, double sleep) :
    store(s), visitor(v), label(l), sleepTime(sleep), currentvb(0)
{
sleepTime = 0.0;
....

Description of the cause is provided by Abhinav:

We back off or snooze for 1 second during the backfill task because the size of the backfill/tap queue crosses this limit (which we set to 5000 as part of initial configuration), we snooze for a second to wait for the items in the queue to drain.
So what's happening here is since all the items are in memory, this queue gets filled up really fast, causing the queue size to hit the limit and there by snoozing.




[MB-11188] RemoteMachineShellConnection.extract_remote_info doesn't work on OSX Mavericks Created: 22/May/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: test-execution
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Artem Stemkovski Assignee: Parag Agarwal
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
2 problems:

1:
executing sw_vers on ssh returns:
/Users/artem/.bashrc: line 2: brew: command not found

2:
workbook:ns_server artem$ hostname -d
hostname: illegal option -- d
usage: hostname [-fs] [name-of-host]




[MB-11171] mem_used stat exceeds the bucket memory quota in extremely heavy DGM and highly overloaded cluster Created: 20/May/14  Updated: 21/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.1.0, 2.2.0, 2.1.1, 2.5.0, 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This issue was reported from one of the customers. Their cluster was extremely heavy DGM (resident ratio near zero in both active and replica vbuckets) and was highly overloaded when this memory bloating issue happened.

From the logs, we saw that the number of memcached connections was spiked from 300 to 3K during the period having the memory issue. However, we were not able to correlate the increased number of connections to the memory bloating issue yet, but plan to keep investigating this issue by running the similar workload tests.





[MB-11154] Document proper way to detect a flush success from the SDK Created: 19/May/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Critical
Reporter: Michael Nitschinger Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Hi folks,

while implementing the 2.0 SDK for java, I had the need for flush() again and thought let's do it right this time. Here is how the old SDK did this, more or less in a hacky way:

- Do the HTTP flush command agains the bucket
- Then poll for ep_degraded_mode

Now I talked to trond and he said polling those stats is just guessing since the only authorative for this can be ns_server. I guess the only way to poll is since flush can take a long time and HTTP request can time out before that?

We need to come up:

1) With a documented way how to do this reliably for 2.* series so we can provide good support for it.
2) If this is not good enough and has some edge cases or whatever, come up with something better for 3.*

I'm starting with Alk here since I guess ns_server has the coordination of all that.
Cheers,
Michael

 Comments   
Comment by Trond Norbye [ 19/May/14 ]
Polling for such a status change in ep-engine will never be "safe". It may enter and exit the degraded mode between any poll requests. You would have to have a stat with some sort of uuid in ep-engine in order to implement this. Given that this is a "cluster-wide" operation the only component that know the overall status of this operation is ns_server.

I don't think it is a good idea to spread the "internal logic" from ep-engine to the clients (since that may make it hard to change the implementation logic inside the engine)..
Comment by Matt Ingenthron [ 19/May/14 ]
A couple of high-level points (much of this has been discussed before in email):
- Since this is a cluster, the thing managing the cluster is the place to ask for the 'flush', that is ns-server as Trond mentions
- With REST, any long running operations are supposed to return an HTTP 201 with a location to check status on that operation. This is something we really need for many things beyond flush(). For instance, bucket create... how should a client (doesn't matter if it's an SDK) know when that operation is done?
- Connected clients (those who did not request the flush) should have very simple interaction with the cluster (to Trond's other point). If it's a flush, during the duration of the flushing activity there should be TMPFAIL replies and we should make the flush as low latency as possible. I know it can't be as fast as memcached, but I also know it can be pretty fast.

Mike: I assume there must be some other reason you're asking about this now? Related to UPR work?
Comment by Michael Nitschinger [ 19/May/14 ]
Matt,

I just asked because I wanted to implement flush in the new jvm core so that I can support my own unit tests properly. I then digged into the dusty corners of the old SDK and wondered if there is a better way then how we do it right now. And also to bring it up so we get better semantics moving forward.
Comment by Aleksey Kondratenko [ 19/May/14 ]
Unfortunately there is no clean and bullet-proof way of doing it. Here's what I could come up with which should be usable for tests:

* upload some "marker doc". Say empty doc with key __flush_marker

* send flush via REST API.

* if it returned 200 then you're done

* if it returned 201 poll for __flush_marker until you get miss (note, not temp error, and not hit, bit miss)

* if it returned anything else assume that request failed and restart by sending another flush request
Comment by Brett Lawson [ 18/Jun/14 ]
@Alk: Will this method of detecting a flush degrade on larger clusters, where many nodes may not be done flushing even if the doc has been flushed from a particular node?
Comment by Aleksey Kondratenko [ 18/Jun/14 ]
No. Flush is done in 2pc fashion. If you stop seeing marker doc in some vbucket, then you know that other vbuckets are already rejecting ops or already done flushing.
Comment by Aleksey Kondratenko [ 18/Jun/14 ]
Let me clarify. Once you start seeing _lack of presence_ of marker docs, then as pointed above flush is guaranteed to be done. And done means that you may see tmperrors for some time after that. But you will not see any docs from before flush.




[MB-11102] extended documentation about stats flowing out of CBSTATS and the correlation between them Created: 12/May/14  Updated: 12/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Update documentation about stats flowing out of CBSTATS and the correlation between them - Need this to be able to accurately predict capacity/other bottlenecks as well as detect trends.




[MB-11100] Ability to shutoff disk persistence for Couchbase bucket and still have replication, failover and other Couchbase bucket features Created: 12/May/14  Updated: 13/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by MB-8714 introduce vbucket based cache bucket ... Resolved

 Description   
Ability to shutoff disk persistence for Couchbase bucket and still have replication, failover and other Couchbase bucket features.




[MB-11101] supported go SDK for couchbase server Created: 12/May/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Matt Ingenthron
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
go client




[MB-11098] Ability to set block size written to storage for better alignment with SSD¹s and/or HDD¹s for better throughput performance Created: 12/May/14  Updated: 12/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Ability to set block size written to storage for better alignment with SSD¹s and/or HDD¹s for better throughput performance




[MB-11084] Build python snappy module on windows Created: 09/May/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Critical
Reporter: Bin Cui Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows


 Description   
To deal with compressed datatype, we need to python support for snappy function. We need to build https://github.com/andrix/python-snappy on windows and make it part of package.

 Comments   
Comment by Bin Cui [ 09/May/14 ]
I implement related logic for centos 5.x, 6.x and ubuntu. Please look at http://review.couchbase.org/#/c/36902/
Comment by Trond Norbye [ 16/Jun/14 ]
I've updated the windows build depot with the modules built for Python 2.7.6.

Please populate the depot to the builder and reassing the bug to Bin for verification.




[MB-10789] Bloom Filter based optimization to reduce the I/O overhead Created: 07/Apr/14  Updated: 07/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
A bloom filter can be considered as an optimization to reduce the disk IO overhead. Basically, we maintain a separate bloom filter per vbucket database file, and rebuild the bloom filter (e.g., increasing the filter size to reduce a false positive error rate) as part of vbucket database compaction.

As we know the number of items in a vBucket database file, we can determine the number of hash functions and the size of the bloom filter to achieve the desired false positive error rate. Note that Murmur hash has been widely used in Hadoop and Cassandra because it is much faster than MD5 and Jenkins. It has been widely known that fewer than 10 bits per element are required for a 1% false positive probability, independent of the number of elements in the set.

We expect that having a bloom filter will enhance both XDCR and full-ejection cache management performance at the expense of the filter's memory overhead.






[MB-10790] Transaction log support for individual mutations Created: 07/Apr/14  Updated: 07/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
There is always a time window that we lose a given mutation from an application, because we do both persistence and replication in an async manner. To address this limitation, we need to consider supporting a transaction (commit) log for individual mutations from applications, and later extend it to support a full transaction on multi docs across different nodes.


 Comments   
Comment by Matt Ingenthron [ 07/Apr/14 ]
+1

There were some earlier thoughts on how to accomplish this that I can share if it'd be useful.
Comment by Chiyoung Seo [ 07/Apr/14 ]
Thanks Matt. Please feel free to share them with me. We can schedule a separate meeting if necessary.
Comment by Matt Ingenthron [ 07/Apr/14 ]
Sure, thus week is bad, but want to grab 30 mins next week?
Comment by Chiyoung Seo [ 07/Apr/14 ]
Sure, I will then schedule a meeting sometime next week. Thanks!




[MB-10788] Support synchronous replication through UPR for mutations (SET, DEL, etc.) from applications Created: 07/Apr/14  Updated: 08/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
There are more and more demands on synchronous replication support for mutation events from applications. As a starting point, we need to investigate how this can be supported for the replication between master and slave nodes in the same cluster (e.g., quorum based).


 Comments   
Comment by Cihan Biyikoglu [ 08/Apr/14 ]
There are a number of reasons why this comes up regularly with customers.
- they see replication as a better way to create durability over local node disk persistence.
- this does allow replica reads without compromise on consistency




[MB-10767] DOC: Misc - DITA conversion Created: 04/Apr/14  Updated: 04/Apr/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Ruth Harris Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10718] Change Capture API and 3rd party consumable Created: 01/Apr/14  Updated: 02/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10716] SSD IO throughput optimizations Created: 01/Apr/14  Updated: 01/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
forestdb work




[MB-10662] _all_docs is no longer supported in 3.0 Created: 27/Mar/14  Updated: 01/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-10649 _all_docs view queries fails with err... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
As of 3.0, view engine will no longer support the special predefined view, _all_docs.

It was not a published feature, but as it has been around for a long time, it is possible it was actually utilized in some setups.

We should document that _all_docs queries will not work in 3.0

 Comments   
Comment by Cihan Biyikoglu [ 27/Mar/14 ]
Thanks. are there internal tools depending on this? Do you know if we have deprecated this in the past? I realize it isn't a supported API but want to make sure we keep the door open for feedback during beta from large customers etc.
Comment by Perry Krug [ 28/Mar/14 ]
We have a few (very few) customers who have used this. They've known it is unsupported...but that doesn't ever really stop anyone if it works for them.

Do we have a doc describing what the proposed replacement will look like and will that be available for 3.0?
Comment by Ruth Harris [ 01/May/14 ]
_all_docs is not mentioned anywhere in the 2.2+ documentation. Not sure how to handle this. It's not deprecated because it was never intended for use.
Comment by Perry Krug [ 01/May/14 ]
I think at the very least a prominant release not is appropriate.




[MB-10651] The guide for install user defined port doesn't work for Rest port change Created: 26/Mar/14  Updated: 17/Jun/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Larry Liu Assignee: Aruna Piravi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
http://docs.couchbase.com/couchbase-manual-2.5/cb-install/#install-user-defined-ports

I followed the instruction to change admin port (Rest port):
append to the /opt/couchbase/etc/couchbase/static_config file.
{rest_port, 9000}.

[root@localhost bin]# netstat -an| grep 9000
[root@localhost bin]# netstat -an| grep :8091
tcp 0 0 0.0.0.0:8091 0.0.0.0:* LISTEN

logs:
https://s3.amazonaws.com/customers.couchbase.com/larry/output.zip

Larry



 Comments   
Comment by Larry Liu [ 26/Mar/14 ]
The log files shows that the change was taken by server:

[ns_server:info,2014-03-26T19:13:24.063,nonode@nohost:<0.58.0>:ns_server:log_pending:30]Static config terms:
[{error_logger_mf_dir,"/opt/couchbase/var/lib/couchbase/logs"},
 {error_logger_mf_maxbytes,10485760},
 {error_logger_mf_maxfiles,20},
 {path_config_bindir,"/opt/couchbase/bin"},
 {path_config_etcdir,"/opt/couchbase/etc/couchbase"},
 {path_config_libdir,"/opt/couchbase/lib"},
 {path_config_datadir,"/opt/couchbase/var/lib/couchbase"},
 {path_config_tmpdir,"/opt/couchbase/var/lib/couchbase/tmp"},
 {nodefile,"/opt/couchbase/var/lib/couchbase/couchbase-server.node"},
 {loglevel_default,debug},
 {loglevel_couchdb,info},
 {loglevel_ns_server,debug},
 {loglevel_error_logger,debug},
 {loglevel_user,debug},
 {loglevel_menelaus,debug},
 {loglevel_ns_doctor,debug},
 {loglevel_stats,debug},
 {loglevel_rebalance,debug},
 {loglevel_cluster,debug},
 {loglevel_views,debug},
 {loglevel_mapreduce_errors,debug},
 {loglevel_xdcr,debug},
 {rest_port,9000}]
Comment by Aleksey Kondratenko [ 17/Apr/14 ]
This is because rest_port entry in static_config is only taken into account for fresh install.

There's some way to install our package without starting server first. And that has to be documented. I don't know who owns working with docs people.
Comment by Anil Kumar [ 09/May/14 ]
Alk - Before it gets to documentation we need to test it and verify the instructions. Can you provide those instructions and assign this ticket to Aruna to test it.
Comment by Anil Kumar [ 03/Jun/14 ]
Alk - can you provide those instructions and assign this ticket to Aruna to test it.
Comment by Aleksey Kondratenko [ 04/Jun/14 ]
Instructions fail to mention the fact that rest_port must be changed before config.dat is written. And config.dat is initialized on first server start.

There's some way to install server without starting it.

But here's what I managed to do:

# dpkg -i ~/Desktop/forReview/couchbase-server-enterprise_ubuntu_1204_x86_2.5.1-1086-rel.deb

# /etc/init.d/couchbase-server stop

# rm /opt/couchbase/var/lib/couchbase/config/config.dat

# emacs /opt/couchbase/etc/couchbase/static_config

# /etc/init.d/couchbase-server start

I.e. I stoped service, removed config.dat, edited static_config, then started it back and found rest port to be updated.
Comment by Anil Kumar [ 04/Jun/14 ]
Thanks Alk. Assigning this to Aruna for verification and later please assign this ticket to Documentation (Ruth).




[MB-10531] No longer necessary to wait for persistence to issue stale=false query Created: 21/Mar/14  Updated: 25/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Sriram Melkote Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Matt pointed out that in the past, we had to wait for an item to persist to disk before issuing stale=false query for correct results. In 3.0, this is not necessary. One can issue a stale=false view query anytime and results will fetch all changes that have been made when the query was issued. This task is a placeholder to update docs to remove the unnecessary step of waiting for persistence in 3.0 docs.

 Comments   
Comment by Matt Ingenthron [ 21/Mar/14 ]
Correct. Thanks for making sure this is raised Siri. While I'm thinking of it, two points need to be in there:
1) if you have older code, you will need to change it to take advantage of the semantic change to the query
2) application developers still need to be a bit careful to ensure any modifications being done aren't async operations-- they'll have to wait for the responses before doing the stale=false query
Comment by Anil Kumar [ 25/Mar/14 ]
This is for 3.0 documentation.
Comment by Sriram Melkote [ 25/Mar/14 ]
Not an improvement. This is a task.




[MB-10511] Feature request for supporting rolling downgrades Created: 19/Mar/14  Updated: 11/Apr/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Abhishek Singh Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to

 Description   
Some customers are interested in Couchbase supporting rolling downgrades. Currently we can't add 2.2 nodes inside a cluster that has all nodes on 2.5.




[MB-10512] Update documentation to convey we don't support rolling downgrades Created: 19/Mar/14  Updated: 27/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Abhishek Singh Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Update documentation to convey we don't support rolling downgrades to 2.2 once all nodes are running on 2.5




[MB-10469] Support Couchbase Server on SuSE linux platform Created: 14/Mar/14  Updated: 17/Apr/14

Status: Open
Project: Couchbase Server
Component/s: build, installer
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: SuSE linux platform

Issue Links:
Duplicate

 Description   
Add support for SuSE Linux platform




[MB-10431] Removed ep_expiry_window stat/engine_parameter Created: 11/Mar/14  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Mike Wiederhold Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This parameter is no longer needed since we require everything to be persisted. In the past it was used to skip persistence on items that would be expiring very soon.




[MB-10432] Removed ep_max_txn_size stat/engine_parameter Created: 11/Mar/14  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Mike Wiederhold Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This value is no longer used in the server. Please not that you need to update the documentation for cbepctl since this stat could be set with that script.




[MB-10430] Add AWS AMI documentation to Installation and Upgrade Guide Created: 11/Mar/14  Updated: 25/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: