[MB-11846] Compiling breakdancer test case exceeds available memory Created: 29/Jul/14  Updated: 12/Aug/14  Due: 30/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Chris Hillery Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
1. With memcached change 4bb252a2a7d9a369c80f8db71b3b5dc1c9f47eb9, cc1 on ubuntu-1204 quickly uses up 100% of the available memory (4GB RAM, 512MB swap) and crashes with an internal error.

2. Without Trond's change, cc1 compiles fine and never takes up more than 12% memory, running on the same hardware.

 Comments   
Comment by Chris Hillery [ 29/Jul/14 ]
Ok, weird fact - on further investigation, it appears that this is NOT happening on the production build server, which is an identically-configured VM. It only appears to be happening on the commit validation server ci03. I'm going to temporarily disable that machine so the next make-simple-github-tap test runs on a different ci server and see if it is unique to ci03. If it is I will lower the priority of the bug. I'd still appreciate some help in understanding what's going on either way.
Comment by Trond Norbye [ 30/Jul/14 ]
Please verify that the two builders have the same patch level so that we're comparing apples with apples.

It does bring up another interesting topic. should our builders just use the compiler provided with the installation, or should we have a reference compiler we're using to build our code. It does seems like a bad idea having to support a ton of various compiler revision (including the fact that they support different levels of C++11 that we have to work around).
Comment by Chris Hillery [ 31/Jul/14 ]
This is now occurring on other CI build servers in other tests - http://www.couchbase.com/issues/browse/CBD-1423

I am bumping this back to Test Blocker and I will revert the change as a work-around for now.
Comment by Chris Hillery [ 31/Jul/14 ]
Partial revert committed to memcached master: http://review.couchbase.org/#/c/40152/ and 3.0: http://review.couchbase.org/#/c/40153/
Comment by Trond Norbye [ 01/Aug/14 ]
That review in memcached should NEVER have been pushed through. Its subject line is too long
Comment by Chris Hillery [ 01/Aug/14 ]
If there's a documented standard out there for commit messages, my apologies; it was never revealed to me.
Comment by Trond Norbye [ 01/Aug/14 ]
When it doesn't fit within a terminal window there is a problem. it is way better to use multiple lines..

IN addition I'm not happy with the fix. instead of deleting the line it should have been checking for an environment variable so that people could explicitly disable it. This is why we have review cycles.
Comment by Chris Hillery [ 01/Aug/14 ]
I don't think I want to get into style arguments. If there's a standard I'll use it. In the meantime I'll try to keep things to 72-character lines.

As to the content of the change, it was not intended to be a "fix"; it was a simple revert of a change that was provably breaking other jobs. I returned the code to its previous state, nothing more or less. And especially given the time crunch of the beta (which is supposed to be built tomorrow), waiting for a code review on a reversion is not in the cards.
Comment by Trond Norbye [ 01/Aug/14 ]
The normal way of doing a revert is to use git revert (which as an extra bonus makes the commit message contain that).
Comment by Trond Norbye [ 01/Aug/14 ]
http://review.couchbase.org/#/c/40165/
Comment by Chris Hillery [ 01/Aug/14 ]
1. Your fix is not correct, because simply adding -D to cmake won't cause any preprocessor defines to be created. You need to have some CONFIGURE_FILE() or similar to create a config.h using #cmakedefine. As it is there is no way to compile with your change.

2. The default behaviour should not be the one that is known to cause problems. Until and unless there is an actual fix for the problem (whether or not that is in the code), the default should be to keep the optimization, with an option to let individuals bypass that if they desire and accept the risks.

3. Characterizing the problem as "misconfigured VMs" is, at best, premature.

I will revert this change again on the 3.0 branch shortly, unless you have a better suggestion (I'm definitely all ears for a better suggestion!).
Comment by Trond Norbye [ 01/Aug/14 ]
If you look at the comment it pass the -D over into the CMAKE_C_FLAGS, causing it to be set into the compiler flags and it'll be passed on to compilation cycle.

As of misconfiguration, it is either insufficient resources on the vm or a "broken" compiler version installed there.
Comment by Trond Norbye [ 01/Aug/14 ]
Can I get login credentials to the server it fails and an identical vm where it succeeds.
Comment by Chris Hillery [ 01/Aug/14 ]
[CMAKE_C_FLAGS] Fair enough, I did misread that. That's not really a sufficient workaround, though. Doing that may overwrite other CFLAGS set by other parts of the build process.

I still maintain that the default behaviour should be the known-working version. However, for the moment I have temporarily locked the rel-3.0.0.xml manifest to the revision before my revert (ie, to 5cc2f8d928f0eef8bddbcb2fcb796bc5e9768bb8), so I won't revert anything else until that has been tested.

The only VM I know of at the moment where we haven't seen build failures is the production build slave. I can't give you access to that tonight as we're in crunch mode to produce a beta build. Let's plan to hook up next week and do some exploration.
Comment by Volker Mische [ 01/Aug/14 ]
There are commit message guidelines. At the bottom of

http://www.couchbase.com/wiki/display/couchbase/Contributing+Changes

links to:

http://en.wikibooks.org/wiki/Git/Introduction#Good_commit_messages
Comment by Trond Norbye [ 01/Aug/14 ]
I've not done anything on the 3.0.0 branch, the fix going forward is for 3.0.1 and trunk. Hopefully the 3.0 branch will die relatively soon since we've got a lot of good stuff in the 3.0.1 branch.

The "workaround" is not intended as a permanent solution, its just until the vms is fixed. I've not been able to reproduce this issue on my centos, ubuntu, fedora or smartos builders. They're running in the following vm's:

[root@00-26-b9-85-bd-92 ~]# vmadm list
UUID TYPE RAM STATE ALIAS
04bf8284-9c23-4870-9510-0224e7478f08 KVM 2048 running centos-6
7bcd48a8-dcc2-43a6-a1d8-99fbf89679d9 KVM 2048 running ubuntu
c99931d7-eaa3-47b4-b7f0-cb5c4b3f5400 KVM 2048 running fedora
921a3571-e1f6-49f3-accb-354b4fa125ea OS 4096 running compilesrv
Comment by Trond Norbye [ 01/Aug/14 ]
I need access to two identical configured builders where one may reproduce the error and one where it succeeds.
Comment by Volker Mische [ 01/Aug/14 ]
I would also add that I think it is about bad VMs. On the commit validation we have 6 VMs, It failed only always on ubuntu-1204-64-ci-01 due to this error and never on the others (ubuntu-1204-64-ci-02 - 06).
Comment by Chris Hillery [ 01/Aug/14 ]
That's not correct. The problem originally occurred on ci-03.
Comment by Volker Mische [ 01/Aug/14 ]
Then I need to correct it that my comment only holds true for the couchdb-gerrit-300 job.
Comment by Trond Norbye [ 01/Aug/14 ]
can I get login creds to one that it fails on? while I'm waiting for access to one that it works on?
Comment by Volker Mische [ 01/Aug/14 ]
I don't know about creds (I think my normal user login works) The machine details are here: http://factory.couchbase.com/computer/ubuntu-1204-64-ci-01/
Comment by Chris Hillery [ 01/Aug/14 ]
Volker - it was initially detected in the make-simple-github-tap job, so it's not unique to couchdb-gerrit-300 either. Both jobs pretty much just checkout the code and build it, though; they're pretty similar.
Comment by Trond Norbye [ 01/Aug/14 ]
Adding swap space to the builder makes the compilation pass. I've been trying to figure out how to get gcc to print more information about each step (the -ftime-reports memory usage didn't at all match the process usage ;-))
Comment by Anil Kumar [ 12/Aug/14 ]
Adding the component as "build". Let me know if that's not correct.




[MB-11948] [Windows]: Simple-test broken - Rebalance exited with reason {unexpected_exit..{dcp_wait_for_data_move_failed,"default",254,..wrong_rebalancer_pid}}} Created: 13/Aug/14  Updated: 18/Aug/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Meenakshi Goel Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1143-rel

Issue Links:
Duplicate
is duplicated by MB-11981 [3.0.0-1166-Windows] items are stucke... Resolved
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Ref Link:
http://qa.hq.northscale.net/job/win_2008_x64--01_00--qe-sanity-P0/68/console
http://qa.hq.northscale.net/job/win_2008_x64--01_00--qe-sanity-P0/66/consoleFull

Test to Reproduce:
./testrunner -i <yourfile>.ini -t rebalance.rebalancein.RebalanceInTests.rebalance_in_with_ops,nodes_in=3,replicas=1,items=50000,doc_ops=create;update;delete

Logs:
[ns_server:error,2014-08-13T7:46:46.007,ns_1@10.3.3.213:janitor_agent-default<0.18311.0>:janitor_agent:handle_call:639]Rebalance call failed due to the wrong rebalancer pid <0.18169.0>. Should be undefined.
[ns_server:error,2014-08-13T7:46:46.007,ns_1@10.3.3.213:<0.18299.0>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {'EXIT',<0.18312.0>,
                               {dcp_wait_for_data_move_failed,"default",254,
                                   'ns_1@10.3.3.213',
                                   ['ns_1@10.1.2.66'],
                                   wrong_rebalancer_pid}}
[ns_server:error,2014-08-13T7:46:46.007,ns_1@10.3.3.213:<0.18299.0>:misc:sync_shutdown_many_i_am_trapping_exits:1430]Shutdown of the following failed: [{<0.18312.0>,
                                    {dcp_wait_for_data_move_failed,"default",
                                     254,'ns_1@10.3.3.213',
                                     ['ns_1@10.1.2.66'],
                                     wrong_rebalancer_pid}}]
[ns_server:error,2014-08-13T7:46:46.007,ns_1@10.3.3.213:<0.18245.0>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {'EXIT',<0.18289.0>,
                            {bulk_set_vbucket_state_failed,
                             [{'ns_1@10.3.3.213',
                               {'EXIT',
                                {{{{case_clause,
                                    {error,
                                     {{{badmatch,
                                        {error,
                                         {{badmatch,{error,enobufs}},
                                          [{mc_replication,connect,1,
                                            [{file,"src/mc_replication.erl"},
                                             {line,30}]},
                                           {mc_replication,connect,1,
                                            [{file,"src/mc_replication.erl"},
                                             {line,49}]},
                                           {dcp_proxy,connect,4,
                                            [{file,"src/dcp_proxy.erl"},
                                             {line,179}]},
                                           {dcp_proxy,maybe_connect,1,
                                            [{file,"src/dcp_proxy.erl"},
                                             {line,166}]},
                                           {dcp_consumer_conn,init,2,
                                            [{file,
                                              "src/dcp_consumer_conn.erl"},
                                             {line,55}]},
                                           {dcp_proxy,init,1,
                                            [{file,"src/dcp_proxy.erl"},
                                             {line,48}]},
                                           {gen_server,init_it,6,
                                            [{file,"gen_server.erl"},
                                             {line,304}]},
                                           {proc_lib,init_p_do_apply,3,
                                            [{file,"proc_lib.erl"},
                                             {line,239}]}]}}},
                                       [{dcp_replicator,init,1,
                                         [{file,"src/dcp_replicator.erl"},
                                          {line,47}]},
                                        {gen_server,init_it,6,
                                         [{file,"gen_server.erl"},{line,304}]},
                                        {proc_lib,init_p_do_apply,3,
                                         [{file,"proc_lib.erl"},{line,239}]}]},
                                      {child,undefined,'ns_1@10.1.2.66',
                                       {dcp_replicator,start_link,
                                        ['ns_1@10.1.2.66',"default"]},
                                       temporary,60000,worker,
                                       [dcp_replicator]}}}},
                                   [{dcp_sup,start_replicator,2,
                                     [{file,"src/dcp_sup.erl"},{line,78}]},
                                    {dcp_sup,
                                     '-set_desired_replications/2-lc$^2/1-2-',
                                     2,
                                     [{file,"src/dcp_sup.erl"},{line,55}]},
                                    {dcp_sup,set_desired_replications,2,
                                     [{file,"src/dcp_sup.erl"},{line,55}]},
                                    {replication_manager,handle_call,3,
                                     [{file,"src/replication_manager.erl"},
                                      {line,130}]},
                                    {gen_server,handle_msg,5,
                                     [{file,"gen_server.erl"},{line,585}]},
                                    {proc_lib,init_p_do_apply,3,
                                     [{file,"proc_lib.erl"},{line,239}]}]},
                                  {gen_server,call,
                                   ['replication_manager-default',
                                    {change_vbucket_replication,255,
                                     'ns_1@10.1.2.66'},
                                    infinity]}},
                                 {gen_server,call,
                                  [{'janitor_agent-default','ns_1@10.3.3.213'},
                                   {if_rebalance,<0.18169.0>,
                                    {update_vbucket_state,255,replica,
                                     undefined,'ns_1@10.1.2.66'}},
                                   infinity]}}}}]}}

Uploading Logs.

Live Cluster
1:10.3.3.213
2:10.3.2.21
3:10.3.2.23
4:10.1.2.66

 Comments   
Comment by Meenakshi Goel [ 13/Aug/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11948/f806d72b/10.3.3.213-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11948/11dd43ca/10.3.2.21-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11948/75af6ef6/10.1.2.66-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11948/9dc45b16/10.3.2.23-diag.zip
Comment by Aleksey Kondratenko [ 13/Aug/14 ]
You digged out right error message.

ENOBUFS is mapped from WSAENOBUFS and googling for it found me http://support.microsoft.com/kb/196271 which reminds me about that installer thing that we did this registry setting. It looks like you still need to do it.
Comment by Sriram Melkote [ 13/Aug/14 ]
I think we support only Windows 2008 and they fixed that issue,

it has 16k ports by default.
http://support.microsoft.com/kb/929851

Alk, any idea why we initiate so many outgoing connections?
Comment by Aleksey Kondratenko [ 13/Aug/14 ]
We shouldn't. And I don't know what "many" is.
Comment by Aleksey Kondratenko [ 14/Aug/14 ]
Rebalance is stuck waiting for seqno persistence. It's something in ep-engine.
Comment by Aleksey Kondratenko [ 14/Aug/14 ]
And memcached logs are full of this:

Thu Aug 14 10:51:05.630859 Pacific Daylight Time 3: (default) Warning: failed to open database file for vBucket = 1013 rev = 1
Thu Aug 14 10:51:05.633789 Pacific Daylight Time 3: (default) Warning: couchstore_open_db failed, name=c:/Program Files/Couchbase/Server/var/lib/couchbase/data/default/246.couch.1 option=2 rev=1 error=no such file [errno = 0: 'The operation completed successfully.

So indeed persistence is not working.
Comment by Chiyoung Seo [ 14/Aug/14 ]
Sriram,

I think this issue was caused by the file path name issue on windows that we discussed the other day.
Comment by Sriram Ganesan [ 15/Aug/14 ]
From bug MB-11934, Meenakshi had run the simple test on build 3.0.0-1130-rel (http://qa.hq.northscale.net/job/win_2008_x64--01_00--qe-sanity-P0/65/consoleFull) and it looks like all the tests passed except for warmup. So, it looks like persistence was okay at that point. Did this manifest in any build earlier than 3.0.0-1143-rel? It might help pinpoint the check-ins that caused the regression.
Comment by Meenakshi Goel [ 18/Aug/14 ]
I had the last successful run with 3.0.0-1137-rel and after that picked 3.0.0-1143-rel in which observed this issue.
Also sanity tests seems to worked fine till 3.0.0-1139-rel
http://qa.sc.couchbase.com/job/CouchbaseServer-SanityTest-4Node-Windows_2012_x64/213/consoleFull
Comment by Sriram Ganesan [ 18/Aug/14 ]
Also observing a lot of couch notifier errors

memcached<0.81.0>: Wed Aug 13 07:45:39.978617 Pacific Daylight Time 3: (default) Resetting connection to mccouch, lastReceivedCommand = select_bucket lastSentCommand = select_bucket currentCommand =unknown
memcached<0.81.0>: Wed Aug 13 07:45:39.979593 Pacific Daylight Time 3: (default) Trying to connect to mccouch: "127.0.0.1:11213"
memcached<0.81.0>: Wed Aug 13 07:45:39.979593 Pacific Daylight Time 3: (default) Connected to mccouch: "127.0.0.1:11213"
memcached<0.81.0>: Wed Aug 13 07:45:39.980570 Pacific Daylight Time 3: (default) Failed to read from mccouch for select_bucket: "The operation completed successfully.

Before all those, there is an error from moxi

[ns_server:info,2014-08-13T7:45:38.812,babysitter_of_ns_1@127.0.0.1:<0.79.0>:ns_port_server:log:169]moxi<0.79.0>: 2014-08-13 07:45:40: (C:\Jenkins\workspace\cs_300_win6408\couchbase\moxi\src\agent_config.c.721) ERROR: bad JSON configuration from http://127.0.0.1:8091/pools/default/saslBucketsStreaming: No vBuckets available; service maybe still initializing





[MB-11955] [windows] could not re-create default bucket after delete default bucket Created: 13/Aug/14  Updated: 18/Aug/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows server 2008 R2 64-bit

Attachments: Zip Archive 172.23.107.124-8132014-1238-diag.zip     Zip Archive 172.23.107.125-8132014-1239-diag.zip     Zip Archive 172.23.107.126-8132014-1241-diag.zip    
Triage: Untriaged
Operating System: Windows 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build (I don't see manifest file for windows)

http://latestbuilds.hq.couchbase.com/couchbase-server-enterprise_x86_64_3.0.0-1144-rel.rpm.manifest.xml
Is this a Regression?: Yes

 Description   
Run sanity test on windows 2008 R2 64-bit on build 3.0.0-1144, there were a lot of failed tests.
Check error log, see test failed due to unable to create default bucket.

[2014-08-13 11:54:31,069] - [rest_client:1524] INFO - http://172.23.107.124:8091/pools/default/buckets with param: bucketType=membase&evictionPolicy=valueOnly&threadsNumber=3&ramQuotaMB=200&proxyPort=12211&authType=sasl&name=default&flushEnabled=1&replicaNumber=1&replicaIndex=1&saslPassword=
[2014-08-13 11:54:31,078] - [rest_client:747] ERROR - http://172.23.107.124:8091/pools/default/buckets

Sanity tests passed in build 3.0.0-1139

 Comments   
Comment by Abhinav Dangeti [ 15/Aug/14 ]
Hey Tony, can I get one windows vm for testing here?
Comment by Thuan Nguyen [ 15/Aug/14 ]
I gave to Siram some physical windows servers.
Can you check with him if he still needs server with IP 10.17.1.166?
Comment by Abhinav Dangeti [ 15/Aug/14 ]
Bucket deletion doesn't complete, reason why you cannot recreate the bucket.
Comment by Thuan Nguyen [ 15/Aug/14 ]
This bug does not happen in build 3.0.0-1139 and previous build as shown in
http://qa.sc.couchbase.com/job/CouchbaseServer-SanityTest-4Node-Windows_2008_x64/
Comment by Abhinav Dangeti [ 15/Aug/14 ]
Changes between 1140 and 1142 to be precise.
Comment by Abhinav Dangeti [ 18/Aug/14 ]
Seeing multiple mcCouch connection failures in the logs:

Mon Aug 18 11:28:22.741780 Pacific Daylight Time 3: (default) Trying to connect to mccouch: "127.0.0.1:11213"
Mon Aug 18 11:28:22.741780 Pacific Daylight Time 3: (default) Connected to mccouch: "127.0.0.1:11213"
Mon Aug 18 11:28:22.748780 Pacific Daylight Time 3: (default) Failed to read from mccouch for select_bucket: "The operation completed successfully.

"
Mon Aug 18 11:28:22.748780 Pacific Daylight Time 3: (default) Resetting connection to mccouch, lastReceivedCommand = select_bucket lastSentCommand = select_bucket currentCommand =unknown
Mon Aug 18 11:28:22.748780 Pacific Daylight Time 3: (default) Trying to connect to mccouch: "127.0.0.1:11213"
Mon Aug 18 11:28:22.748780 Pacific Daylight Time 3: (default) Connected to mccouch: "127.0.0.1:11213"
Mon Aug 18 11:28:22.758780 Pacific Daylight Time 3: (No Engine) Bucket default registered with low priority
Mon Aug 18 11:28:22.758780 Pacific Daylight Time 3: (No Engine) Spawning zu readers, zu writers, zu auxIO, zu nonIO threads
Mon Aug 18 11:28:22.761781 Pacific Daylight Time 3: (default) metadata loaded in 1000 usec
Mon Aug 18 11:28:22.761781 Pacific Daylight Time 3: (default) Enough number of items loaded to enable traffic
Mon Aug 18 11:28:22.761781 Pacific Daylight Time 3: (default) warmup completed in 1000 usec
Mon Aug 18 11:28:23.839842 Pacific Daylight Time 3: (default) Failed to read from mccouch for notify_vbucket_update: "The operation completed successfully.

"
Mon Aug 18 11:28:23.839842 Pacific Daylight Time 3: (default) Resetting connection to mccouch, lastReceivedCommand = notify_vbucket_update lastSentCommand = notify_vbucket_update currentCommand =unknown
Mon Aug 18 11:28:23.839842 Pacific Daylight Time 3: (default) Trying to connect to mccouch: "127.0.0.1:11213"
Mon Aug 18 11:28:23.839842 Pacific Daylight Time 3: (default) Connected to mccouch: "127.0.0.1:11213"
Mon Aug 18 11:28:23.888845 Pacific Daylight Time 3: (default) Failed to read from mccouch for notify_vbucket_update: "The operation completed successfully.

"

...




[MB-10156] "XDCR - Cluster Compare" support tool Created: 07/Feb/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Cihan Biyikoglu Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: 2.5.1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
for the recent issues we have seen we need a tool that cam compare metadata (specifically revids) for a given replication definition in XDCR. To scale to large data sizes, being able to do this per vbucket or per doc range would be great but we can do without these. for clarity, here is a high level desc.

Ideal case:
xdcr_compare cluster1_connectioninfo cluster1_bucketname cluster2connectioninfo cluster2_bucketname [vbucketid] [keyrange]
should return a line per docid for each row where cluster1 metadata and clustermetadata for the given key differ.
docID - cluster1_metadata cluster2_metadata

simplification: the tool is expected to return false positives in a moving system but we will tackle that by rerunning the tool multiple times.

 Comments   
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Aaron, do you have a timeline for this?
thanks
-cihan
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan,

For test automation/verification, can you list out the stats/metadata that we should be testing specifically?
we want to create/implement the tests accordingly.


Also -- is this tool de-coupled from the server package? or is this part of rpm/deb/.exe/osx build package?

Thanks,
Maria
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
This depends on the requirements; A tool that requires the manual collection of all data from all nodes in both clusters onto one machine (like we've done recently) could be done pretty quickly, but I imagine that may be difficult or unfeasible entirely for some users.

Better would be to be able to operate remotely on clusters and only look at metadata. Unfortunately there is no *currently exposed* interface to only extract metadata from the system without also retrieving values. I may be able to work around this, but the workaround is unlikely to be simple.

Also for some users, even the amount of *metadata* may be prohibitively large to transfer all to one place, this also can be avoided, but again, adds difficulty.

Q: Can the tool be JVM-based?
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
I think it would be more feasible for this to ship separately from the server package.
Comment by Maria McDuff (Inactive) [ 19/Feb/14 ]
Cihan, Aaron,

If it's de-coupled, what older versions of Couchbase would this tool support? as far back as 1.8.x? pls confirm as this would expand our backward compatibility testing for this tool.
Comment by Aaron Miller (Inactive) [ 19/Feb/14 ]
Well, 1.8.x didn't have XDCR or the rev field; It can't be compatible with anything older than 2.0 since it operates mostly to check things added since 2.0.

I don't know how far back it needs to go but it *definitely* needs to be able to run against 2.2
Comment by Cihan Biyikoglu [ 19/Feb/14 ]
Agree with Aaron, lets keep this lightweight. can we depend on Aaron for testing if this will initially be just a support tool? for 3.0, we may graduate the tool to the server shipped category.
thanks
Comment by Sangharsh Agarwal [ 27/Feb/14 ]
Cihan, Is the Spec finalized for this tool in version 2.5.1?
Comment by Cihan Biyikoglu [ 27/Feb/14 ]
Sangharsh, for 2.5.1, we wanted to make this a "Aaron tested" tool. I believe Aaron already has the tool. Aaron?
Comment by Aaron Miller (Inactive) [ 27/Feb/14 ]
Working on it; wanted to get my actually-in-the-package 2.5.1 stuff into review first.

What I do already have is a diff tool for *files*, but is highly inconvenient to use; this should be a tool that doesn't require collecting all data files into one place in order to use, and instead can work against a running cluster.
Comment by Maria McDuff (Inactive) [ 05/Mar/14 ]
Aaron,

Is the tool merged yet into the build? can you update pls?
Comment by Cihan Biyikoglu [ 06/Mar/14 ]
2.5.1 shiproom note: Phil raised a build concern on getting this packaged with 2.5.1. The initial bar we set was not to ship this as part of the server - it was intended to be a downloadable support tool. Aaron/Cihan will re-eval and get back to shiproom.
Comment by Cihan Biyikoglu [ 15/Jun/14 ]
Aaron no longer here. assigning to Xiaomei for consideration.




[MB-10719] Missing autoCompactionSettings during create bucket through REST API Created: 01/Apr/14  Updated: 19/Jun/14

Status: Reopened
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: michayu Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File bucket-from-API-attempt1.txt     Text File bucket-from-API-attempt2.txt     Text File bucket-from-API-attempt3.txt     PNG File bucket-from-UI.png     Text File bucket-from-UI.txt    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Unless I'm not using the API correctly, there seems to be some holes in the Couchbase API – particularly with autoCompaction.

The autoCompaction parameter can be set via the UI (as long as the bucketType is couch base).

See the following attachments:
1) bucket-from-UI.png
2) bucket-from-UI.txt

And compare with creating the bucket (with autoCompaction) through the REST API:
1) bucket-from-API-attempt1.txt
    - Reference: http://docs.couchbase.com/couchbase-manual-2.5/cb-rest-api/#creating-and-editing-buckets
2) bucket-from-API-attempt2.txt
    - Reference: http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-admin-rest-auto-compaction
3) bucket-from-API-attempt3.txt
    - Setting autoCompaction globally
    - Reference: http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-admin-rest-auto-compaction

In all cases, autoCompactionSettings is still false.


 Comments   
Comment by Anil Kumar [ 19/Jun/14 ]
Triage - June 19 2014 Alk, parag, Anil
Comment by Aleksey Kondratenko [ 19/Jun/14 ]
It works, just apparently not properly documented:

# curl -u Administrator:asdasd -d name=other -d bucketType=couchbase -d ramQuotaMB=100 -d authType=sasl -d replicaNumber=1 -d replicaIndex=0 -d parallelDBAndViewCompaction=true -d purgeInterval=1 -d 'viewFragmentationThreshold[percentage]'=30 -d autoCompactionDefined=1 http://lh:9000/pools/default/buckets

And general hint is that you can see what browser is POSTing when it creates bucket or does anything else to figure out working (but not necessarily publicly supported) way of doing things.
Comment by Anil Kumar [ 19/Jun/14 ]
Ruth - Above documentation references needs to be fixed with correct REST API.




[MB-9632] diag / master events captured in log file Created: 22/Nov/13  Updated: 17/Feb/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Steve Yen Assignee: Ravi Mayuram
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
The information available in the diag / master events REST stream should be captured in a log (ALE?) file and hence available to cbcollect-info's and later analysis tools.

 Comments   
Comment by Aleksey Kondratenko [ 22/Nov/13 ]
It is already available in collectinfo
Comment by Dustin Sallings (Inactive) [ 26/Nov/13 ]
If it's only available in collectinfo, then it's not available at all. We lose most of the useful information if we don't run an http client to capture it continually throughout the entire course of a test.
Comment by Aleksey Kondratenko [ 26/Nov/13 ]
Feel free to submit a patch with exact behavior you need




[MB-9358] while running concurrent queries(3-5 queries) getting 'Bucket X not found.' error from time to time Created: 16/Oct/13  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 64 bit

Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
one thread gives correct result:
[root@localhost tuqtng]# curl 'http://10.3.121.120:8093/query?q=SELECT+META%28%29.cas+as+cas+FROM+bucket2&#39;
{
    "resultset": [
        {
            "cas": 4.956322522514292e+15
        },
        {
            "cas": 4.956322525999292e+15
        },
        {
            "cas": 4.956322554862292e+15
        },
        {
            "cas": 4.956322832498292e+15
        },
        {
            "cas": 4.956322835757292e+15
        },
        {
            "cas": 4.956322838836292e+15
...

    ],
    "info": [
        {
            "caller": "http_response:152",
            "code": 100,
            "key": "total_rows",
            "message": "0"
        },
        {
            "caller": "http_response:154",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "405.41885ms"
        }
    ]
}

but in another I see
{
    "error":
        {
            "caller": "view_index:195",
            "code": 5000,
            "key": "Internal Error",
            "message": "Bucket bucket2 not found."
        }
}

cbcollect will be attached

 Comments   
Comment by Marty Schoch [ 16/Oct/13 ]
This is a duplicate, though I can't yet find the original.

We believe under higher load the view queries timeout, which we report as bucket not found (may not be possible to distinguish).
Comment by Iryna Mironava [ 16/Oct/13 ]
https://s3.amazonaws.com/bugdb/jira/MB-9358/447a45ae/10.3.121.120-10162013-858-diag.zip
Comment by Ketaki Gangal [ 17/Oct/13 ]
Seeing these errors and frequent tuq-server crashes on concurrent queries during typical server operations like
- w/ Failovers
- w/ Backups
- w/ Indexing.

Similar server ops for single queries however seem to run okay.

Note: This is a very small number of concurrent queries ( 3-5), typically users may have higher level of concurrency if used at an Application level.




[MB-9145] Add option to download the manual in pdf format (as before) Created: 17/Sep/13  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 2.0, 2.1.0, 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Anil Kumar Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
On the documentation site there is no option to download the manual in pdf format as before. We need to add this option back.

 Comments   
Comment by Maria McDuff (Inactive) [ 18/Sep/13 ]
need for 2.2.1 bug fix release.




[MB-8838] Security Improvement - Connectors to implement security improvements Created: 14/Aug/13  Updated: 19/May/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Security Improvement - Connectors to implement security improvements

Spec ToDo.




[MB-9415] auto-failover in seconds - (reduced from minimum 30 seconds) Created: 21/May/12  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0, 1.8.1, 2.0, 2.0.1, 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Dipti Borkar Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 2
Labels: customer, ns_server-story
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-9416 Make auto-failover near immediate whe... Technical task Open Aleksey Kondratenko  

 Description   
including no false positives

http://www.pivotaltracker.com/story/show/25006101

 Comments   
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
At the very least it requires getting our timeout-ful cases under control. So at least splitting couchdb into separate VM is a requirement for this. But not necessarily enough.
Comment by Aleksey Kondratenko [ 25/Oct/13 ]
Still seeing misunderstanding on this one.

So we have _different_ problem that even manual failover (let alone automatic) cannot succeed quickly if master node fails. It can easily take up to 2 minutes because of our use of erlang "global" facility than requires us to detect that node is dead and erlang is tuned to detect that within 2 minutes.

Now _this_ problem is lowering autofailover detection to 10 seconds. We can blindly make it happen today. But it will not be usable because of all sorts of timeouts happening in cluster management layer. We have some significant proportion of CBSEs _today_ about false positive autofailovers even with 30 seconds threshold. Clearly lowering it to 10 will only make it worse. Therefore my point above. We have to get those timeouts under control so that heartbeats are sent/received timely. Or whatever else we use to detect node being unresponsive.

I would like to note however that especially in some virtualized environments (arguably, oversubscribed) we saw as high as low tens of seconds delays from virtualization _alone_. Given relatively high cost of failover in our software I'd like to point out that people could too easily abuse that feature.

High cost of failover is refered to above is this:

* you almost certainly and irrecoverably lose some recent mutations. _At least_ recent mutations. I.e. if replication is really working well. In node that's on the edge of autofailover you can imagine replication not being "diamond-hard quick". That's cost 1.

* in order to return node back to cluster (say node crashed and needed some time to recover, whatever it might mean) you need rebalance. That type of rebalance is relatively quick by design; i.e. it only moves data back to this node and nothing else. But it's still rebalance. with upr we can possibly make it better. I.e. because its failover log is capable of rewinding just conflicting mutations.

What I'm trying to say in "our approach appears to have relatively high price for failover" is that it appears inherent issue for strongly consistent system. I'm trying to say that in many cases it might be actually better to wait up to few minutes for node to recover and restore it's availability than failing it over and paying price of restoring cluster capacility (with rebalancing this node back or it's replacement, which is irrelevant). If somebody wants stronger availability then some other approaches which particularly can "reconcile" changes from both failed over node and it's replacement node look like fundamentally better choice _for this requirements_.




[MB-4030] enable traffic for for ready nodes even if not all nodes are up/healthy/ready (aka partial janitor) (was: After two nodes crashed, curr_items remained 0 after warmup for extended period of time) Created: 06/Jul/11  Updated: 20/May/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.1, 2.0, 2.0.1, 2.2.0, 2.1.1, 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
we had two nodes crash at a customer, possibly related to a disk space issue, but I don't think so.

After they crashed, the nodes warmed up relatively quickly, but immediately "discarded" their items. I say that because I see that they warmed up ~10m items, but the current item counts were both 0.

I tried shutting down the service and had to kill memcached manually (kill -9). Restarting it went through the same process of warming up and then nothing.

While I was looking around, I left it sit for a little while and magically all of the items came back. I seem to recall this bug previously where a node wouldn't be told to be active until all the nodes in the cluster were active...and it got into trouble when not all of the nodes restarted.

Diags for all nodes will be attached

 Comments   
Comment by Perry Krug [ 06/Jul/11 ]
Full set of logs at \\corp-fs1\export_support_cases\bug_4030
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
It _is_ ns_server issue caused by janitor needing all nodes to be up for vbuckets activation. We planned fix for 1.8.1 (now 1.8.2)
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
Fix would land as part of fast warmup integration
Comment by Perry Krug [ 18/Jul/12 ]
Peter, can we get a second look at this one? We've seen this before, and the problem is that the janitor did not run until all nodes had joined the cluster and warmed up. I'm not sure we've fixed that already...
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Latest 2.0 will mark nodes as green and enable memcached traffic when all of them are up. So easy part is done.

Partial janitor (i.e. enabling traffic for some nodes when others are still down/warming up) is something that will unlikely be done soon
Comment by Perry Krug [ 18/Jul/12 ]
Thanks Alk...what's the difference in behavior (in this area) between 1.x and 2.0? It "sounds" like they're the same, no?

And this bug should still remain open until we fix the primary issue which is the partial janitor...correct?
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
1.8.1 will show node as green when ep-engine thinks it's warmed up. But confusingly it'll not be really ready. All vbuckets will be in state dead and curr_items will be 0.

2.0 fixes this confusion. Node is marked green when it's actually warmed up from user's perspective. I.e. right vbucket states are set and it'll serve clients traffic.

2.0 is still very conservative about only making vbucket state changes when all nodes are up and warmed up. Thats "impartial" janitor. Whether it's a bug or "lack of feature" is debatable. But I think main concern that users are confused by green-ness of nodes is resolved.
Comment by Aleksey Kondratenko [ 18/Jul/12 ]
Closing as fixed. We'll get to partial janitor some day in future which is feature we lack today, not bug we have IMHO
Comment by Perry Krug [ 12/Nov/12 ]
Reopening this for the need for partial janitor. Recent customer had multiple nodes need to be hard-booted and none returned to service until all were warmed up
Comment by Steve Yen [ 12/Nov/12 ]
bug-scrub: moving out of 2.0, as this looks like a feature req.
Comment by Farshid Ghods (Inactive) [ 13/Nov/12 ]
in system testing we have noticed many times that if multiple nodes crash until all nodes are warmed up node status for those that are already warmed up appears as yellow.


user won't be able to understand which node has successfully warmed up from the console and if one node is actually not recovering or not warm up in a reasonable time they have to figure it out some other way ( cbstats ... )

another issue with this is that user won't be able to perform a fail over for 1 node even though N-1 nodes has warmed up already.

i am not sure if fixing this bug will impact cluster-restore functionality but something important to fix or suggest a workaround to the user ( by workaround i mean a documented , tested and supported set of commands )
Comment by Mike Wiederhold [ 17/Mar/13 ]
Comments say this is an ns_server issue so I am removing couchbase-bucket from affected components. Please re-add if there is a couchbase-bucket task for this issue.
Comment by Aleksey Kondratenko [ 23/Feb/14 ]
Not going to happen for 3.0.




[MB-10838] cbq-engine must work without all_docs Created: 11/Apr/14  Updated: 29/Jun/14  Due: 07/Jul/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: tried builds 3.0.0-555 and 3.0.0-554

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
WORKAROUND: Run "CREATE PRIMARY INDEX ON <bucket>" once per bucket, when using 3.0 server

SYMPTOM: tuq returns Bucket default not found.', u'caller': u'view_index:200 for all queries

single node cluster, 2 buckets(default and standard)
run simple query
q=FROM+default+SELECT+name%2C+email+ORDER+BY+name%2Cemail+ASC

got {u'code': 5000, u'message': u'Bucket default not found.', u'caller': u'view_index:200', u'key': u'Internal Error'}
tuq displays
[root@grape-001 tuqtng]# ./tuqtng -couchbase http://localhost:8091
22:36:07.549322 Info line disabled false
22:36:07.554713 tuqtng started...
22:36:07.554856 version: 0.0.0
22:36:07.554942 site: http://localhost:8091
22:47:06.915183 ERROR: Unable to access view - cause: error executing view req at http://127.0.0.1:8092/default/_all_docs?limit=1001: 500 Internal Server Error - {"error":"noproc","reason":"{gen_server,call,[undefined,bytes,infinity]}"}
 -- couchbase.(*viewIndex).ScanRange() at view_index.go:186


 Comments   
Comment by Sriram Melkote [ 11/Apr/14 ]
Iryna, can you please add cbcollectinfo or at least the couchdb logs?

Also, all CBQ DP4 testing must be done against 2.5.x server, please confirm it is the case in this bug.
Comment by Iryna Mironava [ 22/Apr/14 ]
cbcollect
https://s3.amazonaws.com/bugdb/jira/MB-10838/9c1cf39c/172.27.33.17-4222014-111-diag.zip

bug is valid only for 3.0. 2.5.x versions are working fine
Comment by Sriram Melkote [ 22/Apr/14 ]
Gerald, we need to update query code to not use _all_docs for 3.0

Iryna, workaround is to run "CREATE PRIMARY INDEX ON <bucket>" first before running any queries when using 3.0 server
Comment by Sriram Melkote [ 22/Apr/14 ]
Reducing severity with workaround. Please ping me if that doesn't work
Comment by Iryna Mironava [ 22/Apr/14 ]
works with workaround
Comment by Gerald Sangudi [ 22/Apr/14 ]
Manik,

Please modify the tuqtng / DP3 Couchbase catalog to return an error telling the user to CREATE PRIMARY INDEX. This should only happen with 3.0 server. For 2.5.1 or below, #all_docs should still work.

Thanks.




[MB-11736] add client SSL to 3.0 beta documentation Created: 15/Jul/14  Updated: 15/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Matt Ingenthron Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
This is mostly a curation exercise. Add to the server 3.0 beta docs the configuration information for each of the following clients:
- Java
- .NET
- PHP
- Node.js
- C/C++

No other SDKs support SSL at the moment.

This is either in work-in-progress documentation or in the blogs from the various DPs. Please check in with the component owner if you can't find what you need.




[MB-10180] Server Quota: Inconsistency between documentation and CB behaviour Created: 11/Feb/14  Updated: 21/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Dave Rigby Assignee: Ruth Harris
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File MB-10180_max_quota.png    
Issue Links:
Relates to
relates to MB-2762 Default node quota is still too high Resolved
relates to MB-8832 Allow for some back-end setting to ov... Open
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
In the documentation for the product (and general sizing advice) we tell people to allocate no more than 80% of their memory for the Server Quota, to leave headroom for the views, disk write queues and general OS usage.

However on larger[1] nodes we don't appear to enforce this, and instead allow people to allocate up to 1GB less than the total RAM.

This is inconsistent, as we document and tell people one thing and let them do another.

This appears to be something inherited from MB-2762, which the intent of which appeared to only allow the relaxing of this when joining a cluster, however this doesn't appear to be how it works - I can successfully change the existing cluster quota from the CLI to a "large" value:

    $ /opt/couchbase/bin/couchbase-cli cluster-edit -c localhost:8091 -u Administrator -p dynam1te --cluster-ramsize=127872
    ERROR: unable to init localhost (400) Bad Request
    [u'The RAM Quota value is too large. Quota must be between 256 MB and 127871 MB (memory size minus 1024 MB).']

While I can see some logic to relax the 80% constraint on big machines, with the advent of 2.X features 1024MB seems far too small an amount of headroom.

Suggestions to resolve:

A) Revert to a straightforward 80% max, with a --force option or similar to allow specific customers to go higher if they know what they are doing
B) Leave current behaviour, but document it.
B) Increase minimum headroom to something more reasonable for 2.X, *and* document the beaviour.

([1] On a machine with 128,895MB of RAM I get the "total-1024" behaviour, on a 1GB VM I get 80%. I didn't check in the code what the cutoff for 80% / total-1024 is).


 Comments   
Comment by Dave Rigby [ 11/Feb/14 ]
Screenshot of initial cluster config: maximum quota is total_RAM-1024
Comment by Aleksey Kondratenko [ 11/Feb/14 ]
Do not agree with that logic.

There's IMHO quite a bit of difference between default settings, recommended settings limit and allowed settings limit. The later can be wider for folks who really know what they're doing.
Comment by Aleksey Kondratenko [ 11/Feb/14 ]
Passed to Anil, because that's not my decision to change limits
Comment by Dave Rigby [ 11/Feb/14 ]
@Aleksey: I'm happy to resolve as something other than my (A,B,C), but the problem here is that many people haven't even been aware of this "extended" limit in the system - and moreover on a large system we actually advertise it in the GUI when specifying the allowed limit (see attached screenshot).

Furthermore, I *suspect* that this was originally only intended for upgrades for 1.6.X (see http://review.membase.org/#/c/4051/), but somehow is now being permitted for new clusters.

Ultimately I don't mind what our actual max quota value is, but the app behaviour should be consistent with the documentation (and the sizing advice we give people).
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
raising to product blocker.
this inconsistency has to be resolved - PM to re-align.
Comment by Anil Kumar [ 28/May/14 ]
Going with option B - Leave current behaviour, but document it.
Comment by Ruth Harris [ 17/Jul/14 ]
I only see the 80% number coming up as an example of setting the high water mark (85% suggested). The Server Quota section doesn't mention anything. The working set managment & ejection section(s) and item pager sub-section also mention high water mark.

Can you be more specific about where this information is? Anyway, the best solution is to add a "note" in the applicable section(s).

--ruth

Comment by Dave Rigby [ 21/Jul/14 ]
@Ruth: So the current product behaviour is that the ServerQuota limit depends on the maximum memory available:

* For machines with <= X MB of memory, the maximum server quota is 80% of total physical memory
* For machines with > X MB of memory, the maximum Server Quota is Total Physical Memory - 1024.

The value of 'X' is fixed in the code, but it wasn't obvious what it's actually is (it's derived from a few different things. I suggest you ask Alk who should be able to provide the value of it.




[MB-9917] DOC - memcached should dynamically adjust the number of worker threads Created: 14/Jan/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Trond Norbye Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
4 threads is probably not ideal for a 24 core system ;)

 Comments   
Comment by Anil Kumar [ 25/Mar/14 ]
Trond - Can you explain is this new feature in 3.0 or fixing documentation on older docs?
Comment by Ruth Harris [ 17/Jul/14 ]
Trond, Could you provide more information here and then reassign to me? --ruth
Comment by Trond Norbye [ 24/Jul/14 ]
New in 3.0 is that memcached no longer defaults to 4 threads for the frontend, but use 75% of the number of cores reported of the system (with a minimum of 4 cores).

There are 3 ways to tune this:

* Export MEMCACHED_NUM_CPUS=number of threads you want before starting couchbase server

* Use the -t <number> command line argument (this will go away in the future)

* specify it in the configuration file read during startup (but when started from the full server this file is regenerated every time so you'll loose the modifications)




[MB-6972] distribute couchbase-server through yum and ubuntu package repositories Created: 19/Oct/12  Updated: 06/Aug/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Farshid Ghods (Inactive) Assignee: Phil Labee
Resolution: Unresolved Votes: 3
Labels: devX
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks MB-8693 [Doc when ready] distribute couchbase... Reopened
blocks MB-7821 yum install couchbase-server from cou... Resolved
Duplicate
duplicates MB-2299 Create signed RPM's Resolved
is duplicated by MB-9409 repository for deb packages (debian&u... Resolved
Flagged:
Release Note

 Description   
this helps us in handling dependencies that are needed for couchbase server
sdk team has already implemented this for various sdk packages.

we might have to make some changes to our packaging metadata to work with this schema

 Comments   
Comment by Steve Yen [ 26/Nov/12 ]
to 2.0.2 per bug-scrub

first step is do the repositories?
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Farshid Ghods (Inactive) [ 19/Dec/12 ]
Phil,
please sync up with Farshid and get instructions that Sergey and Pavel sent
Comment by Farshid Ghods (Inactive) [ 28/Jan/13 ]
we should resolve this task once 2.0.1 is released .
Comment by Dipti Borkar [ 29/Jan/13 ]
Have we figured out the upgrade process moving forward. for example from 2.0.1 to 2.0.2 or 2.0.1 to 2.1 ?
Comment by Jin Lim [ 04/Feb/13 ]
Please ensure that we also confirm/validate the upgrade process moving from 2.0.1 to 2.0.2. Thanks.
Comment by Phil Labee [ 06/Feb/13 ]
Now have DEB repo working, but another issue has come up: We need to distribute the public key so that users can install the key before running apt-get.

wiki page has been updated.
Comment by kzeller [ 14/Feb/13 ]
Added to 2.0.1 RN as:

Fix:

We now provide Couchbase Server as a yum and Debian package
repositories.
Comment by Matt Ingenthron [ 09/Apr/13 ]
What are the public URLs for these repositories? This was mentioned in the release notes here:
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-rn_2-0-0l.html
Comment by Matt Ingenthron [ 09/Apr/13 ]
Reopening, since this isn't documented that I can find. Apologies if I'm just missing it.
Comment by Dipti Borkar [ 23/Apr/13 ]
Anil, can you work with Phil to see what are the next steps here?
Comment by Anil Kumar [ 24/Apr/13 ]
Yes I'll be having discussion with Phil and will update here with details.
Comment by Tim Ray [ 28/Apr/13 ]
could we either remove the note about yum/deb repo's in the release notes or get those repo locations / sample files / keys added to public pages? The only links that seem that they 'might' contain the info point to internal pages I don't have access to.
Comment by Anil Kumar [ 14/May/13 ]
thanks Tim, we have removed it from release notes. we will add instructions about yum/deb repo's locations/files/keys to documentation once its available. thanks!
Comment by kzeller [ 14/May/13 ]
Removing duplicate ticket:

http://www.couchbase.com/issues/browse/MB-7860
Comment by h0nIg [ 24/Oct/13 ]
any update? maybe i created a duplicate issue: http://www.couchbase.com/issues/browse/MB-9409 but it seems that the repositories are outdated on http://hub.internal.couchbase.com/confluence/display/CR/How+to+Use+a+Linux+Repo+--+debian
Comment by Sriram Melkote [ 22/Apr/14 ]
I tried to install on Debian today. It failed badly. One .deb package didn't match the libc version of stable. The other didn't match the openssl version. Changing libc or openssl is simply not an option for someone using Debian stable because it messes with the base OS too deeply. So as of 4/23/14, we don't have support for Debian.
Comment by Sriram Melkote [ 22/Apr/14 ]
Anil, we have accumulated a lot of input in this bug. I don't think this will realistically go anywhere for 3.0 unless we define specific goals and some considered platform support matrix expansion. Can you please create a goal for 3.0 more precisely?
Comment by Matt Ingenthron [ 22/Apr/14 ]
+1 on Siri's comments. Conversations I had with both Ubuntu (who recommend their PPAs) and Red Hat experts (who recommend setting up a repo or getting into EPEL or the like) indicated that's the best way to ensure coverage of all OSs. Binary packages built on one OS and deployed on another are risky, run into dependency issues.
Comment by Anil Kumar [ 28/Apr/14 ]
This ticket specially for distributing DEB and RPM repositories through YUM and APT repo. We have another ticket for supporting Debian platform MB-10960.
Comment by Anil Kumar [ 23/Jun/14 ]
Assigning ticket to Tony for verification.
Comment by Phil Labee [ 21/Jul/14 ]
Need to do before closing:

[ ] capture keys and process used for build that is currently posted (3.0.0-628), update tools and keys of record in build repo and wiki page
[ ] distribute 2.5.1 and 3.0.0-beta1 builds using same process, testing update capability
[ ] test update from 2.0.0 to 2.5.1 to 3.0.0
Comment by Phil Labee [ 21/Jul/14 ]
re-opening to assign to sprint to prepare the distribution repos for testing
Comment by Wayne Siu [ 30/Jul/14 ]
Phil,
has build 3.0.0-973 be updated in the repos for beta testing?




[MB-10214] Mac version update check is incorrectly identifying newest version Created: 14/Feb/14  Updated: 12/Aug/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0.1, 2.2.0, 2.1.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: David Haikney Assignee: Phil Labee
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Mac OS X

Attachments: PNG File upgrade_check.png    
Is this a Regression?: Yes

 Description   
Running 2.1.1 version of couchbase on a Mac, "check for latest version" reports the latest version is already running (e.g. see attached screenshot)


 Comments   
Comment by Aleksey Kondratenko [ 14/Feb/14 ]
Definitely not ui bug. It's using phone home to find out about upgrades. And I have no idea who owns that now.
Comment by Steve Yen [ 12/Jun/14 ]
got an email from ravi to look into this
Comment by Steve Yen [ 12/Jun/14 ]
Not sure if this is correct analysis, but I did a quick scan of what I think is the mac installer, which I think is...

  https://github.com/couchbase/couchdbx-app

It gets its version string by running a "git describe", in the Makefile here...

  https://github.com/couchbase/couchdbx-app/blob/master/Makefile#L1

Currently, a "git describe" on master branch returns...

  $ git describe
  2.1.1r-35-gf6646fa

...which is *kinda* close to the reported version string in the screenshot ("2.1.1-764-rel").

So, I'm thinking one fix needed would be a tagging (e.g., "git tag -a FOO -m FOO") of the couchdbx-app repository.

So, reassigning to Phil to do that appropriately.

Also, it looks like the our mac installer is using an open-source packaging / installer / runtime library called "sparkle" (which might be a little under-maintained -- not sure).

  https://github.com/andymatuschak/Sparkle/wiki

The sparkle library seems to check for version updates by looking at the URL here...

  https://github.com/couchbase/couchdbx-app/blob/master/cb.plist.tmpl#L42

Which seems to either be...

  http://appcast.couchbase.com/membasex.xml

Or, perhaps...

  http://appcast.couchbase.com/couchbasex.xml

The appcast.couchbase.com appears to be actually an S3 bucket, off of our production couchbase AWS account. So those *.xml files need to be updated, as they currently have content that has older versions. For example, http://appcast.couchbase.com/couchbase.xml looks currently like...

    <rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle" version="2.0">
    <channel>
    <title>Updates for Couchbase Server</title>
    <link>http://appcast.couchbase.com/couchbase.xml&lt;/link>
    <description>Recent changes to Couchbase Server.</description>
    <language>en</language>
    <item>
    <title>Version 1.8.0</title>
    <sparkle:releaseNotesLink>
    http://www.couchbase.org/wiki/display/membase/Couchbase+Server+1.8.0
    </sparkle:releaseNotesLink>
    <!-- date -u +"%a, %d %b %Y %H:%M:%S GMT" -->
    <pubDate>Fri, 06 Jan 2012 16:11:17 GMT</pubDate>
    <enclosure url="http://packages.couchbase.com/1.8.0/Couchbase-Server-Community.dmg" sparkle:version="1.8.0" sparkle:dsaSignature="MCwCFAK8uknVT3WOjPw/3LkQpLBadi2EAhQxivxe2yj6EU6hBlg9YK/5WfPa5Q==" length="33085691" type="application/octet-stream"/>
    </item>
    </channel>
    </rss>

Not updating the xml files, though, probably causes no harm. Just that our osx users won't be pushed news on updates.
Comment by Phil Labee [ 12/Jun/14 ]
This has nothing to do with "git describe". There should be no place in the product that "git describe" should be used to determine version info. See:

    http://hub.internal.couchbase.com/confluence/display/CR/Branching+and+Tagging

so there's definitely a bug in the Makefile.

The version update check seems to be out of date. The phone-home file is generated during:

    http://factory.hq.couchbase.com:8080/job/Product_Staging_Server/

but the process of uploading it is not automated.
Comment by Steve Yen [ 12/Jun/14 ]
Thanks for the links.

> This has nothing to do with "git describe".

My read of the Makefile makes me think, instead, that "git describe" is the default behavior unless it's overridden by the invoker of the make.

> There should be no place in the product that "git describe" should be used to determine version info. See:
> http://hub.internal.couchbase.com/confluence/display/CR/Branching+and+Tagging

It appears all this couchdbx-app / sparkle stuff predates that wiki page by a few years, so I guess it's inherited legacy.

Perhaps voltron / buildbot are not setting the PRODUCT_VERSION correctly before invoking the the couchdbx-app make, which makes the Makefile default to 'git describe'?

    commit 85710d16b1c52497d9f12e424a22f3efaeed61e4
    Date: Mon Jun 4 14:38:58 2012 -0700

    Apply correct product version number
    
    Get version number from $PRODUCT_VERSION if it's set.
    (Buildbot and/or voltron will set this.)
    If not set, default to `git describe` as before.
    
> The version update check seems to be out of date.

Yes, that's right. The appcast files are out of date.

> The phone-home file is generated during:
> http://factory.hq.couchbase.com:8080/job/Product_Staging_Server/

I think appcast files for OSX / sparkle are a _different_ mechanism than the phone-home file, and an appcast XML file does not appear to be generated/updated by the Product_Staging_Server job.

But, I'm not an expert or really qualified on the details here -- this is just my opinions from a quick code scan, not from actually doing/knowing.

Comment by Wayne Siu [ 01/Aug/14 ]
Per PM (Anil), we should get this fixed by 3.0 RC1.
Raising the priority to Critical.
Comment by Wayne Siu [ 07/Aug/14 ]
Phil,
Please provide update.
Comment by Anil Kumar [ 12/Aug/14 ]
Triage - Upgrading to 3.0 Blocker





[MB-11060] Build and test 3.0 for 32-bit Windows Created: 06/May/14  Updated: 13/Aug/14  Due: 09/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build, ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Blocker
Reporter: Chris Hillery Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7/8 32-bit

Issue Links:
Dependency
Duplicate

 Description   
For the "Developer Edition" of Couchbase Server 3.0 on Windows 32-bit, we need to first ensure that we can build 32-bit-compatible binaries. It is not possible to build 3.0 on a 32-bit machine due to the MSVC 2013 requirement. Hence we need to configure MSVC as well as Erlang on a 64-bit machine to produce 32-bit compatible binaries.

 Comments   
Comment by Chris Hillery [ 06/May/14 ]
This is assigned to Trond who is already experimenting with this. He should:

 * test being able to start the server on a 32-bit Windows 7/8 VM

 * make whatever changes are necessary to the CMake configuration or other build scripts to produce this build on a 64-bit VM

 * thoroughly document the requirements for the build team to reproduce this build

Then he can assign this bug to Chris to carry out configuring our build jobs accordingly.
Comment by Trond Norbye [ 16/Jun/14 ]
Can you give me a 32 bit windows installation I can test on. My MSDN license have expired and I don't have Windows media available (and the internal wiki page just have a limited set of licenses and no download links)

Then assign it back to me and I'll try it
Comment by Chris Hillery [ 16/Jun/14 ]
I think you can use 172.23.106.184 - it's a 32-bit Windows 2008 VM that we can't use for 3.0 builds anyway.
Comment by Trond Norbye [ 24/Jun/14 ]
I copied the full result of a build where I set target_platform=x86 on my 64 bit windows server (the "install" directory) over to a 32 bit windows machine and was able to start memcached and it worked as expected.

Our installers make other magic like install the service etc needed in order to start the full server. Once we have such an installer I can do further testing
Comment by Chris Hillery [ 24/Jun/14 ]
Bin - could you take a look at this (figuring out how to make InstallShield on a 64-bit machine create a 32-bit compatible installer)? I won't likely be able to get to it for at least a month, and I think you're the only person here who still has access to an InstallShield 2010 designer anyway.




[MB-11930] [windows] all data lost after offline upgrade from 2.0.0 to 3.0.0-1134 Created: 11/Aug/14  Updated: 13/Aug/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: Windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64-bit

Attachments: Zip Archive 12.11.10.137-8112014-1532-diag.zip    

 Description   
Install couchbase server 2.0.0 in one node windows server 2008 R2 64-bit
Create default bucket and load 10K items to default bucket
Offline upgrade to 3.0.0-1134.
After upgrade done, couchbase server 3.0.0 is up but no items loaded to default bucket




[MB-11812] Need a read-only mode to startup the query server Created: 24/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Don Pinto Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This is required for the tutorial in production as we don't want any user to blow off the data, or add additional data.

All DML queries should be blocked when the server is started in this mode. Only the admin should be start the query server in read-only mode.






[MB-10440] something isn't right with tcmalloc in build 1074 on at least rhel6 causing memcached to crash Created: 11/Mar/14  Updated: 04/Aug/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aleksey Kondratenko Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Relates to
relates to MB-10371 tcmalloc must be compiled with -DTCMA... Resolved
relates to MB-10439 Upgrade:: 2.5.0-1059 to 2.5.1-1074 =>... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
SUBJ.

Just installing latest 2.5.1 build on rhel6 and creating bucket caused segmentation fault (see also MB-10439).

When replacing tcmalloc with a copy I've built it works.

Cannot be 100% sure it's tcmalloc but crash looks too easily reproducible to be something else.


 Comments   
Comment by Wayne Siu [ 12/Mar/14 ]
Phil,
Can you review if this change has been (copied from MB-10371) applied properly?

voltron (2.5.1) commit: 73125ad66996d34e94f0f1e5892391a633c34d3f

    http://review.couchbase.org/#/c/34344/

passes "CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW" to each gprertools configure command
Comment by Andrei Baranouski [ 12/Mar/14 ]
see the same issue on centos 64
Comment by Phil Labee [ 12/Mar/14 ]
need more info:

1. What package did you install?

2. How did you build the tcmalloc which fixes the problem?
 
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
build 1740. Rhel6 package.

You can see yourself. It's easily reproducible as Andrei also confirmed too.

I've got 2.1 tar.gz from googlecode. And then did ./configure --prefix=/opt/couchbase --enable-minimal CPPFLAGS='-DTCMALLOC_SMALL_BUT_SLOW' and then make and make install. After that it works. Have no idea why.

Do you know exact CFLAGS and CXXFLAGS that are used to build our tcmalloc ? Those variables are likely set in voltron (or even from outside of voltron) and might affect optimization and therefore expose some bugs.

Comment by Aleksey Kondratenko [ 12/Mar/14 ]
And 64 bit.
Comment by Phil Labee [ 12/Mar/14 ]
We build out of:

    https://github.com/couchbase/gperftools

and for 2.5.1 use commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

compile using:

(cd /home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools \
&& ./autogen.sh \
        && ./configure --prefix=/opt/couchbase CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW --enable-minimal \
        && make \
        && make install-exec-am install-data-am)
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
That part I know. What I don't know is what cflags are being used.
Comment by Phil Labee [ 13/Mar/14 ]
from the 2.5.1 centos-6-x86 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x86-251-builder/builds/18/steps/couchbase-server%20make%20enterprise%20/logs/stdio

make[1]: Entering directory `/home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools'

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Phil Labee [ 13/Mar/14 ]
from a 2.5.1 centos-6-x64 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x64-251-builder/builds/16/steps/couchbase-server%20make%20enterprise%20/logs/stdio

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Ok. I'll try to exclude -O3 as possible reason of failure later today (in which case it might be upstream bug). In the meantime I suggest you to try lowering optimization to -O2. Unless you have other ideas of course.
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Building tcmalloc with exact same cflags -O3 doesn't cause any crashes. At this time my guess is either compiler bug or cosmic radiation hitting just this specific build.

Can we simply force rebuild ?
Comment by Phil Labee [ 13/Mar/14 ]
test with newer build 2.5.1-1075:

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_2.5.1-1075-rel.rpm

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_64_2.5.1-1075-rel.rpm
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Didn't help unfortunately. Is that still with -O3 ?
Comment by Phil Labee [ 14/Mar/14 ]
still using -O3. There are extensive comments in the voltron Makefile warning against changing to -O2
Comment by Phil Labee [ 14/Mar/14 ]
Did you try to build gperftools out of our repo?
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
The following is not true:

Got myself centos 6.4. And with it's gcc and -O3 I'm finally able to reproduce issue.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
So I've got myself centos 6.4 and _exact same compiler version_. And when I build tcmalloc myself with all right flags and replace tcmalloc from package it works. Without replacing it crashes.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Phil, please, clean ccache, reboot builder host (to clean page cache) and _then_ do another rebuild. Looking at build logs it looks like ccache is being used. So my suspicion about ram corruption is not fully excluded yet. And I have not much other ideas.
Comment by Phil Labee [ 14/Mar/14 ]
cleared ccache and restarted centos-6-x86-builder, centos-6-x64-builder

started build 2.5.1-1076
Comment by Pavel Paulau [ 14/Mar/14 ]
2.5.1-1076 seems to be working, it warns about "SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER" as well.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Maybe I'm doing something wrong but it fails in exact same way on my VM
Comment by Pavel Paulau [ 14/Mar/14 ]
Sorry, it crashed eventually.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Confirmed again. Everything is exactly same as before. Build 1076 centos 6.4 amd64 crashes very easily. Both enterprise edition and community. And doesn't crash if I replace tcmalloc with stuff that I've built, that's exact same source and exact same flags and exact same compiler version.

Build 1071 doesn't crash. All of the 100% consistently.
Comment by Phil Labee [ 17/Mar/14 ]
possibly a difference in build environment

reference env is described in voltron README.md file

for centos-6 X64 (6.4 final) we use the defaults for these tools:


gcc-4.4.7-3.el6 ( 4.4.7-4 available)
gcc-c++-4.4.7-3 ( 4.4.7-4 available)
kernel-devel-2.6.32-358 ( 2.6.32-431.5.1 available)
openssl-devel-1.0.0-27.el6_4.2 ( 1.0.1e-16.el6_5.4 available)
rpm-build-4.8.0-32 ( 4.8.0-37 available)

these tools do not have an update:

scons-2.0.1-1
libtool-2.2.6-15.5

For all centos these specific versions are installed:

gcc, g++ 4.4, currently 4.4.7-3, 4.4.7-4 available
autoconf 2.65, currently 2.63-5 (no update available)
automake 1.11.1
libtool 2.4.2
Comment by Phil Labee [ 17/Mar/14 ]
downloaded gperftools-2.1.tar.gz from

    http://gperftools.googlecode.com/files/gperftools-2.1.tar.gz

and expanded into directory: gperftools-2.1

cloned https://github.com/couchbase/gperftools.git at commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

into directory gperftools, and compared:

=> diff -r gperftools-2.1 gperftools
Only in gperftools: .git
Only in gperftools: autogen.sh
Only in gperftools/doc: pprof.see_also
Only in gperftools/src/windows: TODO
Only in gperftools/src/windows: google

Only in gperftools-2.1: Makefile.in
Only in gperftools-2.1: aclocal.m4
Only in gperftools-2.1: compile
Only in gperftools-2.1: config.guess
Only in gperftools-2.1: config.sub
Only in gperftools-2.1: configure
Only in gperftools-2.1: depcomp
Only in gperftools-2.1: install-sh
Only in gperftools-2.1: libtool
Only in gperftools-2.1: ltmain.sh
Only in gperftools-2.1/m4: libtool.m4
Only in gperftools-2.1/m4: ltoptions.m4
Only in gperftools-2.1/m4: ltsugar.m4
Only in gperftools-2.1/m4: ltversion.m4
Only in gperftools-2.1/m4: lt~obsolete.m4
Only in gperftools-2.1: missing
Only in gperftools-2.1/src: config.h.in
Only in gperftools-2.1: test-driver
Comment by Phil Labee [ 17/Mar/14 ]
Since the build files in your source are different than in the production build, we can't really say we're using the same source.

Please build from our repo and re-try your test.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The difference is in autotools products. I _cannot_ build using same autotools that's present on build machine unless I'm given access to that box.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The _source_ is exact same
Comment by Phil Labee [ 17/Mar/14 ]
I've given the versions of autotools to use, so you could make your build environment in line with the production builds.

As a shortcut, I've submitted a request for a clone of the builder VM that you can experiment with.

See CBIT-1053
Comment by Wayne Siu [ 17/Mar/14 ]
The cloned builder is available. Info in CBIT-1053.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Built tcmalloc from exact copy in builder directory.

Installed package from inside builder directory (build 1077). Verified that problem exists. Stopped service. Replaced tcmalloc. Observer that everything is fine.

Something in environment is causing this. Like maybe unusual ldflags or something else. But _not_ source.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Build full rpm package under buildbot user. With exact same make invocation as I see in buildbot logs. And resultant package works. Weird indeed.
Comment by Phil Labee [ 18/Mar/14 ]
some differences between test build and production build:


1) In gperftools, production calls "make install-exec-am install-data-am" while test calls "make install" which executes extra step "all-am"

2) In ep-engine, produciton uses "make install" while test uses "make"

3) Test build as user "root" while production build as user "buildbot", so PATH and other env.vars may be different.

In general it's hard to tell what steps were performed for the test build, as no output logfiles have been captured.
Comment by Wayne Siu [ 21/Mar/14 ]
Updated from Phil:
comment:
________________________________________

2.5.1-1082 was done without the tcmalloc flag: CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW

    http://review.couchbase.org/#/c/34755/


2.5.1-1083 was done with build step timeout increased from 60 minutes to 90

2.5.1-1084 was done with the tcmalloc flag restored:

    http://review.couchbase.org/#/c/34792/
Comment by Andrei Baranouski [ 23/Mar/14 ]
 2.5.1-1082 MB-10545 Vbucket map is not ready after 60 seconds
Comment by Meenakshi Goel [ 24/Mar/14 ]
Memcached crashes with segmentation fault is observed with build 2.5.1-1084-rel on ubuntu 12.04 during Auto Compaction tests.

Jenkins Link:
http://qa.sc.couchbase.com/view/2.5.1%20centos/job/centos_x64--00_02--compaction_tests-P0/56/consoleFull

root@jackfruit-s12206:/tmp# gdb /opt/couchbase/bin/memcached core.memcached.8276
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /opt/couchbase/bin/memcached...done.
[New LWP 8301]
[New LWP 8302]
[New LWP 8599]
[New LWP 8303]
[New LWP 8604]
[New LWP 8299]
[New LWP 8601]
[New LWP 8600]
[New LWP 8602]
[New LWP 8287]
[New LWP 8285]
[New LWP 8300]
[New LWP 8276]
[New LWP 8516]
[New LWP 8603]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler'.
Program terminated with signal 11, Segmentation fault.
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
298 src/central_freelist.cc: No such file or directory.
(gdb) t a a bt

Thread 15 (Thread 0x7f3568039700 (LWP 8603)):
#0 0x00007f356f01b9fa in __lll_unlock_wake () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f018104 in _L_unlock_644 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f018063 in pthread_mutex_unlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c663d6 in Mutex::release (this=0x5f68250) at src/mutex.cc:94
#4 0x00007f3569c9691f in unlock (this=<optimized out>) at src/locks.hh:58
#5 ~LockHolder (this=<optimized out>, __in_chrg=<optimized out>) at src/locks.hh:41
#6 fireStateChange (to=<optimized out>, from=<optimized out>, this=<optimized out>) at src/warmup.cc:707
#7 transition (force=<optimized out>, to=<optimized out>, this=<optimized out>) at src/warmup.cc:685
#8 Warmup::initialize (this=<optimized out>) at src/warmup.cc:413
#9 0x00007f3569c97f75 in Warmup::step (this=0x5f68258, d=..., t=...) at src/warmup.cc:651
#10 0x00007f3569c2644a in Dispatcher::run (this=0x5e7f180) at src/dispatcher.cc:184
#11 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5f68258) at src/dispatcher.cc:28
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 14 (Thread 0x7f356a705700 (LWP 8516)):
#0 0x00007f356ed0d83d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ed3b774 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f3569c65445 in updateStatsThread (arg=<optimized out>) at src/memory_tracker.cc:31
#3 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 13 (Thread 0x7f35703e8740 (LWP 8276)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e000, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e000, flags=<optimized out>) at event.c:1558
#3 0x000000000040c9e6 in main (argc=<optimized out>, argv=<optimized out>) at daemon/memcached.c:7996

Thread 12 (Thread 0x7f356c709700 (LWP 8300)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e280, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e280, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16814f8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 11 (Thread 0x7f356e534700 (LWP 8285)):
#0 0x00007f356ed348bd in read () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ecc8ff8 in _IO_file_underflow () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f356ecca03e in _IO_default_uflow () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f356ecbe18a in _IO_getline_info () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f356ecbd06b in fgets () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f356e535b19 in fgets (__stream=<optimized out>, __n=<optimized out>, __s=<optimized out>) at /usr/include/bits/stdio2.h:255
#6 check_stdin_thread (arg=<optimized out>) at extensions/daemon/stdin_check.c:37
#7 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 10 (Thread 0x7f356d918700 (LWP 8287)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
---Type <return> to continue, or q <return> to quit---

#1 0x00007f356db32176 in logger_thead_main (arg=<optimized out>) at extensions/loggers/file_logger.c:368
#2 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f3567037700 (LWP 8602)):
#0 SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:32
#1 0x00007f3569c6351c in lock (this=<optimized out>) at src/atomic.hh:282
#2 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#3 gimme (this=<optimized out>) at src/atomic.hh:396
#4 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#5 KVShard::getBucket (this=0x7a6e7c0, id=256) at src/kvshard.cc:58
#6 0x00007f3569c9231d in VBucketMap::getBucket (this=0x614a448, id=256) at src/vbucketmap.cc:40
#7 0x00007f3569c314ef in EventuallyPersistentStore::getVBucket (this=<optimized out>, vbid=256, wanted_state=<optimized out>) at src/ep.cc:475
#8 0x00007f3569c315f6 in EventuallyPersistentStore::firePendingVBucketOps (this=0x614a400) at src/ep.cc:488
#9 0x00007f3569c41bb1 in EventuallyPersistentEngine::notifyPendingConnections (this=0x5eb8a00) at src/ep_engine.cc:3474
#10 0x00007f3569c41d63 in EvpNotifyPendingConns (arg=0x5eb8a00) at src/ep_engine.cc:1182
#11 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f3565834700 (LWP 8600)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7e1c0) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7e204) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f3566035700 (LWP 8601)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7fa40) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7fa84) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f356cf0a700 (LWP 8299)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e500, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e500, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x1681400) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f3567838700 (LWP 8604)):
#0 0x00007f356f01b89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f017065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f016eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c6635a in Mutex::acquire (this=0x5e7f890) at src/mutex.cc:79
#4 0x00007f3569c261f8 in lock (this=<optimized out>) at src/locks.hh:48
#5 LockHolder (m=..., this=<optimized out>) at src/locks.hh:26
---Type <return> to continue, or q <return> to quit---
#6 Dispatcher::run (this=0x5e7f880) at src/dispatcher.cc:138
#7 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5e7f898) at src/dispatcher.cc:28
#8 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f356af06700 (LWP 8303)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e780, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e780, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16817e0) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f3565033700 (LWP 8599)):
#0 0x00007f356ed18267 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f3569c13997 in SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:35
#2 0x00007f3569c63e57 in lock (this=<optimized out>) at src/atomic.hh:282
#3 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#4 gimme (this=<optimized out>) at src/atomic.hh:396
#5 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#6 KVShard::getVBucketsSortedByState (this=0x7a6e7c0) at src/kvshard.cc:75
#7 0x00007f3569c5d494 in Flusher::getNextVb (this=0x168d040) at src/flusher.cc:232
#8 0x00007f3569c5da0d in doFlush (this=<optimized out>) at src/flusher.cc:211
#9 Flusher::step (this=0x5ff7010, tid=21) at src/flusher.cc:152
#10 0x00007f3569c69034 in ExecutorThread::run (this=0x5e7e8c0) at src/scheduler.cc:159
#11 0x00007f3569c6963d in launch_executor_thread (arg=0x5ff7010) at src/scheduler.cc:36
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f356b707700 (LWP 8302)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8ea00, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8ea00, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16816e8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f356bf08700 (LWP 8301)):
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
#1 0x00007f356f23ef19 in tcmalloc::CentralFreeList::FetchFromSpansSafe (this=0x7f356f45d780) at src/central_freelist.cc:283
#2 0x00007f356f23efb7 in tcmalloc::CentralFreeList::RemoveRange (this=0x7f356f45d780, start=0x7f356bf07268, end=0x7f356bf07260, N=4) at src/central_freelist.cc:263
#3 0x00007f356f2430b5 in tcmalloc::ThreadCache::FetchFromCentralCache (this=0xf5d298, cl=9, byte_size=128) at src/thread_cache.cc:160
#4 0x00007f356f239fa3 in Allocate (this=<optimized out>, cl=<optimized out>, size=<optimized out>) at src/thread_cache.h:364
#5 do_malloc_small (size=128, heap=<optimized out>) at src/tcmalloc.cc:1088
#6 do_malloc_no_errno (size=<optimized out>) at src/tcmalloc.cc:1095
#7 (anonymous namespace)::cpp_alloc (size=128, nothrow=<optimized out>) at src/tcmalloc.cc:1423
#8 0x00007f356f249538 in tc_new (size=139867476842368) at src/tcmalloc.cc:1601
#9 0x00007f3569c2523e in Dispatcher::schedule (this=0x5e7f880,
    callback=<error reading variable: DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>, outtid=0x6127930, priority=...,
    sleeptime=<optimized out>, isDaemon=true, mustComplete=false) at src/dispatcher.cc:243
#10 0x00007f3569c84c1a in TapConnNotifier::start (this=0x6127920) at src/tapconnmap.cc:66
---Type <return> to continue, or q <return> to quit---
#11 0x00007f3569c42362 in EventuallyPersistentEngine::initialize (this=0x5eb8a00, config=<optimized out>) at src/ep_engine.cc:1415
#12 0x00007f3569c42616 in EvpInitialize (handle=0x5eb8a00,
    config_str=0x7f356bf07993 "ht_size=3079;ht_locks=5;tap_noop_interval=20;max_txn_size=10000;max_size=1491075072;tap_keepalive=300;dbname=/opt/couchbase/var/lib/couchbase/data/default;allow_data_loss_during_shutdown=true;backend="...) at src/ep_engine.cc:126
#13 0x00007f356cf0f86a in create_bucket_UNLOCKED (e=<optimized out>, bucket_name=0x7f356bf07b80 "default", path=0x7f356bf07970 "/opt/couchbase/lib/memcached/ep.so", config=<optimized out>,
    e_out=<optimized out>, msg=0x7f356bf07560 "", msglen=1024) at bucket_engine.c:711
#14 0x00007f356cf0faac in handle_create_bucket (handle=<optimized out>, cookie=0x5e4bc80, request=<optimized out>, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2168
#15 0x00007f356cf10229 in bucket_unknown_command (handle=0x7f356d1171c0, cookie=0x5e4bc80, request=0x5e44000, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2478
#16 0x0000000000412c35 in process_bin_unknown_packet (c=<optimized out>) at daemon/memcached.c:2911
#17 process_bin_packet (c=<optimized out>) at daemon/memcached.c:3238
#18 complete_nread_binary (c=<optimized out>) at daemon/memcached.c:3805
#19 complete_nread (c=<optimized out>) at daemon/memcached.c:3887
#20 conn_nread (c=0x5e4bc80) at daemon/memcached.c:5744
#21 0x0000000000406e45 in event_handler (fd=<optimized out>, which=<optimized out>, arg=0x5e4bc80) at daemon/memcached.c:6012
#22 0x00007f356fd9948c in event_process_active_single_queue (activeq=<optimized out>, base=<optimized out>) at event.c:1308
#23 event_process_active (base=<optimized out>) at event.c:1375
#24 event_base_loop (base=0x5e8ec80, flags=<optimized out>) at event.c:1572
#25 0x0000000000415584 in worker_libevent (arg=0x16815f0) at daemon/thread.c:301
#26 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#27 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#28 0x0000000000000000 in ?? ()
(gdb)
Comment by Aleksey Kondratenko [ 25/Mar/14 ]
Yesterday I took that consistently failing ubuntu build and played with it on my box.

It is exactly same situation. Replacing libtcmalloc.so makes it work.

So I've spent afternoon on running what's in our actual package under debugger.

I found several evidences that some object files linked into libtcmalloc.so that we ship were built with -DTCMALLOC_SMALL_BUT_SLOW and some _were_ not.

That explains weird crashes.

I'm unable to explain how it's possible that our builders produced such .so files. Yet.

Gut feeling is that it might be:

* something caused by ccache

* perhaps not full cleanup between builds

In order to verify that I'm asking the following:

* do a build with ccache completely disabled but with define

* do git clean -xfd inside gperftools checkout before doing build

Comment by Phil Labee [ 29/Jul/14 ]
The failure was detected by

    http://qa.sc.couchbase.com/job/centos_x64--00_02--compaction_tests-P0/

Can I run this test on a 3.0.0 build to see if this bug still exists?
Comment by Phil Labee [ 29/Jul/14 ]
Can I run this test on a 3.0.0 build to see if bug still exists?
Comment by Meenakshi Goel [ 30/Jul/14 ]
Started a run with latest 3.0.0 build 1057.
http://qa.hq.northscale.net/job/centos_x64--44_01--auto_compaction_tests-P0/37/console

However haven't seen such crashes with compaction tests during 3.0.0 testing.
Comment by Meenakshi Goel [ 30/Jul/14 ]
Tests passed with 3.0.0-1057-rel.
Comment by Wayne Siu [ 31/Jul/14 ]
Pavel also helped verify that this is not an issue in 3.0 (3.0.0-1067).
Comment by Wayne Siu [ 31/Jul/14 ]
Reopening for 2.5.x.
Comment by Aleksey Kondratenko [ 01/Aug/14 ]
We still have _exactly_ same problem as in 2.5.1. Enabling -DSMALL_BUT_SLOW causes mis-compilation. And this is _not_ upstream bug.
Comment by Aleksey Kondratenko [ 01/Aug/14 ]
Looking at build log here: http://builds.hq.northscale.net:8010/builders/centos-6-x64-300-builder/builds/1111/steps/couchbase-server%20make%20enterprise%20/logs/stdio

I see that just few files were rebuilt with new define. And previous build did not have CPPFLAGS set to -DSMALL_BUT_SLOW.

So at least in this case I'm adamant that builder did not rebuild tcmalloc when it should.
Comment by Phil Labee [ 01/Aug/14 ]
check to see if ccache is causing failure to rebuild components under changed configure settings
Comment by Aleksey Kondratenko [ 01/Aug/14 ]
build logs indicate that it is unrelated to ccache. I.e. lots of files are not getting built at all.
Comment by Chris Hillery [ 01/Aug/14 ]
The quick-n-dirty solution would be to delete the buildslave directories for the next build. I would do that now, but there is a build on-going so possibly Phil has already taken care of it. If this build (1091) doesn't work, then we'll clean the world and try again.
Comment by Chris Hillery [ 01/Aug/14 ]
build 1091 is wrapping up, and visually it doesn't appear that gperftools got recompiled. I am waiting for each builder to finish and deleting the buildslave directories. When that is done I'll start a new build.
Comment by Chris Hillery [ 01/Aug/14 ]
build 1092 should start shortly.
Comment by Chris Hillery [ 01/Aug/14 ]
build 1092 is wrapping up (bits are already on latestbuilds I believe). Please test.




[MB-11985] couchbase server 2.x.x failed to start in server if that server was used to upgrade from 2.x to 3.0.0 and uninstall couchbase server 3.0.0 Created: 18/Aug/14  Updated: 18/Aug/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 2.5.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows server 2008 R2 64-bit

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Install couchbase server 2.x.x to a windows server 2008 R2 64-bit
Upgrade couchbase server to 3.0.0-1159
Uninstall this couchbase server 3.0.0
Install any couchbase server 2.x.x into this server, couchbase server 2.x.x will not start.

 Comments   
Comment by Thuan Nguyen [ 18/Aug/14 ]
This will block all windows upgrade jobs from 2.x.x to 3.0.0




[MB-11623] test for performance regressions with JSON detection Created: 02/Jul/14  Updated: 19/Aug/14

Status: In Progress
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Matt Ingenthron Assignee: Thomas Anderson
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: 0h
Time Spent: 120h
Original Estimate: Not Specified

Attachments: File JSONDoctPerfTest140728.rtf     File JSONPerfTestV3.uos    
Issue Links:
Relates to
relates to MB-11675 20-30% performance degradation on app... Closed

 Description   
Related to one of the changes in 3.0, we need to test what has been implemented to see if a performance regression or unexpected resource utilization has been introduced.

In 2.x, all JSON detection was handled at the time of persistence. Since persistence was done in batch and in background, with the then current document, it would limit the resource utilization of any JSON detection.

Starting in 3.x, with the datatype/HELLO changes introduced (and currently disabled), the JSON detection has moved to both memcached and ep-engine, depending on the type of mutation.

Just to paint the reason this is a concern, here's a possible scenario.

Imagine a cluster node that is happily accepting 100,000 sets/s for a given small JSON document, and it accounts for about 20mbit of the network (small enough to not notice). That node has a fast SSD at about 8k IOPS. That means that we'd only be doing JSON detection some 5000 times per second with Couchbase Server 2.x

With the changes already integrated, that JSON detection may be tried over 100k times/s. That's a 20x increase. The detection needs to occur somewhere other than on the persistence path, as the contract between DCP and view engine is such that the JSON detection needs to occur before DCP transfer.

This request is to test/assess if there is a performance change and/or any unexpected resource utilization when having fast mutating JSON documents.

I'll leave it to the team to decide what the right test is, but here's what I might suggest.

With a view defined create a test that has a small to moderate load at steady state and one fast-changing item. Test it with a set of sizes and different complexity. For instance, permutations that might be something like this:
non-JSON of 1k, 8k, 32k, 128k
simple JSON of 1k, 8k, 32k, 128k
complex JSON of 1k, 8k, 32k, 128k
metrics to gather:
throughput, CPU utilization by process, RSS by process, memory allocation requests by process (or minor faults or something)

Hopefully we won't see anything to be concerned with, but it is possible.

There are options to move JSON detection to somewhere later in processing (i.e., before DCP transfer) or other optimization thoughts if there is an issue.

 Comments   
Comment by Cihan Biyikoglu [ 07/Jul/14 ]
this is no longer needed for 3.0 is that right? ready to postpone to 3.0.1?
Comment by Pavel Paulau [ 07/Jul/14 ]
HELLO-based negotiation was disabled but detection still happens in ep-engine.
We need to understand impact before 3.0 release. Sooner than later.
Comment by Matt Ingenthron [ 23/Jul/14 ]
I'm curious Thomas, when you say "increase in bytes appended", do you mean for the same workload the RSS is larger in the 'increase' case? Great to see you making progress.
Comment by Wayne Siu [ 24/Jul/14 ]
Pasted comment from Thomas:
Subject: Re: Couchbase Issues: (MB-11623) test for performance regressions with JSON detection
Yes, ~20% increase from 2.5.1 to 3.0 for same load generator. as reported by the cb server for same input load. I’m verifying and ‘isolating’ . Will also be looking at if/how this contributes to replication load increase (20% on 20% increase …)
The issues seem related. Same increase for 1K, 8K, 16K and 32K with some variance.
—thomas
Comment by Thomas Anderson [ 29/Jul/14 ]
initial results using JSON document load test.
Comment by Matt Ingenthron [ 29/Jul/14 ]
Tom: saw your notes in the work log, out of curiosity, what was deferred to 3.0.1? Also, from the comment above, 20% increase in what?
Comment by Anil Kumar [ 13/Aug/14 ]
Thomas - As discussed please update the ticket with % or regression it has caused with JSON detection now in memcached. I will open separate ticket to document it.
Comment by Thomas Anderson [ 19/Aug/14 ]
a comparison of non-JSON to JSON in 2.5.1 and 3.0.0.1105 showed statistically similar performance, i.e., the minimal overhead of handling JSON document over similar KV document stayed consistent from 2.5.1 to 3.0.0 pre-RC1. see attached file JSONPerfTestV3.uos. to be re-run with official RC1 candidate. feature to load complex JSON documents now modified to 4 levels of JSON complexity (for each document size in bytes) {simpleJSON:: 1 element-attribute value pair; smallJSON:: 10 elements - no array, no nesting; mediumJSON:: 100 elements - arrays & nesting; largeJSON:: 10000 elements mix of element types}.

note, the original seed to this issue was a detected performance issue with JSON documents, ~20-30%. the code/architectural change which caused this was deferred to 3.0.1. additional modifications to server to address simple append mode performance degradation, further lessened issue of whether the document type was the cause of performance degradation. the tests did however show the positive change in compaction, i.e., 3.x compacts documents ~ 5-7% over 2.5.1

 
Comment by Thomas Anderson [ 19/Aug/14 ]
re-run with build 1105. regression comparing same document size, same document load for non-JSON to simple-JSON.
2.5.1:: 1024 byte document, 10 loaders, 1.25M documents for nonJSON to JSON showed a < 4% performance degredation; 3.0:: shows a < 3% degredation. many other factors seem to dominate
Comment by Matt Ingenthron [ 19/Aug/14 ]
Just for the comments here, the original seed wasn't an observed performance regression but rather an architectural concern that there could be a space/CPU/throughput cost for the new JSON detection. That's why I opened it.




[MB-11048] Range queries result in thousands of GET operations/sec Created: 05/May/14  Updated: 18/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Benchmark for range queries demonstrated very high latency. At the same time I noticed extremely high rate of GET operations.

Even single query such as "SELECT name.f.f.f AS _name FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000 LIMIT 20" led to hundreds of memcached reads.

Explain:

https://gist.github.com/pavel-paulau/5e90939d6ab28034e3ed

Engine output:

https://gist.github.com/pavel-paulau/b222716934dfa3cb598e

I don't like to use JIRA as forum but why does it happen? Do you fetch entire range before returning limited output?

 Comments   
Comment by Gerald Sangudi [ 05/May/14 ]
Pavel,

Yes, the scan and fetch are performed before we do any LIMIT. This will be fixed in DP4, but it may not be easily fixable in DP3.

Can you please post the results of the following query:

SELECT COUNT(*) FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000

Thanks.
Comment by Pavel Paulau [ 05/May/14 ]
cbq> SELECT COUNT(*) FROM bucket-1 WHERE coins.f > 224.210000 AND coins.f < 448.420000
{
    "resultset": [
        {
            "$1": 2134
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "1"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "547.545767ms"
        }
    ]
}
Comment by Pavel Paulau [ 05/May/14 ]
Also it looks like we are leaking memory in this scenario.

Resident memory of cbq-engine grows very fast (several megabytes per second) and never goes down...




[MB-11007] Request for Get Multi Meta Call for bulk meta data reads Created: 30/Apr/14  Updated: 30/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Parag Agarwal Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: All


 Description   
Currently we support per key call for getMetaData. As a result our verification requires per key fetch during verification phase. This request is to support for get bulk meta data call which can get us meta data per vbucket for all keys or in batches. This would help enhance our verification ability for meta data per documents over time or after operations like rebalance, as it will be faster. If there is a better alternative, please recommend.

Current Behavior

https://github.com/couchbase/ep-engine/blob/master/src/ep.cc

ENGINE_ERROR_CODE EventuallyPersistentStore::getMetaData(
                                                        const std::string &key,
                                                        uint16_t vbucket,
                                                        const void *cookie,
                                                        ItemMetaData &metadata,
                                                        uint32_t &deleted,
                                                        bool trackReferenced)
{
    (void) cookie;
    RCPtr<VBucket> vb = getVBucket(vbucket);
    if (!vb || vb->getState() == vbucket_state_dead ||
        vb->getState() == vbucket_state_replica) {
        ++stats.numNotMyVBuckets;
        return ENGINE_NOT_MY_VBUCKET;
    }

    int bucket_num(0);
    deleted = 0;
    LockHolder lh = vb->ht.getLockedBucket(key, &bucket_num);
    StoredValue *v = vb->ht.unlocked_find(key, bucket_num, true,
                                          trackReferenced);

    if (v) {
        stats.numOpsGetMeta++;

        if (v->isTempInitialItem()) { // Need bg meta fetch.
            bgFetch(key, vbucket, -1, cookie, true);
            return ENGINE_EWOULDBLOCK;
        } else if (v->isTempNonExistentItem()) {
            metadata.cas = v->getCas();
            return ENGINE_KEY_ENOENT;
        } else {
            if (v->isTempDeletedItem() || v->isDeleted() ||
                v->isExpired(ep_real_time())) {
                deleted |= GET_META_ITEM_DELETED_FLAG;
            }
            metadata.cas = v->getCas();
            metadata.flags = v->getFlags();
            metadata.exptime = v->getExptime();
            metadata.revSeqno = v->getRevSeqno();
            return ENGINE_SUCCESS;
        }
    } else {
        // The key wasn't found. However, this may be because it was previously
        // deleted or evicted with the full eviction strategy.
        // So, add a temporary item corresponding to the key to the hash table
        // and schedule a background fetch for its metadata from the persistent
        // store. The item's state will be updated after the fetch completes.
        return addTempItemForBgFetch(lh, bucket_num, key, vb, cookie, true);
    }
}



 Comments   
Comment by Venu Uppalapati [ 30/Apr/14 ]
Server has support for quiet CMD_GETQ_META call which can be used on the client side to create a multi-getMeta call similar to multiGet call implementation.
Comment by Parag Agarwal [ 30/Apr/14 ]
Please point to a working example for this call
Comment by Venu Uppalapati [ 30/Apr/14 ]
Parag, you can find some relevant information on using queuing requests using quiet call at https://code.google.com/p/memcached/wiki/BinaryProtocolRevamped#Get,_Get_Quietly,_Get_Key,_Get_Key_Quietly
Comment by Chiyoung Seo [ 30/Apr/14 ]
Changing the fix version to the feature backlog given that 3.0 feature complete date was already passed and it is requested for the QE testing framework.




[MB-10993] Cluster Overview - Usable Free Space documentation misleading Created: 29/Apr/14  Updated: 29/Apr/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Jim Walker Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: documentation
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Issue relates to:
 http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#viewing-cluster-summary

I was working through a support case and trying to explain the cluster overview free space and usable free space.

The following statement is from out documentation. After code review of ns_server I concluded that this is incorrect.

Usable Free Space:
The amount of usable space for storing information on disk. This figure shows the amount of space available on the configured path after non-Couchbase files have been taken into account.

The correct statement should be

Usable Free Space:
The amount of usable space for storing information on disk. This figure is calculated from the node with least amount of available storage in the cluster. The final value is calculated by multiplying by the number of nodes in the cluster.


This change is important as it is important for users to understand why Usable Free Space can be less than Free Space. The cluster considers all nodes to be equal. If you actually have a "weak" node in the cluster, e.g. one with a small disk, then the cluster nodes all have to ensure they keep storage under the weaker nodes limits, else for example we can never failover to the weak node as it cannot take on the job of a stronger node. When Usable Free Space is less than Free space, the user may actually want to see why a node has less storage available.




[MB-10944] Support of stale=false queries Created: 23/Apr/14  Updated: 18/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3, cbq-DP4
Fix Version/s: cbq-DP3
Security Level: Public

Type: Story Priority: Critical
Reporter: Pavel Paulau Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
stale=false queries in view engine are not really consistent but critical for competitive benchmarking.

 Comments   
Comment by Gerald Sangudi [ 23/Apr/14 ]
Manik,

Please add a -stale parameter to the REST API for cbq-engine. The parameter should accept true, false, and update-after as values.

Please include this fix in the DP3 bugfix release.

Thanks.




[MB-10920] unable to start tuq if there are no buckets Created: 22/Apr/14  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
node is initialized but has no buckets
[root@kiwi-r116 tuqtng]# ./tuqtng -couchbase http://localhost:8091
10:26:56.520415 Info line disabled false
10:26:56.522641 FATAL: Unable to run server, err: Unable to access site http://localhost:8091, err: HTTP error 401 Unauthorized getting "http://localhost:8091/pools": -- main.main() at main.go:76




[MB-10834] update the license.txt for enterprise edition for 2.5.1 Created: 10/Apr/14  Updated: 19/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Microsoft Word 2014-04-07 EE Free Clickthru Breif License.docx    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
document attached.

 Comments   
Comment by Phil Labee [ 10/Apr/14 ]
2.5.1 has already been shipped, so this file can't be included.

Is this for 3.0.0 release?
Comment by Phil Labee [ 10/Apr/14 ]
voltron commit: 8044c51ad7c5bc046f32095921f712234e74740b

uses the contents of the attached file to update LICENSE-enterprise.txt on the master branch.




[MB-10823] Log failed/successful login with source IP to detect brute force attacks Created: 10/Apr/14  Updated: 18/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Cihan Biyikoglu [ 18/Jun/14 ]
http://www.couchbase.com/issues/browse/MB-11463 for covering ports 11209 or 11211.




[MB-10821] optimize storage of larger binary object in couchbase Created: 10/Apr/14  Updated: 10/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10084] Sub-Task: Changes required for Data Encryption in Client SDK's Created: 30/Jan/14  Updated: 28/May/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Andrei Baranouski
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
depends on JCBC-441 add SSL support in support of Couchba... Open
depends on CCBC-344 add support for SSL to libcouchbase i... Resolved
depends on NCBC-424 Add SSL support in support of Couchba... Resolved

 Description   
Changes required for Data Encryption in Client SDK's

 Comments   
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
wanted to make sure we agree this will be in 3.0. Matt any concerns?
thanks
Comment by Matt Ingenthron [ 20/Mar/14 ]
This should be closed in favor of the specific project issues. That said, the description is a bit fuzzy. Is this SSL support for memcached && views && any cluster management?

Please clarify and then we can open specific issues. It'd be good to have a link to functional requirements.
Comment by Matt Ingenthron [ 20/Mar/14 ]
And Cihan: it can't be "in 3.0", unless you mean concurrent release or release prior to 3.0 GA. Is that what you mean? I'd actually aim to have this feature support in SDKs prior to 3.0's release and we are working on it right now, though it has some other dependencies. See CCBC-344, for example.
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
thanks Matt. I meant 3.0 paired client SDK release so prior or shortly after is all good for me.
context - we are doing a pass to clean up JIRA. Like to button up what's in and out for 3.0.
Comment by Cihan Biyikoglu [ 24/Mar/14 ]
Matt, is there a client side ref implementation you guys did for this one? would be good to pass that onto test folks for initial validation until you guys completely integrate so no regressions creep up while we march to GA.
thanks
Comment by Matt Ingenthron [ 24/Mar/14 ]
We did verification with a non-mainline client since that was the quickest way to do so and have provided that to QE. Also, Brett filed a bug around HTTPS with ns-server and streaming configuration replies. See MB-10519.

We'll do a mainline client with libcouchbase and the python client as soon as it's dependency for handling packet IO is done. This is under CCBC-298 and CCBC-301, among others.




[MB-10003] [Port-configurability] Non-root instances and multiple sudo instances in a box cannot be 'offline' upgraded Created: 24/Jan/14  Updated: 27/Mar/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Unix/Linux


 Description   
Scenario
------------
As of today, we do not support offline 'upgrade' per se for packages installed in non-root/sudo users. Upgrades are usually handled by package managers. Since these are absent in non-root users and rpm cannot handle more than a a single package upgrade(if there are many instances running), offline upgrades are not supported (confirmed with Bin).

ALL non-root installations will be affected by this limitation. Although a single instance running on a box under sudo user can be offline upgraded, it cannot be extended to more than one such instance.

This is important

Workaround
-----------------
- Online upgrade (swap with nodes running latest build, take old nodes down and do clean install)
- Backup data and restore after fresh install (cbbackup and cbrestore)

Note : At this point, these are mere suggestions and both these workarounds haven't been tested yet.




[MB-10146] Document editor overwrites precision of long numbers Created: 06/Feb/14  Updated: 09/May/14

Status: Reopened
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Perry Krug Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged

 Description   
Just tested this out, not sure what diagnostics to capture so please let me know.

Simple test case:
-Create new document via document editor in UI
-Document contents are:
{"id": 18446744072866779556}
-As soon as you save, the above number is rewritten to:
{
  "id": 18446744072866780000
}
-The same effect is had if you edit a document that was inserted with the above "long" number

 Comments   
Comment by Aaron Miller (Inactive) [ 06/Feb/14 ]
It's worth noting views will always suffer from this, as it is a limitation of Javascript in general. Many JSON libraries have this behavior as well (even though they don't *have* to).
Comment by Aleksey Kondratenko [ 11/Apr/14 ]
cannot fix it. Just closing. If you want to reopen, please pass it to somebody responsible for overall design.
Comment by Perry Krug [ 11/Apr/14 ]
Reopening and assigning to docs, we need this to be release noted IMO.
Comment by Ruth Harris [ 14/Apr/14 ]
Reassigning to Anil. He makes the call on what we put in the release notes for known and fixed issues.
Comment by Anil Kumar [ 09/May/14 ]
Ruth - Lets release note this for 3.0.




[MB-11346] Audit logs for User/App actions Created: 06/Jun/14  Updated: 07/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Couchbase Server should be able to get an audit logs for all User/App actions such-as login/logout events, mutations and other bucket and security changes.






[MB-11314] Enhaced Authentication model for Couchbase Server for Administrators, Users and Applications Created: 04/Jun/14  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Couchbase Server will add support for authentication using various techniques example: Kerberos, LDAP etc…







[MB-11282] Separate stats for internal memory allocation (application vs. data) Created: 02/Jun/14  Updated: 02/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Story Priority: Critical
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
AFAIK currently we track allocation for data and application together.

But sometimes application (memcached / ep-engine) overhead is huge and cannot be ignored.




[MB-11250] Go-Coucbase: Provide DML APIs using CAS Created: 29/May/14  Updated: 18/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-11247] Go-Couchbase: Use password to connect to SASL buckets Created: 29/May/14  Updated: 19/Jun/14  Due: 30/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Gerald Sangudi Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Gerald Sangudi [ 19/Jun/14 ]
https://github.com/couchbaselabs/query/blob/master/docs/n1ql-authentication.md




[MB-11208] stats.org should be installed Created: 27/May/14  Updated: 27/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: techdebt-backlog
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
stats.org contains a description of the stats we're sending from ep-engine. It could be useful for people

 Comments   
Comment by Matt Ingenthron [ 27/May/14 ]
If it's "useful" shouldn't this be part of official documentation? I've often thought it should be. There's probably a duplicate here somewhere.

I also think the stats need stability labels applied as people may rely on stats when building their own integration/monitoring tools. COMMITTED, UNCOMMITTED, VOLATILE, etc. would be useful for the stats.

Relatedly, someone should document deprecation of TAP stats for 3.0.




[MB-11195] Support binary collation for views Created: 23/May/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
N1QL would benefit significantly if we could allow memcmp() collation for views it creates. So much so that we should consider this for a minor release after 3.0 so it can be available for N1QL beta.




[MB-11192] Snooze for 1 second during the backfill task is causing significant pauses during backup Created: 23/May/14  Updated: 24/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Critical
Reporter: Daniel Owen Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: customer, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: cbbackup --single-node
Data all memory resident.

Attachments: PNG File dropout-screenshot.png     PNG File IOThroughput-magnified.png     PNG File ThroughputGraphfromlocalhostport11210.png    
Issue Links:
Duplicate

 Description   
When performing a backup - the cbbackup process repeatedly stalls waiting on the socket for data. This can be seen in the uploaded graphs. The uploaded TCPdump output also shows the delay.

Setting the backfill/tap queue snooze always to zero - makes the issue go away.
i.e. modifying the sleep to zero in ep-engine/src/ep.cc/ function VBCBAdaptor::VBCBAdaptor

VBCBAdaptor::VBCBAdaptor(EventuallyPersistentStore *s,
                         shared_ptr<VBucketVisitor> v,
                         const char *l, double sleep) :
    store(s), visitor(v), label(l), sleepTime(sleep), currentvb(0)
{
sleepTime = 0.0;
....

Description of the cause is provided by Abhinav:

We back off or snooze for 1 second during the backfill task because the size of the backfill/tap queue crosses this limit (which we set to 5000 as part of initial configuration), we snooze for a second to wait for the items in the queue to drain.
So what's happening here is since all the items are in memory, this queue gets filled up really fast, causing the queue size to hit the limit and there by snoozing.




[MB-11171] mem_used stat exceeds the bucket memory quota in extremely heavy DGM and highly overloaded cluster Created: 20/May/14  Updated: 21/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.1.0, 2.2.0, 2.1.1, 2.5.0, 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This issue was reported from one of the customers. Their cluster was extremely heavy DGM (resident ratio near zero in both active and replica vbuckets) and was highly overloaded when this memory bloating issue happened.

From the logs, we saw that the number of memcached connections was spiked from 300 to 3K during the period having the memory issue. However, we were not able to correlate the increased number of connections to the memory bloating issue yet, but plan to keep investigating this issue by running the similar workload tests.





[MB-11102] extended documentation about stats flowing out of CBSTATS and the correlation between them Created: 12/May/14  Updated: 12/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Update documentation about stats flowing out of CBSTATS and the correlation between them - Need this to be able to accurately predict capacity/other bottlenecks as well as detect trends.




[MB-11100] Ability to shutoff disk persistence for Couchbase bucket and still have replication, failover and other Couchbase bucket features Created: 12/May/14  Updated: 13/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by MB-8714 introduce vbucket based cache bucket ... Resolved

 Description   
Ability to shutoff disk persistence for Couchbase bucket and still have replication, failover and other Couchbase bucket features.




[MB-11101] supported go SDK for couchbase server Created: 12/May/14  Updated: 16/Jun/14

Status: Open
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Matt Ingenthron
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
go client




[MB-11098] Ability to set block size written to storage for better alignment with SSD¹s and/or HDD¹s for better throughput performance Created: 12/May/14  Updated: 12/May/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Ability to set block size written to storage for better alignment with SSD¹s and/or HDD¹s for better throughput performance




[MB-10789] Bloom Filter based optimization to reduce the I/O overhead Created: 07/Apr/14  Updated: 07/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
A bloom filter can be considered as an optimization to reduce the disk IO overhead. Basically, we maintain a separate bloom filter per vbucket database file, and rebuild the bloom filter (e.g., increasing the filter size to reduce a false positive error rate) as part of vbucket database compaction.

As we know the number of items in a vBucket database file, we can determine the number of hash functions and the size of the bloom filter to achieve the desired false positive error rate. Note that Murmur hash has been widely used in Hadoop and Cassandra because it is much faster than MD5 and Jenkins. It has been widely known that fewer than 10 bits per element are required for a 1% false positive probability, independent of the number of elements in the set.

We expect that having a bloom filter will enhance both XDCR and full-ejection cache management performance at the expense of the filter's memory overhead.






[MB-10790] Transaction log support for individual mutations Created: 07/Apr/14  Updated: 07/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
There is always a time window that we lose a given mutation from an application, because we do both persistence and replication in an async manner. To address this limitation, we need to consider supporting a transaction (commit) log for individual mutations from applications, and later extend it to support a full transaction on multi docs across different nodes.


 Comments   
Comment by Matt Ingenthron [ 07/Apr/14 ]
+1

There were some earlier thoughts on how to accomplish this that I can share if it'd be useful.
Comment by Chiyoung Seo [ 07/Apr/14 ]
Thanks Matt. Please feel free to share them with me. We can schedule a separate meeting if necessary.
Comment by Matt Ingenthron [ 07/Apr/14 ]
Sure, thus week is bad, but want to grab 30 mins next week?
Comment by Chiyoung Seo [ 07/Apr/14 ]
Sure, I will then schedule a meeting sometime next week. Thanks!




[MB-10767] DOC: Misc - DITA conversion Created: 04/Apr/14  Updated: 04/Apr/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Ruth Harris Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10718] Change Capture API and 3rd party consumable Created: 01/Apr/14  Updated: 02/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-10716] SSD IO throughput optimizations Created: 01/Apr/14  Updated: 01/Apr/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
forestdb work




[MB-10662] _all_docs is no longer supported in 3.0 Created: 27/Mar/14  Updated: 01/May/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-10649 _all_docs view queries fails with err... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
As of 3.0, view engine will no longer support the special predefined view, _all_docs.

It was not a published feature, but as it has been around for a long time, it is possible it was actually utilized in some setups.

We should document that _all_docs queries will not work in 3.0

 Comments   
Comment by Cihan Biyikoglu [ 27/Mar/14 ]
Thanks. are there internal tools depending on this? Do you know if we have deprecated this in the past? I realize it isn't a supported API but want to make sure we keep the door open for feedback during beta from large customers etc.
Comment by Perry Krug [ 28/Mar/14 ]
We have a few (very few) customers who have used this. They've known it is unsupported...but that doesn't ever really stop anyone if it works for them.

Do we have a doc describing what the proposed replacement will look like and will that be available for 3.0?
Comment by Ruth Harris [ 01/May/14 ]
_all_docs is not mentioned anywhere in the 2.2+ documentation. Not sure how to handle this. It's not deprecated because it was never intended for use.
Comment by Perry Krug [ 01/May/14 ]
I think at the very least a prominant release not is appropriate.




[MB-10651] The guide for install user defined port doesn't work for Rest port change Created: 26/Mar/14  Updated: 17/Jun/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Larry Liu Assignee: Aruna Piravi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
http://docs.couchbase.com/couchbase-manual-2.5/cb-install/#install-user-defined-ports

I followed the instruction to change admin port (Rest port):
append to the /opt/couchbase/etc/couchbase/static_config file.
{rest_port, 9000}.

[root@localhost bin]# netstat -an| grep 9000
[root@localhost bin]# netstat -an| grep :8091
tcp 0 0 0.0.0.0:8091 0.0.0.0:* LISTEN

logs:
https://s3.amazonaws.com/customers.couchbase.com/larry/output.zip

Larry



 Comments   
Comment by Larry Liu [ 26/Mar/14 ]
The log files shows that the change was taken by server:

[ns_server:info,2014-03-26T19:13:24.063,nonode@nohost:<0.58.0>:ns_server:log_pending:30]Static config terms:
[{error_logger_mf_dir,"/opt/couchbase/var/lib/couchbase/logs"},
 {error_logger_mf_maxbytes,10485760},
 {error_logger_mf_maxfiles,20},
 {path_config_bindir,"/opt/couchbase/bin"},
 {path_config_etcdir,"/opt/couchbase/etc/couchbase"},
 {path_config_libdir,"/opt/couchbase/lib"},
 {path_config_datadir,"/opt/couchbase/var/lib/couchbase"},
 {path_config_tmpdir,"/opt/couchbase/var/lib/couchbase/tmp"},
 {nodefile,"/opt/couchbase/var/lib/couchbase/couchbase-server.node"},
 {loglevel_default,debug},
 {loglevel_couchdb,info},
 {loglevel_ns_server,debug},
 {loglevel_error_logger,debug},
 {loglevel_user,debug},
 {loglevel_menelaus,debug},
 {loglevel_ns_doctor,debug},
 {loglevel_stats,debug},
 {loglevel_rebalance,debug},
 {loglevel_cluster,debug},
 {loglevel_views,debug},
 {loglevel_mapreduce_errors,debug},
 {loglevel_xdcr,debug},
 {rest_port,9000}]
Comment by Aleksey Kondratenko [ 17/Apr/14 ]
This is because rest_port entry in static_config is only taken into account for fresh install.

There's some way to install our package without starting server first. And that has to be documented. I don't know who owns working with docs people.
Comment by Anil Kumar [ 09/May/14 ]
Alk - Before it gets to documentation we need to test it and verify the instructions. Can you provide those instructions and assign this ticket to Aruna to test it.
Comment by Anil Kumar [ 03/Jun/14 ]
Alk - can you provide those instructions and assign this ticket to Aruna to test it.
Comment by Aleksey Kondratenko [ 04/Jun/14 ]
Instructions fail to mention the fact that rest_port must be changed before config.dat is written. And config.dat is initialized on first server start.

There's some way to install server without starting it.

But here's what I managed to do:

# dpkg -i ~/Desktop/forReview/couchbase-server-enterprise_ubuntu_1204_x86_2.5.1-1086-rel.deb

# /etc/init.d/couchbase-server stop

# rm /opt/couchbase/var/lib/couchbase/config/config.dat

# emacs /opt/couchbase/etc/couchbase/static_config

# /etc/init.d/couchbase-server start

I.e. I stoped service, removed config.dat, edited static_config, then started it back and found rest port to be updated.
Comment by Anil Kumar [ 04/Jun/14 ]
Thanks Alk. Assigning this to Aruna for verification and later please assign this ticket to Documentation (Ruth).




[MB-10531] No longer necessary to wait for persistence to issue stale=false query Created: 21/Mar/14  Updated: 25/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Sriram Melkote Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Matt pointed out that in the past, we had to wait for an item to persist to disk before issuing stale=false query for correct results. In 3.0, this is not necessary. One can issue a stale=false view query anytime and results will fetch all changes that have been made when the query was issued. This task is a placeholder to update docs to remove the unnecessary step of waiting for persistence in 3.0 docs.

 Comments   
Comment by Matt Ingenthron [ 21/Mar/14 ]
Correct. Thanks for making sure this is raised Siri. While I'm thinking of it, two points need to be in there:
1) if you have older code, you will need to change it to take advantage of the semantic change to the query
2) application developers still need to be a bit careful to ensure any modifications being done aren't async operations-- they'll have to wait for the responses before doing the stale=false query
Comment by Anil Kumar [ 25/Mar/14 ]
This is for 3.0 documentation.
Comment by Sriram Melkote [ 25/Mar/14 ]
Not an improvement. This is a task.




[MB-10511] Feature request for supporting rolling downgrades Created: 19/Mar/14  Updated: 11/Apr/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Abhishek Singh Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to

 Description   
Some customers are interested in Couchbase supporting rolling downgrades. Currently we can't add 2.2 nodes inside a cluster that has all nodes on 2.5.




[MB-10512] Update documentation to convey we don't support rolling downgrades Created: 19/Mar/14  Updated: 27/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Abhishek Singh Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Update documentation to convey we don't support rolling downgrades to 2.2 once all nodes are running on 2.5




[MB-10469] Support Couchbase Server on SuSE linux platform Created: 14/Mar/14  Updated: 17/Apr/14

Status: Open
Project: Couchbase Server
Component/s: build, installer
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: SuSE linux platform

Issue Links:
Duplicate

 Description   
Add support for SuSE Linux platform




[MB-10431] Removed ep_expiry_window stat/engine_parameter Created: 11/Mar/14  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Mike Wiederhold Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This parameter is no longer needed since we require everything to be persisted. In the past it was used to skip persistence on items that would be expiring very soon.




[MB-10432] Removed ep_max_txn_size stat/engine_parameter Created: 11/Mar/14  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Mike Wiederhold Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This value is no longer used in the server. Please not that you need to update the documentation for cbepctl since this stat could be set with that script.




[MB-10430] Add AWS AMI documentation to Installation and Upgrade Guide Created: 11/Mar/14  Updated: 25/Mar/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Brian Shumate Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
It would be useful to have some basic installation instructions for those who want to use the Couchbase Server Amazon Machine Instances in a direct manner without RightScale.

This is particularly with regards to the special case of the Administrator user and password, which can become a stumbling point for some users.


 Comments   
Comment by Anil Kumar [ 25/Mar/14 ]
Ruth - Please add reference to Couchbase on AWS Whitepaper - http://aws.typepad.com/aws/2013/08/running-couchbase-on-aws-new-white-paper.html that has all the information.




[MB-10379] index is not used for simple query Created: 06/Mar/14  Updated: 28/May/14  Due: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: 2.5.0
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Iryna Mironava
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 64-bit

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
I created index for name field of bucket b0 and then my_skill index for b0
cbq> select * from :system.indexes
{
    "resultset": [
        {
            "bucket_id": "b0",
            "id": "#alldocs",
            "index_key": [
                "META().id"
            ],
            "index_type": "view",
            "name": "#alldocs",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
        {
            "bucket_id": "b0",
            "id": "my_name",
            "index_key": [
                "name"
            ],
            "index_type": "view",
            "name": "my_name",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
       {
            "bucket_id": "b0",
            "id": "my_skill",
            "index_key": [
                "skills"
            ],
            "index_type": "view",
            "name": "my_skill",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
        {
            "bucket_id": "b1",
            "id": "#alldocs",
            "index_key": [
                "META().id"
            ],
            "index_type": "view",
            "name": "#alldocs",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        },
        {
            "bucket_id": "default",
            "id": "#alldocs",
            "index_key": [
                "META().id"
            ],
            "index_type": "view",
            "name": "#alldocs",
            "pool_id": "default",
            "site_id": "http://localhost:8091"
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "4"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "1.185438ms"
        }
    ]
}

I see my view on UI, I can query it.
but explain says i am still using #alldocs

cbq> explain select name from b0
{
    "resultset": [
        {
            "input": {
                "as": "b0",
                "bucket": "b0",
                "ids": null,
                "input": {
                    "as": "",
                    "bucket": "b0",
                    "cover": false,
                    "index": "#alldocs",
                    "pool": "default",
                    "ranges": null,
                    "type": "scan"
                },
                "pool": "default",
                "projection": null,
                "type": "fetch"
            },
            "result": [
                {
                    "as": "name",
                    "expr": {
                        "left": {
                            "path": "b0",
                            "type": "property"
                        },
                        "right": {
                            "path": "name",
                            "type": "property"
                        },
                        "type": "dot_member"
                    },
                    "star": false
                }
            ],
            "type": "projector"
        }
    ],
    "info": [
        {
            "caller": "http_response:160",
            "code": 100,
            "key": "total_rows",
            "message": "1"
        },
        {
            "caller": "http_response:162",
            "code": 101,
            "key": "total_elapsed_time",
            "message": "1.236104ms"
        }
    ]
}
same result i see for skills


 Comments   
Comment by Sriram Melkote [ 07/Mar/14 ]
I think the current implementation considers secondary indexes only for filtering operations. When you do SELECT <anything> FROM <bucket>, it is a full bucket scan, and that is implemented by #alldocs and by #primary index only.

So the current behavior looks to be correct. Try running "CREATE PRIMARY INDEX USING VIEW" and please see if the query will then switch from #alldocs to #primary. Please also try adding a filter, like WHERE name > 'Mary' and see if the my_name index gets used for the filtering.

As a side note, what you're running is a covered query, where all the data necessary is held in a secondary index completely. However, this is not implemented. A secondary index is only used as an access path, and not as a source of data.
Comment by Gerald Sangudi [ 11/Mar/14 ]
This particular query will always use #primary or #alldocs. Even for documents without a "name" field, we return a result object that is missing the "name" field.

@Iryna, please test WHERE name iS NOT MISSING to see if it uses the index. If not, we'll fix that for DP4. Thanks.




[MB-10370] ep-engine deadlock in write-heavy DGM cases Created: 05/Mar/14  Updated: 06/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.2.0, 2.5.0, 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630
Memory = 64 GB
Disk = 2 x SSD

Attachments: JPEG File deadlock.jpeg    
Is this a Regression?: Yes

 Description   
This is not a new issue, we discussed it many times.

In extremely write-heavy cases we overload servers, memory usage reaches 95% of bucket quota, we eject all replica items... eventually system becomes unusable.

I'm creating this ticket because of XDCR. In 3.0 we can achieve very high throughput of XDCR operations, throttling it doesn't make sense. According to PM team some "users" deploy XDCR within the same data center so this is quite realistic scenario.

Feel free to close this ticket as duplicate of existing bugs. Though I didn't manage to find anything well-defined.

 Comments   
Comment by Cihan Biyikoglu [ 13/Mar/14 ]
I get that we should be able to recover from this and will open another issues to ensure XDCR also behaves as a good citizen under conditions where destination is under stress.
Comment by Pavel Paulau [ 18/Mar/14 ]
It's really hard to achieve <1% resident ratio because of this issue.
Comment by Pavel Paulau [ 03/Apr/14 ]
Just spotted the same issue in Sync Gateway performance test.
Comment by Maria McDuff (Inactive) [ 08/Apr/14 ]
bumping up to Test Blocker per bug scrub.
Comment by Cihan Biyikoglu [ 08/Apr/14 ]
Hi Pavel, are you blocked? lets mark this a test blocker if so.
Comment by Li Yang [ 08/Apr/14 ]
This issue is blocking sync-gateway performance test. Even with small workload of 5K users on one sync-gateway, connecting to a two-node couchbase database, the test eventually failed with the memory deadlock.
Comment by Chiyoung Seo [ 09/Apr/14 ]
Li,

I just discussed this issue with Pavel. This is not a new issue in the current value-only cache ejection policy. Let's discuss this issue more tomorrow with Pavel.

Thanks,
Comment by Chiyoung Seo [ 11/Apr/14 ]
The major reason of this issue was that all the SET operations are new insertions, which eventually inserted too many items and caused the memory usage to reach to the bucket memory quota because we still maintain keys and metadata in cache for non-resident items. To address this architectural limitation of the value-only ejection policy, we added the full ejection feature in 3.0 release, which ejects key, metadata, and value together from cache.

I recommend you to use the full ejection in 3.0, but otherwise increase the cluster capacity by considering the total number of items inserted and desired resident ratio if you want to use the value ejection policy in 2.5.1.
Comment by Chiyoung Seo [ 11/Apr/14 ]
As you said, we don't have a good way of recovering from this case once it happens. The workaround will be increasing the bucket memory quota if there is any available memory on the existing nodes and then add new nodes to the cluster in order to increase the cluster capacity.
Comment by Chiyoung Seo [ 06/Jun/14 ]
Moving it to post 3.0 as this issue is related to the current architectural limitation.




[MB-9890] xdcr should not log documents contents Created: 10/Jan/14  Updated: 20/May/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication, ns_server
Affects Version/s: 2.2.0, 2.1.1, 2.5.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
Created general ticket for this problem. See also https://www.couchbase.com/issues/browse/MB-9879

 Comments   
Comment by Aleksey Kondratenko [ 10/Jan/14 ]
This is replacement of MB-9879. Created to de-confuse things.
Comment by Junyi Xie (Inactive) [ 10/Jan/14 ]
Not sure the title is correct. XDCR used to dumped document body as part of vb replicator info in concurrency throttle, which has been fixed in MB-9879.
Comment by Maria McDuff (Inactive) [ 20/May/14 ]
Alk, I am confuse about this ticket? MB-9879 is resolved/closed.
can this ticket be closed?




[MB-9883] High CPU utilization of SSL proxy (both source and dest) Created: 09/Jan/14  Updated: 17/Apr/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 2.5.0-1032

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630
Memory = 64 GB
Disk = 2 x SSD

Attachments: JPEG File cpu_agg_nossl.jpeg     JPEG File cpu_agg_ssl.jpeg    
Triage: Triaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/perf-dev/17/artifact/

 Description   
-- 4 -> 4, unidir, xmem, 1 bucket, moderate DGM
-- Initial replication

Based on manual observation, confirmed by your internal counters.

 Comments   
Comment by Wayne Siu [ 15/Jan/14 ]
Deferring from 2.5. Potential candidate for 2.5.1. ns_server team will make changes on top of 2.5.1.
Comment by Pavel Paulau [ 15/Jan/14 ]
Some benchmarks from our environment:

# openssl speed aes
Doing aes-128 cbc for 3s on 16 size blocks: 15847532 aes-128 cbc's in 2.99s
Doing aes-128 cbc for 3s on 64 size blocks: 4282124 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 256 size blocks: 1078408 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 274532 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 34227 aes-128 cbc's in 2.99s
Doing aes-192 cbc for 3s on 16 size blocks: 13432096 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 64 size blocks: 3576384 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 256 size blocks: 906793 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 1024 size blocks: 227850 aes-192 cbc's in 2.99s
Doing aes-192 cbc for 3s on 8192 size blocks: 28528 aes-192 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16 size blocks: 11684190 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 64 size blocks: 3014948 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 256 size blocks: 771234 aes-256 cbc's in 2.99s
Doing aes-256 cbc for 3s on 1024 size blocks: 190996 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 24076 aes-256 cbc's in 3.00s
OpenSSL 1.0.1e-fips 11 Feb 2013
built on: Tue Dec 3 20:18:14 UTC 2013
options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128 cbc 84802.85k 91351.98k 92024.15k 93706.92k 93775.11k
aes-192 cbc 71637.85k 76296.19k 77379.67k 78032.91k 77900.46k
aes-256 cbc 62315.68k 64318.89k 66032.07k 65193.30k 65743.53k

This is what one core demonstrates. Notice that during the test single node was able to serve as max as 8K documents/sec (~2KB), utilizing on average 3-4 cores.
Comment by Aleksey Kondratenko [ 15/Jan/14 ]
BTW one core can get much higher than that. When AES NI hardware support is enabled. But those benchmarks are not doing it by default. You need to pass -evp as noted for example here: http://stackoverflow.com/questions/19307909/how-do-i-enable-aes-ni-hardware-acceleration-for-node-js-crypto-on-linux

With aes-ni enabled I've seen my box to show more than one _billion_ bytes per second in aes 128 bit! So there's definitely large potential here. And I bet erlang is unable to utilize it. I.e. perf showed us that erlang did not use aes-ni enabled versions of aes in openssl at least on my box.
Comment by Pavel Paulau [ 20/Jan/14 ]
Just for quantification: in bidir scenario with encryption we should easily expect 3x higher CPU utilization on both source and destination sides. Apparently absolute numbers depend on rate of replication.
Comment by Wayne Siu [ 21/Jan/14 ]
Alk will provide a build to Pavel to test. Will review the results in the next meeting.
Wanted to check
a. how much CPU has improved.
b. if there is any added latency.
Comment by Aleksey Kondratenko [ 21/Jan/14 ]
>> b. if there is any added latency.

and c. if there's any throughput change.

Also if possible I'd like to see results with/without rc4 and GSO.
Comment by Pavel Paulau [ 22/Jan/14 ]
a. 1.5-2x lower CPU utilization than in build 2.5.0-1054.
b. No extra latency.
c. No change.

GSO/TSO results will be reported in MB-9896.
Comment by Wayne Siu [ 23/Jan/14 ]
Lowering the priority to Critical as the fix has helped/improved the CPU utilization in build 1054.
Keep this ticket open for further optimization in 3.0.
Comment by Cihan Biyikoglu [ 11/Mar/14 ]
I think we need to be below %5-10 with SSL overhead. we should look for ways to ensure this is a feature we can recommend in general.
Comment by Aleksey Kondratenko [ 17/Apr/14 ]
Moved out of 3.0




[MB-9446] there's chance of starting janitor while not having latest version of config (was: On reboot entire cluster , see many conflicting bucket config changes frequently.) Created: 30/Oct/13  Updated: 04/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.0, 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ketaki Gangal Assignee: Aliaksey Artamonau
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: build 0.0.0-7040toy

Triage: Triaged
Is this a Regression?: Yes

 Description   

Load items on a cluster , build toy-000-704
Reboot cluster

Post reboot, see a lot of messages on conflicting bucket config on the web logs.

Cluster logs here: https://s3.amazonaws.com/bugdb/bug_9445/9435.tar

Sample

{fastForwardMap,undefined}]}]}]}, choosing the former, which looks newer.
ns_config003 ns_1@soursop-s11207.sc.couchbase.com 18:59:30 - Wed Oct 30, 2013
Conflicting configuration changes to field buckets:
{[{'ns_1@172.23.105.45',{5088,63550403967}},
{'ns_1@soursop-s11203.sc.couchbase.com',{1,63550403967}},
{'ns_1@soursop-s11204.sc.couchbase.com',{1764,63550403283}}],
[{'_vclock',[{'ns_1@172.23.105.45',{5088,63550403967}},
{'ns_1@soursop-s11203.sc.couchbase.com',{1,63550403967}},
{'ns_1@soursop-s11204.sc.couchbase.com',{1764,63550403283}}]},
{configs,[{"saslbucket",
[{uuid,<<"b51edfdad356db7e301d9b32c6ef47a3">>},
{num_replicas,1},
{replica_index,false},
{ram_quota,3355443200},
{auth_type,sasl},
{sasl_password,"password"},
{autocompaction,false},
{purge_interval,undefined},
{flush_enabled,false},
{num_threads,3},
{type,membase},
{num_vbuckets,1024},
{servers,['ns_1@soursop-s11203.sc.couchbase.com',
'ns_1@soursop-s11204.sc.couchbase.com',
'ns_1@soursop-s11205.sc.couchbase.com',
'ns_1@soursop-s11207.sc.couchbase.com']},
{map,[['ns_1@soursop-s11207.sc.couchbase.com',
'ns_1@soursop-s11205.sc.couchbase.com'],
['ns_1@soursop-s11207.sc.couchbase.com',
'ns_1@soursop-s11203.sc.couchbase.com'],
['ns_1@soursop-s11207.sc.couchbase.com',
'ns_1@soursop-s11204.sc.couchbase.com'],

 Comments   
Comment by Aleksey Kondratenko [ 30/Oct/13 ]
Very weird. But if indeed issue, there's likely exactly same issue on 2.5.0. And if it's the case looks pretty scary.
Comment by Aliaksey Artamonau [ 01/Nov/13 ]
I set affect version to 2.5 because I really know that it affects 2.5. And actually many preceding releases.
Comment by Maria McDuff (Inactive) [ 31/Jan/14 ]
Alk,

is this already merged in 2.5? pls confirm and mark as resolved if that's the case, assign back to QE.
Thanks.
Comment by Aliaksey Artamonau [ 31/Jan/14 ]
No, it's not fixed in 2.5.
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - 06/04/2014 Alk, Wayne, Parag, Anil




[MB-9356] tuq crash during query + rebalance having 1M items Created: 16/Oct/13  Updated: 18/Jun/14  Due: 23/Jun/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP3
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Manik Taneja
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 64 bit

Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Initial setup:
2 buckets, 1M items each of them, 1 node

Steps:
1) start a query using curl
2) add a node and start rebalance
3) start same query using tuq_client console.


[root@localhost tuqtng]# ./tuqtng -couchbase http://localhost:8091
07:19:57.406786 tuqtng started...
07:19:57.406880 version: 0.0.0
07:19:57.406887 site: http://localhost:8091
panic: runtime error: index out of range

goroutine 283323 [running]:
github.com/couchbaselabs/go-couchbase.func·001(0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:151 +0x4f1
github.com/couchbaselabs/go-couchbase.(*Bucket).doBulkGet(0xc2000e8480, 0xc200c101fc, 0xc20043af70, 0x1, 0x1, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:188 +0x150
github.com/couchbaselabs/go-couchbase.func·002()
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:212 +0x115
created by github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:218 +0x1ef

goroutine 1 [chan receive]:
github.com/couchbaselabs/tuqtng/server.Server(0x8464a0, 0x5, 0x7fff3931ab76, 0x15, 0x8554a0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:66 +0x4f4
main.main()
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/main.go:71 +0x28a

goroutine 2 [syscall]:

goroutine 4 [syscall]:
os/signal.loop()
/usr/local/go/src/pkg/os/signal/signal_unix.go:21 +0x1c
created by os/signal.init·1
/usr/local/go/src/pkg/os/signal/signal_unix.go:27 +0x2f

goroutine 13 [chan send]:
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).SendResult(0xc2001cb930, 0x70d420, 0xc200a27880)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:47 +0x46
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).processItem(0xc2001aa080, 0xc2001c9a80, 0xc2001c99c0, 0xc200a27080, 0x2b5e52485d01, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:119 +0x119
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).executeInternal(0xc2001aa080, 0xc2001d7e10, 0xc2001c9a80, 0xc2001c99c0, 0xc2001e2720, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:90 +0x2a7
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).Execute(0xc2001aa080, 0xc2001d7e10, 0xc2001c9a80, 0xc2001c99c0, 0xc2000004f8, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:42 +0x100
github.com/couchbaselabs/tuqtng/server.Dispatch(0xc2001c9a80, 0xc2001c99c0, 0xc2001c1b10, 0xc2001ab000, 0xc2001c1b40, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:85 +0x191
created by github.com/couchbaselabs/tuqtng/server.Server
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:67 +0x59c

goroutine 6 [chan receive]:
main.dumpOnSignal(0x2b5e52484fa0, 0x1, 0x1)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/main.go:80 +0x7f
main.dumpOnSignalForPlatform()
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/dump.go:19 +0x80
created by main.main
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/main.go:62 +0x1d7

goroutine 7 [IO wait]:
net.runtime_pollWait(0x2aaaaabacf00, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001242c0, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).accept(0xc200124240, 0x90ae00, 0x0, 0xc200198660, 0xb, ...)
/usr/local/go/src/pkg/net/fd_unix.go:385 +0x2c1
net.(*TCPListener).AcceptTCP(0xc2000005f8, 0x4443f6, 0x2b5e52483e28, 0x4443f6)
/usr/local/go/src/pkg/net/tcpsock_posix.go:229 +0x45
net.(*TCPListener).Accept(0xc2000005f8, 0xc200125420, 0xc2001ac2b0, 0xc2001f46c0, 0x0, ...)
/usr/local/go/src/pkg/net/tcpsock_posix.go:239 +0x25
net/http.(*Server).Serve(0xc200107a50, 0xc2001488c0, 0xc2000005f8, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/server.go:1542 +0x85
net/http.(*Server).ListenAndServe(0xc200107a50, 0xc200107a50, 0xc2001985d0)
/usr/local/go/src/pkg/net/http/server.go:1532 +0x9e
net/http.ListenAndServe(0x846860, 0x5, 0xc2001985d0, 0xc200107960, 0x0, ...)
/usr/local/go/src/pkg/net/http/server.go:1597 +0x65
github.com/couchbaselabs/tuqtng/network/http.func·001()
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http.go:37 +0x6c
created by github.com/couchbaselabs/tuqtng/network/http.NewHttpEndpoint
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http.go:41 +0x2a0

goroutine 31 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).SendItem(0xc2001e29c0, 0xc200eb0a40, 0x85b2c0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:151 +0xbe
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).scanRange(0xc2001e29c0, 0x0, 0x8a7970)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:121 +0x640
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).Run(0xc2001e29c0, 0xc2001e2960)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:61 +0xdc
created by github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:97 +0xe3

goroutine 11 [IO wait]:
net.runtime_pollWait(0x2aaaaabace60, 0x77, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitWrite(0xc200124620, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:80 +0x31
net.(*netFD).Write(0xc2001245a0, 0xc2001af000, 0x44, 0x1000, 0x4, ...)
/usr/local/go/src/pkg/net/fd_unix.go:294 +0x3e6
net.(*conn).Write(0xc2000008f0, 0xc2001af000, 0x44, 0x1000, 0x452dd2, ...)
/usr/local/go/src/pkg/net/net.go:131 +0xc3
net/http.(*switchWriter).Write(0xc2001ad040, 0xc2001af000, 0x44, 0x1000, 0x4d5989, ...)
/usr/local/go/src/pkg/net/http/chunked.go:0 +0x62
bufio.(*Writer).Flush(0xc200148f40, 0xc20092e6b4, 0x34)
/usr/local/go/src/pkg/bufio/bufio.go:465 +0xb9
net/http.(*chunkWriter).flush(0xc2001cb8e0)
/usr/local/go/src/pkg/net/http/server.go:270 +0x59
net/http.(*response).Flush(0xc2001cb8c0)
/usr/local/go/src/pkg/net/http/server.go:953 +0x5d
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).ProcessResults(0xc2001cb930, 0x2, 0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:109 +0x16b
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).Process(0xc2001cb930, 0x40519c, 0x71b260)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:61 +0x52
github.com/couchbaselabs/tuqtng/network/http.(*HttpQuery).Process(0xc2001c99c0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_query.go:72 +0x29
github.com/couchbaselabs/tuqtng/network/http.(*HttpEndpoint).ServeHTTP(0xc200000508, 0xc2001b1140, 0xc2001cb8c0, 0xc2001cd000)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http.go:55 +0xcd
github.com/gorilla/mux.(*Router).ServeHTTP(0xc200107960, 0xc2001b1140, 0xc2001cb8c0, 0xc2001cd000)
/tmp/gocode/src/github.com/gorilla/mux/mux.go:90 +0x1e1
net/http.serverHandler.ServeHTTP(0xc200107a50, 0xc2001b1140, 0xc2001cb8c0, 0xc2001cd000)
/usr/local/go/src/pkg/net/http/server.go:1517 +0x16c
net/http.(*conn).serve(0xc200124630)
/usr/local/go/src/pkg/net/http/server.go:1096 +0x765
created by net/http.(*Server).Serve
/usr/local/go/src/pkg/net/http/server.go:1564 +0x266

goroutine 21 [chan receive]:
github.com/couchbaselabs/tuqtng/catalog/couchbase.keepPoolFresh(0xc2001fa200)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/couchbase.go:157 +0x4b
created by github.com/couchbaselabs/tuqtng/catalog/couchbase.newPool
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/couchbase.go:149 +0x34c

goroutine 30 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).SendItem(0xc2001c1660, 0xc200a27940, 0xc200a27940)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:49 +0xbf
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).flushBatch(0xc2001e1c40, 0xc2001e2a00)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:141 +0x7cd
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).processItem(0xc2001e1c40, 0xc200390c00, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:78 +0xd9
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator(0xc2001c1660, 0xc2002015f0, 0xc2001e1c40, 0xc2001e2840)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:107 +0x1b0
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).Run(0xc2001e1c40, 0xc2001e2840)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:57 +0xa8
created by github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:97 +0xe3

goroutine 32 [chan send]:
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanRange(0xc200201500, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:179 +0x386
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanEntries(0xc200201500, 0x0, 0xc2001e2b40, 0xc2001e2ba0, 0xc2001250c0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:112 +0x78
created by github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).scanRange
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:82 +0x18b

goroutine 29 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).SendItem(0xc2001c1600, 0xc200a27140, 0x87ea70)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:49 +0xbf
github.com/couchbaselabs/tuqtng/xpipeline.(*Project).processItem(0xc2001c1630, 0xc200a27140, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/project.go:95 +0x33b
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator(0xc2001c1600, 0xc2002015a0, 0xc2001c1630, 0xc2001e2ae0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:107 +0x1b0
github.com/couchbaselabs/tuqtng/xpipeline.(*Project).Run(0xc2001c1630, 0xc2001e2ae0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/project.go:46 +0x91
created by github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).executeInternal
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:79 +0x1c7

goroutine 33 [chan send]:
github.com/couchbaselabs/tuqtng/catalog/couchbase.WalkViewInBatches(0xc2001251e0, 0xc2001252a0, 0xc2000e8480, 0x845260, 0x0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_util.go:90 +0x424
created by github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanRange
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:159 +0x209

goroutine 165124 [select]:
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).executeInternal(0xc2001aa080, 0xc2005d6340, 0xc2001c9a80, 0xc200282580, 0xc2005bba80, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:87 +0x667
github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).Execute(0xc2001aa080, 0xc2005d6340, 0xc2001c9a80, 0xc200282580, 0xc2000004f8, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:42 +0x100
github.com/couchbaselabs/tuqtng/server.Dispatch(0xc2001c9a80, 0xc200282580, 0xc2001c1b10, 0xc2001ab000, 0xc2001c1b40, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:85 +0x191
created by github.com/couchbaselabs/tuqtng/server.Server
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/server/server.go:67 +0x59c

goroutine 283325 [chan receive]:
github.com/dustin/gomemcached/client.(*Client).GetBulk(0xc200d55990, 0xc200d50092, 0xc2001d7d60, 0x1, 0x1, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:228 +0x3c3
github.com/couchbaselabs/go-couchbase.func·001(0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:158 +0x1dc
github.com/couchbaselabs/go-couchbase.(*Bucket).doBulkGet(0xc2000e8480, 0xc200c10092, 0xc2001d7d60, 0x1, 0x1, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:188 +0x150
github.com/couchbaselabs/go-couchbase.func·002()
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:212 +0x115
created by github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:218 +0x1ef

goroutine 283623 [runnable]:
net.runtime_pollWait(0x2aaaaabac820, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001f46b0, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2001f4630, 0xc20063fda0, 0x18, 0x18, 0x0, ...)
/usr/local/go/src/pkg/net/fd_unix.go:195 +0x2b3
net.(*conn).Read(0xc2002f2658, 0xc20063fda0, 0x18, 0x18, 0x1, ...)
/usr/local/go/src/pkg/net/net.go:123 +0xc3
io.ReadAtLeast(0xc200198840, 0xc2002f2658, 0xc20063fda0, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:284 +0xf7
io.ReadFull(0xc200198840, 0xc2002f2658, 0xc20063fda0, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:302 +0x6f
github.com/dustin/gomemcached.(*MCResponse).Receive(0xc2002b4d80, 0xc200198840, 0xc2002f2658, 0xc20063fda0, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/mc_res.go:155 +0xc7
github.com/dustin/gomemcached/client.getResponse(0xc200198840, 0xc2002f2658, 0xc20063fda0, 0x18, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/transport.go:30 +0xc6
github.com/dustin/gomemcached/client.(*Client).Receive(0xc200b13f30, 0xc2002b4c00, 0x0, 0x0)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:81 +0x67
github.com/dustin/gomemcached/client.func·003()
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:193 +0xaf
created by github.com/dustin/gomemcached/client.(*Client).GetBulk
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:207 +0x1e6

goroutine 165127 [chan receive]:
github.com/couchbaselabs/go-couchbase.(*Bucket).GetBulk(0xc2000e8480, 0xc200977000, 0x3e8, 0x3e8, 0xc200000001, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:278 +0x341
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*bucket).BulkFetch(0xc2001c11e0, 0xc200977000, 0x3e8, 0x3e8, 0x15, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/couchbase.go:249 +0x83
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).flushBatch(0xc2006e8230, 0xc2005bbd00)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:113 +0x35d
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).processItem(0xc2006e8230, 0xc200c19840, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:78 +0xd9
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator(0xc200257300, 0xc2002015f0, 0xc2006e8230, 0xc2005bbba0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:107 +0x1b0
github.com/couchbaselabs/tuqtng/xpipeline.(*Fetch).Run(0xc2006e8230, 0xc2005bbba0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/fetch.go:57 +0xa8
created by github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:97 +0xe3

goroutine 165126 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator(0xc2002572a0, 0xc2002015a0, 0xc2002572d0, 0xc2005bbe40)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:104 +0x32c
github.com/couchbaselabs/tuqtng/xpipeline.(*Project).Run(0xc2002572d0, 0xc2005bbe40)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/project.go:46 +0x91
created by github.com/couchbaselabs/tuqtng/executor/interpreted.(*InterpretedExecutor).executeInternal
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/executor/interpreted/interpreted.go:79 +0x1c7

goroutine 165123 [chan receive]:
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).ProcessResults(0xc2006e80e0, 0x2, 0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:88 +0x3c
github.com/couchbaselabs/tuqtng/network/http.(*HttpResponse).Process(0xc2006e80e0, 0x40519c, 0x71b260)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_response.go:61 +0x52
github.com/couchbaselabs/tuqtng/network/http.(*HttpQuery).Process(0xc200282580)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http_query.go:72 +0x29
github.com/couchbaselabs/tuqtng/network/http.(*HttpEndpoint).ServeHTTP(0xc200000508, 0xc2001b1140, 0xc2006e8070, 0xc2001cdea0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/network/http/http.go:55 +0xcd
github.com/gorilla/mux.(*Router).ServeHTTP(0xc200107960, 0xc2001b1140, 0xc2006e8070, 0xc2001cdea0)
/tmp/gocode/src/github.com/gorilla/mux/mux.go:90 +0x1e1
net/http.serverHandler.ServeHTTP(0xc200107a50, 0xc2001b1140, 0xc2006e8070, 0xc2001cdea0)
/usr/local/go/src/pkg/net/http/server.go:1517 +0x16c
net/http.(*conn).serve(0xc2001f46c0)
/usr/local/go/src/pkg/net/http/server.go:1096 +0x765
created by net/http.(*Server).Serve
/usr/local/go/src/pkg/net/http/server.go:1564 +0x266

goroutine 165129 [select]:
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanRange(0xc200201500, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:166 +0x6b9
github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanEntries(0xc200201500, 0x0, 0xc2005bbea0, 0xc2005bbf00, 0xc2005bbf60, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:112 +0x78
created by github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).scanRange
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:82 +0x18b

goroutine 165128 [select]:
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).scanRange(0xc2005bbd20, 0x0, 0x8a7970)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:99 +0x7a2
github.com/couchbaselabs/tuqtng/xpipeline.(*Scan).Run(0xc2005bbd20, 0xc2005bbcc0)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/scan.go:61 +0xdc
created by github.com/couchbaselabs/tuqtng/xpipeline.(*BaseOperator).RunOperator
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/xpipeline/base.go:97 +0xe3

goroutine 165130 [select]:
net/http.(*persistConn).roundTrip(0xc200372f00, 0xc200765660, 0xc200372f00, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/transport.go:857 +0x6c7
net/http.(*Transport).RoundTrip(0xc20012e080, 0xc2004dac30, 0xc2005a5808, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/transport.go:186 +0x396
net/http.send(0xc2004dac30, 0xc2000e7e70, 0xc20012e080, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/client.go:166 +0x3a1
net/http.(*Client).send(0xb7fcc0, 0xc2004dac30, 0x7c, 0x2b5e52429020, 0xc200fca2c0, ...)
/usr/local/go/src/pkg/net/http/client.go:100 +0xcd
net/http.(*Client).doFollowingRedirects(0xb7fcc0, 0xc2004dac30, 0x90ae80, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/client.go:282 +0x5ff
net/http.(*Client).Do(0xb7fcc0, 0xc2004dac30, 0xc20052cae0, 0x0, 0x0, ...)
/usr/local/go/src/pkg/net/http/client.go:129 +0x8d
github.com/couchbaselabs/go-couchbase.(*Bucket).ViewCustom(0xc2000e8480, 0x845260, 0x0, 0x873ab0, 0x9, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/views.go:115 +0x210
github.com/couchbaselabs/go-couchbase.(*Bucket).View(0xc2000e8480, 0x845260, 0x0, 0x873ab0, 0x9, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/views.go:155 +0xcc
github.com/couchbaselabs/tuqtng/catalog/couchbase.WalkViewInBatches(0xc20045b000, 0xc20045b060, 0xc2000e8480, 0x845260, 0x0, ...)
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_util.go:80 +0x2ce
created by github.com/couchbaselabs/tuqtng/catalog/couchbase.(*viewIndex).ScanRange
/tmp/gocode/src/github.com/couchbaselabs/tuqtng/catalog/couchbase/view_index.go:159 +0x209

goroutine 283324 [chan receive]:
github.com/dustin/gomemcached/client.(*Client).GetBulk(0xc200b13f30, 0xc200b1001b, 0xc200fce200, 0x2, 0x2, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:228 +0x3c3
github.com/couchbaselabs/go-couchbase.func·001(0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:158 +0x1dc
github.com/couchbaselabs/go-couchbase.(*Bucket).doBulkGet(0xc2000e8480, 0xc200c1001b, 0xc200fce200, 0x2, 0x2, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:188 +0x150
github.com/couchbaselabs/go-couchbase.func·002()
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:212 +0x115
created by github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:218 +0x1ef

goroutine 283622 [runnable]:
net.runtime_pollWait(0x2aaaaabacb40, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001f4e00, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2001f4d80, 0xc200eda4e0, 0x18, 0x18, 0x0, ...)
/usr/local/go/src/pkg/net/fd_unix.go:195 +0x2b3
net.(*conn).Read(0xc20080b948, 0xc200eda4e0, 0x18, 0x18, 0x1, ...)
/usr/local/go/src/pkg/net/net.go:123 +0xc3
io.ReadAtLeast(0xc200198840, 0xc20080b948, 0xc200eda4e0, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:284 +0xf7
io.ReadFull(0xc200198840, 0xc20080b948, 0xc200eda4e0, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:302 +0x6f
github.com/dustin/gomemcached.(*MCResponse).Receive(0xc2002b4ba0, 0xc200198840, 0xc20080b948, 0xc200eda4e0, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/mc_res.go:155 +0xc7
github.com/dustin/gomemcached/client.getResponse(0xc200198840, 0xc20080b948, 0xc200eda4e0, 0x18, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/transport.go:30 +0xc6
github.com/dustin/gomemcached/client.(*Client).Receive(0xc2004bfa20, 0xc2002b4a20, 0x0, 0x0)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:81 +0x67
github.com/dustin/gomemcached/client.func·003()
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:193 +0xaf
created by github.com/dustin/gomemcached/client.(*Client).GetBulk
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:207 +0x1e6

goroutine 283331 [select]:
net/http.(*persistConn).writeLoop(0xc200372f00)
/usr/local/go/src/pkg/net/http/transport.go:774 +0x26f
created by net/http.(*Transport).dialConn
/usr/local/go/src/pkg/net/http/transport.go:512 +0x58b

goroutine 283321 [chan receive]:
github.com/couchbaselabs/go-couchbase.errorCollector(0xc20101a900, 0xc200372c80)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:246 +0x9f
created by github.com/couchbaselabs/go-couchbase.(*Bucket).GetBulk
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:275 +0x2f2

goroutine 283544 [select]:
net/http.(*persistConn).writeLoop(0xc2006cbd80)
/usr/local/go/src/pkg/net/http/transport.go:774 +0x26f
created by net/http.(*Transport).dialConn
/usr/local/go/src/pkg/net/http/transport.go:512 +0x58b

goroutine 283322 [chan receive]:
github.com/dustin/gomemcached/client.(*Client).GetBulk(0xc2004bfa20, 0xc2004b01ec, 0xc200e8cd60, 0x2, 0x2, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:228 +0x3c3
github.com/couchbaselabs/go-couchbase.func·001(0x0, 0x0)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:158 +0x1dc
github.com/couchbaselabs/go-couchbase.(*Bucket).doBulkGet(0xc2000e8480, 0xc200c101ec, 0xc200e8cd60, 0x2, 0x2, ...)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:188 +0x150
github.com/couchbaselabs/go-couchbase.func·002()
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:212 +0x115
created by github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:218 +0x1ef

goroutine 283621 [runnable]:
net.runtime_pollWait(0x2aaaaabac960, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001f4500, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2001f4480, 0xc200a83940, 0x18, 0x18, 0x0, ...)
/usr/local/go/src/pkg/net/fd_unix.go:195 +0x2b3
net.(*conn).Read(0xc20084a5d8, 0xc200a83940, 0x18, 0x18, 0x1, ...)
/usr/local/go/src/pkg/net/net.go:123 +0xc3
io.ReadAtLeast(0xc200198840, 0xc20084a5d8, 0xc200a83940, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:284 +0xf7
io.ReadFull(0xc200198840, 0xc20084a5d8, 0xc200a83940, 0x18, 0x18, ...)
/usr/local/go/src/pkg/io/io.go:302 +0x6f
github.com/dustin/gomemcached.(*MCResponse).Receive(0xc2002b49c0, 0xc200198840, 0xc20084a5d8, 0xc200a83940, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/mc_res.go:155 +0xc7
github.com/dustin/gomemcached/client.getResponse(0xc200198840, 0xc20084a5d8, 0xc200a83940, 0x18, 0x18, ...)
/tmp/gocode/src/github.com/dustin/gomemcached/client/transport.go:30 +0xc6
github.com/dustin/gomemcached/client.(*Client).Receive(0xc200d55990, 0xc2002b4720, 0x0, 0x0)
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:81 +0x67
github.com/dustin/gomemcached/client.func·003()
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:193 +0xaf
created by github.com/dustin/gomemcached/client.(*Client).GetBulk
/tmp/gocode/src/github.com/dustin/gomemcached/client/mc.go:207 +0x1e6

goroutine 283543 [runnable]:
net/http.(*persistConn).readLoop(0xc2006cbd80)
/usr/local/go/src/pkg/net/http/transport.go:761 +0x64b
created by net/http.(*Transport).dialConn
/usr/local/go/src/pkg/net/http/transport.go:511 +0x574

goroutine 283320 [chan send]:
github.com/couchbaselabs/go-couchbase.(*Bucket).processBulkGet(0xc2000e8480, 0xc200c19980, 0xc20101a8a0, 0xc20101a900)
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:222 +0x26b
created by github.com/couchbaselabs/go-couchbase.(*Bucket).GetBulk
/tmp/gocode/src/github.com/couchbaselabs/go-couchbase/client.go:273 +0x2d1

goroutine 283330 [IO wait]:
net.runtime_pollWait(0x2aaaaabacdc0, 0x72, 0x0)
/usr/local/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2001f47d0, 0xb, 0xc200198660)
/usr/local/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2001f4750, 0xc20097b000, 0x1000, 0x1000, 0x0, ...)
/usr/local/go/src/pkg/net/fd_unix.go:195 +0x2b3
net.(*conn).Read(0xc20084a498, 0xc20097b000, 0x1000, 0x1000, 0x8, ...)
/usr/local/go/src/pkg/net/net.go:123 +0xc3
bufio.(*Reader).fill(0xc200b83180)
/usr/local/go/src/pkg/bufio/bufio.go:79 +0x10c
bufio.(*Reader).Peek(0xc200b83180, 0x1, 0xc200198840, 0x0, 0xc200eda4e0, ...)
/usr/local/go/src/pkg/bufio/bufio.go:107 +0xc9
net/http.(*persistConn).readLoop(0xc200372f00)
/usr/local/go/src/pkg/net/http/transport.go:670 +0xc4
created by net/http.(*Transport).dialConn
/usr/local/go/src/pkg/net/http/transport.go:511 +0x574
[root@localhost tuqtng]#


 Comments   
Comment by Marty Schoch [ 16/Oct/13 ]
Looks like a bug in go-couchbase. I have filed an issue there:

http://cbugg.hq.couchbase.com/bug/bug-906
Comment by Ketaki Gangal [ 16/Oct/13 ]
Can easily hit this on any rebalance, bumping this upto a critical.




[MB-9321] Get us off erlang's global facility and re-elect failed master quickly and safely Created: 10/Oct/13  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Duplicate
is duplicated by MB-9691 rebalance repeated failed when add no... Closed
Relates to
relates to MB-9691 rebalance repeated failed when add no... Closed
Triage: Triaged
Is this a Regression?: No

 Description   
We have a number of bugs due to erlang global facility or related issue of not being able to spawn new master quickly. I.e.:

* MB-7282 (erlang's global naming facility apparently drops globally registered service with actual service still alive (was: impossible to change settings/autoFailover after rebalance))

* MB-7168 [Doc'd 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)

* MB-8682 start rebalance request is hunging sometimes (looks like another global facility issue)

* MB-5622 Crash of master node may lead to autofailover in 2 minutes instead of configured shorter autofailover period or similarly slow manual failover

By getting us off global, we will fix all this issues.


 Comments   
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
This also includes making sure autofailover takes into account time it takes for master election in case of master crash.

Current thinking is that every node will run autofailover service but it will run only if it's on master node. And we can have special code that speeds up master re-election if we detect that master node is down.
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
Note that currently mb_master is the thing that first suffers when timeout-ful situation starts.

So we should look at making mb_master more robust if necessary
Comment by Aleksey Kondratenko [ 17/Oct/13 ]
I'm _really_ curious who makes decisions to move this into 2.5.0. Why. And why they think we have bandwidth to handle it.
Comment by Aleksey Kondratenko [ 09/Dec/13 ]
Workaround diag/eval snippet:

rpc:call(mb_master:master_node(), erlang, apply ,[fun () -> erlang:exit(erlang:whereis(mb_master), kill) end, []]).

Detection snippet:

F = (fun (Name) -> {Oks, NotOks} = rpc:multicall(ns_node_disco:nodes_actual(), global, whereis_name, [Name], 60000), case {lists:usort(Oks), NotOks} of {_, [_|_]} -> {failed_rpc, NotOks}; {[_], _} -> ok; {L, _} -> {different_answers, L} end end), [(catch {N, ok} = {N, F(N)}) || N <- [ns_orchestrator, ns_tick, auto_failover]].

Detection snipped should return:

 [{ns_orchestrator,ok},{ns_tick,ok},{auto_failover,ok}]

If not, there's decent chance that we're hitting this issue.
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
As part of that we'll likely have to revamp autofailover. And John Liang suggested nice idea for single ejected node to disable memcached traffic on itself to signal smart clients that something notable occurred.

On Fri, Dec 13, 2013 at 11:48 AM, John Liang <john.liang@couchbase.com> wrote:
>> I.e. consider client that only updates vbucket map if it receives "not my vbucket". And consider 3 node cluster where 1 node is partitioned off other nodes but is accessible from client. Lets name this node C. And imagine that remaining two nodes did failover that node. It can be seen that client will happily continue using old vbucket map and reading/writing to/from node C, because it'll never get a single "not my vbucket" reply.

Thanks Alk. In this case, is there a reason why not to change the vbucket state on the singly-paritioned node on auto-failover? There still be a window for "data loss", but this window should be much smaller.

Yes we can do it. Good idea.
Comment by Perry Krug [ 13/Dec/13 ]
But if that node (C) is partitioned off...how will we be able to tell it to set those vbucket states? IMO, wouldn't it be better for the clients to implement a true quorum approach to the map when they detect that something isn't right? Or am I being too naive and missing something?
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
It's entirely possible that I misunderstood original text, but I understand it the following:

* when autofailover is enabled every node observes if it's alone. If node finds itself alone and usual autofailover threshold passes. This node can be somewhat sure that it was automatically failed over by other cluster

* when that happens, node can either turn all vbuckets to replicas or disable traffic (similarly to what we're doing during flush).

There is of course a chance that all other nodes have truly failed and that single node is all that's left. But in can be argued that in this case amount of data loss is big enough anyways. And one node that artificially disables traffic doesn't change things much.

Regarding "quorum on clients". I've seen one proposal for that. And I don't think it's good idea. I.e. being in majority and being right are almost completely independent things. We can do far better than that. Particularly with CCCP we have rev field that gives sufficient ordering between bucket configurations.
Comment by Perry Krug [ 13/Dec/13 ]
My concern is that our recent observations of false-positive autofailovers may lead lots of individual nodes to decide that they have been isolated and disable their traffic...whether they've been automatically failed over or not.

As you know, one of the very nice safety nets of our autofailover is that it will not activate if it sees more than one node down at once which means that we can never do the wrong thing. If we allow one node to disable its traffic when it can't intelligently reason about the state of the rest of the cluster, IMO we go away from this safety net...no?
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
No. Because node can only do that when it's sure that other side of cluster is not accessible. And it can immediately recover it's memcached traffic ASAP after it detects that rest of cluster is back.`
Comment by Perry Krug [ 13/Dec/13 ]
But it can't ever be sure that the other side of the cluster is actually not accessible...clients may still be able to reach it right?

I'm thinking about some extreme corner cases...but what about the situation where two nodes of a >2-node cluster are completely isolated via some weird networking situation and yet are still reachable to the clients. Both of them would decide that they were isolated from the whole cluster, both of them would disable all their vbuckets and yet neither would be auto-failed over because the rest of the cluster would see two nodes down and not trigger the autofailover. I realize it's rare...but I bet there are less convoluted scenarios that would lead the software to do something undesirable.

I think this is a good discussion...but not directly relevant to the purpose of this bug which I believe is still an important fix that needs to be made. Do you want to take this discussion offline from this bug?
Comment by Aleksey Kondratenko [ 13/Dec/13 ]
There's definitely ways how this can backfire. But tradeoffs are quite clear. You "buy" ability detect autofailovers (and only autofailovers in my words above; but this can be potentially extended to other cases), at expense of small chance of node false-positively disabling it's traffic, briefly and without data loss.

Thinking about this more I now see that it's less good idea than I thought. I.e. particularly autofailover but not manual failover is not as interesting. But we can return to this discussion when mb_master work is actually in progress.

Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Lowered to critical. It's not blocking anyone




[MB-9234] Failover message should take into account availability of replica vbuckets Created: 08/Oct/13  Updated: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
If a node goes down and its vbuckets do not have corresponding replicas available in the cluster, we should warn the user that pressing failover will result in perceived dataloss. At the moment, we have the same failover message whether those replica vbuckets are available or not.




[MB-9143] Allow replica count to be edited Created: 17/Sep/13  Updated: 12/Jun/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 2.5.0, 3.0

Type: Task Priority: Critical
Reporter: Perry Krug Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-2512 Allow replica count to be edited Closed

 Description   
Currently the replication factor cannot be edited after a bucket has been created. It would be nice to have this functionality.

 Comments   
Comment by Ruth Harris [ 06/Nov/13 ]
Currently, it's added to the 3.0 Eng branch by Alk. See MB-2512. This would be a 3.0 doc enhancement.
Comment by Perry Krug [ 25/Mar/14 ]
FYI, this is already in as-of 2.5 and probably needs to be documented there as well...if possible before 3.0
Comment by Amy Kurtzman [ 16/May/14 ]
Anil, Can you verify whether this was added in 2.5 or 3.0?
Comment by Anil Kumar [ 28/May/14 ]
Verified - As perry mentioned this was added in 2.5 release. We need to document this soon for 2.5 docs.




[MB-8686] CBHealthChecker - Fix fetching number of CPU processors Created: 23/Jul/13  Updated: 05/Jun/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.1.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Anil Kumar Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: customer
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-8817 REST API support to report number of ... Technical task Open Bin Cui  
Triage: Untriaged

 Description   
Issue reported by customer - cbhealthchecker report showing incorrect information for 'Minimum CPU core number required'.


 Comments   
Comment by Bin Cui [ 07/Aug/13 ]
it will depend on ns_server to provide number of cpu processors in the collected stats. Suggest to push to next release.
Comment by Maria McDuff (Inactive) [ 01/Nov/13 ]
per Bin:
Suggest to push the following two bugs to next release:
1. MB-8686: it depends on ns_server to provide capability to retrieve number of cpu cores
2. MB-8502: caused by async communication between main installer thread and api to get status. Change will be dramatic for installer.
 
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Bin,

Raising to Critical.
If this is still dependent on ns_server, pls assign to Alk.
This needs to be fixed for 3.0.
Comment by Anil Kumar [ 05/Jun/14 ]
We need this information to be provided from ns_server. Created ticket MB-11334.

Traige - June 05 2014 Bin, Anil, Tony, Ashvinder
Comment by Aleksey Kondratenko [ 05/Jun/14 ]
Ehm. I don't think it's good idea to treat ns_server as "provider of random system-level stats". I believe you'll need to find other way of getting it.




[MB-9045] [windows] cbworkloadgen hungs Created: 03/Sep/13  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.2.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: scrubbed
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.2.0-817
<manifest><remote name="couchbase" fetch="git://10.1.1.210/"/><remote name="membase" fetch="git://10.1.1.210/"/><remote name="apache" fetch="git://github.com/apache/"/><remote name="erlang" fetch="git://github.com/erlang/"/><default remote="couchbase" revision="master"/><project name="tlm" path="tlm" revision="862733cea3805cf8eba957a120a67986cd57e4e3"><copyfile dest="Makefile" src="Makefile.top"/></project><project name="bucket_engine" path="bucket_engine" revision="2a797a8d97f421587cce728f2e6aa2cd42c8fa26"/><project name="ep-engine" path="ep-engine" revision="864296f0b4068f9d8e3943fbea6e34c29cf0e903"/><project name="libconflate" path="libconflate" revision="c0d3e26a51f25a2b020713559cb344d43ce0b06c"/><project name="libmemcached" path="libmemcached" revision="ea579a523ca3af872c292b1e33d800e3649a8892" remote="membase"/><project name="libvbucket" path="libvbucket" revision="408057ec55da3862ab8d75b1ed25d2848afd640f"/><project name="couchbase-cli" path="couchbase-cli" revision="94b37190ece87b4386a93b64e62487370d268654" remote="couchbase"/><project name="memcached" path="memcached" revision="414d788f476a019cc5d2b05e0ce72504fe469c79" remote="membase"/><project name="moxi" path="moxi" revision="01bd2a5c0aff2ca35611ba3fb857198945cc84eb"/><project name="ns_server" path="ns_server" revision="8e533a59413ba98dd8a0bc31b409668ca886c560"/><project name="portsigar" path="portsigar" revision="2204847c85a3ccaecb2bb300306baf64824b2597"/><project name="sigar" path="sigar" revision="a402af5b6a30ea8e5e7220818208e2601cb6caba"/><project name="couchbase-examples" path="couchbase-examples" revision="cd9c8600589a1996c1ba6dbea9ac171b937d3379"/><project name="couchbase-python-client" path="couchbase-python-client" revision="f14c0f53b633b5313eca1ef64b0f241330cf02c4"/><project name="couchdb" path="couchdb" revision="386be73085c0b2a8e11cd771fc2ce367b62b7354"/><project name="couchdbx-app" path="couchdbx-app" revision="300031ab2e7e2fc20c59854cb065a7641e8654be"/><project name="couchstore" path="couchstore" revision="30f8f0872ef28f95765a7cad4b2e45e32b95dff8"/><project name="geocouch" path="geocouch" revision="000096996e57b2193ea8dde87e078e653a7d7b80"/><project name="healthchecker" path="healthchecker" revision="fd4658a69eec1dbe8a6122e71d2624c5ef56919c"/><project name="testrunner" path="testrunner" revision="8371aa1cc3a21650b3a9f81ba422ec9ac3151cfc"/><project name="cbsasl" path="cbsasl" revision="6ba4c36480e78569524fc38f6befeefb614951e6"/><project name="otp" path="otp" revision="b6dc1a844eab061d0a7153d46e7e68296f15a504" remote="erlang"/><project name="icu4c" path="icu4c" revision="26359393672c378f41f2103a8699c4357c894be7" remote="couchbase"/><project name="snappy" path="snappy" revision="5681dde156e9d07adbeeab79666c9a9d7a10ec95" remote="couchbase"/><project name="v8" path="v8" revision="447decb75060a106131ab4de934bcc374648e7f2" remote="couchbase"/><project name="gperftools" path="gperftools" revision="44a584d1de8c89addfb4f1d0522bdbbbed83ba48" remote="couchbase"/><project name="pysqlite" path="pysqlite" revision="0ff6e32ea05037fddef1eb41a648f2a2141009ea" remote="couchbase"/></manifest>

Attachments: Zip Archive cbcollect.zip    
Triage: Untriaged
Operating System: Windows 64-bit

 Description   
/cygdrive/c/Program\ Files/Couchbase/Server/bin/cbworkloadgen.exe -n localhost:8091 -r 0.9 -i 1000 -b default -s 256 -j -t 2 -u Administrator -p password
loaded only 369 items and then just hungs

 Comments   
Comment by Bin Cui [ 03/Sep/13 ]
Looks like the parameter -s 256 causes the trouble, which is to create any doc with at least 256 byte.

When tested with s less than 50, it always works fine. But we will have trouble when it runs beyond this value.

BTW, the default one is 10 for s.
Comment by Thuan Nguyen [ 21/Jan/14 ]
Test on build 2.5.0-1054, cbworkloadgen.exe still hang with item size only 35 bytes

cbworkloadgen.exe -n 10.1.2.31:8091 -r 0.9 -i 1000000 -b default -s 35 -j -t 2 -u Administrator -p password

Comment by Thuan Nguyen [ 21/Jan/14 ]
check UI, it loads only 639 items and stopped at default bucket




[MB-9004] Frontend ops/sec drops by 5% - 15% during rebalance Created: 29/Aug/13  Updated: 13/Mar/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.2.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged

 Description   
This ticket is created to further address the frontend ops/sec drop during rebalance. Previously, we saw more than 40-50% drop in the frontend ops/sec during rebalance. Please refer to MB-7972 for more details.

We recently made a fix in the ep-engine side to address this issue for 2.2.0 release, but still observed 5% - 15% drop. We plan to address this issue furthermore in the next major release.

 Comments   
Comment by Chiyoung Seo [ 01/Nov/13 ]
Move this to 3.0 release or later as it requires some thread scheduling changes.
Comment by Maria McDuff (Inactive) [ 10/Feb/14 ]
Chiyoung,

are there any fix related to this issue that went in 3.0?




[MB-8915] Tombstone purger need to find a better home for lifetime of deletion Created: 21/Aug/13  Updated: 15/Jun/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication, storage-engine
Affects Version/s: 2.2.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Junyi Xie (Inactive) Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-8916 migration tool for offline upgrade Technical task Open Anil Kumar  
Triage: Untriaged

 Description   
=== copy and paste my email to a group of people, it should explain clearly why we need this ticket ===

Thanks for your comments. Probably it is easier to read in email than code review.

Let me explain a bit to see if we can be on the same page. First of all the current resolution algorithm (comparing all fields) is still right, yes there is small chance we would touch fields after CAS, but for correctness we should have them there.

The cause of MB-8825 is that tombstone purger uses expiration time field to put the purger specific "lifetime of deletion". This is just a "temporary solution" because IMHO the expiration time of a key is not the right place for "lifetime of deletion" (this is purely a storage specific metadata, IMHO should not be in eo_engine), but unfortunately today we cannot find a better place to put such info unless we change the storage format, which has too much overhead at this time. In future, I think we need to figure out the best place for "lifetime of deletion" and move it out of key expiration time field.

In practice, today this temporary solution in tombstone purger is OK in most cases because rarely you have collision in CAS for two deletions on the same key. But MB-8825 just hit the small dark area, when destination tries to replicate a deletion from source back to source in bi-dir XDCR, both share the same (SeqNo, CAS) but different expiration time field (which is not exp time of key, but lifetime of deletion created by tombstone purger), exp time at destination is some times bigger than that at source, causing incorrect resolution results at source. The problem exists for both CAPI and XMEM.

For backward compatibility,
1) If both sides are 2.2, we uses new resolution algorithm for deletion and we are safe.
2) if both sides are pre-2.2, since they do not have tombstone purger, the current algorithm (comparing all fields) should be safe.
3) If a bi-dir XDCR between pre-2.2 and 2.2 cluster on CAPI. deletion born at 2.2 replicating to pre-2.2 should be safe because there is no tombstone purger at pre-2.2. For deletions born at pre-2.2, we may see them bounced back from 2.2. But there should be no dataloss since you just re-delete something already deleted.

This fix may not be perfect, but it is still much better than issues in MB-8825. I hope in near future we can find a right place for "lifetime of deletion" in tombstone purger.


Thanks,

Junyi

 Comments   
Comment by Junyi Xie (Inactive) [ 21/Aug/13 ]
Anil and Dipti,

Please determine the priority of this task, and comment if I miss anything. Thanks.


Comment by Anil Kumar [ 21/Aug/13 ]
Upgrade - We need migration tool (which we talked about) in case of Offline upgrade to move the data. Created a subtask for that.
Comment by Aaron Miller (Inactive) [ 17/Oct/13 ]
Considering that fixing this has lots of implications w.r.t. upgrade and all components that touch the file format, and that not fixing it is not causing any problems, I believe that this is not appropriate for 2.5.0
Comment by Junyi Xie (Inactive) [ 22/Oct/13 ]
I agree with Aaron that this may not be a small task and may have lots of implications to different components.

Anil, please reconsider if this is appropriate for 2.5. Thanks.
Comment by Anil Kumar [ 22/Oct/13 ]
Moved it to 3.0.
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
As "temporary head of xdcr for 3.0" I don't need this fixed in 3.0

And my guess is that after 3.0 when "the plan" for xdcr will be ready, we'll just close it as won't fix, but lets wait and see.
Comment by Cihan Biyikoglu [ 15/Jun/14 ]
Aaron is no longer here. assigning to Chiyoung for consideration.




[MB-8845] spend 5 days prototyping master-less cluster orchestration Created: 15/Aug/13  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to

 Description   
We have a number of issues caused by master election and stuff like that.

We have some ideas about doing better than that but it needs to be protyped.

 Comments   
Comment by Aleksey Kondratenko [ 26/Aug/13 ]
See MB-7282
Comment by Aleksey Kondratenko [ 16/Sep/13 ]
I've spent 1 day on that on Fri Sep 13
Comment by Andrei Baranouski [ 10/Feb/14 ]
Alk, could you provide a cases to reproduce it when finished the task. because this problem occurs very rarely in our tests
Comment by Aleksey Kondratenko [ 10/Feb/14 ]
No. That task will not lead to any commits into mainline code. It's just prototype.

After prototype is done we'll have more specific plan for mainline codebase
Comment by Maria McDuff (Inactive) [ 14/Feb/14 ]
Removed from 3.0 Release.




[MB-8832] Allow for some back-end setting to override hard limit on server quota being 80% of RAM capacity Created: 14/Aug/13  Updated: 28/May/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.1.0, 2.2.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Perry Krug Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Relates to
relates to MB-10180 Server Quota: Inconsistency between d... Open
Triage: Untriaged
Is this a Regression?: Yes

 Description   
At the moment, there is no way to override the 80% of RAM limit for the server quota. At very large node sizes, this can end up leaving a lot of RAM unused.

 Comments   
Comment by Aleksey Kondratenko [ 14/Aug/13 ]
Passing this to Dipti.

We've seen memory fragmentation to easily be 50% of memory usage. So even with 80% you can get into swap and badness.

I'd recommend _against_ this until we solve fragmentation issues we have today.

Also keep in mind that today you _can_ raise this above all limites with simple /diag/eval snippet
Comment by Perry Krug [ 14/Aug/13 ]
We have seen this I agree, but it's been fairly uncommon in production environments and is something that can be monitored and resolved when it does occur. In larger RAM systems, I think we would be better served for most use cases by allowing more RAM to be used.

For example, 80% of 60GB is 48GB...leaving 12GB unused. Even worse for 256GB (leaving 50+GB unused)
Comment by Aleksey Kondratenko [ 14/Aug/13 ]
And on 256 gigs machine fragmentation can be as big as 128 gigs (!) IMHO this is not about absolute numbers but about percentages. Anyways, Dipti will tell us what to do, but your numbers above are just saying how bad our _expected_ fragmentation is.
Comment by Perry Krug [ 14/Aug/13 ]
But that's where I disagree...I think it _is_ about absolute numbers. If we leave fragmentation out of it (since it's something we will fix eventually, something that is specific to certain workloads and something that can be worked around via rebalancing), the point of this overhead was specifically to leave space available for the operating system and any other processes running outside of Couchbase. I'm sure you'd agree that Linux doesn't need anywhere near 50GB of RAM to run properly :) Even if we could decrease that by half it would provide huge savings in terms of hardware and costs to our users.

Is fragmentation the only concern of yours? If we were able to analyze a running production cluster to quantify the RAM fragmentation that exists and determine that it is within a certain bounds...would it be okay to raise the quota about 80%?
Comment by Aleksey Kondratenko [ 14/Aug/13 ]
My point was that fragmentation is also % not absolute. So with larger ram, waste from fragmentation looks scarier.

Now that you're asking if that's my only concern I see that there's more.

Without sufficient space for page cache disk performance will suffer. How much we need to be at least on par with sqlite I cannot say. Nobody can, apparently. Things depend on whether you're going to do bgfetches or not.

Because if you do care about quick bgfetches (or, say views and xdcr) then you may want to set lowest possible quota and give us much ram as possible for page cache, hoping that at least all metadata is in page cache.

If you do not care about residency of metadata, that means you don't care about btree leafs being page-cache-resident. But in order to remain io-efficient you do need to keep non-leaf nodes in page cache. The issue is that with our append-only design nobody knows how well it works in practice and exactly how much of page cache you need to give to keep few perhaps hundreds of megs of metadata-of-metadata page-cache resident. And quite possibly that "correct" recommendation is something like "you need XX percents of your data size for page cache to keep disk subsystem efficient".
Comment by Perry Krug [ 14/Aug/13 ]
Okay, that does make a very good point.

But it also highlights the need for a flexible configuration on our end depending on the use case and customer's needs. i.e., certain customers want to enforce that they are 100% resident and to me that would mean giving Couchbase more than the default quota (while still keeping the potential for fragmentation in mind).
Comment by Patrick Varley [ 11/Feb/14 ]
MB-10180 is strongly related to this issue.
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Anil, pls see my comment on MB-10180.




[MB-8054] Couchstore's mergesort module, currently used for db compaction, can buffer too much data in memory Created: 10/Apr/13  Updated: 19/May/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 2.0.1, 2.1.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Filipe Manana Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
The size of the buffer used by the mergesort module is exclusively bounded by number of elements. This
is dangerous, because elements can have variable sizes, and
a small number of elements does not necessarily means that
the buffer size (byte size) is small.
    
Namely the treewriter module, used by the database compactor
to sort the temporary file containing records for the id btree,
was specifying a buffer element count of 100 * 1024 * 1024.
If for example there are 100 * 1024 * 1024 id records and each
has an average size of 512 bytes, the merge sort module buffers
50Gb of data in memory!
    
Although the id btree records are currently very small (under
a hundred bytes or so), use of other types of records may easily
cause too much memory consumption - this will be the case for
view records. Issue MB-8029 adds a module that uses the mergesort
module to sort files containing view records.


 Comments   
Comment by Filipe Manana [ 10/Apr/13 ]
http://review.couchbase.org/#/c/25588
Comment by Filipe Manana [ 11/Apr/13 ]
It turns out this is not a simple change.

Simply adding a buffer byte size limit breaks the merge algorithm for some cases, particularly when the file to sort is larger than the specified buffer sizes. The mergesort.c merge phase relies on the fact that each sorted batch written to the tmp files always has the same number of elements - a thing that doesn't hold true when records have a variable size such as with views (MB-8029).

For now it's not too bad because for the current use of mergesort.c by the views, the files to sort are small (up to 30Mb max). Later this will have to change, as the files to sort can have any size, from a few kb to hundreds of mbs or gbs. I'll look for alternative external mergesort implementation, which allows to control max buffer size, merge only a group of already sorted files (like Erlang's file_sorter allows) and ideally more optimized as well (allow for N-way merge, instead of fixed 2-way merge, etc).
Comment by Filipe Manana [ 16/May/13 ]
There's a new and improved (flexibility, error handling, some performance optimizations) on-disk file sorter now in master branch.
It's being used for views already.

Introduced in:

https://github.com/couchbase/couchstore/commit/fdb0da52a1e3c059fef3fa7e74ec54b03e62d5db

Advantages:

1) Allow in-memory buffer sizes to be bounded by number of
   bytes, unlike mergesort.c which bounds buffers by number of
   records regardless of their sizes.

2) Allow for a N-way merges, allowing for better performance,
   due to a significant reduction of moving records between
   temporary files;

3) Some optimizations to avoid unncessary moving of records
   between temporary files (specially when total number of
   records is smaller than buffer size);

4) Allow specifying which directory is used to store temporary
   files. The mergesort.c uses the C stdlib function tmpfile()
   to create temporary files - the standard doesn't specify in
   which directory such files are created, but on GNU/Linux it
   seems to be in /tmp (see http://linux.die.net/man/3/tmpfile).
   For database compaction and index compaction, it's important
   to use a directory from within the configured database and
   index directories (settings database_dir and view_index_dir),
   because those directories are what the administrator configured
   and may be part of a disk drive that offers better performance
   or just has more available space for example.
   Further, in some system /tmp might map to a tmpfs mount, which
   is an in-memory filesystem (http://en.wikipedia.org/wiki/Tmpfs);

5) Better and more fine grained error handling. Confront with MB-8055 -
   the mergesort.c module ignored completely read errors when
   reading from the temporary files - which could lead to silent data loss.
Comment by Filipe Manana [ 16/May/13 ]
See above.
Since this is core database, I believe it belongs to you.
Comment by Maria McDuff (Inactive) [ 10/Feb/14 ]
Aaron,

is this going to be in 3.0?
Comment by Aaron Miller (Inactive) [ 18/Feb/14 ]
I wouldn't count on it. This sort of thing affects views a lot more than the storage files, and the view code has already been modified to use the newer disk sort.

This is unlikely to cause any problems with storage file compaction, as the sizes of the records in storage files can't grow arbitrarily.

Using the newer sort will probably perform better, but *not* using it shouldn't cause any problems, making this issue more of a performance enhancement than a bug, and as such will probably lose to other issues I'm working on for 3.0 and 2.5.X




[MB-8022] Fsync optimizations (remove double fsyncs) Created: 05/Feb/13  Updated: 01/Apr/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 2.0, 2.0.1, 2.1.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dipti Borkar Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: PM-PRIORITIZED
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Aaron Miller (Inactive) [ 28/Mar/13 ]
There is a toy build that Ronnie is testing to see the potential perfomance impacts of this. (toy-aaron #1022)
Comment by Maria McDuff (Inactive) [ 10/Apr/13 ]
Jin will update use case scenario that QE will run.
Comment by Jin Lim [ 11/Apr/13 ]
This feature is to optimize disk write from ep engine/couchstore.

Any existing test that measures disk drain rate should determine any tangible improvement from the feature.
Baseline:
* Heavy dgm
* Write heavy (read:20% write:80%)
* Write I/O should be mix of set/delete/update
* Measure disk drain rate and cbstats's kvtimings (writeTime, commit, save_documents)
Comment by Aaron Miller (Inactive) [ 11/Apr/13 ]
The most complicated part of this change is the addition of a corruption check that must be run the first time a file is opened after the server comes up, since we're buying these perf gains by playing a bit more fast and loose with the disk.

To check that this is behaving correctly we'll want to make sure that corrupting the most-recent transaction in a storage file rolls that transaction back.

This could be accomplished by updating an item that will land in a known vbucket, shutting down the server, and flipping some bits around end of the file. The update should be rolled back when the server comes back up, and nothing should freak out :)

A position guaranteed to affect an item body from the recentmost transaction is 4095 bytes behind the last position in the file that is a multiple of 4096, or: floor(file_length / 4096) * 4096 - 4095
Comment by Maria McDuff (Inactive) [ 16/Apr/13 ]
abhinav,
will you be able to craft a test that involves this update to an item and manipulating the bits on eof? this seems tricky. let's discuss with Jin/Aaron.
Comment by Dipti Borkar [ 19/Apr/13 ]
I don't think this is user visible and so doesn't make sense to include in the release notes.
Comment by Maria McDuff (Inactive) [ 19/Apr/13 ]
aaron, pls assign back to QE (Abhinav) once you've merged the fix.
Comment by kzeller [ 22/Apr/13 ]
Updated 4/22 - No docs needed
Comment by Maria McDuff (Inactive) [ 22/Apr/13 ]
Aaron, can you also include the code changes for review here as soon as you have checked-in the fix?
thanks.
Comment by Maria McDuff (Inactive) [ 23/Apr/13 ]
deferred.
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
Hi Aaron, are you working on this for 3.0? if yes, could you push this to fexversion=3.0
Comment by Cihan Biyikoglu [ 01/Apr/14 ]
Chiyoung, pls close if this isn't relevant anymore, given this is a year old.




[MB-7177] lack of fsyncs in view engine may lead to silent index corruption Created: 13/Nov/12  Updated: 11/Mar/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Rahim Yaseen
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
SUBJ. Found out about this in discussion with Filipe about how views work.

If I understood correctly it doesn't fsync at all silently assuming that if there's valid header, then preceding data is valid as well. Which is clearly not true.

IMHO that's a massive blocker that needs to be fixed sooner rather than later.

 Comments   
Comment by Steve Yen [ 14/Nov/12 ]
bug-scrub -- assigned to yaseen
Comment by Aleksey Kondratenko [ 14/Nov/12 ]
Comment was made that this cannot be silent index corruption due to CRC-ing of all btree nodes. But my point still holds, we if there's data corruption we'll know at query time and people we'll have to experience down time to manually rebuild index.
Comment by Steve Yen [ 15/Nov/12 ]
per bug scrub
Comment by Farshid Ghods (Inactive) [ 26/Nov/12 ]
Deep and Iryna have tried a scenario where they rebooted the system and did not hit this issue.
Comment by Steve Yen [ 26/Nov/12 ]
to .next per bug-scrub.

QE reports that deep & iryna tried to reproduce this and couldn't yet.
Comment by Aleksey Kondratenko [ 26/Nov/12 ]
It appears that move to .next was based on same old "we cannot reproduce" logic. It appears that we continue to under-prioritize IMHO important bugs merely because it's hard to reproduce them.

Because with that logic we'll I'm sure will forever move it to next release. If we think we don't need to that, IMHO it would be better to just close it.
Comment by Filipe Manana [ 04/Jan/13 ]
Due to crc checks for every object written to a file (btree nodes), it won't certainly be silent.
Comment by Aleksey Kondratenko [ 04/Jan/13 ]
I agree. My earlier comment above (based on your's or Damien's verbal comment) has same information.

But not being silent doesn't mean we can simply close it (or IMHO downgrade or forget it). Do we know what exactly will happen if querying or updating view will suddenly detect corrupted index file ?
Comment by Andrew DePue [ 21/May/13 ]
We just ran into this, or something like it. We have a development cluster and lost power to the entire cluster at once (it was a dev cluster so we didn't have backup power). The Couchbase cluster _seemed_ to start OK, but accessing certain views would result in strange behavior... mostly timeouts without any error or any indication as to what the problem could be.
Comment by Filipe Manana [ 21/May/13 ]
If there's a corruption issue with a file (either view or database), view queries will return an explicit file_corruption error if the index file is corrupted. If the corruption is in a database file, the error is only returned in a query response if the query is of type stale=false. For all cases, the error (and a stack trace) are logged.

Did you saw such error in your case? Example:
http://www.couchbase.com/forums/thread/filecorruption-error-executing-view




[MB-6746] separate disk path for replica index (and individual design doc) disk path Created: 26/Sep/12  Updated: 20/Jun/13

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-beta
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dipti Borkar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
On the UI and REST API to create a separate disk path for replica indexes (right under the replica index check box in the setup wizard)

This will allow users to have better disk solution is replica index is used.

In addition, add new rest APIs to enable a separate disk path for each design document (Not in the UI, only in REST)

 Comments   
Comment by Aleksey Kondratenko [ 28/Sep/12 ]
Dipti, this checkbox in setup wizard is for default bucket. Not cluster-wide setting.

Also are you really sure we need this ? I mean raid0 for views looks even better from performance perspective.
Comment by Aleksey Kondratenko [ 04/Oct/12 ]
We discussed already that I can't do that without more instructions.
Comment by Peter Wansch (Inactive) [ 08/Oct/12 ]
Change too invasive for 2.0
Comment by Steve Yen [ 25/Oct/12 ]
alk would be a better assignee for this than peter
Comment by Aleksey Kondratenko [ 20/Jun/13 ]
Given this is per-bucket/per-node we don't have place for this in current UI design.

And I'm not sure we really need this. I seriously doubt that, honestly speaking.




[MB-6527] Tools to Index and compact database/indexes when the server is offline Created: 05/Sep/12  Updated: 01/Apr/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Karan Kumar (Inactive) Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
This is from the supportability point of view.
If for whatever reason customer bring their nodes down, eg. maintenance etc.

When they bring the node back up, hopefully we have all the compaction/indexing finished for that particular node.

We need a way to index and compact data (database and index) if possible when the nodes are offline.

 Comments   
Comment by Cihan Biyikoglu [ 20/Mar/14 ]
Anil, could you pull this into 3.0 if this is happening in 3.0 timeline?




[MB-6450] Finalize doc editing API and implementation Created: 27/Aug/12  Updated: 19/May/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-beta
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
We need to:

* avoid loading and displaying blobs on UI

* avoid seeing deleted docs

* handle warmup, node being down in REST API implementation and UI


 Comments   
Comment by Tug Grall (Inactive) [ 09/Sep/13 ]
As a follow up to our discussion...

We have some issue with the console where for example the user stores: (using the PHP SDK but more or less the same with other)

$cb->set("T2","2.0");

The data on disk looks like (couchstore dump)
     id: T2
     rev: 6
     content_meta: 128
     cas: 76870971085289, expiry: 0, flags: 0
     data: (snappy) 2.0

When I read the value is:

echo( $cb->get("T2") );
2.0

But in the Couchbase console the value is showed has:
2




Comment by Tug Grall (Inactive) [ 09/Sep/13 ]
Another issue that I have found that is relate to this global work is:

If you store:
$cb->set("T3","2,0");

On disk it is:
     id: T3
     rev: 8
     content_meta: 131
     cas: 76870971228863, expiry: 0, flags: 0
     data: (snappy) 2,0

In the console : document view :
"SyntaxError: JSON.parse: unexpected non-whitespace character after JSON data (Document is invalid JSON)"
Does not show any valye (where it should show a "base64" string

In the console : preview document in the view editor:
"Lost connection to server at 127.0.0.1:8091. Repeating in 1 seconds. Retry now" and nothing is showed

The XHR (for example http://127.0.0.1:8091/couchBase/default/T3?_=1378774035807) returns 2,0 where it "should" be a base64 not a JSON "thing"
Comment by Aleksey Kondratenko [ 10/Sep/13 ]
Tug, issue we discussed above indeed applies to UI. _But_ there's also full stack issue. And IMHO larger one.

I've cc-ed some bosses so that they're aware too. So let me elaborate and raise that.

Our product is positioned as "document database". Yet what you've shown me php sdk doing is:

* when number 2.0 is passed to set(), it get's silently converted to array of bytes "2" which is stored on server. And we treat that as valid json (well at least in views; we've long had plans to have flag that would allow memcached to refuse non-json values). So map function will see that as 2.0 (js numbers are always floating point). So far good.

* when string "2.0" is sent, it still gets sent as array of bytes "2.0" which is stored on server. This time map function will see _number_ again. Which is arguably not good. Application is seeing string, yet view is seeing number.

* when string "2gargage" is sent it gets to server as "2gargage". And this is not valid json (note, quotes are _mine_ and are not part of value). So views will see that (due to some arguably questionable decision) as base64 encoding of that octet string.

My point is: if we want to seriously be json storage, we should consider doing something on clients as well. So that for example "asdsad" is sent as "\"asdasd\"" (quoted to be json string). So that views see roughly same value as your app.

This is in my opinion larger issue than UI problem that I'm going to fix soon. And let me note again that there is _no plan at all_ to support displaying let alone editing arbitrary values (blobs) on UI. We'll only limit ourselves to json values. And quite possibly even not all types of json values.




[MB-4840] hotfix release should reflect the change(version#) on the web console Created: 27/Feb/12  Updated: 11/Mar/14

Status: In Progress
Project: Couchbase Server
Component/s: build
Affects Version/s: None
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Farshid Ghods (Inactive) Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Comments   
Comment by Phil Labee [ 04/Mar/13 ]
When a hotfix is generated, capture the manifest.xml file from the build. Diff with the GA manifest and update the installed manifest by appending with new commit info.
Comment by Steve Yen [ 18/Mar/13 ]
Related to this, need to update the hotfix creation procedures / docs.
Comment by Steve Yen [ 18/Mar/13 ]
Also, easier to wait for the next hotfix before updating this.
Comment by Phil Labee [ 17/Apr/13 ]
I need the diagnostic info that customers send to support, the one that includes the manifest.xml file.

If the customer ran the update.sh script the manifest.xml should have been updated in a way that is reflected in the diagnostic info.

I'd also like general feedback on the hot-fix installation process, as I am trying to make it easier and more reliable.
Comment by Maria McDuff (Inactive) [ 03/Jul/13 ]
tony,

can u quickly verify this?
you have to run a hotfix release -- get it from phil.




[MB-4785] Meaningful alert when low-level packet corruption on node Created: 08/Feb/12  Updated: 21/Nov/13

Status: Reopened
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 1.7.2
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Tim Smith (Inactive) Assignee: Anil Kumar
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: See http://www.couchbase.com/issues/browse/CBSE-101 for complete details


 Description   
Logs showed that some low-level corruption in network data was apparent. Symptom is that nodes are going up and down. Not clear in the UI that this is happening only on 2 nodes. Not clear in UI that it's low-level corruption. Not clear that these nodes are consistently having a problem, and need to be failed over. No info bubbles up about why the node flaps up and down, or how to report this up to data center or Amazon (in this case on EC2).

Need a clear alert to user, suggesting to fail over a troublesome node. Ideal to have concrete examples of the corrupt data to pass on to data center ops.

 Comments   
Comment by Aleksey Kondratenko [ 08/Feb/12 ]
are you sure this is really critical ?
Comment by Tim Smith (Inactive) [ 08/Feb/12 ]
My priority calibration may be off here. It is OK for product management or whoever to re-triage this request based on a larger picture of priorities.

Tim
Comment by Farshid Ghods (Inactive) [ 08/Feb/12 ]
I would actually rephrase this bug saying that the node status should become red when nss_server detects a corruption during send/receive and change the issue type from enhancement into a bug.

and the fact this happened in ec2 environment makes it more important
Comment by Peter Wansch (Inactive) [ 19/Jul/12 ]
Maybe this has been resolved because of recent infinity fixes in Erlang
Comment by Aleksey Kondratenko [ 07/Sep/12 ]
Not fixed.

We indeed think we've fixed cause of this.

But if this happens again, only thing we'll see is node being red for a moment in UI.

Unfortunately erlang doesn't provide us a way to monitor and react on this particular condition. It'll be just disconnect and you cannot know why.

So only way to fix seems to be extending erlang's vm.
Comment by Aleksey Kondratenko [ 20/Sep/12 ]
We haven't fixed it.

We think we _have_ fixed CBSE-whatever by working around some unknown subtle bug in infinity trapping via signals that's specific to Linux on EC2 (or any linux or any xen, we don't know).

This particular request is to make this condition when low-level erlang code detects packet corruption and disconnects pair of nodes _visible to end user_. Via alert particularly. It makes sense to me.

Regarding what Farshid said. We _do_ mark node as red, but next second we re-establish connection and things work again, until this happens next time.
Comment by Aleksey Kondratenko [ 20/Sep/12 ]
Also I think we can fix it, but in not necessarily future-proof and pleasant way. We can grep log message that erlang logs via error logging facility that our logger implementation intercepts. That seems like the only path (excluding erlang VM modification) that can produce alerts from this kind of events.
Comment by Aleksey Kondratenko [ 12/Aug/13 ]
Depends on non-poor-man's alerts
Comment by Brian Shumate [ 21/Nov/13 ]
It would be helpful if network errors or network partition conditions could
be logged and represented almost in the same manner as the uptime command's
representation of load average, i.e. number of network issues in the last
5/15/30 minutes or similar somewhere in the web console UI.




[MB-9199] Dynamic centralized configuration management Created: 21/May/12  Updated: 25/Apr/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.1, 2.0, 2.0.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dipti Borkar Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links: