[MB-12197] [Windows]: Bucket deletion failing with error 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue."} Created: 16/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Meenakshi Goel Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: windows, windows-3.0-beta, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.1-1299-rel

Attachments: Text File test.txt    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Ref Link:
http://qa.hq.northscale.net/job/win_2008_x64--14_01--replica_read-P0/32/consoleFull
http://qa.hq.northscale.net/job/win_2008_x64--59--01--bucket_flush-P1/14/console
http://qa.hq.northscale.net/job/win_2008_x64--59_01--warmup-P1/6/consoleFull

Test to Reproduce:
newmemcapable.GetrTests.getr_test,nodes_init=4,GROUP=P0,expiration=60,wait_expiration=true,error=Not found for vbucket,descr=#simple getr replica_count=1 expiration=60 flags = 0 docs_ops=create cluster ops = None
flush.bucketflush.BucketFlushTests.bucketflush,items=20000,nodes_in=3,GROUP=P0

*Note that test doesn't fail but further do fails with "error 400 reason: unknown ["Prepare join failed. Node is already part of cluster."]" because cleanup wasn't successful.

Logs:
[rebalance:error,2014-09-15T9:36:01.989,ns_1@10.3.121.182:<0.6938.0>:ns_rebalancer:do_wait_buckets_shutdown:307]Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.3.121.182',
                                                         {'EXIT',
                                                          {old_buckets_shutdown_wait_failed,
                                                           ["default"]}}}]

[error_logger:error,2014-09-15T9:36:01.989,ns_1@10.3.121.182:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.6938.0>
    registered_name: []
    exception exit: {buckets_shutdown_wait_failed,
                        [{'ns_1@10.3.121.182',
                             {'EXIT',
                                 {old_buckets_shutdown_wait_failed,
                                     ["default"]}}}]}
      in function ns_rebalancer:do_wait_buckets_shutdown/1 (src/ns_rebalancer.erl, line 308)
      in call from ns_rebalancer:rebalance/5 (src/ns_rebalancer.erl, line 361)
    ancestors: [<0.811.0>,mb_master_sup,mb_master,ns_server_sup,
                  ns_server_cluster_sup,<0.57.0>]
    messages: []
    links: [<0.811.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 46422
    stack_size: 27
    reductions: 5472
  neighbours:

[user:info,2014-09-15T9:36:01.989,ns_1@10.3.121.182:<0.811.0>:ns_orchestrator:handle_info:483]Rebalance exited with reason {buckets_shutdown_wait_failed,
                              [{'ns_1@10.3.121.182',
                                {'EXIT',
                                 {old_buckets_shutdown_wait_failed,
                                  ["default"]}}}]}
[ns_server:error,2014-09-15T9:36:09.645,ns_1@10.3.121.182:ns_memcached-default<0.4908.0>:ns_memcached:terminate:798]Failed to delete bucket "default": {error,{badmatch,{error,closed}}}

Uploading Logs

 Comments   
Comment by Meenakshi Goel [ 16/Sep/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-12197/11dd43ca/10.3.121.182-9152014-938-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/e7795065/10.3.121.183-9152014-940-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/6442301b/10.3.121.102-9152014-942-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/10edf209/10.3.121.107-9152014-943-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/9f16f503/10.1.2.66-9152014-945-diag.zip
Comment by Ketaki Gangal [ 16/Sep/14 ]
Assigning to ns_server team for a first look.
Comment by Aleksey Kondratenko [ 16/Sep/14 ]
For cases like this it's very useful to get sample of backtraces from memcached on bad node. Is it still running ?
Comment by Aleksey Kondratenko [ 16/Sep/14 ]
Eh. It's windows....
Comment by Aleksey Kondratenko [ 17/Sep/14 ]
I've merged diagnostics commit (http://review.couchbase.org/41463). Please rerun, reproduce and give me new set of logs.
Comment by Meenakshi Goel [ 18/Sep/14 ]
Tested with 3.0.1-1307-rel, Please find logs below.
https://s3.amazonaws.com/bugdb/jira/MB-12197/c2191900/10.3.121.182-9172014-2245-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/28bc4a83/10.3.121.183-9172014-2246-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/8f1efbe5/10.3.121.102-9172014-2248-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/91a89d6a/10.3.121.107-9172014-2249-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/2d272074/10.1.2.66-9172014-2251-diag.zip
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
BTW I am indeed quite interested if this is specific to windows or not.
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
This continues to be superweird. Possibly another erlang bug. I need somebody to answer the following:

* can we reliably reproduce this on windows ?

* 100 % of the time ?

* if not (roughly) how often?

* can we reproduce this (at all) on GNU/Linux? How frequently?




[MB-12158] erlang gets stuck in gen_tcp:send despite socket being closed (was: Replication queue grows unbounded after graceful failover) Created: 09/Sep/14  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dcp_proxy.beam    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
After speaking with Mike briefly, sounds like this may be a known issue. My apologies if there is a duplicate issue already filed.

Logs are here:
 https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-176-128-88.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-193-231-33.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-219-111-249.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-219-84-241.us-west-1.compute.amazonaws.com.zip

 Comments   
Comment by Mike Wiederhold [ 10/Sep/14 ]
Perry,

The stats seem to be missing for dcp streams so I cannot look further into this. If you can still reproduce this on 3.0 build 1209 then assign it back to me and include the logs.
Comment by Perry Krug [ 11/Sep/14 ]
Mike, does the cbcollect_info include these stats or do you need me to gather something specifically when the problem occurs?

If not, let's also get them included for future builds...
Comment by Perry Krug [ 11/Sep/14 ]
Hey Mike, I'm having a hard time reproducing this on build 1209 where it seemed rather easy on previous builds. Do you think any of the changes from the "bad_replicas" bug would have affected this? Is it worth reproducing on a previous build where it was easier in order to get the right logs/stats or do you think it may be fixed already?
Comment by Mike Wiederhold [ 11/Sep/14 ]
This very well could be related to MB-12137. I'll take a look at the cluster and if I don't find anything worth investigating further then I think we should close this as cannot reproduce since it doesn't seem to happen anymore on build 1209. If there is still a problem I'm sure it will be reproduced again later in one of our performance tests.
Comment by Mike Wiederhold [ 11/Sep/14 ]
It looks like one of the dcp connections to the failed over node was still active. My guess is that the node when down and came back up quickly. As a result it's possible that ns_server re-established the connection with the downed node. Can you attach the logs and assign this to Alk so he can take a look?
Comment by Perry Krug [ 11/Sep/14 ]
Thanks Mike.

Alk, logs are attached from the first time this was reproduced. Let me know if you need me to do so again.

Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Mike, btw for the future, if you could post exact details (i.e. node and name of connection) of stuff you want me to double-check/explain it could have saved me time.

Also, let me note that it's replica and node master who establishes replication. I.e. we're "pulling" rather than "pushing" replication.

I'll look at all this and see if I can find something.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Sorry, replica instead of master, who initiates replication.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Indeed I'm seeing dcp connection from memcached on .33 to beam of .88. And it appears that something in dcp replicator is stuck. I'll need a bit more time to figure this out.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Looks like socket send gets blocked somehow despite socket actually being closed already.

Might be serious enough to be a show stopper for 3.0.

Do you by any chance still have nodes running? Or if not, can you easily reproduce this? Having direct access to bad node might be very handy to diagnose this further.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Moved back to 3.0. Because if it's indeed erlang bug it might be very hard to fix and because it may happen not just during failover.
Comment by Cihan Biyikoglu [ 12/Sep/14 ]
triage - need and update pls.
Comment by Perry Krug [ 12/Sep/14 ]
I'm reproducing now and will post both the logs and the live systems momentarily
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Able to reproduce this condition with erlang outside of our product (which is great news):

* connect gen_tcp socket to nc or irb process listening

* spawn erlang process that will send stuff infinitely on that socket and will eventually block

* from erlang console do gen_tcp:close (i.e. while other erlang process is blocked writing)

* observe how erlang process that's blocked is still blocked

* observe with lsof that socket isn't really closed

* close the socket on the other end (by killing nc)

* observe with lsof that socket is closed

* observe how erlang process is still blocked (!) despite underlying socket fully dead

The fact that it's not a race is really great because dealing with deterministic bug (even if it's "feature" from erlang's point of view) is much easier
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Fix is at: http://review.couchbase.org/41396

I need approval to get this in 3.0.0.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Attaching fixed dcp_proxy.beam if somebody wants to be able to test the fix without waiting for build
Comment by Perry Krug [ 12/Sep/14 ]
Awesome as usual Alk, thanks very much.

I'll give this a try on my side for verification.
Comment by Parag Agarwal [ 12/Sep/14 ]
Alk, will this issue occur in TAP as well? during upgrades.
Comment by Mike Wiederhold [ 12/Sep/14 ]
Alk,

I apologize or not including a better description of what happened. In the future I'll make sure to leave better details before assigning bugs to others so that we don't have multiple people duplicating the same work.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
>> Alk, will this issue occur in TAP as well? during upgrades.

No.
Comment by Perry Krug [ 12/Sep/14 ]
As of yet unable to reproduce this on build 1209+dcp_proxy.beam.

Thanks for the quick turnaround Alk.
Comment by Cihan Biyikoglu [ 12/Sep/14 ]
triage discussion:
under load this may happen frequently -
there is good chance that this recovers itself in few mins - it should but we should validate.
if we are in this state, we can restart erlang to get out of the situation - no app unavailability required
fix could be risky to take at this point

decision: not taking this for 3.0
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Mike, need you ACK on this:

Because of dcp nops between replicators, dcp producer should after few minutes, close his side of the socket and release all resources.

Am I right? I said this in meeting just few minutes ago and it affected decision. If I'm wrong (say if you decided to disable nops in the end, or if you know it's broken etc), then we need to know it.
Comment by Perry Krug [ 12/Sep/14 ]
FWIW, I have seen that this does not recover after a few minutes. However, I agree that it is workaround-able both by restarting beam or bringing the node back into the cluster. Unless we think this will happen much more often, I agree it could be deferred out of 3.0.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Well if it does not recover then it can be argued that we have another bug on ep-engine side that may lead to similar badness (queue size and resources eated) _without_ clean workaround.

Mike, we'll need your input on DCP NOPs.
Comment by Mike Wiederhold [ 12/Sep/14 ]
I was curious about this myself. As far as I know the noop code is working properly and we have some tests to make sure it is. I can work with Perry to try to figure out what is going on on the ep-engine side and see if the noops are actually being sent. I know this sounds unlikely, but I was curious whether or not the noops were making it through to the failed over node for some reason.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
>> I know this sounds unlikely, but I was curious whether or not the noops were making it through to the failed over node for some reason.

I can rule this out. We do have connection between destination's beam and source's memcached. And we _dont_ have connection to beam's connection to destination memcached anymore. Erlang is stuck writing to dead socket. So there's no way you could get nop acks back.
Comment by Perry Krug [ 15/Sep/14 ]
I've confirmed that this state persists for much longer than a few minutes...I've not ever seen it recover itself, and have left it to run for 15-20 minutes at least.

Do you need a live system to diagnose?
Comment by Cihan Biyikoglu [ 15/Sep/14 ]
thanks for the update - Mike, sounds like we should open an issue for DCP to reliably detect these conditions. We should add this in for 3.0.1.
Perry, Could you confirm restarting the erlang process resolves the issue Perry?
thanks
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41410

Mike will open different ticket for NOPs in DCP.




[MB-11998] Working set is screwed up during rebalance with delta recovery (>95% cache miss rate) Created: 18/Aug/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Venu Uppalapati
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1169

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File cache_miss_rate.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares-dev/45/artifact/
Is this a Regression?: No

 Description   
1 of 4 nodes is being re-added after failover.
500M x 2KB items, 10K mixed ops/sec.

Steps:
1. Failover one of nodes.
2. Add it back.
3. Enabled delta recovery.
4. Sleep 20 minutes.
5. Rebalance cluster.

 Comments   
Comment by Abhinav Dangeti [ 17/Sep/14 ]
Warming up during the delta recovery without an access log seems to be the cause for this.
Comment by Abhinav Dangeti [ 18/Sep/14 ]
Venu, my suspicion here is that there was no access log generated during the course of this test. Can you set the access log task time to zero, and its sleep interval to say 5-10 minutes and retest this scenario? I think you will need to be using the performance framework to be able to plot the cache miss ratio.




[MB-12210] xdcr related services sometimes log debug and error messages to non-xdcr logs (was: XDCR Error Logging Improvement) Created: 18/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Chris Malarky Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: logging
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
When debugging an XDCR issue some very useful information was in the ns_server.error.log but not the ns_server.xdcr_errors.log

ns_server.xdcr_errors.log:

[xdcr:error,2014-09-18T7:02:12.674,ns_1@ec2-XX-XX-XX-XX.compute-1.amazonaws.com:<0.8020.1657>:xdc_vbucket_rep:init_replication_state:496]Error in fetching remot bucket, error: timeout,sleep for 30 secs before retry.
[xdcr:error,2014-09-18T7:02:12.674,ns_1@ec2-XX-XX-XX-XX.compute-1.amazonaws.com:<0.8021.1657>:xdc_vbucket_rep:init_replication_state:503]Error in fetching remot bucket, error: all_nodes_failed, msg: <<"Failed to grab remote bucket `wi_backup_bucket_` from any of known nodes">>sleep for 30 secs before retry

ns_server.error.log:

[ns_server:error,2014-09-18T7:02:12.674,ns_1@ec2-XX-XX-XX-XX.compute-1.amazonaws.com:<0.8022.1657>:remote_clusters_info: do_mk_json_get:1460]Request to http://Administrator:****@10.x.x.x:8091/pools failed:
{error,rest_error,
       <<"Error connect_timeout happened during REST call get to http://10.x.x.x:8091/pools.">>,
       {error,connect_timeout}}
[ns_server:error,2014-09-18T7:02:12.674,ns_1@ec2-xx-xx-xx-xx.compute-1.amazonaws.com:remote_clusters_info<0.20250.6>: remote_clusters_info:handle_info:435]Failed to grab remote bucket `wi_backup_bucket_`: {error,rest_error,
                                                   <<"Error connect_timeout happened during REST call get to http://10.x.x.x:8091/pools.">>,
                                                   {error,connect_timeout}}

Is there any way these messages could appear in with the xdcr_errors.log ?

 Comments   
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
Yes. Valid request. And some of that but not all has been addressed in 3.0.
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
Good candidate for 3.0.1 but not necessarily important enough. I.e. in light of ongoing rewrite.




[MB-12211] Investigate noop not closing connection in case where a dead connection is still attached to a failed node Created: 18/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
See MB-12158 for information on how to reproduce this issue and why it needs to be looked at on the ep-engine side.




[MB-12209] [windows] failed to offline upgrade from 2.5.x to 3.0.1-1299 Created: 18/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows server 2008 r2 64-bit

Attachments: Zip Archive 12.11.10.145-9182014-1010-diag.zip     Zip Archive 12.11.10.145-9182014-922-diag.zip    
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Install couchbase server 2.5.1 on one node
Create default bucket
Load 1000 items to bucket
Offline upgrade from 2.5.1 to 3.0.1-1299
After upgrade, node reset to initial setup


 Comments   
Comment by Thuan Nguyen [ 18/Sep/14 ]
I got the same issue when offline upgrade from 2.5.0 to 3.0.1-1299. Updated the title
Comment by Thuan Nguyen [ 18/Sep/14 ]
cbcollectinfo of node failed to offline upgrade from 2.5.0 to 3.0.1-1299
Comment by Bin Cui [ 18/Sep/14 ]
http://review.couchbase.org/#/c/41473/




[MB-6972] distribute couchbase-server through yum and ubuntu package repositories Created: 19/Oct/12  Updated: 18/Sep/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Anil Kumar Assignee: Phil Labee
Resolution: Unresolved Votes: 3
Labels: devX
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks MB-8693 [Doc] distribute couchbase-server thr... Reopened
blocks MB-7821 yum install couchbase-server from cou... Resolved
Duplicate
duplicates MB-2299 Create signed RPM's Resolved
is duplicated by MB-9409 repository for deb packages (debian&u... Resolved
Flagged:
Release Note

 Description   
this helps us in handling dependencies that are needed for couchbase server
sdk team has already implemented this for various sdk packages.

we might have to make some changes to our packaging metadata to work with this schema

 Comments   
Comment by Steve Yen [ 26/Nov/12 ]
to 2.0.2 per bug-scrub

first step is do the repositories?
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Farshid Ghods (Inactive) [ 19/Dec/12 ]
Phil,
please sync up with Farshid and get instructions that Sergey and Pavel sent
Comment by Farshid Ghods (Inactive) [ 28/Jan/13 ]
we should resolve this task once 2.0.1 is released .
Comment by Dipti Borkar [ 29/Jan/13 ]
Have we figured out the upgrade process moving forward. for example from 2.0.1 to 2.0.2 or 2.0.1 to 2.1 ?
Comment by Jin Lim [ 04/Feb/13 ]
Please ensure that we also confirm/validate the upgrade process moving from 2.0.1 to 2.0.2. Thanks.
Comment by Phil Labee [ 06/Feb/13 ]
Now have DEB repo working, but another issue has come up: We need to distribute the public key so that users can install the key before running apt-get.

wiki page has been updated.
Comment by kzeller [ 14/Feb/13 ]
Added to 2.0.1 RN as:

Fix:

We now provide Couchbase Server as a yum and Debian package
repositories.
Comment by Matt Ingenthron [ 09/Apr/13 ]
What are the public URLs for these repositories? This was mentioned in the release notes here:
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-rn_2-0-0l.html
Comment by Matt Ingenthron [ 09/Apr/13 ]
Reopening, since this isn't documented that I can find. Apologies if I'm just missing it.
Comment by Dipti Borkar [ 23/Apr/13 ]
Anil, can you work with Phil to see what are the next steps here?
Comment by Anil Kumar [ 24/Apr/13 ]
Yes I'll be having discussion with Phil and will update here with details.
Comment by Tim Ray [ 28/Apr/13 ]
could we either remove the note about yum/deb repo's in the release notes or get those repo locations / sample files / keys added to public pages? The only links that seem that they 'might' contain the info point to internal pages I don't have access to.
Comment by Anil Kumar [ 14/May/13 ]
thanks Tim, we have removed it from release notes. we will add instructions about yum/deb repo's locations/files/keys to documentation once its available. thanks!
Comment by kzeller [ 14/May/13 ]
Removing duplicate ticket:

http://www.couchbase.com/issues/browse/MB-7860
Comment by h0nIg [ 24/Oct/13 ]
any update? maybe i created a duplicate issue: http://www.couchbase.com/issues/browse/MB-9409 but it seems that the repositories are outdated on http://hub.internal.couchbase.com/confluence/display/CR/How+to+Use+a+Linux+Repo+--+debian
Comment by Sriram Melkote [ 22/Apr/14 ]
I tried to install on Debian today. It failed badly. One .deb package didn't match the libc version of stable. The other didn't match the openssl version. Changing libc or openssl is simply not an option for someone using Debian stable because it messes with the base OS too deeply. So as of 4/23/14, we don't have support for Debian.
Comment by Sriram Melkote [ 22/Apr/14 ]
Anil, we have accumulated a lot of input in this bug. I don't think this will realistically go anywhere for 3.0 unless we define specific goals and some considered platform support matrix expansion. Can you please create a goal for 3.0 more precisely?
Comment by Matt Ingenthron [ 22/Apr/14 ]
+1 on Siri's comments. Conversations I had with both Ubuntu (who recommend their PPAs) and Red Hat experts (who recommend setting up a repo or getting into EPEL or the like) indicated that's the best way to ensure coverage of all OSs. Binary packages built on one OS and deployed on another are risky, run into dependency issues.
Comment by Anil Kumar [ 28/Apr/14 ]
This ticket specially for distributing DEB and RPM repositories through YUM and APT repo. We have another ticket for supporting Debian platform MB-10960.
Comment by Anil Kumar [ 23/Jun/14 ]
Assigning ticket to Tony for verification.
Comment by Phil Labee [ 21/Jul/14 ]
Need to do before closing:

[ ] capture keys and process used for build that is currently posted (3.0.0-628), update tools and keys of record in build repo and wiki page
[ ] distribute 2.5.1 and 3.0.0-beta1 builds using same process, testing update capability
[ ] test update from 2.0.0 to 2.5.1 to 3.0.0
Comment by Phil Labee [ 21/Jul/14 ]
re-opening to assign to sprint to prepare the distribution repos for testing
Comment by Wayne Siu [ 30/Jul/14 ]
Phil,
has build 3.0.0-973 be updated in the repos for beta testing?
Comment by Wayne Siu [ 29/Aug/14 ]
Phil,
Please refresh it with build 3.0.0-1205. Thanks.
Comment by Phil Labee [ 04/Sep/14 ]
Due to loss of private keys used to post 3.0.0-628, created new key pairs. Upgrade testing was never done, so starting with 2.5.1 release version (2.5.1-1100).

upload and test using location http://packages.couchbase.com/linux-repos/TEST/:

  [X] ubuntu-12.04 x86_64
  [X] ubuntu-10.04 x86_64

  [X] centos-6-x86_64
  [X] centos-5-x86_64
Comment by Anil Kumar [ 04/Sep/14 ]
Phil / Wayne - Not sure whats happening here please clarify.
Comment by Wayne Siu [ 16/Sep/14 ]
Please refresh with the build 3.0.0-1209.
Comment by Phil Labee [ 17/Sep/14 ]
upgrade to 3.0.0-1209:

  [ ] ubuntu-12.04 x86_64
  [ ] ubuntu-10.04 x86_64

  [X] centos-6-x86_64
  [ ] centos-5-x86_64

  [ ] debian-7-x86_64




[MB-12185] update to "couchbase" from "membase" in gerrit mirroring and manifests Created: 14/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.0, 2.5.1, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Matt Ingenthron Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-8297 Some key projects are still hosted at... Open

 Description   
One of the key components of Couchbase is still only at github.com/membase and not at github.com/couchbase. I think it's okay to mirror to both locations (not that there's an advantage), but for sure it should be at couchbase and the manifest for Couchbase Server releases should be pointing to Couchbase.

I believe the steps here are as follows:
- Set up a github.com/couchbase/memcached project (I've done that)
- Update gerrit's commit hook to update that repository
- Change the manifests to start using that repository

Assigning this to build as a component, as gerrit is handled by the build team. Then I'm guessing it'll need to be handed over to Trond or another developer to do the manifest change once gerrit is up to date.

Since memcached is slow changing now, perhaps the third item can be done earlier.

 Comments   
Comment by Chris Hillery [ 15/Sep/14 ]
Actually manifests are owned by build team too so I will do both parts.

However, the manifest for the hopefully-final release candidate already exists, and I'm a teensy bit wary about changing it after the fact. The manifest change may need to wait for 3.0.1.
Comment by Matt Ingenthron [ 15/Sep/14 ]
I'll leave it to you to work out how to fix it, but I'd just point out that manifest files are mutable.
Comment by Chris Hillery [ 15/Sep/14 ]
The manifest we build from is mutable. The historical manifests recording what we have already built really shouldn't be.
Comment by Matt Ingenthron [ 15/Sep/14 ]
True, but they are. :) That was half me calling back to our discussion about tagging and mutability of things in the Mountain View office. I'm sure you remember that late night conversation.

If you can help here Ceej, that'd be great. I'm just trying to make sure we have the cleanest project possible out there on the web. One wart less will bring me to 999,999 or so. :)
Comment by Trond Norbye [ 15/Sep/14 ]
Just a FYI, we've been ramping up the changes to memcached, so it's no longer a slow moving component ;-)
Comment by Matt Ingenthron [ 15/Sep/14 ]
Slow moving w.r.t. 3.0.0 though, right? That means the current github.com/couchbase/memcached probably has the commit planned to be released, so it's low risk to update github.com/couchbase/manifest with the couchbase repo instead of membase.

That's all I meant. :)
Comment by Trond Norbye [ 15/Sep/14 ]
_all_ components should be slow moving with respect to 3.0.0 ;)
Comment by Chris Hillery [ 16/Sep/14 ]
Matt, it appears that couchbase/memcached is a *fork* of membase/memcached, which is probably undesirable. We can actively rename the membase/memcached project to couchbase/memcached, and github will automatically forward requests from the old name to the new so it is seamless. It also means that we don't have to worry about migrating any commits, etc.

Does anything refer to couchbase/memcached already? Could we delete that one outright and then rename membase/memcached instead?
Comment by Matt Ingenthron [ 16/Sep/14 ]
Ah, that would be my fault. I propose deleting the couchbase/memcached and then transferring ownership from membase/memcached to couchbase/memcached. I think that's what you meant by "actively rename", right? Sounds like a great plan.

I think that's all in your hands Ceej, but I'd be glad to help if needed.

I still think in the interest of reducing warts, it'd be good to fix the manifest.
Comment by Chris Hillery [ 16/Sep/14 ]
I will do that (rename the repo), just please confirm explicitly that temporarily deleting couchbase/memcached won't cause the world to end. :)
Comment by Matt Ingenthron [ 16/Sep/14 ]
It won't since it didn't exist until this last Sunday when I created this ticket. If something world-ending happens as a result, I'll call it a bug to have depended on it. ;)
Comment by Chris Hillery [ 18/Sep/14 ]
I deleted couchbase/memcached and then transferred ownership of membase/memcached to couchbase. The original membase/memcached repository had a number of collaborators, most of which I think were historical. For now, couchbase/memcached only has "Owners" and "Robots" listed as collaborators, which is generally the desired configuration.

http://review.couchbase.org/#/c/41470/ proposes changes to the active manifests. I see no problem with committing that.

As for the historical manifests, there are two:

1. Sooner or later we will add a "released/3.0.0.xml" manifest to the couchbase/manifest repository, representing the exact SHAs which were built. I think it's probably OK to retroactively change the remote on that manifest since the two repositories are aliases for each other. This will affect any 3.0.0 hotfixes which are built, etc.

2. However, all of the already-built 3.0 packages (.deb / .rpm / .zip files) have embedded in them the manifest which was used to build them. Those, unfortunately, cannot be changed at this time. Doing so would require re-packaging the deliverables which have already undergone QE validation. While it is technically possible to do so, it would be a great deal of manual work, and IMHO a non-trivial and unnecessary risk. The only safe solution would be to trigger a new build, but in that case I would argue we would need to re-validate the deliverables, which I'm sure is a non-starter for PM. I'm afraid this particular sub-wart will need to wait for 3.0.1 to be fully addressed.
Comment by Matt Ingenthron [ 18/Sep/14 ]
Excellent, thanks Ceej. I think this is a great improvement-- espeically if 3.0.0's release manifest no longer references membase.

I'll leave it to the build team to manage, but I might suggest that gerrit and various other things pointing to membase should slowly change as well, in case someone decides someday to cancel the membase organization subscription to github.




[MB-4593] Windows Installer hangs on "Computing Space Requirements" Created: 27/Dec/11  Updated: 18/Sep/14

Status: Reopened
Project: Couchbase Server
Component/s: installer
Affects Version/s: 2.0-developer-preview-3, 2.0-developer-preview-4
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Bin Cui Assignee: Don Pinto
Resolution: Unresolved Votes: 3
Labels: windows, windows-3.0-beta, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7 Ultimate 64. Sony Vaio, i3 with 4GB RAM and 200 GB of 500 GB free. Also on a Sony Vaio, Windows 7 Ultimate 64, i7, 6 GB RAM and a 750GB drive with about 600 GB free.

Attachments: PNG File couchbase-installer.png     PNG File image001.png     PNG File ss 2014-08-28 at 4.16.09 PM.png    
Triage: Triaged

 Description   
When installing the Community Server 2.0 DP3 on Windows, the installer hangs on the "Computing space requirements screen." There is no additional feedback from the installer. After 90-120 minutes or so, it does move forward and complete. The same issue was reported on Google Groups a few months back - http://groups.google.com/group/couchbase/browse_thread/thread/37dbba592a9c150b/f5e6d80880f7afc8?lnk=gst&q=msi.

Executable: couchbase-server-community_x86_64_2.0.0-dev-preview-3.setup.exe

WORKAROUND IN 3.0 - Create a registry key HKLM\SOFTWARE\Couchbase, name=SkipVcRuntime, type=DWORD, value=1 to skip installing VC redistributable installation which is causing this issue. If VC redistributable is necessary, it must be installed manually if the registry key is set to skip automatic install of it.


 Comments   
Comment by Filip Stas [ 23/Feb/12 ]
Is there any solution for this? I'm experiencing the same problem. Running the unpacked msi does not seem to work because the Installshield setup has been configured to require to install through the exe.

Comment by Farshid Ghods (Inactive) [ 22/Mar/12 ]
from Bin:

Looks like it is related to installshield engine. Maybe installshield tries to access system registry and it is locked by other process. The suggestion is to shut down other running programs and try again if such problem pops up.
Comment by Farshid Ghods (Inactive) [ 22/Mar/12 ]
we were unable to reproduce this on windows 2008 64-bit

the bug mentions this happened on windows 7 64-bit which is not a supported platform but that should not make any difference
Comment by Farshid Ghods (Inactive) [ 23/Mar/12 ]
From Bin:

Windows 7 is my dev environment. And I have no problem to install and test it. From your description, I cannot tell whether it is failed during the installation or after installation finishes but couchcbase server cannot start.
 
If it is due to installshield failure, you can generate the log file for debugging as:
setup.exe /debuglog"C:\PathToLog\setupexe.log"
 
If Couchbase server fails to start, the most possible reason is due to missing or incompatible Microsoft runtime library. You can manually service_start.bat under bin directory and check what is going on. And you can run cbbrowse_log.bat to generate log file for further debugging.
Comment by John Zablocki (Inactive) [ 23/Mar/12 ]
This is an installation only problem. There's not much more to it other than the installer hangs on the screen (see attachment).

However, after a failed install, I did get it to work by:

a) deleting C:\Program Files\Couchbase\*

b) deleting all registry keys with Couchbase Server left over from the failed install

c) rebooting

Next time I see this problem, I'll run it again with the /debuglog

I think the problem might be that a previous install of DP3 or DP4 (nightly build) failed and left some bits in place somewhere.
Comment by Steve Yen [ 05/Apr/12 ]
from Perry...
Comment by Thuan Nguyen [ 05/Apr/12 ]
I can not repo this bug. I test on Windows 7 Professional 64 bit and Windows Server 2008 64 bit.
Here are steps:
- Install couchbase server 2.0.0r-388 (dp3)
- Open web browser and go to initial setup in web console.
- Uninstall couchbase server 2.0.0r-388
- Install couchbase server 2.0.0dp4r-722
- Open web browser and go to initial setup in web console.
Install and uninstall couchbase server go smoothly without any problem.
Comment by Bin Cui [ 25/Apr/12 ]
Maybe we need to get the installer verbose log file to get some clues.

setup.exe /verbose"c:\temp\logfile.txt"
Comment by John Zablocki (Inactive) [ 06/Jul/12 ]
Not sure if this is useful or not, but without fail, every time I encounter this problem, simply shutting down apps (usually Chrome for some reason) causes the hanging to stop. Right after closing Chrome, the C++ redistributable dialog pops open and installation completes.
Comment by Matt Ingenthron [ 10/Jul/12 ]
Workarounds/troubleshooting for this issue:


On installshield's website, there are similar problems reported for installshield. There are several possible reasons behind it:

1. The installation of the Microsoft C++ redistributable is blocked by some other running program, sometimes Chrome.
2. There are some remote network drives that are mapped to local system. Installshield may not have enough network privileges to access them.
3. Couchbase server was installed on the machine before and it was not totally uninstalled and/or removed. Installshield tried to recover from those old images.

To determine where to go next, run setup with debugging mode enabled:
setup.exe /debuglog"C:\temp\setupexe.log"

The contents of the log will tell you where it's getting stuck.
Comment by Bin Cui [ 30/Jul/12 ]
Matt's explanation should be included in document and Q&A website. I reproduced the hanging problem during installation if Chrome browser is running.
Comment by Farshid Ghods (Inactive) [ 30/Jul/12 ]
so does that mean the installer should wait until chrome and other browsers are terminated before proceeding ?

i see this as a very common use case with many installers that they ask the user to stop those applications and if user does not follow the instructions the set up process does not continue until these conditions are met.
Comment by Dipti Borkar [ 31/Jul/12 ]
Is there no way to fix this? At the least we need to provide an error or guidance that chrome needs to be quit before continuing. Is chrome the only one we have seen causing this problem?
Comment by Steve Yen [ 13/Sep/12 ]
http://review.couchbase.org/#/c/20552/
Comment by Steve Yen [ 13/Sep/12 ]
See CBD-593
Comment by Øyvind Størkersen [ 17/Dec/12 ]
Same bug when installing 2.0.0 (build-1976) on Windows 7. Stopping Chrome did not help, but killing the process "Logitech ScrollApp" (KhalScroll.exe) did..
Comment by Joseph Lam [ 13/Sep/13 ]
It's happening to me when installing 2.1.1 on Windows 7. What is this step for and it is really necessary? I see that it happens after the files have been copied to the installation folder. No entirely sure what it's computing space requirements for.
Comment by MikeOliverAZ [ 16/Nov/13 ]
Same problem on 2.2.0x86_64. I have tried everything, closing down chrome and torch from Task Manager to ensure no other apps are competing. Tried removing registry entries but so many, my time please. As is noted above this doesn't seem to be preventing writing the files under Program Files so what's it doing? So I cannot install, it now complains it cannot upgrade and run the installer again.

BS....giving up and going to MongoDB....it installs no sueat.

Comment by Sriram Melkote [ 18/Nov/13 ]
Reopening. Testing on VMs is a problem because they are all clones. We miss many problems like these.
Comment by Sriram Melkote [ 18/Nov/13 ]
Please don't close this bug until we have clear understanding of:

(a) What is the Runtime Library that we're trying to install that conflicts with all these other apps
(b) Why we need it
(c) A prioritized task to someone to remove that dependency on 3.0 release requirements

Until we have these, please do not close the bug.

We should not do any fixes on the lines of checking for known apps that conflict etc, as that is treating the symptom and not fixing the cause.
Comment by Bin Cui [ 18/Nov/13 ]
We install window runtime library because erlang runtime libraries depend on it. Not any runtime library, but the one that comes with erlang distribution package. Without it or with incompatible versions, erl.exe won't run.

In stead of checking any particular applications, the current solution is:
Run a erlang test script. If it runs correctly, no runtime library installed. Otherwise, installer has to install the runtime library.

Please see CBD-593.

Comment by Sriram Melkote [ 18/Nov/13 ]
My suggestion is that let us not attempt to install MSVCRT ourselves.

Let us check the library we need is present or not prior to starting the install (via appropriate registry keys).

If it is absent, let us direct the user to download and install it and exit.
Comment by Bin Cui [ 18/Nov/13 ]
The approach is not totally right. Even if the msvcrt exists, we still need to install it. Here the key is the absolute same msvrt package that comes with erlang distribution. We had problems before that with the same version, but different build of msvcrt installed, erlang won't run.

One possible solution is to ask user to download the msvcrt library from our website and make it a prerequisite for installing couchbase server.
Comment by Sriram Melkote [ 18/Nov/13 ]
OK. It looks like MS distributes some versions of VC runtime with the OS itself. I doubt that Erlang needs anything newer.

So let us rebuild Erlang and have it link to the OS supplied version of MSVCRT (i.e., msvcr70.dll) in Couchbase 3.0 onwards

In the meanwhile, let us point the user to the vcredist we ship in Couchbase 2.x versions and ask them to install it from there.
Comment by Steve Yen [ 23/Dec/13 ]
Saw this in the email inboxes...

From: Tal V
Date: December 22, 2013 at 1:19:36 AM PST
Subject: Installing Couchbase on Windows 7

Hi CouchBase support,
I would like to get your assist on an issue I’m having. I have a windows 7 machine on which I tried to install Couchbase, the installation is stuck on the “Computing space requirements”.
I tried several things without success:

1. 1. I tried to download a new installation package.

2. 2. I deleted all records of the software from the Registry.

3. 3. I deleted the folder that was created under C:\Program Files\Couchbase

4. 4. I restart the computer.

5. 5. Opened only the installation package.

6. 6. Re-install it again.
And again it was stuck on the same step.
What is the solution for it?

Thank you very much,


--
Tal V
Comment by Steve Yen [ 23/Dec/13 ]
Hi Bin,
Not knowing much about installshield here, but one idea - are there ways of forcibly, perhaps optionally, skipping the computing space requirements step? Some environment variable flag, perhaps?
Thanks,
Steve

Comment by Bin Cui [ 23/Dec/13 ]
This "Computing space requirements" is quite misleading. It happens at the post install step while GUI still shows that message. Within the step, we run the erlang test script and fails and the installer runs "vcredist.exe" for microsoft runtime library which gets stuck.

For the time being, the most reliable way is not to run this vcredist.exe from installer. Instead, we should provide a link in our download web site.

1. During installation, if we fails to run the erlang test script, we can pop up a warning dialog and ask customers to download and run it after installation.
 
Comment by Bin Cui [ 23/Dec/13 ]
To work around the problem, we can instruct the customer to download the vcredist.exe and run it manually before set up couchbase server. If running environment is set up correctly, installer will bypass that step.
Comment by Bin Cui [ 30/Dec/13 ]
Use windows registry key to install/skip the vcredist.exe step:

On 32bit windows, Installer will check HKEY_LOCAL_MACHINE\SOFTWARE\Couchbase\SkipVcRuntime
On 64bit windows, Installer will check HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Couchbase\SkipVcRuntime,
where SkipVcRuntime is a DWORD (32-bit) value.

When SkipVcRuntime is set to 1, installer will skip the step to install vcredist.exe. Otherwise, installer will follow the same logic as before.
vcredist_x86.exe can be found in the root directory of couchbase server. It can be run as:
c:\<couchbase_root>\vcredist_x86.exe

http://review.couchbase.org/#/c/31501/
Comment by Bin Cui [ 02/Jan/14 ]
Check into branch 2.5 http://review.couchbase.org/#/c/31558/
Comment by Iryna Mironava [ 22/Jan/14 ]
tested with Win 7 and Win Server 2008
I am unable to reproduce this issue(build 2.0.0-1976, dp3 is no longer available)
Installed/uninstalled couchbase several times
Comment by Sriram Melkote [ 22/Jan/14 ]
Unfortunately, for this problem, if it did not reproduce, we can't say it is fixed. We have to find a machine where it reproduces and then verify a fix.

Anyway, no change made actually addresses the underlying problem (the registry key just gives a way to workaround it when it happens), so reopening the bug and targeting for 3.0
Comment by Sriram Melkote [ 23/Jan/14 ]
Bin - I just noticed that the Erlang installer itself (when downloaded from their website) installs VC redistributable in non-silent mode. The Microsoft runtime installer dialog pop us up, indicates it will install VC redistributable and then complete. Why do we run it in silent mode (and hence assume liability of it running properly)? Why do we not run the MSI in interactive mode like ESL Erlang installer itself does?
Comment by Wayne Siu [ 05/Feb/14 ]
If we could get the information on the exact software version, it could be helpful.
From registry, Computer\HKLM\Software\Microsoft\WindowsNT\CurrentVersion
Comment by Wayne Siu [ 12/Feb/14 ]
Bin, looks like the erl.ini was locked when this issue happened.
Comment by Pavel Paulau [ 19/Feb/14 ]
Just happened to me in 2.2.0-837.
Comment by Anil Kumar [ 18/Mar/14 ]
Triaged by Don and Anil as per Windows Developer plan.
Comment by Bin Cui [ 08/Apr/14 ]
http://review.couchbase.org/#/c/35463/
Comment by Chris Hillery [ 13/May/14 ]
I'm new here, but it seems to me that vcredist_x64.exe does exactly the same thing as the corresponding MS-provided merge module for MSVC2013. If that's true, we should be able to just include that merge module in our project, and not need to fork out to install things. In fact, as of a few weeks ago, the 3.0 server installers are doing just that.

http://msdn.microsoft.com/en-us/library/dn501987.aspx

Is my understanding incomplete in some way?
Comment by Chris Hillery [ 14/May/14 ]
I can confirm that the most recent installers do install msvcr120.dll and msvcp120.dll in apparently the correct places, and the server can start with them. I *believe* this means that we no longer need to fork out vcredist_x64.exe, or have any of the InstallShield tricks to detect whether it is needed and/or skip installing it, etc. I'm leaving this bug open to both verify that the current merge module-based solution works, and to track removal of the unwanted code.
Comment by Sriram Melkote [ 16/May/14 ]
I've also verified that 3.0 build installed VCRT (msvcp100) is sufficient for Erlang R16.
Comment by Bin Cui [ 15/Sep/14 ]
Recently I happen to reproduce this problem on my own laptop. Use setup.exe /verbose"c:\temp\verbose.log", i generated a log file with more verbose debugging information. At the end the file, it looks something like :

MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: OVERVIEW.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_admin\overview\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: BUCKETS.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_admin\buckets\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: MN_DIALOGS.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_dialogs\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: ABOUT.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_dialogs\about\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: ALLUSERSPROFILE , Object: Q:\
MSI (c) (C4:C0) [10:51:36:274]: PROPERTY CHANGE: Adding INSTALLLEVEL property. Its value is '1'.

It means that installer tried to populate some property values for alluser profile after it copied all data to install location even though it shows this notorious "Computing space requirements" message.

From every installation, installer will use user temp directory to populate installer related data. After I delete or rename temp data under
c:\Users\<logonuser>\AppData\Temp, I reboot the machine. I solve the problem. at least for my laptop.

Conclusion:

1. After installed copied files, it needs to set alluser profiles. This action is synchronous and it waits and checks exit code. And certainly it will hangs on if this action never returns.

2. This is an issue related to setup environment, i.e. caused by other running applications, etc.

Suggestion:

1. Stop any other browers and applications when you install couchbase.
2. Kill the installation process and uninstall the failed setup.
3. Delete/rename the temp location under c:\Users\<logonuser>\AppData\Temp
4. Reboot and try again.

Comment by Bin Cui [ 17/Sep/14 ]
Turns out, it is really about the installation environment, not about a particular installation step.

Suggest to document the work around method.
Comment by Don Pinto [ 17/Sep/14 ]
Bin, some installers kill conflicting processes before installation starts so that it can complete. Why can't we do this?

(Maybe using something like this - http://stackoverflow.com/questions/251218/how-to-stop-a-running-process-during-an-msi-based-un-install)

Thanks,
Don




[MB-12208] Security Risk: XDCR logs emit entire Document contents in a error situations Created: 17/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 2.2.0, 2.5.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Major
Reporter: Gokul Krishnan Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Per recent discussions with CFO and contract teams, we need to ensure that Customer's Data (Document Keys and Values) aren't emitted in the logs. This poses a security risk and we need default logging throttle levels that don't emit document data in readable format.

Support team have noticed this in the 2.2 version, verifying if this behavior still exists in 2.5.1.

Example posted in a private comment below

 Comments   
Comment by Patrick Varley [ 18/Sep/14 ]
At the same time we need the ability to increase the log level on the fly and include this information, when we hit a wall and need that extra information.

Summarise:

default setting: Do not expose customer data.

Increase logging on the fly that might include customer data. Which the support team will explain to the end-user.




[MB-12126] there is not manifest file on windows 3.0.1-1253 Created: 03/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 r2 64-bit

Attachments: PNG File ss 2014-09-03 at 12.05.41 PM.png    
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.1-1253 on windows server 2008 r2 64-bit. There is not manifest file in directory c:\Program Files\Couchbase\Server\



 Comments   
Comment by Chris Hillery [ 03/Sep/14 ]
Also true for 3.0 RC2 build 1205.
Comment by Chris Hillery [ 03/Sep/14 ]
(Side note: While fixing this, log onto build slaves and delete stale "server-overlay/licenses.tgz" file so we stop shipping that)
Comment by Anil Kumar [ 17/Sep/14 ]
Ceej - Any update on this?
Comment by Chris Hillery [ 18/Sep/14 ]
No, not yet.




[MB-9897] Implement upr cursor dropping Created: 13/Jan/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Major
Reporter: Mike Wiederhold Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Chiyoung Seo [ 17/Sep/14 ]
This requires some significant changes in DCP and checkpointing in ep-engine. Moving this to post 3.0.1




[MB-12201] Hotfix Rollup Release Created: 16/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Cihan Biyikoglu Assignee: Raju Suravarjjala
Resolution: Unresolved Votes: 0
Labels: hotfix
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
representing the rollup hotfix for 2.5.1 that includes all hotfixes (without the V8 change) release to date (sept 2014)

 Comments   
Comment by Dipti Borkar [ 16/Sep/14 ]
is this rollup still 2.5.1? it will create lots of confusion. can we tag it 2.5.2? or does that lead to another round of testing? there are way too many hot fixes so really need a new . release.
Comment by Cihan Biyikoglu [ 17/Sep/14 ]
Hi Dipti, to improve the hotfix management, we are changing the way we'll do hotfixes. the rollup will bring in more hotfixes together and ensure we provide customers the all fixes we know about. if we fixed an issue already at the time you requested your hotfix, there is not reason why we should risk exposing you to known+fixed issues in the version you are using. side effects of this should also improve life for support.
-cihan




[MB-12084] Create 3.0.0 chef-based rightscale template for EE and CE Created: 27/Aug/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: cloud
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Anil Kumar Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Need this before 3.0 GA




[MB-12083] Create 3.0.0 legacy rightscale templates for Enterprise and Community Edition (non-chef) Created: 27/Aug/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: cloud
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Anil Kumar Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We need this before 3.0 GA




[MB-10789] Bloom Filter based optimization to reduce the I/O overhead Created: 07/Apr/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
A bloom filter can be considered as an optimization to reduce the disk IO overhead. Basically, we maintain a separate bloom filter per vbucket database file, and rebuild the bloom filter (e.g., increasing the filter size to reduce a false positive error rate) as part of vbucket database compaction.

As we know the number of items in a vBucket database file, we can determine the number of hash functions and the size of the bloom filter to achieve the desired false positive error rate. Note that Murmur hash has been widely used in Hadoop and Cassandra because it is much faster than MD5 and Jenkins. It has been widely known that fewer than 10 bits per element are required for a 1% false positive probability, independent of the number of elements in the set.

We expect that having a bloom filter will enhance both XDCR and full-ejection cache management performance at the expense of the filter's memory overhead.



 Comments   
Comment by Abhinav Dangeti [ 17/Sep/14 ]
Design Document:
https://docs.google.com/document/d/13ryBkiLltJDry1WZV3UHttFhYkwwWsmyE1TJ_6tKddQ




[MB-11999] Resident ratio of active items drops from 3% to 0.06% during rebalance with delta recovery Created: 18/Aug/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1169

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File vb_active_resident_items_ratio.png     PNG File vb_replica_resident_items_ratio.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares-dev/45/artifact/
Is this a Regression?: No

 Description   
1 of 4 nodes is being re-added after failover.
500M x 2KB items, 10K mixed ops/sec.

Steps:
1. Failover one of nodes.
2. Add it back.
3. Enabled delta recovery.
4. Sleep 20 minutes.
5. Rebalance cluster.

Most importantly it happens due to excessive memory usage.

 Comments   
Comment by Abhinav Dangeti [ 17/Sep/14 ]
http://review.couchbase.org/#/c/41468/




[MB-12054] [windows] [2.5.1] cluster hang when flush beer-sample bucket Created: 22/Aug/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Abhinav Dangeti
Resolution: Cannot Reproduce Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows server 2008 R2

Attachments: Zip Archive 172.23.107.124-8222014-1546-diag.zip     Zip Archive 172.23.107.125-8222014-1547-diag.zip     Zip Archive 172.23.107.126-8222014-1548-diag.zip     Zip Archive 172.23.107.127-8222014-1549-diag.zip    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 2.5.1 on 4 nodes windows server 2008 R2 64-bit
Create a cluster of 4 nodes
Create beer-sample bucket
Enable flush in bucket setting.
Flush beer-sample bucket. Cluster became hang.

 Comments   
Comment by Abhinav Dangeti [ 11/Sep/14 ]
I wasn't able to reproduce this issue with a 2.5.1 build with 2 nodes.

From your logs on one of the nodes I see some couchNotifier logs, where we are waiting for mcCouch:
..
Fri Aug 22 14:00:03.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
Fri Aug 22 14:21:53.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
Fri Aug 22 14:43:43.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
Fri Aug 22 15:05:33.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
...

This won't be a problem in 3.0.1, as mcCouch has been removed. Please re-open if you see this issue in your testing again.




[MB-11426] API for compact-in-place operation Created: 13/Jun/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Jens Alfke Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
It would be convenient to have an explicit API for compacting the database in place, i.e. to the same file. This is what auto-compact does, but if auto-compact isn't enabled, or if the caller wants to run it immediately instead of on a schedule, then the caller has to use fdb_compact, which compacts to a separate file.

I assume the workaround is to compact to a temporary file, then replace the original file with the temporary. But this is several more steps. Since forestdb already contains the logic to compact in place, it'd be convenient if calling fdb_compact(handle, NULL) would do that.

 Comments   
Comment by Chiyoung Seo [ 10/Sep/14 ]
The change is in gerrit for review:

http://review.couchbase.org/#/c/41337/
Comment by Jens Alfke [ 10/Sep/14 ]
The notes on Gerrit say "a new file name will be automatically created by appending a file revision number to the original file name. …. Note that this new compacted file can be still opened by using the original file name"

I don't understand what's going on here — after the compaction is complete, does the old file still exist or am I responsible for deleting it? When does the file get renamed back to the original filename, or does it ever? Should my code ignore the fact that the file is now named "test.fdb.173" and always open it as "test.fdb"?
Comment by Chiyoung Seo [ 10/Sep/14 ]
>I don't understand what's going on here — after the compaction is complete, does the old file still exist or am I responsible for deleting it?

The old file is automatically removed by ForestDB after the compaction is completed.

>When does the file get renamed back to the original filename, or does it ever?

The file won't be renamed to the original name in the current implementation. But, I will adapt the current implementation so that when the file is closed and its ref counter becomes zero, the file can be renamed to its original name.

>Should my code ignore the fact that the file is now named "test.fdb.173" and always open it as "test.fdb"?

Yes, you can still open "test.fdb.173" by passing "test.fdb" file name.

Note that renaming it to the original file name right after finishing the compaciton becomes complicated as the other threads might traverse the old file's blocks (through buffer cache or OS page cache).

Comment by Chiyoung Seo [ 11/Sep/14 ]
I incorporated those answers into the commit message and API header file. Let me know if you have any suggestions / concerns.
Comment by Chiyoung Seo [ 12/Sep/14 ]
The change was merged into the master branch.




[MB-12082] Marketplace AMI - Enterprise Edition and Community Edition - provide AMI id to PM Created: 27/Aug/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: cloud
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Anil Kumar Assignee: Wei-Li Liu
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Need AMI's before 3.0.0 GA

 Comments   
Comment by Wei-Li Liu [ 17/Sep/14 ]
3.0.0 EE AMI: ami-283a9440 Snapshots: snap-308fc192
3.0.0 CE AMI: ami-3237995a




[MB-12186] If flush can not be completed because of a timeout, we should not display a message "Failed to flush bucket" when it's still in progress Created: 15/Sep/14  Updated: 17/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Andrei Baranouski Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1208

Attachments: PNG File MB-12186.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
When I tried to flush heavily loaded cluster I received "Failed To Flush Bucket" popup, in fact it not failed, but simply has not been completed for a set period of time(30 sec)?

expected behaviour: message like "flush is not complete, but continue..."

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
timeout is timeout. We can say "it timed out" be we cannot be sure if it's continuing or not.
Comment by Andrei Baranouski [ 15/Sep/14 ]
hm, we get timeout when removing bucket occurs much long, but we inform that the removal is still in progress, right?
Comment by Aleksey Kondratenko [ 17/Sep/14 ]
You're right. I don't think we're entirely precise on bucket deletion timeout message. It's one of our mid-term goals to be better on this longer running ops and how their progress or results are exposed to user. I see not much value in tweaking messages. Instead we'll just make this entire thing work "right".




[MB-12202] UI shows a cbrestore as XDCR ops Created: 17/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: [info] OS Name : Linux 3.13.0-30-generic
[info] OS Version : Ubuntu 14.04 LTS
[info] CB Version : 2.5.1-1083-rel-enterprise

Attachments: PNG File cbrestoreXDCRops.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I noticed while doing a cbrestore of a backup on a cluster that doesn't have any XDCR configured that the stats in the UI showed ongoing ops for XDCR. (screenshot attached)

the stats code at
http://src.couchbase.org/source/xref/2.5.1/ns_server/src/stats_collector.erl#334 is including all set with meta as XDCR ops.

 Comments   
Comment by Aleksey Kondratenko [ 17/Sep/14 ]
That's the way it is. We have no way to distinguish sources of set-with-metas.




[MB-12189] (misunderstanding) XDCR REST API "max-concurrency" only works for 1 of 3 documented end-points. Created: 15/Sep/14  Updated: 17/Sep/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server, RESTful-APIs
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Jim Walker Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: supportability, xdcr
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Couchbase Server 2.5.1
RHEL 6.4
VM (VirtualBox0
1 node "cluster"

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
This defect relates to the following REST APIs:

* xdcrMaxConcurrentReps (default 32) http://localhost:8091/internalSettings/
* maxConcurrentReps (default 32) http://localhost:8091/settings/replications/
* maxConcurrentReps (default 32) http://localhost:8091/settings/replications/ <replication_id>

The documentation suggests these all do the same thing, but with the scope of change being different.

<docs>
/settings/replications/ — global settings applied to all replications for a cluster
settings/replications/<replication_id> — settings for specific replication for a bucket
/internalSettings - settings applied to all replications for a cluster. Endpoint exists in Couchbase 2.0 and onward.
</docs>

This defect is because only "settings/replications/<replication_id>" has any effect. The other REST endpoints have no effect.

Out of these APIs I can confirm that changing "/settings/replications/<replication_id>" has an effect. The XDCR code shows that the concurrent reps setting feeds into the concurreny throttle as the number of available tokens. I use xdcr log files where we print the concurrency throttle token data to observe that the setting has an effect.

For example, a cluster in the default configuration has a total tokens of 32. We can grep to see this.

[root@localhost logs]# grep "is done normally, total tokens:" xdcr.*
2014-09-15T13:09:03.886,ns_1@127.0.0.1:<0.32370.0>:concurrency_throttle:clean_concurr_throttle_state:275]rep <0.33.1> to node "192.168.69.102:8092" is done normally, total tokens: 32, available tokens: 32,(active reps: 0, waiting reps: 0)

Now changing the setting to 42 the log file shows the change take affect.

curl -u Administrator:password http://localhost:8091/settings/replications/01d38792865ba2d624edb4b2ad2bf07f%2fdefault%2fdefault -d maxConcurrentReps=42

[root@localhost logs]# grep "is done normally, total tokens:" xdcr.*
dcr.1:[xdcr:debug,2014-09-15T13:17:41.112,ns_1@127.0.0.1:<0.32370.0>:concurrency_throttle:clean_concurr_throttle_state:275]rep <0.2321.1> to node "192.168.69.102:8092" is done normally, total tokens: 42, available tokens: 42,(active reps: 0, waiting reps: 0)

Since this defect is that both of the other two REST end-points don't appear to have any affect here's an example changing "settings/replication". This example was on a clean cluster, i.e. no other settings have been changed. Only creating bucket and replication + client writes has been performed.

root@localhost logs]# curl -u Administrator:password http://localhost:8091/settings/replications/ -d maxConcurrentReps=48
{"maxConcurrentReps":48,"checkpointInterval":1800,"docBatchSizeKb":2048,"failureRestartInterval":30,"workerBatchSize":500,"connectionTimeout":180,"workerProcesses":4,"httpConnections":20,"retriesPerRequest":2,"optimisticReplicationThreshold":256,"socketOptions":{"keepalive":true,"nodelay":false},"supervisorMaxR":25,"supervisorMaxT":5,"traceDumpInvprob":1000}

Above shows that the JSON has acknowledged the value of 48 but the log files show no change. After much waiting and re-checking grep shows no evidence.

[root@localhost logs]# grep "is done normally, total tokens:" xdcr.* | grep "total tokens: 48" | wc -l
0
[root@localhost logs]# grep "is done normally, total tokens:" xdcr.* | grep "total tokens: 32" | wc -l
7713

The same was observed for /internalSettings/

Found on both 2.5.1 and 3.0.

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
This is because global settings affect new replications or replications without per-replication settings defined. UI always defines all per-replication settings.
Comment by Jim Walker [ 16/Sep/14 ]
Have you pushed a documentation update for this?
Comment by Aleksey Kondratenko [ 16/Sep/14 ]
No. I don't own docs.
Comment by Jim Walker [ 17/Sep/14 ]
Then this issue is not resolved.

Closing/resolving this defect with breadcrumbs to the opening of an issue on a different project would suffice as a satisfactory resolution.

You can also very easily put a pull request into docs on github with the correct behaviour.

Can you please perform *one* of those task so that the REST api here is correctly documented with the behaviours you are aware of and this matter can be closed.
Comment by Jim Walker [ 17/Sep/14 ]
Resolution requires either:

* Corrected documentation pushed to documentation repository.
* Enough accurate API information placed into a documentation defect so docs-team can correct.





[MB-11917] One node slow probably due to the Erlang scheduler Created: 09/Aug/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Volker Mische Assignee: Harsha Havanur
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File crash_toy_701.rtf     PNG File leto_ssd_300-1105_561_build_init_indexleto_ssd_300-1105_561172.23.100.31beam.smp_cpu.png    
Issue Links:
Duplicate
duplicates MB-12200 Seg fault during indexing on view-toy... Resolved
duplicates MB-9822 One of nodes is too slow during indexing Closed
is duplicated by MB-12183 View Query Thruput regression compare... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
One node is slow, that's probably due to the "scheduler collapse" bug in the Erlang VM R16.

I will try to find a way to verify that it is really the scheduler and no other problem. This is basically a duplicate of MB-9822. Though that bug has a long history, hence I dare to create a new one.

 Comments   
Comment by Volker Mische [ 09/Aug/14 ]
I forgot to add that our issue sounds exactly like that one: http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
Comment by Sriram Melkote [ 11/Aug/14 ]
Upgrading to blocker as this is doubling initial index time in recent runs on showfast.
Comment by Volker Mische [ 12/Aug/14 ]
I verified that it's the "scheduler collapse". Have a look at the chart I've attached (It's from [1] [172.23.100.31] beam.smp_cpu). It starts with a utilization of around 400% at around 120 I reduced the online schedulers to 1 (with running erlang:system_flag(schedulers_online, 1) via a remote shell). I then increased the schedulers_online again at around 150 to the original value of 24. You can see that it got back to normal.

[1]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1105_561_build_init_index
Comment by Volker Mische [ 12/Aug/14 ]
I would try to run on R16 and see how often it happens with COUCHBASE_NS_SERVER_VM_EXTRA_ARGS=["+swt", "low", "+sfwi", "100"] set (as suggested in MB-9822 [1]).

[1]: https://www.couchbase.com/issues/browse/MB-9822?focusedCommentId=89219&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-89219
Comment by Pavel Paulau [ 12/Aug/14 ]
We agreed to try:

+sfwi 100/500 and +sbwt long

Will run test 5 times with these options.
Comment by Pavel Paulau [ 13/Aug/14 ]
5 runs of tests/index_50M_dgm.test with -sfwi 100 -sbwt long:

http://ci.sc.couchbase.com/job/leto-dev/19/
http://ci.sc.couchbase.com/job/leto-dev/20/
http://ci.sc.couchbase.com/job/leto-dev/21/
http://ci.sc.couchbase.com/job/leto-dev/22/
http://ci.sc.couchbase.com/job/leto-dev/23/

3 normal runs, 2 with slowness.
Comment by Volker Mische [ 13/Aug/14 ]
I see only one slow run (22): http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1137_6a0_build_init_index

But still :-/
Comment by Pavel Paulau [ 13/Aug/14 ]
See (20), incremental indexing: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1137_ed9_build_incr_index
Comment by Volker Mische [ 13/Aug/14 ]
Oh, I was only looking at the initial building.
Comment by Volker Mische [ 13/Aug/14 ]
I got a hint in the #erlang IRC channel. I'll try to use the erlang:bump_reductions(2000) and see if that helps.
Comment by Volker Mische [ 13/Aug/14 ]
Let's see if bumping the reductions make it work: http://review.couchbase.org/40591
Comment by Aleksey Kondratenko [ 13/Aug/14 ]
merged that commit.
Comment by Pavel Paulau [ 13/Aug/14 ]
Just tested build 3.0.0-1150, rebalance test but with initial indexing phase.

2 nodes are super slow and utilize only single core.
Comment by Volker Mische [ 18/Aug/14 ]
I can't reproduce it locally. I tend towards closing this issue as "won't fix". We should really not have long running NIFS.

I also think that it won't happen much under real work loads. And even if, the workaround would be to reduce the number of online schedulers to 1 and immediately increasing it again back to the original number.
Comment by Volker Mische [ 18/Aug/14 ]
Assigning to Siri to make the call on whether we close it or not.
Comment by Anil Kumar [ 18/Aug/14 ]
Triage - Not blocking 3.0 RC1
Comment by Raju Suravarjjala [ 19/Aug/14 ]
Triage: Siri will put additional information and this bug is being retargeted to 3.0.1
Comment by Sriram Melkote [ 19/Aug/14 ]
Folks, for too long we've had trouble that get pinned to our NIFs. In 3.5, let's solve them whatever is the correct Erlang approach to running heavy high performance code. Port, or reporting reductions, or moving to R17 with dirty schedulers, or some other option I missed - whatever is the best solution, let us implement in 3.5 and be done.
Comment by Volker Mische [ 09/Sep/14 ]
I think we should close this issue and rather create a new one for whatever we come up with (e.g. the async mapreduce NIF).
Comment by Harsha Havanur [ 10/Sep/14 ]
Toy Build for this change at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_ubunt12-3.0.0-toy-hhs-x86_64_3.0.0-702-toy.deb

Review in progress at
http://review.couchbase.org/#/c/41221/4
Comment by Harsha Havanur [ 12/Sep/14 ]
Please find udpated toy build for this
http://latestbuilds.hq.couchbase.com/couchbase-server-community_ubunt12-3.0.0-toy-hhs-x86_64_3.0.0-704-toy.deb
Comment by Sriram Melkote [ 12/Sep/14 ]
Another occurrence of this, MB-12183.

I'm making this a blocker.
Comment by Harsha Havanur [ 13/Sep/14 ]
Centos build at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-hhs-x86_64_3.0.0-700-toy.rpm
Comment by Ketaki Gangal [ 16/Sep/14 ]
Filed bug MB-12200 for this toy-build
Comment by Ketaki Gangal [ 17/Sep/14 ]
Attaching stack from toy-build 701
File

crash_toy_701.rtf

Access to machine is as mentioned previously on MB-12200.




[MB-11060] Build and test 3.0 for 32-bit Windows Created: 06/May/14  Updated: 17/Sep/14  Due: 09/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build, ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Blocker
Reporter: Chris Hillery Assignee: Phil Labee
Resolution: Unresolved Votes: 0
Labels: windows-3.0-beta, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7/8 32-bit

Issue Links:
Dependency
Duplicate

 Description   
For the "Developer Edition" of Couchbase Server 3.0 on Windows 32-bit, we need to first ensure that we can build 32-bit-compatible binaries. It is not possible to build 3.0 on a 32-bit machine due to the MSVC 2013 requirement. Hence we need to configure MSVC as well as Erlang on a 64-bit machine to produce 32-bit compatible binaries.

 Comments   
Comment by Chris Hillery [ 06/May/14 ]
This is assigned to Trond who is already experimenting with this. He should:

 * test being able to start the server on a 32-bit Windows 7/8 VM

 * make whatever changes are necessary to the CMake configuration or other build scripts to produce this build on a 64-bit VM

 * thoroughly document the requirements for the build team to reproduce this build

Then he can assign this bug to Chris to carry out configuring our build jobs accordingly.
Comment by Trond Norbye [ 16/Jun/14 ]
Can you give me a 32 bit windows installation I can test on. My MSDN license have expired and I don't have Windows media available (and the internal wiki page just have a limited set of licenses and no download links)

Then assign it back to me and I'll try it
Comment by Chris Hillery [ 16/Jun/14 ]
I think you can use 172.23.106.184 - it's a 32-bit Windows 2008 VM that we can't use for 3.0 builds anyway.
Comment by Trond Norbye [ 24/Jun/14 ]
I copied the full result of a build where I set target_platform=x86 on my 64 bit windows server (the "install" directory) over to a 32 bit windows machine and was able to start memcached and it worked as expected.

Our installers make other magic like install the service etc needed in order to start the full server. Once we have such an installer I can do further testing
Comment by Chris Hillery [ 24/Jun/14 ]
Bin - could you take a look at this (figuring out how to make InstallShield on a 64-bit machine create a 32-bit compatible installer)? I won't likely be able to get to it for at least a month, and I think you're the only person here who still has access to an InstallShield 2010 designer anyway.
Comment by Bin Cui [ 04/Sep/14 ]
PM should make the call that whether or not we want to have 32bit support for windows.
Comment by Anil Kumar [ 05/Sep/14 ]
Bin - As confirmed back in March-April supported platforms for Couchbase Server 3.0 - we decided to continue to build 32bit Windows for Development-Only support. As mentioned in our documentation deprecation page http://docs.couchbase.com/couchbase-manual-2.5/deprecated/#platforms.

Comment by Bin Cui [ 17/Sep/14 ]
1. create a 64bit builder with 32bit target.
2. Create a 32bit builder.
3. Transfer 64bit staging image to 32bit builder
4. Run the packaging steps and generate the final package out of 32bit builder.




[MB-11084] Build python snappy module on windows Created: 09/May/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Task Priority: Minor
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows


 Description   
To deal with compressed datatype, we need to python support for snappy function. We need to build https://github.com/andrix/python-snappy on windows and make it part of package.

 Comments   
Comment by Bin Cui [ 09/May/14 ]
I implement related logic for centos 5.x, 6.x and ubuntu. Please look at http://review.couchbase.org/#/c/36902/
Comment by Trond Norbye [ 16/Jun/14 ]
I've updated the windows build depot with the modules built for Python 2.7.6.

Please populate the depot to the builder and reassing the bug to Bin for verification.
Comment by Chris Hillery [ 13/Aug/14 ]
Depot was updated yesterday, so pysnappy is expanded into the install directory before the Couchbase build is started. I'm not sure what needs to be done to then use this package; passing off to Bin.
Comment by Don Pinto [ 03/Sep/14 ]
Question : Given that compressed datatype is not in 3.0 - is this still a requirement?

Thanks,




[MB-8508] installer - windows packages should be signed Created: 26/Nov/12  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0, 2.1.0, 2.2.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Critical
Reporter: Steve Yen Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-5577 print out Couchbase in the warning sc... Open
relates to MB-9165 Windows 8 Smartscreen blocks Couchbas... Resolved

 Description   
see also: http://www.couchbase.com/issues/browse/MB-7250
see also: http://www.couchbase.com/issues/browse/MB-49


 Comments   
Comment by Steve Yen [ 10/Dec/12 ]
Part of the challenge here would be figuring out the key-ownership process. Perhaps PM's should go create, register and own the signing keys/certs.
Comment by Steve Yen [ 31/Jan/13 ]
Reassigning as I think Phil has been tracking down the keys to the company.
Comment by Phil Labee [ 01/May/13 ]
Need more information:

Why do we need to sign windows app?
What problems are we addressing?
Do you want to release through the Windows Store?
What versions of Windows do we need to support?
Comment by Phil Labee [ 01/May/13 ]
need to know what problem we're trying to solve
Comment by Wayne Siu [ 06/Sep/13 ]
No security warning box is the objective.
Comment by Wayne Siu [ 20/Jun/14 ]
Anil,
I assume this is out of 3.0. Please update if it's not.
Comment by Anil Kumar [ 20/Jun/14 ]
we should still consider it for 3.0 unless there is no time to fix then candidate for punting.
Comment by Wayne Siu [ 30/Jul/14 ]
Moving it out of 3.0.
Comment by Anil Kumar [ 17/Sep/14 ]
we need this for Windows 3.0 GA timeframe




[MB-9825] Rebalance exited with reason bad_replicas Created: 06/Jan/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Venu Uppalapati
Resolution: Unresolved Votes: 0
Labels: performance, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.5.0 enterprise edition (build-1015)

Platform = Physical
OS = Windows Server 2012
CPU = Intel Xeon E5-2630
Memory = 64 GB
Disk = 2 x HDD

Triage: Triaged
Operating System: Windows 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/zeus-64/564/artifact/

 Description   
Rebalance-out, 4 -> 3, 1 bucket x 50M x 2KB, DGM, 1 x 1 views

Bad replicators after rebalance:
Missing = [{'ns_1@172.23.96.27','ns_1@172.23.96.26',597},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',598},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',599},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',600},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',601},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',602},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',603},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',604},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',605},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',606},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',607},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',608},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',609},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',610},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',611},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',612},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',613},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',614},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',615},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',616},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',617},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',618},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',619},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',620},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',621},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',622},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',623},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',624},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',625},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',626},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',627},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',628},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',629},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',630},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',631},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',632},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',633},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',634},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',635},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',636},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',637},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',638},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',639},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',640},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',641},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',642},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',643},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',644},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',645},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',646},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',647},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',648},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',649},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',650},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',651},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',652},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',653},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',654},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',655},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',656},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',657},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',658},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',659},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',660},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',661},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',662},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',663},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',664},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',665},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',666},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',667},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',668},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',669},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',670},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',671},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',672},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',673},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',674},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',675},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',676},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',677},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',678},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',679},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',680},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',681}]
Extras = []

 Comments   
Comment by Aleksey Kondratenko [ 06/Jan/14 ]
Looks like producer node simply closed socket.

Most likely duplicate of old issue where both socket sides suddenly see connection as closed.

Relevant log messages:

[error_logger:info,2014-01-06T10:30:00.231,ns_1@172.23.96.26:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
          supervisor: {local,'ns_vbm_new_sup-bucket-1'}
             started: [{pid,<0.1169.0>},
                       {name,
                           {new_child_id,
                               [597,598,599,600,601,602,603,604,605,606,607,
                                608,609,610,611,612,613,614,615,616,617,618,
                                619,620,621,622,623,624,625,626,627,628,629,
                                630,631,632,633,634,635,636,637,638,639,640,
                                641,642,643,644,645,646,647,648,649,650,651,
                                652,653,654,655,656,657,658,659,660,661,662,
                                663,664,665,666,667,668,669,670,671,672,673,
                                674,675,676,677,678,679,680,681],
                               'ns_1@172.23.96.27'}},
                       {mfargs,
                           {ebucketmigrator_srv,start_link,
                               [{"172.23.96.27",11209},
                                {"172.23.96.26",11209},
                                [{on_not_ready_vbuckets,
                                     #Fun<tap_replication_manager.2.133536719>},
                                 {username,"bucket-1"},
                                 {password,get_from_config},
                                 {vbuckets,
                                     [597,598,599,600,601,602,603,604,605,606,
                                      607,608,609,610,611,612,613,614,615,616,
                                      617,618,619,620,621,622,623,624,625,626,
                                      627,628,629,630,631,632,633,634,635,636,
                                      637,638,639,640,641,642,643,644,645,646,
                                      647,648,649,650,651,652,653,654,655,656,
                                      657,658,659,660,661,662,663,664,665,666,
                                      667,668,669,670,671,672,673,674,675,676,
                                      677,678,679,680,681]},
                                 {set_to_pending_state,false},
                                 {takeover,false},
                                 {suffix,"ns_1@172.23.96.26"}]]}},
                       {restart_type,temporary},
                       {shutdown,60000},
                       {child_type,worker}]



[rebalance:debug,2014-01-06T12:12:33.870,ns_1@172.23.96.26:<0.1169.0>:ebucketmigrator_srv:terminate:737]Dying with reason: normal

Mon Jan 06 12:12:44.371917 Pacific Standard Time 3: (bucket-1) TAP (Producer) eq_tapq:replication_ns_1@172.23.96.26 - disconnected, keep alive for 300 seconds
Comment by Maria McDuff (Inactive) [ 10/Jan/14 ]
looks like a dupe of memcached connection issue.
will close this as a dupe.
Comment by Wayne Siu [ 15/Jan/14 ]
Chiyoung to add more debug logging to 2.5.1.
Comment by Chiyoung Seo [ 17/Jan/14 ]
I added more warning-level logs for disconnection events in the memcached layer. We will continue to investigate this issue for 2.5.1 or 3.0 release.

http://review.couchbase.org/#/c/32567/

merged.
Comment by Cihan Biyikoglu [ 08/Apr/14 ]
Given we have more verbose logging, can we reproduce the issue again and see if we can get a better idea on where the problem is?
thanks
Comment by Pavel Paulau [ 08/Apr/14 ]
This issue happened only on Windows so far.
I wasn't able to reproduce it in 2.5.1 and obviously we haven't tested 3.0 yet.
Comment by Cihan Biyikoglu [ 25/Jun/14 ]
Pavel, do you have the repro with the detailed logs now? if yes, could we assign to a dev for fixing?
Comment by Pavel Paulau [ 25/Jun/14 ]
This is Windows specific bug. We are not testing Windows yet.
Comment by Pavel Paulau [ 27/Jun/14 ]
Just FYI.

I have finally tried Windows build. It's absolutely unstable and not ready for performance testing yet.
Please don't expect news any time soon.




[MB-9874] [Windows] Couchstore drop and reopen of file handle fails Created: 09/Jan/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: windows, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows


 Description   
The unit test doing couchstore_drop_file and couchstore_repoen_file fails due to COUCHSTORE_READ_ERROR when it tries to reopen the file.

The commit http://review.couchbase.org/#/c/31767/ disabled the test to allow the rest of the unit tests to be executed.

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th




[MB-9635] Audit logs for Admin actions Created: 22/Nov/13  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Don Pinto
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Duplicate

 Description   
Couchbase Server should be able to get an audit logs for all Admin actions such-as login/logout events, significant events (rebalance, failover, etc) etc.



 Comments   
Comment by Matt Ingenthron [ 13/Mar/14 ]
Note there isn't exactly a "login/logout" event. This is mostly by design. A feature like this could be added, but there may be better ways to achieve the underlying requirement. One suggestion would be to log initial activities instead of every activity and have a 'cache' for having seen that user agent within a particular window. That would probably meet most auditing requirements and is, I think, relatively straightforward to implement.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
We have access.log implemented now. But it's not exactly same as full-blown audit. Particularly we do log that certain POST was handled in access.log, but we do not log any parameters of that action. So it doesn't count as fullly-featured audit log I think.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
access.log for log and ep-engine's access.log do not conflict due to being in necessarily different directories.
Comment by Perry Krug [ 06/Jun/14 ]
They may not conflict in terms of unique names in the same directory, but to our customers it may be a little bit too close to remember which access.log does what...
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
Ok. Any specific proposals ?
Comment by Perry Krug [ 06/Jun/14 ]
Yes, as mentioned above, login.log would be one proposal but I'm not tied to it.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
access.log has very little to do with logins. It's full blown equivalent of apache's access.log.
Comment by Perry Krug [ 06/Jun/14 ]
Oh sorry, I misread this specific section.

How about audit.log? I know it's not fully "audit" but I'm just trying to avoid the name clash in our customer's minds...
Comment by Anil Kumar [ 09/Jun/14 ]
Agreed we should rename this file to audit.log to avoid any confusion. Updating the MB-10020 to make that change.
Comment by Larry Liu [ 10/Jun/14 ]
Hi, Anil

Does this feature satisfy PCI compliance?

Larry
Comment by Cihan Biyikoglu [ 11/Jun/14 ]
Hi Larry, PCI is a comprehensive set of requirements that go beyond database features. This does help with some part of PCI but talking about compliance with PCI involve many additional controls and most can be done at the operational levels or at the app level.
thanks




[MB-12200] Seg fault during indexing on view-toy build testing Created: 16/Sep/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ketaki Gangal Assignee: Harsha Havanur
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: -3.0.0-700-hhs-toy
-Cen 64 Machines
- 7 Node cluster, 2 Buckets, 2 Views

Attachments: Zip Archive 10.6.2.168-9162014-106-diag.zip     Zip Archive 10.6.2.187-9162014-1010-diag.zip     File crash_beam.smp.rtf     File crash_toybuild.rtf    
Issue Links:
Duplicate
is duplicated by MB-11917 One node slow probably due to the Erl... Open
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
1. Load 70M, 100M on either bucket
2. Wait for initial indexing to complete
3. Start updates on the cluster 1K gets, 7K sets across the cluster

Seeing numerous cores from beam.smp.

Stack is attached.

Adding logs from the nodes.


 Comments   
Comment by Sriram Melkote [ 16/Sep/14 ]
Harsha, this appears to clearly be a NIF related regression. We need to discuss why our own testing didn't find this after you figure out the problem.
Comment by Volker Mische [ 16/Sep/14 ]
Siri, I haven't checked if it's the same issue, but the current patch doesn't pass our unit tests. See my comment at http://review.couchbase.org/41221
Comment by Ketaki Gangal [ 16/Sep/14 ]
Logs https://s3.amazonaws.com/bugdb/bug-12200/bug_12200.tar
Comment by Harsha Havanur [ 17/Sep/14 ]
The issue Volker mentioned is one of queue size. I am suspecting that if a context is in queue beyond 5 seconds and terminator loop destroys context and when doMapDoc loop dequeues the task it will result in SEGV if the ctx is already destroyed. Trying a fix with both increasing queue size as well as handling destroyed contexts.
Comment by Sriram Melkote [ 17/Sep/14 ]
Folks, let's follow this on MB-11917 as it's clear now that this bug is caused by the toy build as a result of proposed fix for MB-11917.




[MB-12206] New 3.0 Doc Site, View and query pattern samples unparsed markup Created: 17/Sep/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Ruth Harris
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
On the page

http://draft.docs.couchbase.com/prebuilt/couchbase-manual-3.0/Views/views-querySample.html

The view code examples under 'General advice' are not displayed properly.

 Comments   
Comment by Ruth Harris [ 17/Sep/14 ]
Fixed. Legacy formatting issues from previous source code.




[MB-9656] XDCR destination endpoints for "getting xdcr stats via rest" in url encoding Created: 29/Nov/13  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.1.0, 2.2.0, 2.1.1, 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Patrick Varley Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: customer, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: http://docs.couchbase.com/couchbase-manual-2.2/#getting-xdcr-stats-via-rest


 Description   
In our documentation the destination endpoint are not in url encoding where "/" are "%2F". This has mislead customers. That section should be in the following format:

replications%2F[UUID]%2F[source_bucket]%2F[destination_bucket]%2Fdocs_written

If this change is made we should remove this line too:

You need to provide properly URL-encoded /[UUID]/[source_bucket]/[destination_bucket]/[stat_name]. To get the number of documents written:



 Comments   
Comment by Amy Kurtzman [ 16/May/14 ]
The syntax and example code in this whole REST section needs to be cleaned up and tested. It is a bigger job than just fixing this one.
Comment by Patrick Varley [ 17/Sep/14 ]
I fall down this hole again and so do another Support Engineer. We really need to get this fixed in all versions.

The 3.0 documentation has this problem too.
Comment by Ruth Harris [ 17/Sep/14 ]
Why are you suggesting that the backslash in the syntax be %2F???
This is not a blocker.




[MB-12192] XDCR : After warmup, replica items are not deleted in destination cluster Created: 15/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, DCP
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 6.x, 3.0.1-1297-rel

Attachments: Zip Archive 172.23.106.45-9152014-1553-diag.zip     GZip Archive 172.23.106.45-9152014-1623-couch.tar.gz     Zip Archive 172.23.106.46-9152014-1555-diag.zip     GZip Archive 172.23.106.46-9152014-1624-couch.tar.gz     Zip Archive 172.23.106.47-9152014-1558-diag.zip     GZip Archive 172.23.106.47-9152014-1624-couch.tar.gz     Zip Archive 172.23.106.48-9152014-160-diag.zip     GZip Archive 172.23.106.48-9152014-1624-couch.tar.gz    
Triage: Untriaged
Is this a Regression?: Yes

 Description   
Steps
--------
1. Setup uni-xdcr between 2 clusters with atleast 2 nodes
2. Load 5000 items onto 3 buckets at source, they get replicated to destination
3. Reboot a non-master node on destination (in this test .48)
4. After warmup, perform 30% updates and 30% deletes on source cluster
5. Deletes get propagated to active vbuckets on destination but replica vbuckets only experience partial deletion.

Important note
--------------------
This test had passed on 3.0.0-1208-rel and 3.0.0-1209-rel. However I'm able to reproduce this consistently on 3.0.1. Unsure if this is a recent regression.

2014-09-15 14:43:50 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 4250 == 3500 expected on '172.23.106.47:8091''172.23.106.48:8091', sasl_bucket_1 bucket
2014-09-15 14:43:51 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 4250 == 3500 expected on '172.23.106.47:8091''172.23.106.48:8091', standard_bucket_1 bucket
2014-09-15 14:43:51 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 4250 == 3500 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket

Testcase
------------
./testrunner -i /tmp/bixdcr.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,reboot=dest_node,items=5000,rdirection=unidirection,replication_type=xmem,standard_buckets=1,sasl_buckets=1,pause=source,doc-ops=update-delete,doc-ops-dest=update-delete

On destination cluster
-----------------------------

Arunas-MacBook-Pro:bin apiravi$ ./cbvdiff 172.23.106.47:11210,172.23.106.48:11210
VBucket 512: active count 4 != 6 replica count

VBucket 513: active count 2 != 4 replica count

VBucket 514: active count 8 != 11 replica count

VBucket 515: active count 3 != 4 replica count

VBucket 516: active count 8 != 10 replica count

VBucket 517: active count 5 != 6 replica count

VBucket 521: active count 0 != 1 replica count

VBucket 522: active count 7 != 11 replica count

VBucket 523: active count 3 != 5 replica count

VBucket 524: active count 6 != 10 replica count

VBucket 525: active count 4 != 6 replica count

VBucket 526: active count 4 != 6 replica count

VBucket 528: active count 7 != 10 replica count

VBucket 529: active count 3 != 4 replica count

VBucket 530: active count 3 != 4 replica count

VBucket 532: active count 0 != 2 replica count

VBucket 533: active count 1 != 2 replica count

VBucket 534: active count 8 != 10 replica count

VBucket 535: active count 5 != 6 replica count

VBucket 536: active count 7 != 11 replica count

VBucket 537: active count 3 != 5 replica count

VBucket 540: active count 3 != 4 replica count

VBucket 542: active count 6 != 10 replica count

VBucket 543: active count 4 != 6 replica count

VBucket 544: active count 6 != 10 replica count

VBucket 545: active count 3 != 4 replica count

VBucket 547: active count 0 != 1 replica count

VBucket 548: active count 6 != 7 replica count

VBucket 550: active count 7 != 10 replica count

VBucket 551: active count 4 != 5 replica count

VBucket 552: active count 9 != 11 replica count

VBucket 553: active count 4 != 6 replica count

VBucket 554: active count 4 != 5 replica count

VBucket 555: active count 1 != 2 replica count

VBucket 558: active count 7 != 10 replica count

VBucket 559: active count 3 != 4 replica count

VBucket 562: active count 6 != 10 replica count

VBucket 563: active count 4 != 5 replica count

VBucket 564: active count 7 != 10 replica count

VBucket 565: active count 4 != 5 replica count

VBucket 566: active count 4 != 5 replica count

VBucket 568: active count 3 != 4 replica count

VBucket 570: active count 8 != 10 replica count

VBucket 571: active count 4 != 6 replica count

VBucket 572: active count 7 != 10 replica count

VBucket 573: active count 3 != 4 replica count

VBucket 574: active count 0 != 1 replica count

VBucket 575: active count 0 != 1 replica count

VBucket 578: active count 8 != 10 replica count

VBucket 579: active count 4 != 6 replica count

VBucket 580: active count 8 != 11 replica count

VBucket 581: active count 3 != 4 replica count

VBucket 582: active count 3 != 4 replica count

VBucket 583: active count 1 != 2 replica count

VBucket 584: active count 3 != 4 replica count

VBucket 586: active count 6 != 10 replica count

VBucket 587: active count 3 != 4 replica count

VBucket 588: active count 7 != 10 replica count

VBucket 589: active count 4 != 5 replica count

VBucket 591: active count 0 != 2 replica count

VBucket 592: active count 8 != 10 replica count

VBucket 593: active count 4 != 6 replica count

VBucket 594: active count 0 != 1 replica count

VBucket 595: active count 0 != 1 replica count

VBucket 596: active count 4 != 6 replica count

VBucket 598: active count 7 != 10 replica count

VBucket 599: active count 3 != 4 replica count

VBucket 600: active count 6 != 10 replica count

VBucket 601: active count 3 != 4 replica count

VBucket 602: active count 4 != 6 replica count

VBucket 606: active count 7 != 10 replica count

VBucket 607: active count 4 != 5 replica count

VBucket 608: active count 7 != 11 replica count

VBucket 609: active count 3 != 5 replica count

VBucket 610: active count 3 != 4 replica count

VBucket 613: active count 0 != 1 replica count

VBucket 614: active count 6 != 10 replica count

VBucket 615: active count 4 != 6 replica count

VBucket 616: active count 7 != 10 replica count

VBucket 617: active count 3 != 4 replica count

VBucket 620: active count 3 != 4 replica count

VBucket 621: active count 1 != 2 replica count

VBucket 622: active count 9 != 11 replica count

VBucket 623: active count 5 != 6 replica count

VBucket 624: active count 5 != 6 replica count

VBucket 626: active count 7 != 11 replica count

VBucket 627: active count 3 != 5 replica count

VBucket 628: active count 6 != 10 replica count

VBucket 629: active count 4 != 6 replica count

VBucket 632: active count 0 != 1 replica count

VBucket 633: active count 0 != 1 replica count

VBucket 634: active count 7 != 10 replica count

VBucket 635: active count 3 != 4 replica count

VBucket 636: active count 8 != 10 replica count

VBucket 637: active count 5 != 6 replica count

VBucket 638: active count 5 != 6 replica count

VBucket 640: active count 2 != 4 replica count

VBucket 641: active count 7 != 11 replica count

VBucket 643: active count 5 != 7 replica count

VBucket 646: active count 3 != 5 replica count

VBucket 647: active count 7 != 10 replica count

VBucket 648: active count 4 != 6 replica count

VBucket 649: active count 8 != 10 replica count

VBucket 651: active count 0 != 1 replica count

VBucket 653: active count 4 != 6 replica count

VBucket 654: active count 3 != 4 replica count

VBucket 655: active count 7 != 10 replica count

VBucket 657: active count 4 != 5 replica count

VBucket 658: active count 2 != 4 replica count

VBucket 659: active count 7 != 11 replica count

VBucket 660: active count 3 != 5 replica count

VBucket 661: active count 7 != 10 replica count

VBucket 662: active count 0 != 2 replica count

VBucket 666: active count 4 != 6 replica count

VBucket 667: active count 8 != 10 replica count

VBucket 668: active count 3 != 4 replica count

VBucket 669: active count 7 != 10 replica count

VBucket 670: active count 1 != 2 replica count

VBucket 671: active count 2 != 3 replica count

VBucket 673: active count 0 != 1 replica count

VBucket 674: active count 3 != 4 replica count

VBucket 675: active count 7 != 10 replica count

VBucket 676: active count 5 != 6 replica count

VBucket 677: active count 8 != 10 replica count

VBucket 679: active count 5 != 6 replica count

VBucket 681: active count 6 != 7 replica count

VBucket 682: active count 3 != 5 replica count

VBucket 683: active count 8 != 12 replica count

VBucket 684: active count 3 != 6 replica count

VBucket 685: active count 7 != 11 replica count

VBucket 688: active count 3 != 4 replica count

VBucket 689: active count 7 != 10 replica count

VBucket 692: active count 1 != 2 replica count

VBucket 693: active count 2 != 3 replica count

VBucket 694: active count 5 != 6 replica count

VBucket 695: active count 8 != 10 replica count

VBucket 696: active count 3 != 5 replica count

VBucket 697: active count 8 != 12 replica count

VBucket 699: active count 4 != 5 replica count

VBucket 700: active count 0 != 1 replica count

VBucket 702: active count 3 != 6 replica count

VBucket 703: active count 7 != 11 replica count

VBucket 704: active count 3 != 5 replica count

VBucket 705: active count 8 != 12 replica count

VBucket 709: active count 4 != 5 replica count

VBucket 710: active count 3 != 6 replica count

VBucket 711: active count 7 != 11 replica count

VBucket 712: active count 3 != 4 replica count

VBucket 713: active count 7 != 10 replica count

VBucket 715: active count 3 != 4 replica count

VBucket 716: active count 1 != 2 replica count

VBucket 717: active count 0 != 2 replica count

VBucket 718: active count 5 != 6 replica count

VBucket 719: active count 8 != 10 replica count

VBucket 720: active count 0 != 1 replica count

VBucket 722: active count 3 != 5 replica count

VBucket 723: active count 8 != 12 replica count

VBucket 724: active count 3 != 6 replica count

VBucket 725: active count 7 != 11 replica count

VBucket 727: active count 5 != 7 replica count

VBucket 728: active count 2 != 4 replica count

VBucket 729: active count 3 != 5 replica count

VBucket 730: active count 3 != 4 replica count

VBucket 731: active count 7 != 10 replica count

VBucket 732: active count 5 != 6 replica count

VBucket 733: active count 8 != 10 replica count

VBucket 737: active count 3 != 4 replica count

VBucket 738: active count 4 != 6 replica count

VBucket 739: active count 8 != 10 replica count

VBucket 740: active count 3 != 4 replica count

VBucket 741: active count 7 != 10 replica count

VBucket 743: active count 0 != 1 replica count

VBucket 746: active count 2 != 4 replica count

VBucket 747: active count 7 != 11 replica count

VBucket 748: active count 3 != 5 replica count

VBucket 749: active count 7 != 10 replica count

VBucket 751: active count 3 != 4 replica count

VBucket 752: active count 4 != 6 replica count

VBucket 753: active count 9 != 11 replica count

VBucket 754: active count 1 != 2 replica count

VBucket 755: active count 4 != 5 replica count

VBucket 758: active count 3 != 4 replica count

VBucket 759: active count 7 != 10 replica count

VBucket 760: active count 2 != 4 replica count

VBucket 761: active count 7 != 11 replica count

VBucket 762: active count 0 != 1 replica count

VBucket 765: active count 6 != 7 replica count

VBucket 766: active count 3 != 5 replica count

VBucket 767: active count 7 != 10 replica count

VBucket 770: active count 3 != 5 replica count

VBucket 771: active count 7 != 11 replica count

VBucket 772: active count 4 != 6 replica count

VBucket 773: active count 6 != 10 replica count

VBucket 775: active count 3 != 4 replica count

VBucket 777: active count 3 != 4 replica count

VBucket 778: active count 3 != 4 replica count

VBucket 779: active count 7 != 10 replica count

VBucket 780: active count 5 != 6 replica count

VBucket 781: active count 8 != 10 replica count

VBucket 782: active count 1 != 2 replica count

VBucket 783: active count 0 != 2 replica count

VBucket 784: active count 3 != 5 replica count

VBucket 785: active count 7 != 11 replica count

VBucket 786: active count 0 != 1 replica count

VBucket 789: active count 4 != 6 replica count

VBucket 790: active count 4 != 6 replica count

VBucket 791: active count 6 != 10 replica count

VBucket 792: active count 3 != 4 replica count

VBucket 793: active count 8 != 11 replica count

VBucket 794: active count 2 != 4 replica count

VBucket 795: active count 4 != 6 replica count

VBucket 798: active count 5 != 6 replica count

VBucket 799: active count 8 != 10 replica count

VBucket 800: active count 4 != 6 replica count

VBucket 801: active count 8 != 10 replica count

VBucket 803: active count 3 != 4 replica count

VBucket 804: active count 0 != 1 replica count

VBucket 805: active count 0 != 1 replica count

VBucket 806: active count 3 != 4 replica count

VBucket 807: active count 7 != 10 replica count

VBucket 808: active count 3 != 4 replica count

VBucket 809: active count 6 != 10 replica count

VBucket 813: active count 4 != 5 replica count

VBucket 814: active count 4 != 5 replica count

VBucket 815: active count 7 != 10 replica count

VBucket 816: active count 1 != 2 replica count

VBucket 817: active count 4 != 5 replica count

VBucket 818: active count 4 != 6 replica count

VBucket 819: active count 8 != 10 replica count

VBucket 820: active count 3 != 4 replica count

VBucket 821: active count 7 != 10 replica count

VBucket 824: active count 0 != 1 replica count

VBucket 826: active count 3 != 4 replica count

VBucket 827: active count 6 != 10 replica count

VBucket 828: active count 4 != 5 replica count

VBucket 829: active count 7 != 10 replica count

VBucket 831: active count 6 != 7 replica count

VBucket 833: active count 4 != 6 replica count

VBucket 834: active count 3 != 4 replica count

VBucket 835: active count 6 != 10 replica count

VBucket 836: active count 4 != 5 replica count

VBucket 837: active count 7 != 10 replica count

VBucket 840: active count 0 != 1 replica count

VBucket 841: active count 0 != 1 replica count

VBucket 842: active count 4 != 6 replica count

VBucket 843: active count 8 != 10 replica count

VBucket 844: active count 3 != 4 replica count

VBucket 845: active count 7 != 10 replica count

VBucket 847: active count 4 != 6 replica count

VBucket 848: active count 3 != 4 replica count

VBucket 849: active count 6 != 10 replica count

VBucket 851: active count 3 != 4 replica count

VBucket 852: active count 0 != 2 replica count

VBucket 854: active count 4 != 5 replica count

VBucket 855: active count 7 != 10 replica count

VBucket 856: active count 4 != 6 replica count

VBucket 857: active count 8 != 10 replica count

VBucket 860: active count 1 != 2 replica count

VBucket 861: active count 3 != 4 replica count

VBucket 862: active count 3 != 4 replica count

VBucket 863: active count 8 != 11 replica count

VBucket 864: active count 3 != 4 replica count

VBucket 865: active count 7 != 10 replica count

VBucket 866: active count 0 != 1 replica count

VBucket 867: active count 0 != 1 replica count

VBucket 869: active count 5 != 6 replica count

VBucket 870: active count 5 != 6 replica count

VBucket 871: active count 8 != 10 replica count

VBucket 872: active count 3 != 5 replica count

VBucket 873: active count 7 != 11 replica count

VBucket 875: active count 5 != 6 replica count

VBucket 878: active count 4 != 6 replica count

VBucket 879: active count 6 != 10 replica count

VBucket 882: active count 3 != 4 replica count

VBucket 883: active count 7 != 10 replica count

VBucket 884: active count 5 != 6 replica count

VBucket 885: active count 9 != 11 replica count

VBucket 886: active count 1 != 2 replica count

VBucket 887: active count 3 != 4 replica count

VBucket 889: active count 3 != 4 replica count

VBucket 890: active count 3 != 5 replica count

VBucket 891: active count 7 != 11 replica count

VBucket 892: active count 4 != 6 replica count

VBucket 893: active count 6 != 10 replica count

VBucket 894: active count 0 != 1 replica count

VBucket 896: active count 8 != 10 replica count

VBucket 897: active count 4 != 6 replica count

VBucket 900: active count 2 != 3 replica count

VBucket 901: active count 2 != 3 replica count

VBucket 902: active count 7 != 10 replica count

VBucket 903: active count 3 != 4 replica count

VBucket 904: active count 7 != 11 replica count

VBucket 905: active count 2 != 4 replica count

VBucket 906: active count 4 != 5 replica count

VBucket 909: active count 0 != 2 replica count

VBucket 910: active count 7 != 10 replica count

VBucket 911: active count 3 != 5 replica count

VBucket 912: active count 0 != 1 replica count

VBucket 914: active count 8 != 10 replica count

VBucket 915: active count 4 != 6 replica count

VBucket 916: active count 7 != 10 replica count

VBucket 917: active count 3 != 4 replica count

VBucket 918: active count 4 != 6 replica count

VBucket 920: active count 5 != 7 replica count

VBucket 922: active count 7 != 11 replica count

VBucket 923: active count 2 != 4 replica count

VBucket 924: active count 7 != 10 replica count

VBucket 925: active count 3 != 5 replica count

VBucket 928: active count 4 != 5 replica count

VBucket 930: active count 8 != 12 replica count

VBucket 931: active count 3 != 5 replica count

VBucket 932: active count 7 != 11 replica count

VBucket 933: active count 3 != 6 replica count

VBucket 935: active count 0 != 1 replica count

VBucket 938: active count 7 != 10 replica count

VBucket 939: active count 3 != 4 replica count

VBucket 940: active count 8 != 10 replica count

VBucket 941: active count 5 != 6 replica count

VBucket 942: active count 2 != 3 replica count

VBucket 943: active count 1 != 2 replica count

VBucket 944: active count 8 != 12 replica count

VBucket 945: active count 3 != 5 replica count

VBucket 946: active count 6 != 7 replica count

VBucket 950: active count 7 != 11 replica count

VBucket 951: active count 3 != 6 replica count

VBucket 952: active count 7 != 10 replica count

VBucket 953: active count 3 != 4 replica count

VBucket 954: active count 0 != 1 replica count

VBucket 956: active count 5 != 6 replica count

VBucket 958: active count 8 != 10 replica count

VBucket 959: active count 5 != 6 replica count

VBucket 960: active count 7 != 10 replica count

VBucket 961: active count 3 != 4 replica count

VBucket 962: active count 3 != 5 replica count

VBucket 963: active count 2 != 4 replica count

VBucket 966: active count 8 != 10 replica count

VBucket 967: active count 5 != 6 replica count

VBucket 968: active count 8 != 12 replica count

VBucket 969: active count 3 != 5 replica count

VBucket 971: active count 0 != 1 replica count

VBucket 972: active count 5 != 7 replica count

VBucket 974: active count 7 != 11 replica count

VBucket 975: active count 3 != 6 replica count

VBucket 976: active count 3 != 4 replica count

VBucket 978: active count 7 != 10 replica count

VBucket 979: active count 3 != 4 replica count

VBucket 980: active count 8 != 10 replica count

VBucket 981: active count 5 != 6 replica count

VBucket 982: active count 0 != 2 replica count

VBucket 983: active count 1 != 2 replica count

VBucket 986: active count 8 != 12 replica count

VBucket 987: active count 3 != 5 replica count

VBucket 988: active count 7 != 11 replica count

VBucket 989: active count 3 != 6 replica count

VBucket 990: active count 4 != 5 replica count

VBucket 993: active count 0 != 1 replica count

VBucket 994: active count 7 != 11 replica count

VBucket 995: active count 2 != 4 replica count

VBucket 996: active count 7 != 10 replica count

VBucket 997: active count 3 != 5 replica count

VBucket 998: active count 5 != 6 replica count

VBucket 1000: active count 4 != 5 replica count

VBucket 1001: active count 1 != 2 replica count

VBucket 1002: active count 9 != 11 replica count

VBucket 1003: active count 4 != 6 replica count

VBucket 1004: active count 7 != 10 replica count

VBucket 1005: active count 3 != 4 replica count

VBucket 1008: active count 7 != 11 replica count

VBucket 1009: active count 2 != 4 replica count

VBucket 1012: active count 4 != 5 replica count

VBucket 1014: active count 7 != 10 replica count

VBucket 1015: active count 3 != 5 replica count

VBucket 1016: active count 8 != 10 replica count

VBucket 1017: active count 4 != 6 replica count

VBucket 1018: active count 3 != 4 replica count

VBucket 1020: active count 0 != 1 replica count

VBucket 1022: active count 7 != 10 replica count

VBucket 1023: active count 3 != 4 replica count

Active item count = 3500

Same at source
----------------------
Arunas-MacBook-Pro:bin apiravi$ ./cbvdiff 172.23.106.45:11210,172.23.106.46:11210
Active item count = 3500

Will attach cbcollect and data files.


 Comments   
Comment by Mike Wiederhold [ 15/Sep/14 ]
This is not a bug. We no longer do this because a replica vbucket cannot delete items on it's own due to dcp.
Comment by Aruna Piravi [ 15/Sep/14 ]
I do not understand why this is not a bug. This is a case where replica items = 4250 and active = 3500. Both were initially 5000 before warmup. However 50% of the actual deletes have happened on replica bucket(5000->4250). And so I would expect the another 750 items to be deleted too so active=replica. If this is not a bug, in case of failover, the cluster will end up having more items than it did before the failover.
Comment by Aruna Piravi [ 15/Sep/14 ]
> We no longer do this because a replica vbucket cannot delete items on it's own due to dcp
Then I would expect the deletes to be propagated from active vbuckets through dcp..but these never get propagated. If you do a cbdiff even now, you can see the mismatch.
Comment by Sriram Ganesan [ 17/Sep/14 ]
Aruna

If there is a testrunner script available for steps (1) - (5), please update the bug. Thanks.
Comment by Aruna Piravi [ 17/Sep/14 ]
Done.




[MB-12138] {Windows - DCP}:: View Query fails with error 500 reason: error {"error":"error","reason":"{index_builder_exit,89,<<>>}"} Created: 05/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: windows, windows-3.0-beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.1-1267, Windows 2012, 64 x, machine:: 172.23.105.112

Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-12138/172.23.105.112-952014-1511-diag.zip
Is this a Regression?: Yes

 Description   


1. Create 1 Node cluster
2. Create default bucket and add 100k items
3. Create views and query it

Seeing the following exceptions

http://172.23.105.112:8092/default/_design/ddoc1/_view/default_view0?connectionTimeout=60000&full_set=true&limit=100000&stale=false error 500 reason: error {"error":"error","reason":"{index_builder_exit,89,<<>>}"}

We cannot run any view tests as a result


 Comments   
Comment by Anil Kumar [ 16/Sep/14 ]
Nimish/Siri - Any update on this.
Comment by Meenakshi Goel [ 17/Sep/14 ]
Seeing similar issue in Views DGM test http://qa.hq.northscale.net/job/win_2008_x64--69_06_view_dgm_tests-P1/1/console
Test : view.createdeleteview.CreateDeleteViewTests.test_view_ops,ddoc_ops=update,test_with_view=True,num_ddocs=4,num_views_per_ddoc=10,items=200000,active_resident_threshold=10,dgm_run=True,eviction_policy=fullEviction
Comment by Nimish Gupta [ 17/Sep/14 ]
We have found the root cause and working on the fix.




[MB-12207] Related links could be clearer. Created: 17/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Patrick Varley Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I think it would be better if "Related link" at the bottom of the page was layout a little different and we added the ability to navigate (MB-12205) from the bottom of a page(Think long pages).

Maybe something like this could work:

Links

Parent Topic:
    Installation and upgrade
Previous Topic:
    Welcome to couchbase
Next Topic:
    uninstalling couchbase
Related Topics:
    Initial server setup
    Testing Couchbase Server
    Upgrading




[MB-12195] Update notifications does not seem to be working Created: 15/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Ian McCloy
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Centos 5.8
2.5.0

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I have installed 2.5.0 build and enabled Update Notifications
Even though I enabled "Enable software Update Notifications", I keep getting "No Updates available"
I thought I will be notified in the UI that there is a 2.5.1 is available.

I have consulted Tony to see if I have done something wrong but he also confirmed that this seems to be an issue and is a bug

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
Based on dev tools we're getting "no new version" from phone home requests. So it's not UI bug.
Comment by Ian McCloy [ 17/Sep/14 ]
Added the missing available upgrade paths to the database,

2.5.0-1059-rel-enterprise -> 2.5.1-1083-rel-enterprise
2.2.0-837-rel-enterprise -> 2.5.1-1083-rel-enterprise
2.1.0-718-rel-enterprise -> 2.2.0-837-rel-enterprise

but it looks like the code that parses http://ph.couchbase.net/v2?callback=jQueryxxx isn't checking the database.




[MB-12205] Doc-system: does not have a next page button. Created: 17/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Patrick Varley Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
When reading a manual you normally want to go to the next page. It would be good to have a "next" button at the bottom of the page. Here is a good example:

http://draft.docs.couchbase.com/prebuilt/couchbase-manual-3.0/Views/views-operation.html




[MB-12204] New doc-system does not have anchors Created: 17/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Patrick Varley Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
The support team uses anchors all the time to link customers directly to the selection that has the information they required.

 know that we have broken a number of sections out into their own page but there are still some long pages for example:

http://draft.docs.couchbase.com/prebuilt/couchbase-manual-3.0/Misc/security-client-ssl.html


It would be good if we could link the customer directly to: "Configuring the PHP client for SSL"

I have marked this as a blocker as it will affect the way the support team works today.




[MB-12203] Available-stats table formatted incorrectly Created: 17/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Task Priority: Minor
Reporter: Patrick Varley Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: http://docs.couchbase.com/couchbase-manual-2.5/cb-cli/#available-stats


 Description   
See the pending_ops cell in the link below.

http://docs.couchbase.com/couchbase-manual-2.5/cb-cli/#available-stats

I believe "client connections blocked for operations in pending vbuckets" should all be in one cell.




[MB-11938]  N1QL developer preview does not work with couchbase 3.0 beta. Created: 12/Aug/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Patrick Varley Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This came in on IRC, user dropped offline before I could point them at Jira. I have created this defect on their behalf:

N1QL makes use of _all_docs which we have removed in 3.0.

The error from the query engine:

couchbase-query_dev_preview3_x86_64_mac ► ./cbq-engine -couchbase http://127.0.0.1:8091/
19:13:38.355197 Info line disabled false
19:13:38.367261 tuqtng started...
19:13:38.367282 version: v0.7.2
19:13:38.367287 site: http://127.0.0.1:8091/
19:14:24.179252 ERROR: Unable to access view - cause: error executing view req at http://127.0.0.1:8092/free/_all_docs?limit=1001: 400 Bad Request - {"error":"bad_request","reason":"_all_docs is no longer supported"}
 -- couchbase.(*viewIndex).ScanRange() at view_index.go:186
19:14:24.179272 Checking bucket URI: /pools/default/buckets/free?bucket_uuid=660ff64e9d1fdfee0c41017e89a4fe72
19:14:24.179315 ERROR: Get /pools/default/buckets/free?bucket_uuid=660ff64e9d1fdfee0c41017e89a4fe72: unsupported protocol scheme "" -- couchbase.(*viewIndex).ScanRange() at view_index.go:192

 Comments   
Comment by Gerald Sangudi [ 12/Aug/14 ]
Please use

CREATE PRIMARY INDEX

before issuing queries against 3.0.
Comment by Brett Lawson [ 17/Sep/14 ]
Hey Gerald,
I assume this is just a temporary workaround?
Cheers, Brett
Comment by Gerald Sangudi [ 17/Sep/14 ]
HI Brett,

It may not be temporary. User would need to issue

CREATE PRIMARY INDEX

once per bucket. After that, they can query the bucket as often as needed. Subsequent calls to CREATE PRIMARY INDEX will notice the existing index and return immediately.

Maintaining the primary index is not cost-free, so we may not want to automatically create it for every bucket (e.g. a very large KV bucket with no N1QL or view usage).

Thanks,
Gerald




[MB-10662] _all_docs is no longer supported in 3.0 Created: 27/Mar/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-10649 _all_docs view queries fails with err... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
As of 3.0, view engine will no longer support the special predefined view, _all_docs.

It was not a published feature, but as it has been around for a long time, it is possible it was actually utilized in some setups.

We should document that _all_docs queries will not work in 3.0

 Comments   
Comment by Cihan Biyikoglu [ 27/Mar/14 ]
Thanks. are there internal tools depending on this? Do you know if we have deprecated this in the past? I realize it isn't a supported API but want to make sure we keep the door open for feedback during beta from large customers etc.
Comment by Perry Krug [ 28/Mar/14 ]
We have a few (very few) customers who have used this. They've known it is unsupported...but that doesn't ever really stop anyone if it works for them.

Do we have a doc describing what the proposed replacement will look like and will that be available for 3.0?
Comment by Ruth Harris [ 01/May/14 ]
_all_docs is not mentioned anywhere in the 2.2+ documentation. Not sure how to handle this. It's not deprecated because it was never intended for use.
Comment by Perry Krug [ 01/May/14 ]
I think at the very least a prominant release not is appropriate.
Comment by Gerald Sangudi [ 17/Sep/14 ]
For N1QL, please advise customers to do

CREATE PRIMARY INDEX on --bucket-name--.




[MB-12101] A tool to restore corrupt vbucket file Created: 29/Aug/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Larry Liu Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Relates to

 Description   
Occassionally, a vbucket file might corrupt. It would be good to have a tool to be able to restore the data from a vbucket file.

 Comments   
Comment by Chiyoung Seo [ 03/Sep/14 ]
I'm not sure what this ticket exactly means. We can't fully restore the up-to-date state from a corrupted database file, but instead can write a tool that allows us to restore one of the latest versions that is not corrupted.




[MB-12176] Missing port number on the network ports documentation for 3.0 Created: 12/Sep/14  Updated: 16/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Cihan Biyikoglu Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Ruth Harris [ 16/Sep/14 ]
Network Ports section of the Couchbase Server 3.0 beta doc has been updated with the new ssl port, 11207, and the table with the details for all of the ports has been updated.

http://docs.couchbase.com/prebuilt/couchbase-manual-3.0/Install/install-networkPorts.html
The site (and network ports section) should be refreshed soon.

thanks, Ruth




[MB-8297] Some key projects are still hosted at Membase GitHub account Created: 16/May/13  Updated: 16/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.1.0, 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Major
Reporter: Pavel Paulau Assignee: Trond Norbye
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-12185 update to "couchbase" from "membase" ... Open

 Description   
memcached, libmemcached, grommit, buildbot-internal...

They are important components of build workflow. For instance, repo manifests have multiple references to these projects.

This is very confusing legacy, I believe we can avoid it.

 Comments   
Comment by Chris Hillery [ 16/Sep/14 ]
buildbot-internal is at github.com/couchbase/buildbot-internal.

grommit will hopefully be retired in the 3.5 timeframe, and until then I don't want the disruption of moving it; it's private and internal.

Matt has opened MB-12185 to track moving memcached, which is the only project still referenced in the Couchbase server manifest from the membase remote.

libmemcached has been moved inside the "moxi" package for the Couchbase server build. Trond, two questions:

1. Does the project github.com/membase/libmemcached still have a purpose?

2. Do you think there are any projects under github.com/membase (including libmemcached) that should be retired, moved, or deprecated?




[MB-12199] curl -H arguments need to use double quotes Created: 16/Sep/14  Updated: 16/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0, 2.5.1, 3.0.1, 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Matt Ingenthron Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Current documentation states:

Indicates that an HTTP PUT operation is requested.
-H 'Content-Type: application/json'

And that will fail, seemingly owing to the single quotes. See also:
https://twitter.com/RamSharp/status/511739806528077824


 Comments   
Comment by Ruth Harris [ 16/Sep/14 ]
TASK for TECHNICAL WRITER
Fix in 3.0 == FIXED: Added single quotes or removed quotes from around the http string in appropriate examples.
Design Doc rest file - added single quotes, Compaction rest file ok, Trbl design doc file ok

FIX in 2.5: TBD

-----------------------

CONCLUSION:
At least with PUT, both single and double quotes work around: Content-Type: application/json. Didn't check GET or DELETE.
With PUT and DELETE, no quotes and single quotes around the http string work. Note: Some of the examples are missing a single quote around the http string. Meaning, one quote is present, but either the ending or beginning quote is missing. Didn't check GET.

Perhaps a missing single quote around the http string was the problem?
Perhaps there was formatting tags associated with ZlatRam's byauth.ddoc code that was causing the problem?

----------------------

TEST ONE:
1. create a ddoc and view from the UI = testview and testddoc
2. retrieve the ddoc using GET
3. use single quotes around Content-Type: application/json and around the http string. Note: Some of the examples are missing single quotes around the http string.
code: curl -X GET -H 'Content-Type: application/json' 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_testddoc'
results: {
    "views": {
        "testview": {
            "map": "function (doc, meta) {\n emit(meta.id, null);\n}"
        }
    }
}

TEST TWO:
1. delete testddoc
2. use single quotes around Content-Type: application/json and around the http string
code: curl -X DELETE -H 'Content-Type: application/json' 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_testddoc'
results: {"ok":true,"id":"_design/dev_testddoc"}
visual check via UI: Yep, it's gone


TEST THREE:
1. create a myauth.ddoc text file using the code in the Couchbase design doc documentation page.
2. Use PUT to create a dev_myauth design doc
3. use single quotes around Content-Type: application/json and around the http string. Note: I used "| python -m json.tool" to get pretty print output

myauth.ddoc contents: {"views":{"byloc":{"map":"function (doc, meta) {\n if (meta.type == \"json\") {\n emit(doc.city, doc.sales);\n } else {\n emit([\"blob\"]);\n }\n}"}}}
code: curl -X PUT -H 'Content-Type: application/json' 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_myauth' -d @myauth.ddoc | python -m json.tool
results: {
    "id": "_design/dev_myauth",
    "ok": true
}
visual check via UI: Yep, it's there.

TEST FOUR:
1. copy myauth.ddoc to zlat.ddoc
2. Use PUT to create a dev_zlat design doc
3. use double quotes around Content-Type: application/json and single quotes around the http string.

zlat.ddoc contents: {"views":{"byloc":{"map":"function (doc, meta) {\n if (meta.type == \"json\") {\n emit(doc.city, doc.sales);\n } else {\n emit([\"blob\"]);\n }\n}"}}}
code: curl -X PUT -H "Content-Type: application/json" 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_zlat' -d @zlat.ddoc | python -m json.tool
results: {
    "id": "_design/dev_zlat",
    "ok": true
}
visual check via UI: Yep, it's there.


TEST FIVE:
1. create a ddoc text file using ZlatRam's ddoc code
2. flattened the formatting so it reflected the code in the Couchbase example (used above)
3. Use PUT and single quotes.

zlatram contents: {"views":{"byauth":{"map":"function (doc, username) {\n if (doc.type == \"session\" && doc.user == username && Date.Parse(doc.expires) > Date.Parse(Date.Now()) ) {\n emit(doc.token, null);\n }\n}"}}}
code: curl -X PUT -H 'Content-Type: application/json' 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_zlatram' -d @zlatram.ddoc | python -m json.tool
results: {
    "id": "_design/dev_zlatram",
    "ok": true
}
visual check via UI: Yep, it's there.

TEST SIX:
1. delete zlatram ddoc but without quotes around the http string: curl -X DELETE -H 'Content-Type: application/json' http://Administrator:password@10.5.2.54:8092/test/_design/dev_zlatram
2. results: {
    "id": "_design/dev_zlatram",
    "ok": true
}
3. verify via UI: Yep, it gone
4. add zlatram but without quotes around the http string: curl -X PUT -H 'Content-Type: application/json' http://Administrator:password@10.5.2.54:8092/test/_design/dev_zlatram
5. results: {
    "id": "_design/dev_zlatram",
    "ok": true
}
6. verify via UI: Yep, it back.




[MB-12167] Remove Minor / Major / Page faults graphs from the UI Created: 10/Sep/14  Updated: 16/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Trivial
Reporter: Ian McCloy Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 1
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Customers often ask what is wrong with their system when they see anything greater than 0 page faults in the UI graphs. What are customers supposed to do with the information ? This isn't a useful metric to customers and we shouldn't show it in the UI. If needed for development debug we can query it from the REST API.

 Comments   
Comment by Matt Ingenthron [ 10/Sep/14 ]
Just to opine: +1. There are a number of things in the UI that aren't actionable. I know they help us when we look back over time, but as presented it's not useful.
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
So it's essentially expression of our belief that majority of our customers are ignorant enough to be confused by "fault" in name of this stat ?

Just want to make sure that there's no misunderstanding on this.

On Matt's point I'd like to say that all our stats are not actionable. They're just information that might end up helpful occasionally. And yes especially major page faults are _tremendously_ helpful sign of issues.
Comment by Matt Ingenthron [ 10/Sep/14 ]
I don't think the word "fault" is at issue, but maybe others do. I know there are others that aren't actionable and to be honest, I take issue with them too. This one is just one of the more egregious examples. :) The problem is, in my opinion, it's not clear what one would do with minor page fault data. One can't really know what's good or bad without looking at trends or doing further analysis.

While I'm tossing out opinions, similarly visualizing everything as a queue length isn't always good. To the app, latency and throughput matter-- how many queues and where they are affects this, but doesn't define it. A big queue length with fast storage can still have very good latency/throughput and equally a short queue length with slow or variable (i.e., EC2 EBS) storage can have poor latency/throughput. An app that will slow down with higher latencies won't make the queue length any bigger.

Anyway, pardon the wide opinion here-- I know you know all of this and I look forward to improvements when we get to them.

You raise a good point on major faults though.

If it only helps occasionally, then it's consistent with the request (to remove it from the UI, but still have it in there). I'm merely a user here, so please discount my opinion accordingly!
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
>> If it only helps occasionally, then it's consistent with the request (to remove it from the UI, but still have it in there).

Well but then it's true for almost all of our stats isn't? Doesn't it mean that we need to hide them all then ?
Comment by Matt Ingenthron [ 10/Sep/14 ]
>> Well but then it's true for almost all of our stats isn't? Doesn't it mean that we need to hide them all then ?

I don't think so. That's an extreme argument. I'd put ops/s which is directly proportional to application load and minor faults which is affected by other things on the system in very different categories. Do we account for minor faults at a per-bucket level? ;)
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
>> I'd put ops/s which is directly proportional to application load and minor faults which is affected by other things on the system in very different categories.

True.

>> Do we account for minor faults at a per-bucket level? ;)

No. And good point. Indeed lacking better UI we show all system stats (including some high-usefulness category things like count of memcached connections) as part of showing any bucket's stats. Despite gathering and storing system stats separately.

In any case, I'm not totally against hiding page fault stats. It's indeed minor topic.

But I'd like to see good reason for that. Because for _anything_ that we do there will all be some at least one user that's confused, which isn't IMO valid reason for "lets hide it".
 
My team has spent some effort getting this stats and we did for specifically because we knew that major page faults is important to be aware of. And we also know that on linux even minor page faults might be "major" in terms of latency impact. We've seen it with our own eyes.

I.e. when you're running out of list of free page, one can think that Linux is just supposed to grab one of clean pages from page cache, but we've seen this to take seconds for reason's I'm not quite sure. It does look like linux might routinely delay minor page fault for IO (perhaps due to some locking impacts). And things like huge page "minor" page fault may have even more obviously hard effect (i.e. because you need physically contiguous run of memory, getting this might require "memory compaction", locking etc). And our system doing constant non-direct-io writes routinely hits this hard condition. I.e. because near-every write from ep-engine or view engine has to allocate brand new page(s) for that data due to append-onlyness of out design (forest db's direct io path plus custom buffer cache management should help dramatically here).

Comment by Patrick Varley [ 10/Sep/14 ]
I think there are three main consumers of stats:

* Customers (cmd_get)
* Support (ep_bg_load_avg)
* Sevelopers of the component (erlang memory atom_used)

As a result we display and collect these stats in different way i.e UI, cbstats, ns_doctor, etc

A number of our users find the amount of stats in the UI overwhelming, a lot of the time they do not know which one are important.

Some of our user do not even understand what a virtual memory system is let alone what a page fault is.

I do not think we should display the page faults in the UI, but we should still collect them. I believe we can make better use of the space in the ui. For example: network usage or byte_written or byte_read, tcp retransmissions, Disk performance.
Comment by David Haikney [ 11/Sep/14 ]
+ 1 for removing page faults. The justification:
* We put them front and centre of the UI. Customers see Minor faults, Major Faults and Total faults before # gets, # sets.
* They have not proven useful for support in diagnosing an issue. In fact they cause more "false positive" questions ("my minor faults look high, is that a problem?")
* Overall this constitutes "noise" that our customers can do without. The stats can quite readily be captured elsewhere if we want to record them.

It would be easy to expand this into a wider discussion of how we might like to reorder / expand all of the current graphs in the UI - and that's a useful discussion. But I propose we keep this ticket to the question of removing the page fault stats.
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41333
Comment by Ian McCloy [ 16/Sep/14 ]
Which version of Couchbase Server is this fixed in ?




[MB-11939] Bucket configuration dialog should mention that fullEviction policy doesn't retain keys Created: 12/Aug/14  Updated: 16/Sep/14  Resolved: 16/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Pavel Blagodov
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
Current wording:
"Retain metadata in memory even for non-resident items"
"Don't retain metadata in memory for non-resident items"

"Metadata" is kind of ambiguous. Please mention keys explicitly.


 Comments   
Comment by Ilam Siva [ 12/Aug/14 ]
Change Radiobutton options to read:
Value Eviction (radiobutton selected by default)
Full Eviction

Change "Whats this?" hint:
Value Eviction - During eviction, only the value will be evicted (key and metadata will remain in memory)
Full Eviction - During eviction, everything (including key, metadata and value) will be evicted
Value Eviction needs more system memory but provides the best performance. Full Eviction reduces memory overhead requirement.
Comment by Anil Kumar [ 12/Aug/14 ]
Pavel - This changes will go in to Bucket Creation and Bucket Edit UI
Comment by Pavel Blagodov [ 14/Aug/14 ]
http://review.couchbase.org/40576
Comment by Anil Kumar [ 09/Sep/14 ]
Looks like we made typo error which needs to be corrected.

Change Radiobutton options to read:
Value Ejection (radiobutton selected by default)
Full Ejection

Change "Whats this?" hint:
Value Ejection - During ejection, only the value will be ejected (key and metadata will remain in memory)
Full Ejection - During ejection, everything (including key, metadata and value) will be ejected
Value Ejection needs more system memory but provides the best performance. Full Ejection reduces memory overhead requirement.
Comment by Pavel Blagodov [ 11/Sep/14 ]
http://review.couchbase.org/41358




[MB-12141] Try to delete a Server group that is empty. The error message needs to be descriptive Created: 05/Sep/14  Updated: 16/Sep/14  Resolved: 16/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Pavel Blagodov
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows build 3.0.1_1261
Environment: Windows 7 64 bit

Attachments: PNG File Screen Shot 2014-09-05 at 5.22.08 PM.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Login to the Couchbase console
http://10.2.2.52:8091/ (Administrator/Password)
Click on Server Nodes
Try to create a group and then click to delete
You will see the error as seen in the screenshot
Expected behavior: Removing Server Group as the tile and should say "Are you sure you want to remove the Server group" or some thing like that

 Comments   
Comment by Pavel Blagodov [ 11/Sep/14 ]
http://review.couchbase.org/41359




[MB-11612] mapreduce: terminator thread can be up to maxTaskDuration late Created: 02/Jul/14  Updated: 16/Sep/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Dave Rigby Assignee: Harsha Havanur
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
From investigation of how long-running map functions are terminated I noticed that the code to terminate them (mapreduce_nif.cc:terminatorLoop) only checks for long-running tasks every maxTaskDuration seconds.

Therefore if a task is close to (but not exceeding) it's timeout period, the terminatorLoop thread will sleep again for maxTaskDuration seconds, and will not detect the long-running task until almost 2x the timeout. Except from the code: https://github.com/couchbase/couchdb/blob/master/src/mapreduce/mapreduce_nif.cc#L459

    while (!shutdownTerminator) {
        enif_mutex_lock(terminatorMutex);
        // due to truncation of second's fraction lets pretend we're one second before
        now = time(NULL) - 1;

        for (it = contexts.begin(); it != contexts.end(); ++it) {
            map_reduce_ctx_t *ctx = (*it).second;

            if (ctx->taskStartTime >= 0) {
                if (ctx->taskStartTime + maxTaskDuration < now) {
                    terminateTask(ctx);
                }
            }
        }

        enif_mutex_unlock(terminatorMutex);
        doSleep(maxTaskDuration * 1000);
    }


We should either check more frequently, or calculate how far away the "oldest" task is from hitting it's deadline and sleep for that period.



 Comments   
Comment by Sriram Melkote [ 07/Jul/14 ]
Good catch, thanks! We'll fix this in 3.0.1 as we're limiting changes for 3.0 as we've hit beta.




[MB-12128] Stale=false may not ensure RYOW property (Regression) Created: 03/Sep/14  Updated: 16/Sep/14  Resolved: 16/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Sarath Lakshman Assignee: Sarath Lakshman
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
For performance reasons, we tried to reply for stale=false query readers immediately after updater internal checkpoint. This may result in sending index updates after partial snapshot reads and user may not observe RYOWs. For ensuring RYOW, we should always returns results after processing a complete upr snapshot.

We just need to revert this commit to fix the problem, https://github.com/couchbase/couchdb/commit/e866fe9330336ab1bda92743e0bd994530532cc8

It is fairly confident that reverting this change will not break anything. It was added as a pure performance improvement.

 Comments   
Comment by Sarath Lakshman [ 04/Sep/14 ]
Added a unit test to prove this case
http://review.couchbase.org/#/c/41192

Here is the change for reverting corresponding commit
http://review.couchbase.org/#/c/41193/
Comment by Wayne Siu [ 04/Sep/14 ]
As discussed in the release meeting on 09.04.14, this is scheduled for 3.0.1.
Comment by Sarath Lakshman [ 16/Sep/14 ]
Merged




[MB-10012] cbrecovery hangs in the case of multi-instance case Created: 24/Jan/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Venu Uppalapati Assignee: Ashvinder Singh
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive cbrecovery1.zip     Zip Archive cbrecovery2.zip     Zip Archive cbrecovery3.zip     Zip Archive cbrecovery4.zip     Zip Archive cbrecovery_source1.zip     Zip Archive cbrecovery_source2.zip     Zip Archive cbrecovery_source3.zip     Zip Archive cbrecovery_source4.zip     PNG File recovery.png    
Issue Links:
Relates to
Triage: Triaged
Operating System: Centos 64-bit

 Description   
2.5.0-1055

during verification MB-9967 I performed the same steps:
source cluster: 3 modes, 4 buckets
destination cluster: 3 nodes, 1 bucket
failover 2 nodes on destination cluster(without rebalance)

cbrecovery hangs on

[root@centos-64-x64 ~]# /opt/couchbase/bin/cbrecovery http://172.23.105.158:8091 http://172.23.105.159:8091 -u Administrator -U Administrator -p password -P password -b RevAB -B RevAB -v
Missing vbuckets to be recovered:[{"node": "ns_1@172.23.105.159", "vbuckets": [513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023]}]
2014-01-22 01:27:59,304: mt cbrecovery...
2014-01-22 01:27:59,304: mt source : http://172.23.105.158:8091
2014-01-22 01:27:59,305: mt sink : http://172.23.105.159:8091
2014-01-22 01:27:59,305: mt opts : {'username': '<xxx>', 'username_destination': 'Administrator', 'verbose': 1, 'dry_run': False, 'extra': {'max_retry': 10.0, 'rehash': 0.0, 'data_only': 1.0, 'nmv_retry': 1.0, 'conflict_resolve': 1.0, 'cbb_max_mb': 100000.0, 'try_xwm': 1.0, 'batch_max_bytes': 400000.0, 'report_full': 2000.0, 'batch_max_size': 1000.0, 'report': 5.0, 'design_doc_only': 0.0, 'recv_min_bytes': 4096.0}, 'bucket_destination': 'RevAB', 'vbucket_list': '{"172.23.105.159": [513]}', 'threads': 4, 'password_destination': 'password', 'key': None, 'password': '<xxx>', 'id': None, 'bucket_source': 'RevAB'}
2014-01-22 01:27:59,491: mt bucket: RevAB
2014-01-22 01:27:59,558: w0 source : http://172.23.105.158:8091(RevAB@172.23.105.156:8091)
2014-01-22 01:27:59,559: w0 sink : http://172.23.105.159:8091(RevAB@172.23.105.156:8091)
2014-01-22 01:27:59,559: w0 : total | last | per sec
2014-01-22 01:27:59,559: w0 batch : 1 | 1 | 15.7
2014-01-22 01:27:59,559: w0 byte : 0 | 0 | 0.0
2014-01-22 01:27:59,559: w0 msg : 0 | 0 | 0.0
2014-01-22 01:27:59,697: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210
2014-01-22 01:27:59,719: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210
2014-01-22 01:27:59,724: w2 source : http://172.23.105.158:8091(RevAB@172.23.105.158:8091)
2014-01-22 01:27:59,724: w2 sink : http://172.23.105.159:8091(RevAB@172.23.105.158:8091)
2014-01-22 01:27:59,727: w2 : total | last | per sec
2014-01-22 01:27:59,728: w2 batch : 1 | 1 | 64.0
2014-01-22 01:27:59,728: w2 byte : 0 | 0 | 0.0
2014-01-22 01:27:59,728: w2 msg : 0 | 0 | 0.0
2014-01-22 01:27:59,738: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210
2014-01-22 01:27:59,757: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210



 Comments   
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - June 04 2014 Bin, Ashivinder, Venu, Tony
Comment by Cihan Biyikoglu [ 27/Aug/14 ]
does this need to be considered for 3.0 or is this a test issue only?
Comment by Cihan Biyikoglu [ 29/Aug/14 ]
pls rerun on RC2 and validate that this is test execution issue and not a tools issue.
thnaks
Comment by Ashvinder Singh [ 15/Sep/14 ]
could not reproduce this issue.
Comment by Ashvinder Singh [ 15/Sep/14 ]
cannot reproduce




[MB-12164] UI: Cancelling a pending add should not show "reducing capacity" dialog Created: 10/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Improvement Priority: Trivial
Reporter: David Haikney Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
3.0.0 Beta build 2

Steps to reproduce:
In the UI click "Server add".
Add the credentials for a server to be added
In the Pending Rebalance pane click "Cancel"

Actual Behaviour:
See a dialog stating"Warning – Removing this server from the cluster will reduce cache capacity across all data buckets. Are you sure you want to remove this server?"

Expected behaviour:
Dialog is not applicable in this context since not adding an unaided node will do nothing to the cluster capacity. Would expect either no dialog or a dialog acknowledging that "This node will no longer be added to the cluster on next rebalance"

 Comments   
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
But it _is_ applicable because you're returning node to "pending remove" state.
Comment by David Haikney [ 10/Sep/14 ]
A node that has never held any data or actively participated in the cluster cannot possibly reduce the cluster's capacity.
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
It looks like I misunderstood this request as referring to cancelling add-back after failover. Which it isn't.

Makes sense now.
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41428




[MB-12147] {UI} :: Memcached Bucket with 0 items indicates NaNB / NaNB for Data/Disk Usage Created: 08/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Parag Agarwal Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Any environment

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
On a 1 node cluster create a memcached bucket with 0 items. UI says NaNB /
NaNB for Data/Disk Usage

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41379




[MB-12156] time of check/time of use race in data path change code of ns_server may lead to deletion of all buckets after adding node to cluster Created: 09/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0, 1.8.1, 2.0, 2.1.0, 2.2.0, 2.1.1, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: No

 Description   
SUBJ.

In code that changes data path we first check if node is provisioned (without preventing provision-ness to be changed after that) and the proceed with change of data path. As part of change of data path we delete buckets.

So if node gets added to cluster after check but before data path is actually changed, we'll delete all buckets of cluster.

As improbable as it may seem, it actually occurred in practice. See CBSE-1387.


 Comments   
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
Whether it's a must have for 3.0.0 is not for me to decide but here's my thinking.

* the bug was there at least since 2.0.0 and it really requires something outstanding in customer's environment to actually occur

* 3.0.1 is just couple months away

* 3.0.0 is done

But if we're still open to adding this fix to 3.0.0, my team will surely be glad to do it.
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41332
http://review.couchbase.org/41333




[MB-12196] [Windows] When I run cbworkloadgen.exe, I see a Warning message Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7
Build 1299

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Install 3.0.1_1299 build
Go to bin directory on the installation directory, run cbworkloadgen.exe
You will see the following warning:
WARNING:root:could not import snappy module. Compress/uncompress function will be skipped.

Expected behavior: The above warning should not appear





[MB-11094] add python-snappy to server builds Created: 12/May/14  Updated: 15/Sep/14  Resolved: 13/Jun/14

Status: Resolved
Project: Couchbase Server
Component/s: 3rd-party, build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Bin Cui Assignee: Trond Norbye
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt: finish-start
has to be done after MB-10957 cbtransfer and cbrestore for document... Resolved

 Description   
In order to fix MB-10597, we need to

[X] clone https://github.com/andrix/python-snappy under github.com/couchbase
[ ] add python-snappy to Windows builds
[X] check to see if python 2.4 is installed on all build machines
[X] verify that adding python 2.4 does not modify build environment for 2.x server builds
[X] add python 2.4 to any build machines it's missing from


 Comments   
Comment by Bin Cui [ 12/May/14 ]
Since this is a third party project, I don't think we need to add it to gerrit and add any commit-validation jobs for it. We just use it like other 3rd party modules such as snappy and v8.

We only need python2.4 for centos build. We use it already for current centos build though.
Comment by Phil Labee [ 12/May/14 ]
see:

    https://github.com/couchbase/python-snappy
Comment by Phil Labee [ 12/May/14 ]
Add more tasks under the description, or link to another bug, if you decide to prebuild and store in gerrit or depot.zip.


Comment by Chris Hillery [ 12/May/14 ]
Something else for the depot, Trond. This needs to be built for Windows 64-bit and (eventually) 32-bit.
Comment by Trond Norbye [ 13/Jun/14 ]
MB-11084




[MB-12194] [Windows] When you try to uninstall CB server it comes up with Installer wizard instead of uninstall Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7
Build: 3.0.1_1299

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Install Windows 3.0.1_1299 build
Try to uninstall the CB server
You will see the CB InstallShield Installation Wizard and then it comes up with the prompt of removing the selected application and all of its features

Expected result: It would be nice to come up with Uninstall Wizard instead of confusing Installation wizard




[MB-12193] Docs should explicitly state that we don't support online downgrades in the installation guide Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Critical
Reporter: Gokul Krishnan Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
In the installation guide, we should call out the fact that online downgrades (from 3.0 to 2.5.1) isn't supported and downgrades will require servers to be taken offline.

 Comments   
Comment by Ruth Harris [ 15/Sep/14 ]
In the 3.0 documentation:

Upgrading >
<note type="important">Online downgrades from 3.0 to 2.5.1 is not supported. Downgrades require that servers be taken offline.</note>

Should this be in the release notes too?
Comment by Matt Ingenthron [ 15/Sep/14 ]
"online" or "any"?




[MB-12191] forestdb needs an fdb_destroy() api to clean up a db Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Sundar Sridharan Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged
Is this a Regression?: Unknown

 Description   
forestdb does not have an option to clean up a database.
Manual deletion of the database files after fdb_close() and fdb_shutdown() is the workaround.
fdb_destroy() option needs to be added which will erase all forestdb files cleanly.




[MB-12187] Webinterface is not displaying items above 2.5kb in size Created: 15/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Philipp Fehre Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: MacOS, Webinterface

Attachments: PNG File document_size_couchbase.png    

 Description   
When trying to display a document which is above 2.5kb the web-interface will block the display. 2.5kb seems like a really low limit and is easily reach by regular documents, which makes using the web-interface inefficient especially when a bucket contains many documents that are close to this limit.
It makes sense to have a limit to not having to load really big documents into the interface but 2.5kb seems like a really low limit.

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
by design. Older browsers have trouble with larger docs. And there must be duplicate of this somewhere




[MB-12188] we should not duplicate log messages if we already have logs with "repeated n times" template Created: 15/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Andrei Baranouski Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File MB-12188.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
please see screenshot,

think that logs without "repeated n times" are unnecessary

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
They _are_. The logic (and it's same logic as many logging products have) is _if_ in short period of time (say 5 minutes) you have a bunch of same messages, it'll log them once. But if periods between messages is larger, then they're logged separately.




[MB-12190] Typo in the output of couchbase-cli bucket-flush Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Patrick Varley Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: cli
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
There should be a space between the full stop and Do.

[patrick:~] 2 $ couchbase-cli bucket-flush -b Test -c localhost
Running this command will totally PURGE database data from disk.Do you really want to do it? (Yes/No)

Another Typo when the command times out:

Running this command will totally PURGE database data from disk.Do you really want to do it? (Yes/No)TIMED OUT: command: bucket-flush: localhost:8091, most likely bucket is not flushed





[MB-11485] pending... statuses scattered on the UI(Pending Rebalance tab) Created: 19/Jun/14  Updated: 15/Sep/14  Resolved: 18/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-837

Attachments: PNG File pending_rebalance.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
please, see screenshot

 Comments   
Comment by Anil Kumar [ 19/Jun/14 ]
Triage - June 19 2014 Alk, Parag, Anil
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Anil, Wayne, Parag, Tony .. July 17th
Comment by Pavel Blagodov [ 18/Jul/14 ]
was fixed somewhere before
Comment by Pavel Blagodov [ 18/Jul/14 ]
please provide me environment description if this is not fixed
Comment by Aleksey Kondratenko [ 18/Jul/14 ]
Assuming it's done. Please reopen with more info as described above.




[MB-10686] rebalance hangs in many tests (~3.0.0-523 build) Created: 29/Mar/14  Updated: 15/Sep/14  Resolved: 31/Mar/14

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
see it in may jobs

in builds 3.0.0-523-rel, 3.0.0-526-rel

http://qa.hq.northscale.net/job/centos_x64--03_01--failover_tests-P0/20/console
http://qa.hq.northscale.net/view/All/job/ubuntu_x64--30_02--swap_rebalance-P1/14/consoleFull





 Comments   
Comment by Andrei Baranouski [ 29/Mar/14 ]
logs from http://qa.hq.northscale.net/view/All/job/ubuntu_x64--30_02--swap_rebalance-P1/14/consoleFull with vbuckets=16


https://s3.amazonaws.com/bugdb/jira/MB-10686/42a396cc/10.3.121.93-3292014-353-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/42a396cc/10.3.121.94-3292014-354-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/42a396cc/10.3.121.95-3292014-355-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/42a396cc/10.3.121.96-3292014-355-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/42a396cc/10.3.121.97-3292014-356-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/42a396cc/10.3.121.98-3292014-354-diag.zip

Comment by Aleksey Kondratenko [ 29/Mar/14 ]
Unfortunately logs are quite empty. I saw you filed different bug that explains it.

Given that cbcollectinfo doesn't work, then in order to diagnose I need /diag-s. From at least one node, preferably master.
Comment by Andrei Baranouski [ 30/Mar/14 ]
build 3.0.0-527-rel

http://qa.hq.northscale.net/job/centos_x86--07_01--swap_rebalance_tests-P0/13/consoleFull

https://s3.amazonaws.com/bugdb/jira/MB-10686/74d75d32/10.3.2.145-3302014-45-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/74d75d32/10.3.2.152-3302014-47-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/74d75d32/10.3.2.146-3302014-48-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/74d75d32/10.3.2.148-3302014-49-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/74d75d32/10.3.2.149-3302014-410-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10686/74d75d32/10.3.2.147-3302014-411-diag.zip

[2014-03-30 03:57:45,228] - [rest_client:986] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.2.146%2Cns_1%4010.3.2.147%2Cns_1%4010.3.2.149&user=Administrator&knownNodes=ns_1%4010.3.2.146%2Cns_1%4010.3.2.147%2Cns_1%4010.3.2.149%2Cns_1%4010.3.2.148%2Cns_1%4010.3.2.145%2Cns_1%4010.3.2.152
[2014-03-30 03:57:45,237] - [rest_client:990] INFO - rebalance operation started
[2014-03-30 03:57:45,241] - [rest_client:1091] INFO - rebalance percentage : 0 %
..
[2014-03-30 04:00:15,699] - [rest_client:1091] INFO - rebalance percentage : 47.9123173278 %
[2014-03-30 04:00:20,713] - [rest_client:1091] INFO - rebalance percentage : 49.0953375087 %
[2014-03-30 04:00:25,723] - [rest_client:1091] INFO - rebalance percentage : 50 %
...
[2014-03-30 04:04:41,261] - [rest_client:1091] INFO - rebalance percentage : 50 %
[2014-03-30 04:04:41,262] - [rest_client:1034] ERROR - apparently rebalance progress code in infinite loop: 50
Comment by Aleksey Kondratenko [ 30/Mar/14 ]
BTW it would be much much better if you could grab me diags/collectinfos when rebalance is stuck.

It would be nice if you could automate this too. I.e. so that any rebalance getting stuck errors automatically have best possible diagnostics.
Comment by Andrei Baranouski [ 31/Mar/14 ]
Hi Alk)

http://qa.hq.northscale.net/job/centos_x86--07_01--swap_rebalance_tests-P0/14/console

please see logs from jenkins( with parameters get-logs=True,stop-on-failure=True) http://qa.hq.northscale.net/job/centos_x86--07_01--swap_rebalance_tests-P0/14/artifact/*zip*/archive.zip
Comment by Aleksey Kondratenko [ 31/Mar/14 ]
Andrei, sorry but latest archive does have diags that are not helpful (no stuck rebalance in-progress).

Or I was supposed to look elsewhere ?

Feel free to ping me at gtalk.
Comment by Andrei Baranouski [ 31/Mar/14 ]
apparently jenkins puts wrong artifacts

should be:
https://s3.amazonaws.com/bugdb/jira/MB-10686/60dc1ba8/73e45884-827c-4e72-9298-ef2c23f24b66-10.3.2.145-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-10686/60dc1ba8/73e45884-827c-4e72-9298-ef2c23f24b66-10.3.2.146-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-10686/60dc1ba8/73e45884-827c-4e72-9298-ef2c23f24b66-10.3.2.147-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-10686/60dc1ba8/73e45884-827c-4e72-9298-ef2c23f24b66-10.3.2.148-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-10686/60dc1ba8/73e45884-827c-4e72-9298-ef2c23f24b66-10.3.2.149-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-10686/60dc1ba8/73e45884-827c-4e72-9298-ef2c23f24b66-10.3.2.152-diag.txt.gz
Comment by Aleksey Kondratenko [ 31/Mar/14 ]
My bad. We had fix since Friday but I incorrectly understood it and therefore not merged soon enough.

Now it's in: http://review.couchbase.org/35056

And in diags I see exact same place causing stuck rebalance as Artem's commit is fixing.




[MB-10594] /pools/FAKE/ returns "fake" data Created: 24/Mar/14  Updated: 15/Sep/14  Resolved: 31/Mar/14

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: build 3.0.0-441

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
don't have FAKE bucket

curl http://Administrator:password@localhost:8091/pools/FAKE/
{"storageTotals":{"ram":{"total":16706072576,"quotaTotal":2097152000,"quotaUsed":0,"used":15074934784,"usedByData":0,"quotaUsedPerNode":0,"quotaTotalPerNode":2097152000},"hdd":{"total":475455516672,"quotaTotal":475455516672,"used":123618434334,"usedByData":0,"free":351837082338}},"serverGroupsUri":"/pools/default/serverGroups?v=52184775","name":"FAKE","alerts":[],"alertsSilenceURL":"/controller/resetAlerts?token=0&uuid=6cbccc29b238a9b7876ec9fc21ae5bf3","nodes":[{"systemStats":{"cpu_utilization_rate":9.183673469387756,"swap_total":17054035968,"swap_used":0,"mem_total":16706072576,"mem_free":10101600256},"interestingStats":{},"uptime":"31805","memoryTotal":16706072576,"memoryFree":10101600256,"mcdMemoryReserved":12745,"mcdMemoryAllocated":12745,"couchApiBase":"http://127.0.0.1:8092/","otpCookie":"nnjyjgyqenvpgvtj","clusterMembership":"active","recoveryType":"none","status":"healthy","otpNode":"ns_1@127.0.0.1","thisNode":true,"hostname":"127.0.0.1:8091","clusterCompatibility":196608,"version":"3.0.0-441-rel-enterprise","os":"x86_64-unknown-linux-gnu","ports":{"sslProxy":11214,"httpsMgmt":18091,"httpsCAPI":18092,"sslDirect":11207,"proxy":11211,"direct":11210}}],"buckets":{"uri":"/pools/FAKE/buckets?v=55449572&uuid=6cbccc29b238a9b7876ec9fc21ae5bf3","terseBucketsBase":"/pools/default/b/","terseStreamingBucketsBase":"/pools/default/bs/"},"remoteClusters":{"uri":"/pools/default/remoteClusters?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3","validateURI":"/pools/default/remoteClusters?just_validate=1"},"controllers":{"addNode":{"uri":"/controller/addNode?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3"},"rebalance":{"uri":"/controller/rebalance?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3","requireDeltaRecoveryURI":"/controller/rebalance?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3&requireDeltaRecovery=true"},"failOver":{"uri":"/controller/failOver?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3"},"startGracefulFailover":{"uri":"/controller/startGracefulFailover?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3"},"reAddNode":{"uri":"/controller/reAddNode?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3"},"ejectNode":{"uri":"/controller/ejectNode?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3"},"setRecoveryType":{"uri":"/controller/setRecoveryType?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3"},"setAutoCompaction":{"uri":"/controller/setAutoCompaction?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3","validateURI":"/controller/setAutoCompaction?just_validate=1"},"replication":{"createURI":"/controller/createReplication?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3","validateURI":"/controller/createReplication?just_validate=1"},"setFastWarmup":{"uri":"/controller/setFastWarmup?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3","validateURI":"/controller/setFastWarmup?just_validate=1"}},"rebalanceStatus":"none","rebalanceProgressUri":"/pools/FAKE/rebalanceProgress","stopRebalanceUri":"/controller/stopRebalance?uuid=6cbccc29b238a9b7876ec9fc21ae5bf3","nodeStatusesUri":"/nodeStatuses","maxBucketCount":10,"autoCompactionSettings":{"parallelDBAndViewCompaction":false,"databaseFragmentationThreshold":{"percentage":30,"size":"undefined"},"viewFragmentationThreshold":{"percentage":30,"size":"undefined"}},"fastWarmupSettings":{"fastWarmupEnabled":true,"minMemoryThreshold":10,"minItemsThreshold":10},"tasks":{"uri":"/pools/FAKE/tasks?v=133172395"},"counte

 Comments   
Comment by Aleksey Kondratenko [ 25/Mar/14 ]
I'm pretty sure I've seen duplicate somewhere.

We should simply return 404 on any pool that's different than default.
Comment by Artem Stemkovski [ 31/Mar/14 ]
http://review.couchbase.org/35036
Comment by Andrei Baranouski [ 15/Sep/14 ]
3.0.0-1208




[MB-12142] Rebalance Exit due to Bad Replicas Error has no support documentation Created: 05/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Parag Agarwal Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Rebalance exits with Bad Replicas which can be caused by ns_server or couchbase-bucket. In such situations, rebalance on re-try fails. To fix such an issue, we need manual intervention to diagnose the problem. For the support team we need to provide documentation as a part of our release notes. Please define a process for the same and then re-assign the bug to Ruth for adding it to our documentation for our release notes

 Comments   
Comment by Chiyoung Seo [ 12/Sep/14 ]
Mike,

Please provide more details on bad replica issues in DCP and assign this back to the doc team.
Comment by Mike Wiederhold [ 12/Sep/14 ]
Bad replicas is an error message that means that replication streams could not be created. As a result there may be many reasons for this to happen. One reason that this might happen is if some of the vbuckets sequence numbers which are maintained internally in Couchbase are invalid. If this happpens you will see a log message in the memcached logs that looks something like this.

(DCP Producer) some_dcp_stream_name (vb 0) Stream request failed because the snap start seqno (100) <= start seqno (101) <= snap end seqno (100) is required

In order for a DCP producer to accept a request for a DCP stream the following must be true.

snapshot start seqno <= start seqno <= snapshot end seqno

If the above condition is not true for a stream request then a customer should contact support so that we can resolve the issue using a script to "reset" the sequence numbers. I can provide this script at a later time, but it is worth noting that we do not expect this scenario to happen and have resolved all bugs we have seen related to this error.
Comment by Ruth Harris [ 12/Sep/14 ]
Put it into the release notes (not in beta but for GA) for Known Issues MB-12142.
Is this the correct MB issue?




[MB-12170] Memory usage did not go down after flush Created: 10/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Wayne Siu Assignee: Gokul Krishnan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: [info] OS Name : Microsoft Windows Server 2008 R2 Enterprise
[info] OS Version : 6.1.7601 Service Pack 1 Build 7601
[info] HW Platform : PowerEdge M420
[info] CB Version : 2.5.0-1059-rel-enterprise
[info] CB Uptime : 31 days, 10 hours, 3 minutes, 51 seconds
[info] Architecture : x64-based PC
[ok] Installed CPUs : 16
[ok] Installed RAM : 98259 MB
[warn] Server Quota : 81.42% of total RAM. Max recommended is 80.00%
        (Quota: 80000 MB, Total RAM: 98259 MB)
[ok] Erlang VM vsize : 546 MB
[ok] Memcached vsize : 142 MB
[ok] Swap used : 0.00%
[info] Erlang VM scheduler : swt low is not set

Issue Links:
Relates to
relates to MB-9992 Memory is not released after 'flush' Closed
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Original problem was reported by our customer.

Steps to reproduce in their setup:
- Setup 4 node cluster (probably does not matter) bucket with 3GB, Replication of 1

- The program write 10MB binary objects from 3 threads parallely, 50 items in each thread.
Run the program (sometimes it crashes, I do not know the reason), simply run it again.
At the end of the run, there is a difference of 500 MB in ep_kv_size to the sum of vb_active_itm_memory and vb_replica_itm_memory (this might depend much on the network speed, I am using just a 100Mbit connection to the server, on production we have a faster network of course)
- Do the flush, ep_kv_size has the size of the difference even though the bucket is empty.
- Repeat this. On each run, the resident items percentage will go down.
- On the fourth or fifth run, it will throw an hard memory error, after insert only a part of the 150 items.




 Comments   
Comment by Wayne Siu [ 10/Sep/14 ]
Raju,
Can you please assign?
Comment by Raju Suravarjjala [ 10/Sep/14 ]
Tony, can you see if you can reproduce this bug? Please note it is 2.5.1 Windows 64bit
Comment by Anil Kumar [ 10/Sep/14 ]
Just a FYI previously we had opened similar issue which was on CentOS but resolved as cannot reproduce.
Comment by Ian McCloy [ 11/Sep/14 ]
It's 2.5.0 not 2.5.1 on Windows 2008 64bit
Comment by Thuan Nguyen [ 11/Sep/14 ]
Follow instruction from here,
Steps to reproduce in their setup:
- Setup 4 node cluster (probably does not matter) bucket with 3GB, Replication of 1

- The program write 10MB binary objects from 3 threads parallely, 50 items in each thread.
Run the program (sometimes it crashes, I do not know the reason), simply run it again.
At the end of the run, there is a difference of 500 MB in ep_kv_size to the sum of vb_active_itm_memory and vb_replica_itm_memory (this might depend much on the network speed, I am using just a 100Mbit connection to the server, on production we have a faster network of course)
- Do the flush, ep_kv_size has the size of the difference even though the bucket is empty.
- Repeat this. On each run, the resident items percentage will go down.
- On the fourth or fifth run, it will throw an hard memory error, after insert only a part of the 150 items.


I could not reproduce this bug after 6 flushes.
After each flush, mem use on both active and replica went down to zero.
Comment by Thuan Nguyen [ 11/Sep/14 ]
Using our loader, I could not reproduce this bug. I will use customer loader to test again.
Comment by Raju Suravarjjala [ 12/Sep/14 ]
Gokul: As we discussed can you folks try to reproduce this bug?




[MB-12019] XDCR@next release - Replication Manager #1: barebone Created: 19/Aug/14  Updated: 12/Sep/14

Status: In Progress
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: techdebt-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: sprint1_xdcr
Remaining Estimate: 32h
Time Spent: Not Specified
Original Estimate: 32h

Epic Link: XDCR next release

 Description   
build on top of generic FeedManager with XDCR specifics
1. interface with Distributed Metadata Service
2. interface with NS-server




[MB-12181] XDCR@next release - rethink XmemNozzle's configuration parameters Created: 12/Sep/14  Updated: 12/Sep/14  Resolved: 12/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: feature-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Xiaomei Zhang
Resolution: Done Votes: 0
Labels: sprint1_xdcr
Remaining Estimate: 8h
Time Spent: Not Specified
Original Estimate: 8h

Epic Link: XDCR next release

 Description   
rethink XmemNozzle's configuration parameters. Some of them should be construction-time parameters, some of them are runtime parameters


 Comments   
Comment by Xiaomei Zhang [ 12/Sep/14 ]
https://github.com/Xiaomei-Zhang/couchbase_goxdcr_impl/commit/44921e06e141f0c9df9cfc4ab43d106643e9b766
https://github.com/Xiaomei-Zhang/couchbase_goxdcr_impl/commit/80a8a059201b9a61bbd1784abef96859670ac233




[MB-12184] Enable logging to a remote server Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Minor
Reporter: James Mauss Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
It would be nice to be able to configure Couchbase Server to log events into a remote syslog-ng or the like server.




[MB-12020] XDCR@next release - REST Server Created: 19/Aug/14  Updated: 12/Sep/14

Status: In Progress
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: techdebt-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Yu Sui
Resolution: Unresolved Votes: 0
Labels: sprint1_xdcr
Remaining Estimate: 32h
Time Spent: Not Specified
Original Estimate: 32h

Epic Link: XDCR next release

 Description   
build on top of admin port
1. request\response message format defined in protobuf
2. handlers for request




[MB-12183] View Query Thruput regression compared with previous and 2.5.1 builds Created: 12/Sep/14  Updated: 12/Sep/14  Resolved: 12/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thomas Anderson Assignee: Harsha Havanur
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 4xnode cluster; 2xSSD

Issue Links:
Duplicate
duplicates MB-11917 One node slow probably due to the Erl... Open
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/leto/597/artifact/172.23.100.29.zip
http://ci.sc.couchbase.com/job/leto/597/artifact/172.23.100.30.zip
http://ci.sc.couchbase.com/job/leto/597/artifact/172.23.100.31.zip
http://ci.sc.couchbase.com/job/leto/597/artifact/172.23.100.32.zip
Is this a Regression?: Yes

 Description   
query thruput, 1 Bucket, 20Mx2KB, nonDGM, 4x1 views, 500 mutations/sec/node.
performance on 2.5.1 - 2185; on 3.0.0-1205 (RC2) 1599; on 3.0.0-1208 (RC3) 1635; on 3.0.0-1209 (RC4) 331.
92% regression with 2.5.1, 72% regression with 3.0.0-1208 (RC3)

 Comments   
Comment by Sriram Melkote [ 12/Sep/14 ]
Sarath looked at it. Data points:

- First run was fine, second run was slow
http://showfast.sc.couchbase.com/#/runs/query_thr_20M_leto_ssd/3.0.0-1209

- CPU utilization in second run was much less in on node 31, indicative of scheduler collapse
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1209_0fb_access

So this is a duplicate of MB-11917
Comment by Thomas Anderson [ 12/Sep/14 ]
reboot of cluster, rerun with same paramters. 3.0.0-1209 now shows same performance as previous 3.0 builds. it is still a 25% regression to 2.5.1, but is now a duplicate of MB-11917 , assigned to 3.0.1, sporadic Erlang Scheduler slowdown on one node in cluster causing various performance and functional issues.
 
Comment by Thomas Anderson [ 12/Sep/14 ]
closed as duplicate of planned 3.0.1 fix for Erlang scheduler collapse, MB-11917




[MB-11428] JSON versions and encodings supported by Couchbase Server need to be defined Created: 16/Jun/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1, 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Matt Ingenthron Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: documentation
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
While JSON is a standard, there are multiple unicode encodings and the definition of how to interact with this encoding has changed over the course of time. Also, our dependencies (mochiweb, view engine's JSON) may not actually conform to these standards.

Couchbase Server needs to define and document what it supports with respect to JSON.

See:
http://tools.ietf.org/html/draft-ietf-json-rfc4627bis-10 and
http://tools.ietf.org/html/rfc4627


 Comments   
Comment by Cihan Biyikoglu [ 16/Jun/14 ]
making this a documentation item - we should make this public.
Comment by Chiyoung Seo [ 24/Jun/14 ]
Moving this to post 3.0 as the datatype support is not supported in 3.0
Comment by Matt Ingenthron [ 11/Sep/14 ]
This isn't really datatype related, though it's not couchbase-bucket any more either. View engine and other parts of the server use JSON, what do they expect as input? It's also sort of documentation, but not strictly documentation since it should either be defined and validated, or determined based on what our dependencies actually do and verified. In either case, there's probably research and writing of unit tests I think.
Comment by Chiyoung Seo [ 12/Sep/14 ]
Assigning to the PM team to figure out the appropriate steps to be taken.




[MB-12178] Fix race condition in checkpoint persistence command Created: 12/Sep/14  Updated: 12/Sep/14  Resolved: 12/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Gokul Krishnan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Thread 11 (Thread 0x43fcd940 (LWP 6218)):

#0 0x00000032e620d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000032e6208e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000032e6208cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00002aaaaaf345ca in Mutex::acquire (this=0x1e79ac48) at src/mutex.cc:79
#4 0x00002aaaaaf595bb in lock (this=0x1e79a880, chkid=7, cookie=0x1d396580) at src/locks.hh:48
#5 LockHolder (this=0x1e79a880, chkid=7, cookie=0x1d396580) at src/locks.hh:26
#6 VBucket::addHighPriorityVBEntry (this=0x1e79a880, chkid=7, cookie=0x1d396580) at src/vbucket.cc:234
#7 0x00002aaaaaf1b580 in EventuallyPersistentEngine::handleCheckpointCmds (this=0x1d494a00, cookie=0x1d396580, req=<value optimized out>,
    response=0x40a390 <binary_response_handler>) at src/ep_engine.cc:3795
#8 0x00002aaaaaf20228 in processUnknownCommand (h=0x1d494a00, cookie=0x1d396580, request=0x1d3d6800, response=0x40a390 <binary_response_handler>) at src/ep_engine.cc:949
#9 0x00002aaaaaf2117c in EvpUnknownCommand (handle=<value optimized out>, cookie=0x1d396580, request=0x1d3d6800, response=0x40a390 <binary_response_handler>)
    at src/ep_engine.cc:1050
---Type <return> to continue, or q <return> to quit---
#10 0x00002aaaaacc4de4 in bucket_unknown_command (handle=<value optimized out>, cookie=0x1d396580, request=0x1d3d6800, response=0x40a390 <binary_response_handler>)
    at bucket_engine.c:2499
#11 0x00000000004122f7 in process_bin_unknown_packet (c=0x1d396580) at daemon/memcached.c:2911
#12 process_bin_packet (c=0x1d396580) at daemon/memcached.c:3238
#13 complete_nread_binary (c=0x1d396580) at daemon/memcached.c:3805
#14 complete_nread (c=0x1d396580) at daemon/memcached.c:3887
#15 conn_nread (c=0x1d396580) at daemon/memcached.c:5744
#16 0x0000000000406355 in event_handler (fd=<value optimized out>, which=<value optimized out>, arg=0x1d396580) at daemon/memcached.c:6012
#17 0x00002b52b162df3c in event_process_active_single_queue (base=0x1d46ec80, flags=<value optimized out>) at event.c:1308
#18 event_process_active (base=0x1d46ec80, flags=<value optimized out>) at event.c:1375
#19 event_base_loop (base=0x1d46ec80, flags=<value optimized out>) at event.c:1572
#20 0x0000000000414e34 in worker_libevent (arg=<value optimized out>) at daemon/thread.c:301
#21 0x00000032e620673d in start_thread () from /lib64/libpthread.so.0
#22 0x00000032e56d44bd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x4a3d7940 (LWP 6377)):

#0 0x00000032e620d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000032e6208e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000032e6208cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000415e16 in notify_io_complete (cookie=<value optimized out>, status=ENGINE_SUCCESS) at daemon/thread.c:485
#4 0x00002aaaaaf5a857 in notifyIOComplete (this=0x1e79a880, e=..., chkid=7) at src/ep_engine.h:423
#5 VBucket::notifyCheckpointPersisted (this=0x1e79a880, e=..., chkid=7) at src/vbucket.cc:250
#6 0x00002aaaaaf038fd in EventuallyPersistentStore::flushVBucket (this=0x1d77e000, vbid=109) at src/ep.cc:2033
---Type <return> to continue, or q <return> to quit---
#7 0x00002aaaaaf2c9e9 in doFlush (this=0x18c70dc0, tid=1046) at src/flusher.cc:222
#8 Flusher::step (this=0x18c70dc0, tid=1046) at src/flusher.cc:152
#9 0x00002aaaaaf36e74 in ExecutorThread::run (this=0x1d4c28c0) at src/scheduler.cc:159
#10 0x00002aaaaaf3746d in launch_executor_thread (arg=<value optimized out>) at src/scheduler.cc:36
#11 0x00000032e620673d in start_thread () from /lib64/libpthread.so.0
#12 0x00000032e56d44bd in clone () from /lib64/libc.so.6


 Comments   
Comment by Mike Wiederhold [ 12/Sep/14 ]
http://review.couchbase.org/#/c/41363/
Comment by Gokul Krishnan [ 12/Sep/14 ]
Thanks Mike!




[MB-10711] If in case of legacy client, check if document is of JSON type before setting datatype Created: 01/Apr/14  Updated: 12/Sep/14  Resolved: 12/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Task Priority: Critical
Reporter: Abhinav Dangeti Assignee: Abhinav Dangeti
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Abhinav Dangeti [ 01/Apr/14 ]
memcached: http://review.couchbase.org/#/c/35169/
ep-engine: http://review.couchbase.org/#/c/35170/
Comment by Abhinav Dangeti [ 09/Apr/14 ]
Merged.
Comment by Matt Ingenthron [ 16/Jun/14 ]
I had reason to look at this earlier and I noticed that while the C code imported supports UTF-16 and UTF-32, we've used a function that just checks UTF-8. Is this an issue?
Comment by Volker Mische [ 16/Jun/14 ]
I think it depends on whether we officially support JSON that isn't UTF-8 encoded. I strongly lean towards that we only support UTF-8 at the backend (I think currently the view engine also has some UTF-8 only checks IIRC). The newest JSON RFC (RFC 7159 [1]), which added a lot of information about interoperability also states that UTF-8 has the broadest support when it comes to JSON parsers.

[1]: http://rfc7159.net/
Comment by Chiyoung Seo [ 24/Jun/14 ]
Moving this to post 3.0 as the datatype is not supported in 3.0
Comment by Volker Mische [ 24/Jun/14 ]
I'm not quite following. This is already implemented and used by the view engine. I thought that only the compression is out, but the datatype in general is in. If it is not in I'd like to know why :)
Comment by Abhinav Dangeti [ 24/Jun/14 ]
Datatype will still be stored and propagated as per original plan, but memcached's HELLO command has been disabled, therefore, a user/client will not be able to set any datatype (other than 0x00) as pre-3.0.
This was the last I heard from management. We will however still continue to do a JSON check on the documents before setting the datatype in 3.0.




[MB-11482] remove or disable DataType Created: 19/Jun/14  Updated: 12/Sep/14  Resolved: 09/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Matt Ingenthron Assignee: Matt Ingenthron
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Based on open design questions and the fact that some components involved in this change are not feature complete, we have decided to disable the hello/datatype features. This affects multiple components and will update this issue based on a discussion of how to go about this.

 Comments   
Comment by Matt Ingenthron [ 02/Jul/14 ]
Datatype/hello has been disabled in 3.0 by not advertising the datatype feature even if the client requests HELLO.

Another task, MB-11623, has been opened to verify that this method of disabling is sufficient. See the details there. The return of this feature will be dependent on further definition of the feature from PM/Engineering.




[MB-11589] Sliding endseqno during initial index build or upr reading from disk snapshot results in longer stale=false query latency and index startup time Created: 28/Jun/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Sarath Lakshman Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks MB-11920 DCP based rebalance with views doesn'... Closed
Relates to
relates to MB-11919 3-5x increase in index size during re... Open
relates to MB-12081 Remove counting mutations introduced ... Resolved
relates to MB-11918 Latency of stale=update_after queries... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We have to fix this depending on the development cycles we have left for 3.0

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - July 17

Currently investigating we will decide depending on the scope of changes needed.
Comment by Anil Kumar [ 30/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Raising this issue to "Critical" this needs to be fixed by RC.
Comment by Sriram Melkote [ 31/Jul/14 ]
The issue is that we'll have to change the view dcp client to stream all 1024 vbuckets in parallel, or we'll need an enhancement in ep-engine to stop streaming at the point requested. Neither is a simple change - the reason it's in 3.0 is because Dipti had requested we try to optimize query performance. I'll leave it at Major as I don't want to commit to fixing this in RC and also, the product works with reasonable performance without this fix and so it's not a must have for RC.
Comment by Sriram Melkote [ 31/Jul/14 ]
Mike noted that even streaming all vbuckets in parallel (which was perhaps possible to do in 3.0) won't directly solve the issue as the backfills are scheduled one at a time. ep-engine could hold onto smaller snapshots but that's not something we can consider in 3.0 - so net effect is that we'll have to revisit this in 3.0.1 to design a proper solution.
Comment by Sriram Melkote [ 12/Aug/14 ]
Bringing back to 3.0 as this is the root cause of MB-11920 and MB-11918
Comment by Anil Kumar [ 13/Aug/14 ]
Deferring this to 3.0.1 since making this out of scope for 3.0.
Comment by Sarath Lakshman [ 05/Sep/14 ]
We need to file an EP-Engine dependency ticket to implement parallel streaming support without causing sliding endseq during ondisk snapshot backfill.




[MB-11706] Graceful failover gets to 55% then hangs Created: 11/Jul/14  Updated: 12/Sep/14  Resolved: 17/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, view-engine
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Dave Rigby Assignee: Ketaki Gangal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS, CB 3.0.0 build 918

Attachments: File allstats     PNG File Screen Shot 2014-07-11 at 15.41.06.png    
Issue Links:
Duplicate
duplicates MB-11505 upr client: Implement better error ha... Resolved
Relates to
relates to MB-11755 vbucket-seqno stats requests getting ... Resolved
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Created 3-node cluster (1GB RAM, 2 CPUs), with 4 buckets. Put one bucket under modest cbworkloadgen workload (10,000 items, 13k op/s).

Selected "Graceful Failover" for one of the nodes. Failover started, got to ~55% then just paused (see screenshot). Waited for maybe 30mins but no progress.

Logs uploaded (via "Collect Information" :)

https://s3.amazonaws.com/customers.couchbase.com/daver_graceful_failover_hung/collectinfo-2014-07-11T144818-ns_1%40192.168.73.101.zip
https://s3.amazonaws.com/customers.couchbase.com/daver_graceful_failover_hung/collectinfo-2014-07-11T144818-ns_1%40192.168.73.102.zip
https://s3.amazonaws.com/customers.couchbase.com/daver_graceful_failover_hung/collectinfo-2014-07-11T144818-ns_1%40192.168.73.103.zip






 Comments   
Comment by Aleksey Kondratenko [ 11/Jul/14 ]
We're waiting for views to become up-to-date.
Comment by Sriram Melkote [ 11/Jul/14 ]
Seems pretty bad bug. Nimish, please help
Comment by Nimish Gupta [ 14/Jul/14 ]
I observed following issues from the log :

1. We were getting badarg error while trying to stop cleaner. It was a race condition but it will not affect the rebalance and it should not be root cause for hanging the failover.
    I have fixed the issue, and code is review is in progress (http://review.couchbase.org/#/c/39348/).

2. It looks like stats call to ep-engine from the upr client is getting timeout after 5 seconds:

[couchdb:info,2014-07-11T23:51:57.061,ns_1@192.168.73.101:<0.1751.0>:couch_log:info:39]Set view `beer-sample`, main (prod) group `_design/beer`, signature `c122e6f8f575b247369afbbba52a785c`, terminating with reason: {timeout,
                                                                                                                                  {gen_server,
                                                                                                                                   call,
                                                                                                                                   [<0.1759.0>,
                                                                                                                                    {get_stats,
                                                                                                                                     <<"vbucket-seqno">>,

It will be great if ep-engine team can look into this timeout issue from ep-engine side.
Comment by Sriram Melkote [ 15/Jul/14 ]
Dave, this is alleviated as a part of MB-11505. I'm going to keep the bug open but reassign to Ketaki, so we can collect stats timings (ep-engine) and mctimings (memcached).

Ketaki, if you see the message "vbucket-seqno stats timed out" in our log files in any test run, can you please collect ep-engine stats timings and mctimings and please attach to this bug?
Comment by Meenakshi Goel [ 15/Jul/14 ]
Observing these timed out errors in one of the test

[couchdb:error,2014-07-15T2:42:48.999,ns_1@10.3.5.90:<0.15918.0>:couch_log:error:42]upr client (<0.15936.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...
[couchdb:error,2014-07-15T2:42:49.000,ns_1@10.3.5.90:<0.9539.0>:couch_log:error:42]upr client (<0.9558.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...
[couchdb:error,2014-07-15T2:42:49.000,ns_1@10.3.5.90:<0.8567.0>:couch_log:error:42]upr client (<0.8582.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...
[couchdb:error,2014-07-15T2:42:49.079,ns_1@10.3.5.90:<0.7712.0>:couch_log:error:42]upr client (<0.7730.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...
Comment by Meenakshi Goel [ 15/Jul/14 ]
Stats:
https://s3.amazonaws.com/bugdb/jira/MB-11706/13f68e9c/stats.tar.gz
Comment by Sriram Melkote [ 15/Jul/14 ]
In attachment allstats, I can see:

10.3.5.90 - get_stats_cmd 1s - 2s : (100.00%) 2
10.3.5.91 - get_stats_cmd 1s - 2s : (100.00%) 3
10.3.5.92 - get_stats_cmd 1s - 2s : (100.00%) 2

I can't run mctimings on it because it says "server was built without timings support"

But I think we do see some problem already where 7 calls taking more than 1 second. Chiyoung/Mike - Is it expected for STATS to take so long?
Comment by Sarath Lakshman [ 15/Jul/14 ]
I am preparing a toybuild with a change to use a separate upr connection for stats. We can try this test with the toybuild.
Comment by Sarath Lakshman [ 15/Jul/14 ]
Toybuild is ready
http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent64-3.0.0-toy-sarath-x86_64_3.0.0-707-toy.rpm
Comment by Sriram Melkote [ 15/Jul/14 ]
Meenakshi / Ketaki - can you please try this toy build?
Comment by Anil Kumar [ 15/Jul/14 ]
Triage - July 15th 2014
Comment by Meenakshi Goel [ 16/Jul/14 ]
Please note that i am unable to see the errors "vbucket-seqno stats timed out" on CentOS cluster after running same test currently.
Will update after trying few more runs.
Comment by Nimish Gupta [ 17/Jul/14 ]
I checked the logs and we still see the timeout errors but we have not seen any problem with rebalance. So I feel that timeout and rebalance issues are not related. We can open a minor issue for timout errors and close/monitor this issue untill we see the rebalance issue again.
Comment by Sriram Melkote [ 17/Jul/14 ]
This is fixed as MB-11505 converted UPR stats timeout into a soft error and hence graceful failover will succeed now despite the timeout. Meenakshi will open a separate issue to track the performance impact of the stats timeout and retry logic.
Comment by Sriram Melkote [ 17/Jul/14 ]
Ketaki - I assume graceful failover with views is covered. Please close the issue after verifying successful run of the same. If we don't test it, then please add this test case.
Comment by Ketaki Gangal [ 17/Jul/14 ]
There is some coverage for graceful failover and more for regular failover.
https://github.com/couchbase/testrunner/blob/master/conf/py-newfailover.conf#L46

I ll add some more tests to this, to make sure we have more testing on this area.
Comment by Ketaki Gangal [ 19/Aug/14 ]
added tests https://github.com/couchbase/testrunner/blob/master/conf/py-newfailover.conf#L37




[MB-11840] 3.0 (Beta): Views periodically take 2 orders of magnitude longer to complete Created: 29/Jul/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Task Priority: Major
Reporter: Daniel Owen Assignee: Sriram Melkote
Resolution: Unresolved Votes: 0
Labels: customer, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Single Node running version 3.0.0 enterprise edition (build-918). Running on VirtualBox, assigned 8 vCPUs and 8GB memory. (host has 24 cores, 128GB RAM).

Attachments: File backup.tgz     File curlscript.sh     PNG File output-251.png     PNG File output-3.0.png    
Issue Links:
Dependency

 Description   
Hi Alk,

I can demonstrate the behaviour of views periodically taking 2 orders of magnitude longer with 3.0.
(Similar to the issue we were investigating relating to Stats Archiver).

See output-3.0, x-axis is just a count of view queries. Test ran for ~53 minutes and completed 315408 view (~100 second). The y-axis is view response time (in seconds).

In general the response time is < 0.01 of a second. However occasionally (9 out of 315408 views) it takes > 0.1. This may be considered acceptable in the design of the Server, but wanted to get confirmation.

To replicate the test, run...

 while true; do ./curlscript.sh >> times2.txt 2>&1 ; done

I have provided curlscript.sh as an attached file.

The generated workload is test data from the same customer that hit the Stats Archive Issue
Create a bucket named "oogway" and then do a cbtransfer of the unpacked backup.tgz file (see attached).

 Comments   
Comment by Aleksey Kondratenko [ 29/Jul/14 ]
What I'm supposed to do with that ?
Comment by Aleksey Kondratenko [ 29/Jul/14 ]
CC-ed some folks.
Comment by Sriram Melkote [ 29/Jul/14 ]
Daniel - can you please let me know what is plotted on X and Y axis, and the unit for them?
Comment by Daniel Owen [ 29/Jul/14 ]
Hi Sriram, I have updated the description to contain more information. I'm just currently running a similar experiment on 2.5.1 and will upload when I get the results.
Comment by Daniel Owen [ 29/Jul/14 ]
I have uploaded data for a similar experiment performed on 2.5.1 (build-1083).
Again for ~53 minutes, we performed at total of 308193 queries (~100 per second) and a total of 15 out of 308193 took > 0.1 seconds to complete. In general response time is < 0.01

Note: That given the large CPU entitlement, we don't see any regular peak in view times due to the Stats Archive (i.e. not seeing regular spikes every 120 seconds) however we are still seeing very large spikes in view query response times (it appears more frequently than in 3.0 beta).
Comment by Daniel Owen [ 29/Jul/14 ]
I suspect the 2.5.1 results are worse than 3.0 because 2.5.1 is using Erlang version R14B04, and therefore as highlighted by Dave Rigby may be impacted by the bug OTP-11163.

See https://gist.github.com/chewbranca/07d9a6eed3da7b490b47#scheduler-collapse
Comment by Sriram Melkote [ 29/Jul/14 ]
A few points I'd like to note:

(a) There is no specified guarantee on time a query will take to respond; 300ms is not unusual response time for the odd case.
(b) It appears to be not a regression based on the 2.5 and 3.0 comparison graph
(c) Query layer is heavily in Erlang and we are already rewriting it. So I'm targeting this outside of 3.0

I'm changing this back to a task as we need to investigate further to see if this behavior is indicative of an underlying bug before proceeding further.

EDIT: Removing comment about OTP-11163 not being a suspect because we're indeed seeing it in MB-11917




[MB-12103] [BUG BASH] - Couchbase view does not handle single quote in key when querying view. Created: 31/Aug/14  Updated: 12/Sep/14  Resolved: 01/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Fernando Assignee: Sriram Melkote
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 14.04

Installed Packages on Ubuntu 14.04

dpkg -l | grep couchbase
ii couchbase-server 3.0.0 amd64 Couchbase Server
ii libcouchbase-dev 2.4.1 amd64 library for the Couchbase protocol, development files
ii libcouchbase2-core 2.4.1 amd64 library for the Couchbase protocol, core files
ii libcouchbase2-libevent 2.4.1 amd64 library for the Couchbase protocol (libevent backend)


Rails Gemfile.lock

    couchbase (1.3.9)
      connection_pool (>= 1.0.0, <= 3.0.0)
      multi_json (~> 1.0)
      yaji (~> 0.3, >= 0.3.2)
    couchbase-model (0.5.4)
      activemodel
      couchbase (~> 1.3.3)

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Repo Steps: (Not sure if Ruby Client/Couchbase Model or Couchbase Server bug)

1.) Create a Couchbase Model with
Section.create(name: "Men's", category_id: "1")

2.) Query the view.

Section.by_name_and_category_id(:key =>["Men's","1"], :include_docs => true, stale:false).first

Expected:
You are returned the document you just created.

Actual:
You returned nil



The Model:

class Section < Couchbase::Model
  extend Couchbase::EscapeJavascript

  attribute :name
  attribute :category_id

  view :by_name_and_category_id


end

 Comments   
Comment by Fernando [ 31/Aug/14 ]
Forgot to attach the view function:

function(doc, meta) {
    if(doc.name && doc.category_id && doc.type=="section"){
        emit([doc.name, doc.category_id], null);
    }
}
Comment by Fernando [ 31/Aug/14 ]
Related Convo:

http://www.couchbase.com/communities/q-and-a/do-i-need-escape-keys-when-loading-view.

Is this a limitation of querying views? Are special characters not supported? If so, how do you index views with special characters.
Comment by Fernando [ 31/Aug/14 ]
I think I jumped the gun on this one. I'm escaping the params passed into the key. So They Key was "Men\'s" instead of "Men's". In that case it makes sense for the view not to return anything.

I was trying to prevent the case where someone would try to pass in javascript into the view. Don't know what goes on behind the scenes in the view engine. I just didn't want to risk arbitrary javascript being passed in.
Comment by Sriram Melkote [ 01/Sep/14 ]
Fernando, thanks for keeping an eye on this. Based on the last comments, I'll resolve the issue. If you notice anything amiss in future testing on this topic of escaping, please feel free to reopen the bug. Thanks.
Comment by Sriram Melkote [ 01/Sep/14 ]
Regarding JavaScript, the query parameters are not passed to JavaScript engine. So script execution should not be possible at server side via query parameters.




[MB-10206] Replication from 3.0 node to 2.x node does not uncompress documents (with compressed datatype) results in lost datatype information during failover. Created: 13/Feb/14  Updated: 12/Sep/14  Resolved: 12/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Venu Uppalapati Assignee: Abhinav Dangeti
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Replication from 3.0 node to 2.x node does not uncompress documents with compressed datatype. This bug manifests in two user visible discrepancies.

Step to reproduce:
1) Setup 2.x node and add a 3.0 node to make a two node cluster.
2) Issue a SET for a compressed document(datatype 2) against the cluster such that the active copy goes to the 3.0 node i.e, doc key hashes to 3.0 node.
3) Issue a GET for this doc from a 2.x client. You will get the uncompressed document back.
4) Now failover the 3.0 node. Replica on 2.x will be promoted as active.
5) Issue a GET for the same doc from a 2.x client. You will get a compressed copy back. This is the first discrepancy.
6) Add back the 3.0 node and rebalance. Now 3.0 has the compressed doc but with datatype marked as raw binary(datatype 0).
7)Issue a GET for the same doc from a 2.x client. You will get a compressed copy back. This is the second discrepancy.
8)basically, during a failover/recover scenario in a mixed cluster we lose all datatype information of a document.

 Comments   
Comment by Cihan Biyikoglu [ 17/Mar/14 ]
seems like we are using a 3.0 feature during upgrade. we don't support that.
thanks
-cihan
Comment by Matt Ingenthron [ 28/May/14 ]
Agreed with Cihan's comment, but we need to define the contract here. Options:
1) Cluster does not respond to HELLO until all nodes are upgraded (&& downgrades are not supported anyway)
2) Client needs to HELLO to all nodes of the cluster and not use datatype JSON & Compression features unless all nodes report supporting it (&& downgrades are not supported anyway)
Comment by Cihan Biyikoglu [ 29/May/14 ]
sending this back to Abhinav after the discussion we had. If option #1 is cheap to implement we should enable that.
Comment by Matt Ingenthron [ 16/Jun/14 ]
Since it would be the client's responsibility to ensure that it either HELLOs or does not HELLO at all nodes, it would be the more expensive path you refer to.

Scenario: three nodes, one supports HELLO/datatype and two do not.
Expected behavior: Client issues a HELLO to all three. Finding that only one of the three supports HELLO, the client must drop the connections and re-establish connections not performing a HELLO operation.

The 'overhead' is small, but non-trivial and would need to be regularly done.

One other option is that all clients have this, but it's off by default and the user has to turn it on. I dislike this a lot since the best UI element is the one you don't have to use. It should just work out of the box in my opinion.
Comment by Chiyoung Seo [ 24/Jun/14 ]
Moving this to post 3.0 as the datatype is not supported in 3.0
Comment by Abhinav Dangeti [ 11/Sep/14 ]
Replication from 3.0 to 2.5 nodes is through TAP, and when through TAP we do inflate compressed documents now.
http://review.couchbase.org/#/c/35601/




[MB-12180] Modularize the DCP code Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We need to modularize the DCP code so that we can write unit tests to ensure that we have fewer bugs and less regressions from future changes.




[MB-12179] Allow incremental pausable backfills Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Task Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description&n