[MB-12196] [Windows] When I run cbworkloadgen.exe, I see a Warning message Created: 15/Sep/14  Updated: 19/Sep/14  Resolved: 19/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7
Build 1299

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Install 3.0.1_1299 build
Go to bin directory on the installation directory, run cbworkloadgen.exe
You will see the following warning:
WARNING:root:could not import snappy module. Compress/uncompress function will be skipped.

Expected behavior: The above warning should not appear


 Comments   
Comment by Bin Cui [ 19/Sep/14 ]
http://review.couchbase.org/#/c/41514/




[MB-12209] [windows] failed to offline upgrade from 2.5.x to 3.0.1-1299 Created: 18/Sep/14  Updated: 19/Sep/14  Resolved: 19/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows server 2008 r2 64-bit

Attachments: Zip Archive 12.11.10.145-9182014-1010-diag.zip     Zip Archive 12.11.10.145-9182014-922-diag.zip    
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Install couchbase server 2.5.1 on one node
Create default bucket
Load 1000 items to bucket
Offline upgrade from 2.5.1 to 3.0.1-1299
After upgrade, node reset to initial setup


 Comments   
Comment by Thuan Nguyen [ 18/Sep/14 ]
I got the same issue when offline upgrade from 2.5.0 to 3.0.1-1299. Updated the title
Comment by Thuan Nguyen [ 18/Sep/14 ]
cbcollectinfo of node failed to offline upgrade from 2.5.0 to 3.0.1-1299
Comment by Bin Cui [ 18/Sep/14 ]
http://review.couchbase.org/#/c/41473/




[MB-12019] XDCR@next release - Replication Manager #1: barebone Created: 19/Aug/14  Updated: 19/Sep/14  Resolved: 19/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: techdebt-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Xiaomei Zhang
Resolution: Done Votes: 0
Labels: sprint1_xdcr
Remaining Estimate: 32h
Time Spent: Not Specified
Original Estimate: 32h

Epic Link: XDCR next release

 Description   
build on top of generic FeedManager with XDCR specifics
1. interface with Distributed Metadata Service
2. interface with NS-server




[MB-12138] {Windows - DCP}:: View Query fails with error 500 reason: error {"error":"error","reason":"{index_builder_exit,89,<<>>}"} Created: 05/Sep/14  Updated: 19/Sep/14  Resolved: 19/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Nimish Gupta
Resolution: Fixed Votes: 0
Labels: windows, windows-3.0-beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.1-1267, Windows 2012, 64 x, machine:: 172.23.105.112

Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-12138/172.23.105.112-952014-1511-diag.zip
Is this a Regression?: Yes

 Description   


1. Create 1 Node cluster
2. Create default bucket and add 100k items
3. Create views and query it

Seeing the following exceptions

http://172.23.105.112:8092/default/_design/ddoc1/_view/default_view0?connectionTimeout=60000&full_set=true&limit=100000&stale=false error 500 reason: error {"error":"error","reason":"{index_builder_exit,89,<<>>}"}

We cannot run any view tests as a result


 Comments   
Comment by Anil Kumar [ 16/Sep/14 ]
Nimish/Siri - Any update on this.
Comment by Meenakshi Goel [ 17/Sep/14 ]
Seeing similar issue in Views DGM test http://qa.hq.northscale.net/job/win_2008_x64--69_06_view_dgm_tests-P1/1/console
Test : view.createdeleteview.CreateDeleteViewTests.test_view_ops,ddoc_ops=update,test_with_view=True,num_ddocs=4,num_views_per_ddoc=10,items=200000,active_resident_threshold=10,dgm_run=True,eviction_policy=fullEviction
Comment by Nimish Gupta [ 17/Sep/14 ]
We have found the root cause and working on the fix.
Comment by Nimish Gupta [ 19/Sep/14 ]
http://review.couchbase.org/#/c/41480




[MB-12197] Bucket deletion failing with error 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue."} Created: 16/Sep/14  Updated: 19/Sep/14  Resolved: 18/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Meenakshi Goel Assignee: Meenakshi Goel
Resolution: Fixed Votes: 0
Labels: windows, windows-3.0-beta, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.1-1299-rel

Attachments: Text File test.txt    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Ref Link:
http://qa.hq.northscale.net/job/win_2008_x64--14_01--replica_read-P0/32/consoleFull
http://qa.hq.northscale.net/job/win_2008_x64--59--01--bucket_flush-P1/14/console
http://qa.hq.northscale.net/job/win_2008_x64--59_01--warmup-P1/6/consoleFull

Test to Reproduce:
newmemcapable.GetrTests.getr_test,nodes_init=4,GROUP=P0,expiration=60,wait_expiration=true,error=Not found for vbucket,descr=#simple getr replica_count=1 expiration=60 flags = 0 docs_ops=create cluster ops = None
flush.bucketflush.BucketFlushTests.bucketflush,items=20000,nodes_in=3,GROUP=P0

*Note that test doesn't fail but further do fails with "error 400 reason: unknown ["Prepare join failed. Node is already part of cluster."]" because cleanup wasn't successful.

Logs:
[rebalance:error,2014-09-15T9:36:01.989,ns_1@10.3.121.182:<0.6938.0>:ns_rebalancer:do_wait_buckets_shutdown:307]Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.3.121.182',
                                                         {'EXIT',
                                                          {old_buckets_shutdown_wait_failed,
                                                           ["default"]}}}]

[error_logger:error,2014-09-15T9:36:01.989,ns_1@10.3.121.182:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.6938.0>
    registered_name: []
    exception exit: {buckets_shutdown_wait_failed,
                        [{'ns_1@10.3.121.182',
                             {'EXIT',
                                 {old_buckets_shutdown_wait_failed,
                                     ["default"]}}}]}
      in function ns_rebalancer:do_wait_buckets_shutdown/1 (src/ns_rebalancer.erl, line 308)
      in call from ns_rebalancer:rebalance/5 (src/ns_rebalancer.erl, line 361)
    ancestors: [<0.811.0>,mb_master_sup,mb_master,ns_server_sup,
                  ns_server_cluster_sup,<0.57.0>]
    messages: []
    links: [<0.811.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 46422
    stack_size: 27
    reductions: 5472
  neighbours:

[user:info,2014-09-15T9:36:01.989,ns_1@10.3.121.182:<0.811.0>:ns_orchestrator:handle_info:483]Rebalance exited with reason {buckets_shutdown_wait_failed,
                              [{'ns_1@10.3.121.182',
                                {'EXIT',
                                 {old_buckets_shutdown_wait_failed,
                                  ["default"]}}}]}
[ns_server:error,2014-09-15T9:36:09.645,ns_1@10.3.121.182:ns_memcached-default<0.4908.0>:ns_memcached:terminate:798]Failed to delete bucket "default": {error,{badmatch,{error,closed}}}

Uploading Logs

 Comments   
Comment by Meenakshi Goel [ 16/Sep/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-12197/11dd43ca/10.3.121.182-9152014-938-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/e7795065/10.3.121.183-9152014-940-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/6442301b/10.3.121.102-9152014-942-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/10edf209/10.3.121.107-9152014-943-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/9f16f503/10.1.2.66-9152014-945-diag.zip
Comment by Ketaki Gangal [ 16/Sep/14 ]
Assigning to ns_server team for a first look.
Comment by Aleksey Kondratenko [ 16/Sep/14 ]
For cases like this it's very useful to get sample of backtraces from memcached on bad node. Is it still running ?
Comment by Aleksey Kondratenko [ 16/Sep/14 ]
Eh. It's windows....
Comment by Aleksey Kondratenko [ 17/Sep/14 ]
I've merged diagnostics commit (http://review.couchbase.org/41463). Please rerun, reproduce and give me new set of logs.
Comment by Meenakshi Goel [ 18/Sep/14 ]
Tested with 3.0.1-1307-rel, Please find logs below.
https://s3.amazonaws.com/bugdb/jira/MB-12197/c2191900/10.3.121.182-9172014-2245-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/28bc4a83/10.3.121.183-9172014-2246-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/8f1efbe5/10.3.121.102-9172014-2248-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/91a89d6a/10.3.121.107-9172014-2249-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/2d272074/10.1.2.66-9172014-2251-diag.zip
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
BTW I am indeed quite interested if this is specific to windows or not.
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
This continues to be superweird. Possibly another erlang bug. I need somebody to answer the following:

* can we reliably reproduce this on windows ?

* 100 % of the time ?

* if not (roughly) how often?

* can we reproduce this (at all) on GNU/Linux? How frequently?
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
No need to diagnose it any further. Thanks to Aliaksey we managed to understand this case and fix is going to be merged shortly.
Comment by Venu Uppalapati [ 18/Sep/14 ]
Here is my empirical observation for this issue:
1)I have the following inside a .bat script

C:\"Program Files"\Couchbase\Server\bin\couchbase-cli.exe bucket-delete -c 127.0.0.1:8091 --bucket=default -u Administrator -p password

C:\"Program Files"\Couchbase\Server\bin\couchbase-cli.exe rebalance -c 127.0.0.1:8091 --server-remove=172.23.106.180 -u Administrator -p password

2)I execute this script against a two node cluster with default bucket created, but with no data.

3)I see bucket deletion and rebalance fail in succession. This happened 4 times out of 4 trials.
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
http://review.couchbase.org/41474
Comment by Meenakshi Goel [ 19/Sep/14 ]
Tested with 3.0.1-1309-rel and no longer seeing the issue.
http://qa.hq.northscale.net/job/win_2008_x64--14_01--replica_read-P0/34/console




[MB-10662] _all_docs is no longer supported in 3.0 Created: 27/Mar/14  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Ruth Harris
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-10649 _all_docs view queries fails with err... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
As of 3.0, view engine will no longer support the special predefined view, _all_docs.

It was not a published feature, but as it has been around for a long time, it is possible it was actually utilized in some setups.

We should document that _all_docs queries will not work in 3.0

 Comments   
Comment by Cihan Biyikoglu [ 27/Mar/14 ]
Thanks. are there internal tools depending on this? Do you know if we have deprecated this in the past? I realize it isn't a supported API but want to make sure we keep the door open for feedback during beta from large customers etc.
Comment by Perry Krug [ 28/Mar/14 ]
We have a few (very few) customers who have used this. They've known it is unsupported...but that doesn't ever really stop anyone if it works for them.

Do we have a doc describing what the proposed replacement will look like and will that be available for 3.0?
Comment by Ruth Harris [ 01/May/14 ]
_all_docs is not mentioned anywhere in the 2.2+ documentation. Not sure how to handle this. It's not deprecated because it was never intended for use.
Comment by Perry Krug [ 01/May/14 ]
I think at the very least a prominant release not is appropriate.
Comment by Gerald Sangudi [ 17/Sep/14 ]
For N1QL, please advise customers to do

CREATE PRIMARY INDEX on --bucket-name--.




[MB-11999] Resident ratio of active items drops from 3% to 0.06% during rebalance with delta recovery Created: 18/Aug/14  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Abhinav Dangeti
Resolution: Fixed Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1169

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File vb_active_resident_items_ratio.png     PNG File vb_replica_resident_items_ratio.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares-dev/45/artifact/
Is this a Regression?: No

 Description   
1 of 4 nodes is being re-added after failover.
500M x 2KB items, 10K mixed ops/sec.

Steps:
1. Failover one of nodes.
2. Add it back.
3. Enabled delta recovery.
4. Sleep 20 minutes.
5. Rebalance cluster.

Most importantly it happens due to excessive memory usage.

 Comments   
Comment by Abhinav Dangeti [ 17/Sep/14 ]
http://review.couchbase.org/#/c/41468/
Comment by Abhinav Dangeti [ 18/Sep/14 ]
Merged fix.




[MB-12176] Missing port number on the network ports documentation for 3.0 Created: 12/Sep/14  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Cihan Biyikoglu Assignee: Ruth Harris
Resolution: Fixed Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Ruth Harris [ 16/Sep/14 ]
Network Ports section of the Couchbase Server 3.0 beta doc has been updated with the new ssl port, 11207, and the table with the details for all of the ports has been updated.

http://docs.couchbase.com/prebuilt/couchbase-manual-3.0/Install/install-networkPorts.html
The site (and network ports section) should be refreshed soon.

thanks, Ruth




[MB-12158] erlang gets stuck in gen_tcp:send despite socket being closed (was: Replication queue grows unbounded after graceful failover) Created: 09/Sep/14  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dcp_proxy.beam    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
After speaking with Mike briefly, sounds like this may be a known issue. My apologies if there is a duplicate issue already filed.

Logs are here:
 https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-176-128-88.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-193-231-33.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-219-111-249.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-219-84-241.us-west-1.compute.amazonaws.com.zip

 Comments   
Comment by Mike Wiederhold [ 10/Sep/14 ]
Perry,

The stats seem to be missing for dcp streams so I cannot look further into this. If you can still reproduce this on 3.0 build 1209 then assign it back to me and include the logs.
Comment by Perry Krug [ 11/Sep/14 ]
Mike, does the cbcollect_info include these stats or do you need me to gather something specifically when the problem occurs?

If not, let's also get them included for future builds...
Comment by Perry Krug [ 11/Sep/14 ]
Hey Mike, I'm having a hard time reproducing this on build 1209 where it seemed rather easy on previous builds. Do you think any of the changes from the "bad_replicas" bug would have affected this? Is it worth reproducing on a previous build where it was easier in order to get the right logs/stats or do you think it may be fixed already?
Comment by Mike Wiederhold [ 11/Sep/14 ]
This very well could be related to MB-12137. I'll take a look at the cluster and if I don't find anything worth investigating further then I think we should close this as cannot reproduce since it doesn't seem to happen anymore on build 1209. If there is still a problem I'm sure it will be reproduced again later in one of our performance tests.
Comment by Mike Wiederhold [ 11/Sep/14 ]
It looks like one of the dcp connections to the failed over node was still active. My guess is that the node when down and came back up quickly. As a result it's possible that ns_server re-established the connection with the downed node. Can you attach the logs and assign this to Alk so he can take a look?
Comment by Perry Krug [ 11/Sep/14 ]
Thanks Mike.

Alk, logs are attached from the first time this was reproduced. Let me know if you need me to do so again.

Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Mike, btw for the future, if you could post exact details (i.e. node and name of connection) of stuff you want me to double-check/explain it could have saved me time.

Also, let me note that it's replica and node master who establishes replication. I.e. we're "pulling" rather than "pushing" replication.

I'll look at all this and see if I can find something.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Sorry, replica instead of master, who initiates replication.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Indeed I'm seeing dcp connection from memcached on .33 to beam of .88. And it appears that something in dcp replicator is stuck. I'll need a bit more time to figure this out.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Looks like socket send gets blocked somehow despite socket actually being closed already.

Might be serious enough to be a show stopper for 3.0.

Do you by any chance still have nodes running? Or if not, can you easily reproduce this? Having direct access to bad node might be very handy to diagnose this further.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Moved back to 3.0. Because if it's indeed erlang bug it might be very hard to fix and because it may happen not just during failover.
Comment by Cihan Biyikoglu [ 12/Sep/14 ]
triage - need and update pls.
Comment by Perry Krug [ 12/Sep/14 ]
I'm reproducing now and will post both the logs and the live systems momentarily
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Able to reproduce this condition with erlang outside of our product (which is great news):

* connect gen_tcp socket to nc or irb process listening

* spawn erlang process that will send stuff infinitely on that socket and will eventually block

* from erlang console do gen_tcp:close (i.e. while other erlang process is blocked writing)

* observe how erlang process that's blocked is still blocked

* observe with lsof that socket isn't really closed

* close the socket on the other end (by killing nc)

* observe with lsof that socket is closed

* observe how erlang process is still blocked (!) despite underlying socket fully dead

The fact that it's not a race is really great because dealing with deterministic bug (even if it's "feature" from erlang's point of view) is much easier
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Fix is at: http://review.couchbase.org/41396

I need approval to get this in 3.0.0.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Attaching fixed dcp_proxy.beam if somebody wants to be able to test the fix without waiting for build
Comment by Perry Krug [ 12/Sep/14 ]
Awesome as usual Alk, thanks very much.

I'll give this a try on my side for verification.
Comment by Parag Agarwal [ 12/Sep/14 ]
Alk, will this issue occur in TAP as well? during upgrades.
Comment by Mike Wiederhold [ 12/Sep/14 ]
Alk,

I apologize or not including a better description of what happened. In the future I'll make sure to leave better details before assigning bugs to others so that we don't have multiple people duplicating the same work.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
>> Alk, will this issue occur in TAP as well? during upgrades.

No.
Comment by Perry Krug [ 12/Sep/14 ]
As of yet unable to reproduce this on build 1209+dcp_proxy.beam.

Thanks for the quick turnaround Alk.
Comment by Cihan Biyikoglu [ 12/Sep/14 ]
triage discussion:
under load this may happen frequently -
there is good chance that this recovers itself in few mins - it should but we should validate.
if we are in this state, we can restart erlang to get out of the situation - no app unavailability required
fix could be risky to take at this point

decision: not taking this for 3.0
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Mike, need you ACK on this:

Because of dcp nops between replicators, dcp producer should after few minutes, close his side of the socket and release all resources.

Am I right? I said this in meeting just few minutes ago and it affected decision. If I'm wrong (say if you decided to disable nops in the end, or if you know it's broken etc), then we need to know it.
Comment by Perry Krug [ 12/Sep/14 ]
FWIW, I have seen that this does not recover after a few minutes. However, I agree that it is workaround-able both by restarting beam or bringing the node back into the cluster. Unless we think this will happen much more often, I agree it could be deferred out of 3.0.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Well if it does not recover then it can be argued that we have another bug on ep-engine side that may lead to similar badness (queue size and resources eated) _without_ clean workaround.

Mike, we'll need your input on DCP NOPs.
Comment by Mike Wiederhold [ 12/Sep/14 ]
I was curious about this myself. As far as I know the noop code is working properly and we have some tests to make sure it is. I can work with Perry to try to figure out what is going on on the ep-engine side and see if the noops are actually being sent. I know this sounds unlikely, but I was curious whether or not the noops were making it through to the failed over node for some reason.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
>> I know this sounds unlikely, but I was curious whether or not the noops were making it through to the failed over node for some reason.

I can rule this out. We do have connection between destination's beam and source's memcached. And we _dont_ have connection to beam's connection to destination memcached anymore. Erlang is stuck writing to dead socket. So there's no way you could get nop acks back.
Comment by Perry Krug [ 15/Sep/14 ]
I've confirmed that this state persists for much longer than a few minutes...I've not ever seen it recover itself, and have left it to run for 15-20 minutes at least.

Do you need a live system to diagnose?
Comment by Cihan Biyikoglu [ 15/Sep/14 ]
thanks for the update - Mike, sounds like we should open an issue for DCP to reliably detect these conditions. We should add this in for 3.0.1.
Perry, Could you confirm restarting the erlang process resolves the issue Perry?
thanks
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41410

Mike will open different ticket for NOPs in DCP.




[MB-12054] [windows] [2.5.1] cluster hang when flush beer-sample bucket Created: 22/Aug/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Abhinav Dangeti
Resolution: Cannot Reproduce Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows server 2008 R2

Attachments: Zip Archive 172.23.107.124-8222014-1546-diag.zip     Zip Archive 172.23.107.125-8222014-1547-diag.zip     Zip Archive 172.23.107.126-8222014-1548-diag.zip     Zip Archive 172.23.107.127-8222014-1549-diag.zip    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 2.5.1 on 4 nodes windows server 2008 R2 64-bit
Create a cluster of 4 nodes
Create beer-sample bucket
Enable flush in bucket setting.
Flush beer-sample bucket. Cluster became hang.

 Comments   
Comment by Abhinav Dangeti [ 11/Sep/14 ]
I wasn't able to reproduce this issue with a 2.5.1 build with 2 nodes.

From your logs on one of the nodes I see some couchNotifier logs, where we are waiting for mcCouch:
..
Fri Aug 22 14:00:03.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
Fri Aug 22 14:21:53.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
Fri Aug 22 14:43:43.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
Fri Aug 22 15:05:33.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
...

This won't be a problem in 3.0.1, as mcCouch has been removed. Please re-open if you see this issue in your testing again.




[MB-11426] API for compact-in-place operation Created: 13/Jun/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Jens Alfke Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
It would be convenient to have an explicit API for compacting the database in place, i.e. to the same file. This is what auto-compact does, but if auto-compact isn't enabled, or if the caller wants to run it immediately instead of on a schedule, then the caller has to use fdb_compact, which compacts to a separate file.

I assume the workaround is to compact to a temporary file, then replace the original file with the temporary. But this is several more steps. Since forestdb already contains the logic to compact in place, it'd be convenient if calling fdb_compact(handle, NULL) would do that.

 Comments   
Comment by Chiyoung Seo [ 10/Sep/14 ]
The change is in gerrit for review:

http://review.couchbase.org/#/c/41337/
Comment by Jens Alfke [ 10/Sep/14 ]
The notes on Gerrit say "a new file name will be automatically created by appending a file revision number to the original file name. …. Note that this new compacted file can be still opened by using the original file name"

I don't understand what's going on here — after the compaction is complete, does the old file still exist or am I responsible for deleting it? When does the file get renamed back to the original filename, or does it ever? Should my code ignore the fact that the file is now named "test.fdb.173" and always open it as "test.fdb"?
Comment by Chiyoung Seo [ 10/Sep/14 ]
>I don't understand what's going on here — after the compaction is complete, does the old file still exist or am I responsible for deleting it?

The old file is automatically removed by ForestDB after the compaction is completed.

>When does the file get renamed back to the original filename, or does it ever?

The file won't be renamed to the original name in the current implementation. But, I will adapt the current implementation so that when the file is closed and its ref counter becomes zero, the file can be renamed to its original name.

>Should my code ignore the fact that the file is now named "test.fdb.173" and always open it as "test.fdb"?

Yes, you can still open "test.fdb.173" by passing "test.fdb" file name.

Note that renaming it to the original file name right after finishing the compaciton becomes complicated as the other threads might traverse the old file's blocks (through buffer cache or OS page cache).

Comment by Chiyoung Seo [ 11/Sep/14 ]
I incorporated those answers into the commit message and API header file. Let me know if you have any suggestions / concerns.
Comment by Chiyoung Seo [ 12/Sep/14 ]
The change was merged into the master branch.




[MB-12082] Marketplace AMI - Enterprise Edition and Community Edition - provide AMI id to PM Created: 27/Aug/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: cloud
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Anil Kumar Assignee: Wei-Li Liu
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Need AMI's before 3.0.0 GA

 Comments   
Comment by Wei-Li Liu [ 17/Sep/14 ]
3.0.0 EE AMI: ami-283a9440 Snapshots: snap-308fc192
3.0.0 CE AMI: ami-3237995a




[MB-12186] If flush can not be completed because of a timeout, we should not display a message "Failed to flush bucket" when it's still in progress Created: 15/Sep/14  Updated: 17/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Andrei Baranouski Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1208

Attachments: PNG File MB-12186.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
When I tried to flush heavily loaded cluster I received "Failed To Flush Bucket" popup, in fact it not failed, but simply has not been completed for a set period of time(30 sec)?

expected behaviour: message like "flush is not complete, but continue..."

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
timeout is timeout. We can say "it timed out" be we cannot be sure if it's continuing or not.
Comment by Andrei Baranouski [ 15/Sep/14 ]
hm, we get timeout when removing bucket occurs much long, but we inform that the removal is still in progress, right?
Comment by Aleksey Kondratenko [ 17/Sep/14 ]
You're right. I don't think we're entirely precise on bucket deletion timeout message. It's one of our mid-term goals to be better on this longer running ops and how their progress or results are exposed to user. I see not much value in tweaking messages. Instead we'll just make this entire thing work "right".




[MB-12200] Seg fault during indexing on view-toy build testing Created: 16/Sep/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ketaki Gangal Assignee: Harsha Havanur
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: -3.0.0-700-hhs-toy
-Cen 64 Machines
- 7 Node cluster, 2 Buckets, 2 Views

Attachments: Zip Archive 10.6.2.168-9162014-106-diag.zip     Zip Archive 10.6.2.187-9162014-1010-diag.zip     File crash_beam.smp.rtf     File crash_toybuild.rtf    
Issue Links:
Duplicate
is duplicated by MB-11917 One node slow probably due to the Erl... Open
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
1. Load 70M, 100M on either bucket
2. Wait for initial indexing to complete
3. Start updates on the cluster 1K gets, 7K sets across the cluster

Seeing numerous cores from beam.smp.

Stack is attached.

Adding logs from the nodes.


 Comments   
Comment by Sriram Melkote [ 16/Sep/14 ]
Harsha, this appears to clearly be a NIF related regression. We need to discuss why our own testing didn't find this after you figure out the problem.
Comment by Volker Mische [ 16/Sep/14 ]
Siri, I haven't checked if it's the same issue, but the current patch doesn't pass our unit tests. See my comment at http://review.couchbase.org/41221
Comment by Ketaki Gangal [ 16/Sep/14 ]
Logs https://s3.amazonaws.com/bugdb/bug-12200/bug_12200.tar
Comment by Harsha Havanur [ 17/Sep/14 ]
The issue Volker mentioned is one of queue size. I am suspecting that if a context is in queue beyond 5 seconds and terminator loop destroys context and when doMapDoc loop dequeues the task it will result in SEGV if the ctx is already destroyed. Trying a fix with both increasing queue size as well as handling destroyed contexts.
Comment by Sriram Melkote [ 17/Sep/14 ]
Folks, let's follow this on MB-11917 as it's clear now that this bug is caused by the toy build as a result of proposed fix for MB-11917.




[MB-12206] New 3.0 Doc Site, View and query pattern samples unparsed markup Created: 17/Sep/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Ruth Harris
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
On the page

http://draft.docs.couchbase.com/prebuilt/couchbase-manual-3.0/Views/views-querySample.html

The view code examples under 'General advice' are not displayed properly.

 Comments   
Comment by Ruth Harris [ 17/Sep/14 ]
Fixed. Legacy formatting issues from previous source code.




[MB-12167] Remove Minor / Major / Page faults graphs from the UI Created: 10/Sep/14  Updated: 16/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Trivial
Reporter: Ian McCloy Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 1
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Customers often ask what is wrong with their system when they see anything greater than 0 page faults in the UI graphs. What are customers supposed to do with the information ? This isn't a useful metric to customers and we shouldn't show it in the UI. If needed for development debug we can query it from the REST API.

 Comments   
Comment by Matt Ingenthron [ 10/Sep/14 ]
Just to opine: +1. There are a number of things in the UI that aren't actionable. I know they help us when we look back over time, but as presented it's not useful.
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
So it's essentially expression of our belief that majority of our customers are ignorant enough to be confused by "fault" in name of this stat ?

Just want to make sure that there's no misunderstanding on this.

On Matt's point I'd like to say that all our stats are not actionable. They're just information that might end up helpful occasionally. And yes especially major page faults are _tremendously_ helpful sign of issues.
Comment by Matt Ingenthron [ 10/Sep/14 ]
I don't think the word "fault" is at issue, but maybe others do. I know there are others that aren't actionable and to be honest, I take issue with them too. This one is just one of the more egregious examples. :) The problem is, in my opinion, it's not clear what one would do with minor page fault data. One can't really know what's good or bad without looking at trends or doing further analysis.

While I'm tossing out opinions, similarly visualizing everything as a queue length isn't always good. To the app, latency and throughput matter-- how many queues and where they are affects this, but doesn't define it. A big queue length with fast storage can still have very good latency/throughput and equally a short queue length with slow or variable (i.e., EC2 EBS) storage can have poor latency/throughput. An app that will slow down with higher latencies won't make the queue length any bigger.

Anyway, pardon the wide opinion here-- I know you know all of this and I look forward to improvements when we get to them.

You raise a good point on major faults though.

If it only helps occasionally, then it's consistent with the request (to remove it from the UI, but still have it in there). I'm merely a user here, so please discount my opinion accordingly!
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
>> If it only helps occasionally, then it's consistent with the request (to remove it from the UI, but still have it in there).

Well but then it's true for almost all of our stats isn't? Doesn't it mean that we need to hide them all then ?
Comment by Matt Ingenthron [ 10/Sep/14 ]
>> Well but then it's true for almost all of our stats isn't? Doesn't it mean that we need to hide them all then ?

I don't think so. That's an extreme argument. I'd put ops/s which is directly proportional to application load and minor faults which is affected by other things on the system in very different categories. Do we account for minor faults at a per-bucket level? ;)
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
>> I'd put ops/s which is directly proportional to application load and minor faults which is affected by other things on the system in very different categories.

True.

>> Do we account for minor faults at a per-bucket level? ;)

No. And good point. Indeed lacking better UI we show all system stats (including some high-usefulness category things like count of memcached connections) as part of showing any bucket's stats. Despite gathering and storing system stats separately.

In any case, I'm not totally against hiding page fault stats. It's indeed minor topic.

But I'd like to see good reason for that. Because for _anything_ that we do there will all be some at least one user that's confused, which isn't IMO valid reason for "lets hide it".
 
My team has spent some effort getting this stats and we did for specifically because we knew that major page faults is important to be aware of. And we also know that on linux even minor page faults might be "major" in terms of latency impact. We've seen it with our own eyes.

I.e. when you're running out of list of free page, one can think that Linux is just supposed to grab one of clean pages from page cache, but we've seen this to take seconds for reason's I'm not quite sure. It does look like linux might routinely delay minor page fault for IO (perhaps due to some locking impacts). And things like huge page "minor" page fault may have even more obviously hard effect (i.e. because you need physically contiguous run of memory, getting this might require "memory compaction", locking etc). And our system doing constant non-direct-io writes routinely hits this hard condition. I.e. because near-every write from ep-engine or view engine has to allocate brand new page(s) for that data due to append-onlyness of out design (forest db's direct io path plus custom buffer cache management should help dramatically here).

Comment by Patrick Varley [ 10/Sep/14 ]
I think there are three main consumers of stats:

* Customers (cmd_get)
* Support (ep_bg_load_avg)
* Sevelopers of the component (erlang memory atom_used)

As a result we display and collect these stats in different way i.e UI, cbstats, ns_doctor, etc

A number of our users find the amount of stats in the UI overwhelming, a lot of the time they do not know which one are important.

Some of our user do not even understand what a virtual memory system is let alone what a page fault is.

I do not think we should display the page faults in the UI, but we should still collect them. I believe we can make better use of the space in the ui. For example: network usage or byte_written or byte_read, tcp retransmissions, Disk performance.
Comment by David Haikney [ 11/Sep/14 ]
+ 1 for removing page faults. The justification:
* We put them front and centre of the UI. Customers see Minor faults, Major Faults and Total faults before # gets, # sets.
* They have not proven useful for support in diagnosing an issue. In fact they cause more "false positive" questions ("my minor faults look high, is that a problem?")
* Overall this constitutes "noise" that our customers can do without. The stats can quite readily be captured elsewhere if we want to record them.

It would be easy to expand this into a wider discussion of how we might like to reorder / expand all of the current graphs in the UI - and that's a useful discussion. But I propose we keep this ticket to the question of removing the page fault stats.
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41333
Comment by Ian McCloy [ 16/Sep/14 ]
Which version of Couchbase Server is this fixed in ?




[MB-11939] Bucket configuration dialog should mention that fullEviction policy doesn't retain keys Created: 12/Aug/14  Updated: 16/Sep/14  Resolved: 16/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Pavel Blagodov
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
Current wording:
"Retain metadata in memory even for non-resident items"
"Don't retain metadata in memory for non-resident items"

"Metadata" is kind of ambiguous. Please mention keys explicitly.


 Comments   
Comment by Ilam Siva [ 12/Aug/14 ]
Change Radiobutton options to read:
Value Eviction (radiobutton selected by default)
Full Eviction

Change "Whats this?" hint:
Value Eviction - During eviction, only the value will be evicted (key and metadata will remain in memory)
Full Eviction - During eviction, everything (including key, metadata and value) will be evicted
Value Eviction needs more system memory but provides the best performance. Full Eviction reduces memory overhead requirement.
Comment by Anil Kumar [ 12/Aug/14 ]
Pavel - This changes will go in to Bucket Creation and Bucket Edit UI
Comment by Pavel Blagodov [ 14/Aug/14 ]
http://review.couchbase.org/40576
Comment by Anil Kumar [ 09/Sep/14 ]
Looks like we made typo error which needs to be corrected.

Change Radiobutton options to read:
Value Ejection (radiobutton selected by default)
Full Ejection

Change "Whats this?" hint:
Value Ejection - During ejection, only the value will be ejected (key and metadata will remain in memory)
Full Ejection - During ejection, everything (including key, metadata and value) will be ejected
Value Ejection needs more system memory but provides the best performance. Full Ejection reduces memory overhead requirement.
Comment by Pavel Blagodov [ 11/Sep/14 ]
http://review.couchbase.org/41358




[MB-12141] Try to delete a Server group that is empty. The error message needs to be descriptive Created: 05/Sep/14  Updated: 16/Sep/14  Resolved: 16/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Pavel Blagodov
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows build 3.0.1_1261
Environment: Windows 7 64 bit

Attachments: PNG File Screen Shot 2014-09-05 at 5.22.08 PM.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Login to the Couchbase console
http://10.2.2.52:8091/ (Administrator/Password)
Click on Server Nodes
Try to create a group and then click to delete
You will see the error as seen in the screenshot
Expected behavior: Removing Server Group as the tile and should say "Are you sure you want to remove the Server group" or some thing like that

 Comments   
Comment by Pavel Blagodov [ 11/Sep/14 ]
http://review.couchbase.org/41359




[MB-12128] Stale=false may not ensure RYOW property (Regression) Created: 03/Sep/14  Updated: 16/Sep/14  Resolved: 16/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Sarath Lakshman Assignee: Sarath Lakshman
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
For performance reasons, we tried to reply for stale=false query readers immediately after updater internal checkpoint. This may result in sending index updates after partial snapshot reads and user may not observe RYOWs. For ensuring RYOW, we should always returns results after processing a complete upr snapshot.

We just need to revert this commit to fix the problem, https://github.com/couchbase/couchdb/commit/e866fe9330336ab1bda92743e0bd994530532cc8

It is fairly confident that reverting this change will not break anything. It was added as a pure performance improvement.

 Comments   
Comment by Sarath Lakshman [ 04/Sep/14 ]
Added a unit test to prove this case
http://review.couchbase.org/#/c/41192

Here is the change for reverting corresponding commit
http://review.couchbase.org/#/c/41193/
Comment by Wayne Siu [ 04/Sep/14 ]
As discussed in the release meeting on 09.04.14, this is scheduled for 3.0.1.
Comment by Sarath Lakshman [ 16/Sep/14 ]
Merged




[MB-10012] cbrecovery hangs in the case of multi-instance case Created: 24/Jan/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Venu Uppalapati Assignee: Ashvinder Singh
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive cbrecovery1.zip     Zip Archive cbrecovery2.zip     Zip Archive cbrecovery3.zip     Zip Archive cbrecovery4.zip     Zip Archive cbrecovery_source1.zip     Zip Archive cbrecovery_source2.zip     Zip Archive cbrecovery_source3.zip     Zip Archive cbrecovery_source4.zip     PNG File recovery.png    
Issue Links:
Relates to
Triage: Triaged
Operating System: Centos 64-bit

 Description   
2.5.0-1055

during verification MB-9967 I performed the same steps:
source cluster: 3 modes, 4 buckets
destination cluster: 3 nodes, 1 bucket
failover 2 nodes on destination cluster(without rebalance)

cbrecovery hangs on

[root@centos-64-x64 ~]# /opt/couchbase/bin/cbrecovery http://172.23.105.158:8091 http://172.23.105.159:8091 -u Administrator -U Administrator -p password -P password -b RevAB -B RevAB -v
Missing vbuckets to be recovered:[{"node": "ns_1@172.23.105.159", "vbuckets": [513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023]}]
2014-01-22 01:27:59,304: mt cbrecovery...
2014-01-22 01:27:59,304: mt source : http://172.23.105.158:8091
2014-01-22 01:27:59,305: mt sink : http://172.23.105.159:8091
2014-01-22 01:27:59,305: mt opts : {'username': '<xxx>', 'username_destination': 'Administrator', 'verbose': 1, 'dry_run': False, 'extra': {'max_retry': 10.0, 'rehash': 0.0, 'data_only': 1.0, 'nmv_retry': 1.0, 'conflict_resolve': 1.0, 'cbb_max_mb': 100000.0, 'try_xwm': 1.0, 'batch_max_bytes': 400000.0, 'report_full': 2000.0, 'batch_max_size': 1000.0, 'report': 5.0, 'design_doc_only': 0.0, 'recv_min_bytes': 4096.0}, 'bucket_destination': 'RevAB', 'vbucket_list': '{"172.23.105.159": [513]}', 'threads': 4, 'password_destination': 'password', 'key': None, 'password': '<xxx>', 'id': None, 'bucket_source': 'RevAB'}
2014-01-22 01:27:59,491: mt bucket: RevAB
2014-01-22 01:27:59,558: w0 source : http://172.23.105.158:8091(RevAB@172.23.105.156:8091)
2014-01-22 01:27:59,559: w0 sink : http://172.23.105.159:8091(RevAB@172.23.105.156:8091)
2014-01-22 01:27:59,559: w0 : total | last | per sec
2014-01-22 01:27:59,559: w0 batch : 1 | 1 | 15.7
2014-01-22 01:27:59,559: w0 byte : 0 | 0 | 0.0
2014-01-22 01:27:59,559: w0 msg : 0 | 0 | 0.0
2014-01-22 01:27:59,697: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210
2014-01-22 01:27:59,719: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210
2014-01-22 01:27:59,724: w2 source : http://172.23.105.158:8091(RevAB@172.23.105.158:8091)
2014-01-22 01:27:59,724: w2 sink : http://172.23.105.159:8091(RevAB@172.23.105.158:8091)
2014-01-22 01:27:59,727: w2 : total | last | per sec
2014-01-22 01:27:59,728: w2 batch : 1 | 1 | 64.0
2014-01-22 01:27:59,728: w2 byte : 0 | 0 | 0.0
2014-01-22 01:27:59,728: w2 msg : 0 | 0 | 0.0
2014-01-22 01:27:59,738: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210
2014-01-22 01:27:59,757: s1 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 513, key: RAB_111001636418, spec: http://172.23.105.159:8091, host:port: 172.23.105.159:11210



 Comments   
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - June 04 2014 Bin, Ashivinder, Venu, Tony
Comment by Cihan Biyikoglu [ 27/Aug/14 ]
does this need to be considered for 3.0 or is this a test issue only?
Comment by Cihan Biyikoglu [ 29/Aug/14 ]
pls rerun on RC2 and validate that this is test execution issue and not a tools issue.
thnaks
Comment by Ashvinder Singh [ 15/Sep/14 ]
could not reproduce this issue.
Comment by Ashvinder Singh [ 15/Sep/14 ]
cannot reproduce




[MB-12164] UI: Cancelling a pending add should not show "reducing capacity" dialog Created: 10/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Improvement Priority: Trivial
Reporter: David Haikney Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
3.0.0 Beta build 2

Steps to reproduce:
In the UI click "Server add".
Add the credentials for a server to be added
In the Pending Rebalance pane click "Cancel"

Actual Behaviour:
See a dialog stating"Warning – Removing this server from the cluster will reduce cache capacity across all data buckets. Are you sure you want to remove this server?"

Expected behaviour:
Dialog is not applicable in this context since not adding an unaided node will do nothing to the cluster capacity. Would expect either no dialog or a dialog acknowledging that "This node will no longer be added to the cluster on next rebalance"

 Comments   
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
But it _is_ applicable because you're returning node to "pending remove" state.
Comment by David Haikney [ 10/Sep/14 ]
A node that has never held any data or actively participated in the cluster cannot possibly reduce the cluster's capacity.
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
It looks like I misunderstood this request as referring to cancelling add-back after failover. Which it isn't.

Makes sense now.
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41428




[MB-12147] {UI} :: Memcached Bucket with 0 items indicates NaNB / NaNB for Data/Disk Usage Created: 08/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Parag Agarwal Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Any environment

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
On a 1 node cluster create a memcached bucket with 0 items. UI says NaNB /
NaNB for Data/Disk Usage

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41379




[MB-12156] time of check/time of use race in data path change code of ns_server may lead to deletion of all buckets after adding node to cluster Created: 09/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0, 1.8.1, 2.0, 2.1.0, 2.2.0, 2.1.1, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: No

 Description   
SUBJ.

In code that changes data path we first check if node is provisioned (without preventing provision-ness to be changed after that) and the proceed with change of data path. As part of change of data path we delete buckets.

So if node gets added to cluster after check but before data path is actually changed, we'll delete all buckets of cluster.

As improbable as it may seem, it actually occurred in practice. See CBSE-1387.


 Comments   
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
Whether it's a must have for 3.0.0 is not for me to decide but here's my thinking.

* the bug was there at least since 2.0.0 and it really requires something outstanding in customer's environment to actually occur

* 3.0.1 is just couple months away

* 3.0.0 is done

But if we're still open to adding this fix to 3.0.0, my team will surely be glad to do it.
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41332
http://review.couchbase.org/41333




[MB-12187] Webinterface is not displaying items above 2.5kb in size Created: 15/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Philipp Fehre Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: MacOS, Webinterface

Attachments: PNG File document_size_couchbase.png    

 Description   
When trying to display a document which is above 2.5kb the web-interface will block the display. 2.5kb seems like a really low limit and is easily reach by regular documents, which makes using the web-interface inefficient especially when a bucket contains many documents that are close to this limit.
It makes sense to have a limit to not having to load really big documents into the interface but 2.5kb seems like a really low limit.

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
by design. Older browsers have trouble with larger docs. And there must be duplicate of this somewhere




[MB-12188] we should not duplicate log messages if we already have logs with "repeated n times" template Created: 15/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Andrei Baranouski Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File MB-12188.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
please see screenshot,

think that logs without "repeated n times" are unnecessary

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
They _are_. The logic (and it's same logic as many logging products have) is _if_ in short period of time (say 5 minutes) you have a bunch of same messages, it'll log them once. But if periods between messages is larger, then they're logged separately.




Generated at Sat Sep 20 07:24:44 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.